MASTER’S THESIS PROPOSAL

ADVISOR: XXX STUDENT: XXX

COMPUTER SCIENCE DEPT. 8/22/2001

INTRODUCTION

Digital technologies such as fax, email have revolutionized the world of information exchange. Yet there is a desire

for more advanced technolgies which would allow us to exchange information on paper as efficiently and safely as done on digital media. For example imagine having a technology which would allow us to detect tamper detection in a signed affidavit. Deteching changes in an electronic file is easy but authenticating is still a research problem( watermarks will ultimately solve the problem). While paper documents have the signature of the person for authtication, sophisticated

technology is required to detect tampering. Sometimes the human eye may not be able todetect such changes( eg. Signture verifiers are required to detect forged signatures). Here comes the desire to solve the problem and Xerox came up with an innovative solution in the form of dataglpys.

The Dataglyph [1] technology is a practical and extensible method for bestowing properties traditionally associated with digital documents to paper documents. These properties enable perfect reproduction of paper documents when transmitted electronically despite noisy transmission channels, and the ability to utilize digital cryptographic technologies on these paper documents. The result is a human and machine readable paper document that can be losslessly converted to digital format and back, automatically verified to be free from tampering, mined or stripped of annotations, copied without generational loss, and automatically repaired of severe damage.

However currently, the Dataglyph technology is effective but limited to a set of small sized documents for the reason that an individual Dataglyph can store only tens of kilobytes of data. It is desired to overcome this restriction, as users wish to make use of the security and accuracy protections afforded by Dataglyphs for any kind of document, irrespective of its size.

PROBLEM DESCRIPTION

I wish to address the problem of storing and retrieving large amounts of data in Dataglyphs. Due to limitation on glyph carrying capacity per page, large amounts of data have to be split into individual pages. Several aspects such as file splitting, page shuffling, and the recovery of lost or heavily damaged pages need to be taken care of, while extending this technology from a single-page domain to the multi-page domain. The goal is to solve the open problems for multi-page aspect of Dataglyph documents and to implement the solution, thinking hard about the applications (tamper detection, annotation lifting, perfect copying etc.) and how they affect multi-page support.

For example, what is to be done if, the users

- want to perfect copy a single page and the data encoded in the Dataglyph pertaining to that page does not fit into an individual glyph but is stored across two or more pages.

- don't want to authenticate a document with missing pages.

- want to perfect copy a 10-page document but only have to scan 5 pages.

- need 6 pages of glyphs to store a 5-page document.

- want to reconstruct the original document even if pages appear out of order.

Thus, taking into consideration all the reasonable ways to split the data between glyphs, classifying those possibilities, deciding the tradeoffs and after devising a metric for evaluating the algorithms, I propose a solution that is simple and maximally robust.

METHODOLOGY - COLLOCATED DATAGLYPHS

Col lo cate (kŏl’ə-kāt) v. tr. To place together in proper order, arrange side by side

To store a large message in glyphs, split it into several packets, each small enough to fit within a single Dataglyph. Optionally, add error correction information, which will produce additional packets. Print each Dataglyph on a separate page of paper. To retrieve the message, read all the Dataglyphs and assemble the message. Because of the meta-information included in the packet headers, the assembly process works even if the individual Dataglyphs have been shuffled. Error correction information stored in the supplementary packets facilitates re-creating lost packets.

I propose to use Reed-Solomon codes [5] for error correction. Reed Solomon codes are an important subclass of the non-binary BCH (Bose-Chaudhari-Hocquenghem) codes and operate over the Galois Field of arithmetic. BCH codes are a multiple error correcting generalization of Hamming codes. BCH codes are cyclic codes. They are as a class the best, known nonrandom codes for channels in which errors affect successive symbols independently. Furthermore, a simply implemented decoding procedure has been devised for these codes.

To support page loss recovery, I use techniques analogous to those used by redundant disk storage arrays [6]. RAID does not use Reed-Solomon codes but parity. An N-packet message can be augmented with of M additional packets error correction information. The message may be recovered, even if any M packets are lost. Reed-Solomon error correction codes are used; with codewords spanning the collection of packets. Current Reed-Solomon implementation operates at byte level, thus restricting the total number of packets to 255 only.

In retrospect, collocated Dataglyphs address the problem of rendering large glyph messages onto multiple printed pages, then recovering the glyph data from rescanned versions of the multi-page documents, even when pages are missing, reordered or damaged.

The algorithm will be implemented in C on the Unix platform.

EXPERIMENTS

Testing: Time and memory analysis will be carried out.

Data Set: The application will be tested on PDF, TIFF-FX file formats wrapped up in a MIME [4] header.

Microsoft word documents are too big to handle and the format of the document cannot be preserved

in glyphs like in the case of HTML files.

The background of this document is a Dataglyph and it encodes only some of the data from this pdf file,as much as it can fit. Thus both the pages have the same glyph in their backgrounds. My program will deal with this problem and then even if the pages are damaged, lost and reordered I will be able to make a perfect copy of this document and detect any tampering, if at all you happen to modify this proposal.

NOVELTY AND EFFORT

Past research has not directly dealt with this problem. Both the packetization and error correction techniques discussed are well known and widely deployed across several problem domains. In particular, magnetic storage arrays (e.g. RAID), electronic network protocols [3] (e.g. TCP/IP [2]) and digital audio systems (e.g. audio compact disks) use similar techniques.

As far as I know, this is the first time these techniques have been applied to the problem of splitting large messages across multiple Dataglyphs.

Much of the effort has been in analyzing the myriad of techniques used in other problem domains, and choosing ones appropriate for use with Dataglyphs. The decision will be made based on criteria storage space efficiency, simplicity, computational speed, the ability to use proven and well known techniques, the ability to handle arbitrary large messages, as well as performance in situations such as where pages are destroyed or considered in isolation. Additional engineering effort goes towards implementation.

I am using off-the-shelf DG500 Dataglyphs to store packets. I am not modifying or taking advantage of the internal structural details of the DG500 Dataglyphs in any way – such details are irrelevant to this technique.

REFERENCES

1. www.dataglyphs.com

2. Computer Networks by Andrew S. Tanenbaum Third Edition. Prentice Hall, Inc.

3. RFC 1717 - The PPP Multilink Protocol (MP)

4. RFC 1521 - MIME

5. Error-Correcting Codes by Peterson and Weldon. Second Edition. The MIT Press Cambridge, Massachusetts, and London, England.

6. Plank, JS: “A Tutorial on Reed-Solomon coding for Fault-Tolerance in RAID-like Systems”, Software, practice & experience, 27(9), Sept 1997, pp. 995.