MASTER’S
THESIS PROPOSAL
COMPUTER SCIENCE DEPT. 8/22/2001
Digital technologies such as fax, email have
revolutionized the world of information
exchange. Yet there is a desire
for more advanced technolgies which would allow us to
exchange information on paper as efficiently and safely as done on digital
media. For example imagine having a technology which would allow us to detect
tamper detection in a signed affidavit. Deteching changes in an electronic file
is easy but authenticating is still a research problem( watermarks will
ultimately solve the problem). While paper documents have the signature of the
person for authtication, sophisticated
technology is required to detect tampering. Sometimes the
human eye may not be able todetect such changes( eg. Signture verifiers are
required to detect forged signatures). Here comes the desire to solve the
problem and Xerox came up with an innovative solution in the form of dataglpys.
The Dataglyph [1] technology is a practical and extensible
method for bestowing properties traditionally associated with digital documents
to paper documents. These properties enable perfect reproduction of paper
documents when transmitted electronically despite noisy transmission
channels, and the ability to utilize digital cryptographic technologies on
these paper documents. The result is a human and machine readable paper
document that can be losslessly converted to digital format and back,
automatically verified to be free from tampering, mined or stripped of
annotations, copied without generational loss, and automatically repaired of
severe damage.
However currently, the Dataglyph technology is effective but limited to a set of small sized documents for the reason that an individual Dataglyph can store only tens of kilobytes of data. It is desired to overcome this restriction, as users wish to make use of the security and accuracy protections afforded by Dataglyphs for any kind of document, irrespective of its size.
I wish to address the problem of
storing and retrieving large amounts of data in Dataglyphs. Due to
limitation on glyph carrying capacity per page, large amounts of data have to
be split into individual pages. Several aspects such as file splitting, page
shuffling, and the recovery of lost or heavily damaged pages need to be taken
care of, while extending this technology from a single-page domain to the
multi-page domain. The goal is to solve the open problems for multi-page aspect
of Dataglyph documents and to implement the solution, thinking hard about the
applications (tamper detection, annotation lifting, perfect copying etc.) and
how they affect multi-page support.
For example, what is to be done if, the users
- want to perfect copy a single
page and the data encoded in the Dataglyph pertaining to that page does not fit
into an individual glyph but is stored across two or more pages.
- don't want to authenticate a
document with missing pages.
- want to perfect copy a 10-page
document but only have to scan 5 pages.
- need 6 pages of glyphs to store
a 5-page document.
- want to reconstruct the original
document even if pages appear out of order.
Thus, taking into consideration all the reasonable ways to
split the data between glyphs, classifying those possibilities, deciding the
tradeoffs and after devising a metric for evaluating the algorithms, I propose
a solution that is simple and maximally robust.
Col lo cate
(kŏl’ə-kāt)
v. tr. To place together in proper order, arrange side by side
I propose to use Reed-Solomon codes [5] for error correction. Reed Solomon codes are an important subclass of the non-binary BCH (Bose-Chaudhari-Hocquenghem) codes and operate over the Galois Field of arithmetic. BCH codes are a multiple error correcting generalization of Hamming codes. BCH codes are cyclic codes. They are as a class the best, known nonrandom codes for channels in which errors affect successive symbols independently. Furthermore, a simply implemented decoding procedure has been devised for these codes.
To support page loss recovery, I use techniques analogous to those used by redundant disk storage arrays [6]. RAID does not use Reed-Solomon codes but parity. An N-packet message can be augmented with of M additional packets error correction information. The message may be recovered, even if any M packets are lost. Reed-Solomon error correction codes are used; with codewords spanning the collection of packets. Current Reed-Solomon implementation operates at byte level, thus restricting the total number of packets to 255 only.
In retrospect,
collocated Dataglyphs address the problem of rendering large glyph messages
onto multiple printed pages, then recovering the glyph data from rescanned
versions of the multi-page documents, even when pages are missing, reordered or
damaged.
The algorithm will be implemented in C on the Unix platform.
Testing: Time and memory analysis will be carried out.
Data Set: The
application will be tested on PDF, TIFF-FX file formats wrapped up in a MIME
[4] header.
Microsoft
word documents are too big to handle and the format of the document cannot be
preserved
in
glyphs like in the case of HTML files.
The background of
this document is a Dataglyph and it encodes only some of the data from this pdf
file,as much as it can fit. Thus both the pages have the same glyph in their
backgrounds. My program will deal with this problem and then even if the pages
are damaged, lost and reordered I will be able to make a perfect copy of this
document and detect any tampering, if at all you happen to modify this
proposal.
Past research has not directly dealt with this problem. Both the packetization and error correction techniques
discussed are well known and widely deployed across several problem domains. In
particular, magnetic storage arrays (e.g. RAID), electronic network protocols
[3] (e.g. TCP/IP [2]) and digital audio systems (e.g. audio compact disks) use
similar techniques.
As far as I know, this is
the first time these techniques have been applied to the problem of splitting
large messages across multiple Dataglyphs.
Much of the effort
has been in analyzing the myriad of techniques used in other problem domains,
and choosing ones appropriate for use with Dataglyphs. The decision will be
made based on criteria storage space efficiency, simplicity, computational
speed, the ability to use proven and well known techniques, the ability to
handle arbitrary large messages, as well as performance in situations such as
where pages are destroyed or considered in isolation. Additional engineering
effort goes towards implementation.
I am using off-the-shelf
DG500 Dataglyphs to store packets. I am not modifying or taking advantage of
the internal structural details of the DG500 Dataglyphs in any way – such
details are irrelevant to this technique.
1.
www.dataglyphs.com
2.
Computer Networks by Andrew S. Tanenbaum Third Edition.
Prentice Hall, Inc.
3.
RFC 1717 - The PPP Multilink Protocol (MP)
4.
RFC 1521 - MIME
5.
Error-Correcting Codes by Peterson and Weldon. Second
Edition. The MIT Press Cambridge, Massachusetts, and London, England.
6.
Plank, JS: “A Tutorial on Reed-Solomon coding for
Fault-Tolerance in RAID-like Systems”, Software,
practice & experience, 27(9), Sept 1997, pp. 995.