[1327] in cryptography@c2.net mail archive

home help back first fref pref prev next nref lref last post

text formatting in literary works (tiny change from coderpunks posting)

daemon@ATHENA.MIT.EDU (Antonomasia)
Wed Aug 13 10:49:24 1997

Date: Wed, 13 Aug 1997 03:18:52 +0100
From: Antonomasia <ant@notatla.demon.co.uk>
To: cryptography@c2.net


[A tiny change from the coderpunks posting is marked near the end.]

Text Formatting in Literary Works


PGP Inc have recently published in book form the source code for
PGP 5.0.  It happens that publication and export in this form is
still permitted in the US, and copies of the book were legally
exported to Europe, where they were scanned into computers and fed
through optical character recognition (OCR) software resulting in
magnetic source code approximating to that from which the book
was prepared.

About 80 volunteers proof-read this intermediate result, leading
eventually to successful compilation of PGP 5.0 from the sources
now legally outside the US.  The final burst of proof-reading was
carried out at the HIP97 (Hacking In Progress) event (an informal
computer security conference).  These thoughts on text formatting
are the result of that weekend of intense code-correction.

Firstly I would like to praise the efforts of PGP Inc in producing
the book.  Dave Del Torto assures me it was not straightforward,
and several weeks of strain lie behind it.  My initial reaction
"Your source code is revolting", was overcritical and on reflection
indicates a shortcoming in the design goals of the publication.
Recognition seems to have been lacking of the gap between ideal
conditions for scanning and checking and those of the real world.
The proof-reading should be as easy as possible without a good copy
to compare against!   For part of the time this was how we had to
work, as paper had been mislaid - perhaps delegated to a volunteer
who could not be found later.  Also evident errors are more easily
fixed than those which need to be found by comparison.


1) Whitespace - WYSINWYG

   It often happened (especially with the infamous line 159)
   that the OCR output mixed spaces and tabs on a line in a
   way that was impossible to read accurately on paper.
   The variable width of the tab hid the number of following
   spaces that could range from 0 to 8.

   Two solutions are fairly obvious.  One is to use checksums that
   ignore unquoted whitespace, to arrive at different but hopefully
   equivalent code.  The other (my preference) is to adopt a whitespace
   convention.  This could be something like:

        All unquoted whitespace (not counting newlines) are single spaces.

        There are no trailing spaces.  All blank lines are empty.

        Any necessary exceptions to this (assembler portions or whatever)
        are labeled as such in nearby comments and the local convention
        is defined.


2) No Comments

The processing of full comments amounted to perhaps 30% of the proofreading
effort.  Exclusion of comments is probably a good idea.  Extravagant comments
such as tables and diagrams are right out.

Usually source code has two audiences (compilers & humans).  In this work
there was a third (the scanner).  One option to try to suit all these is to
publish commented and uncommented versions.  A Fortran standard I have read
had the full standard on one side of a page, and a subset of it on the
opposite page.  Use of blank lines ensured identical pagination so you
could read and compare both.  Readers of dual-translation Bibles will
be familiar with the scheme.


3) Properties of OCR

It appears that OCR software applies rules on the likely content of natural
language text.  Some familar errors are:

  "*bn" -> "*ten",   "cfb" -> "cib",   "cfb" -> "cEb"

Names could be chosen so as to be more easily recognised by this software.


4) Checksums suffer too

Checksums were sometimes misread by the software.  After some practice
0000L0 was clearly 0000b0, and [01] had routine mistranslations into
letters, but some checksums were wrong and still appeared legal.
One we fixed was legally-formed, but wrong in two digits.

Standard hexadecimal may not be the ideal representation for the checksums.
Maybe another alphabet of similar size would result in fewer errors,
[A-P] or something.  Bearing (3) in mind, I lean toward the word format
used in one time passwords (RFC 1938).  That dictionary allows 11 bits
to be conveyed by a word, but some experimentation around that idea may
lead to a scheme that works well.

Also checksums would benefit from redundancy so that if a line is
faulty because of an error in the checksum, rather than in the code,
it can be recognised as such.

I did not tackle any large array initialisations such as S-boxes,
but I doubt that code optimisation is a priority in such sections
(this is code for distribution, and can be tweaked later if need be)
and a more robust binary representation is likely to help there.

5) Letter Case

Many names in the code were structured like pgpWordsHere or
PgpWordsHere.  The initial 'p' was sometimes wrongly changed
into upper case.  If there was a convention here I failed to
see it.  Maybe one case-style will do.  Maybe different styles
could be used to mark variables and functions ....


6) A Data Dictionary

The addition of a scanable data dictionary would help.  Beside setting
out the formatting conventions in use, it would help in determining
changes in code that looked valid.  I'm unsure how much detail would
help, but some variable names would probably be appropriate.

 | Names including underscores, especially at the beginning could
 | be listed.  These sometimes turned to spaces.

This dictionary would be scanned and distributed electronically to
proof-readers.  They would then be equipped to tackle much of the work
without visible good source.



If you have further remarks on the preparation of the PGP book series
Dave Del Torto <ddt@pgp.com> would love to hear from you.


keywords: pgp5 hip97 ocr export

--
##############################################################
# Antonomasia   ant@notatla.demon.co.uk                      #
# See http://www.notatla.demon.co.uk/                        #
#### !!! PGP 5.0 beta available now at ftp.replay.com !!! ####

home help back first fref pref prev next nref lref last post