|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 1 is a flowchart of a processing technique 10 illustrating how an input document 12 is processed in steps 14-34 to provide output 36 for downstream processes, according to the invention. (It is to be noted that, in Figures 2 and 3, the processing technique 10 is equivalently referred to as "Processing Step 1".) In order for the processing technique 10 to function, the input document 12 must be prepared. A prepared input document 12 is one that is marked up in a way that is understood by the processing technique 10. It is the task of the processing method 10 of Figure 1 to take an input document and transform and/or prepare it for all the downstream processes. This involves a number of steps, including checking headers and implementing any instructions contained therein that are relevant to steps 14 and 22, and checking the text body for recognized tags and processing them in a pre-determined way, or as instructed in the headers. If recognized tags are not in the form preferred by the program, they are then converted to the preferred form of tagging (18,26,30), which will be explained below.
Most Markup for the input document is optional, but the use of Markup is recommended as it affects the presentations of the document. In particular, although a digital document without any Markup can be processed, in such a case the system is crippled because it can do little more than add Object Citation Numbers. In practice at least a title would be provided and the document structure would be defined, either by a pattern defining structure, or by specific tagging of headings with their level in the document structure.
The prepared input document 12 is in some primary text representation format, such as ASCII or UNICODE (ASCII is currently used). The markup in the prepared input document is characterized by being visible as tags that are instructions to the process, (rather than to the human reader) and easily understood (by a human) as a simple set of tags, and wherever possible visual mnemonics, and current text practices (in mail, chat and newsgroups) are used. Syntax high-lighters can be used to make markup easily visible within the text. Extensive help is available for a user concerning the markup tags and their meaning and effect on processes.
In a typical document, such as an article or news story, not much markup is required. The alternative modes of markup are provided for flexibility and to provide options that simplify document preparation. The most appropriate markup depends on the nature of the contents of the document being prepared, and the form in which it is received. In a document which has a structure that can be defined by a pattern in a header, virtually no markup is required, the title and pattern defining header being all that is required. A pattern header, preferably also a title, are all that is required. Optional additional semantic information about the document, which adds to its value in several downstream processes for searching purposes, may also be included. In the text body any font an paragraph appearance modifications could be tagged, and any headings not caught by the defined pattern would be tagged with their Level.
A prepared input document must contain information about the document structure, which may be done by providing descriptions in the appropriate header, and/or by explicit Tagging of a heading with a Level. If a pattern can be used to define the document structure, as explained below, then only the pattern for the level need be provided in the document header. A combination of both pattern descriptors and manual Tagging may be used.
Nothing needs to be done for the process to assign object citation numbers, because this is automatically done in the processing technique 10. However, if a particular paragraph or object should not be numbered, it has to be marked. At present, a dash or tilde ("-" or "~") followed by a hash at the end of the line of a paragraph is used to indicate that the object is not to be numbered. If a tilde is used, it is kept and presented but not numbered; if a dash is used, the unnumbered object is dropped from the text in output forms that do not need it (this permits the creation of dummy levels in html, that do not appear in the LaTeX/pdf output).
Referring again to Figure 1, the document data is processed in a stream, one object at a time. An object, except (at present) in the case of a table, corresponds roughly to a paragraph, of information, that is anything that is not separated by an empty line (two carriage returns). Examples of objects include a heading, an ordinary text paragraph, a reference to an external image, placed on its own. Tables are processed according to their own rules as a single object, and numbered accordingly. Poetry and blocks of code are delimited as objects in the same way, but are processed line by line. Thus, steps 14 through 34 are performed on every Object within a document, and multiple passes of the entire document may occur to generate required data for use as input to the downstream processes as required. The entire output 36 is then utilized as an input for further downstream processing.
Footnotes are the other special case, as they may have different representations, and are subject to their own numbering system. Because Footnotes "belong" to the object, from which they refer.
In each case a directory is created (at the location the program has been instructed to use), using the file name in which the input data is stored without the suffix recognized by the program (so these can contain human meaningful names). All output data that is created for storage on the file-system for a given document is placed within that document's directory.
The program begins by checking the document header for processing instructions (steps 14 and 18). Headers are currently represented by 0 at the beginning of the line, followed by an open curly bracket, a tilde and the associated name (e.g. 0{~toc, 0{~markup, 0{~skin, 0{~links, or for semantic header data 0{~creator, 0{~title, 0{~date, etc.). This is then followed by any relevant information associated with the tag.
Another example of a header processing instruction is one that instructs that headings should be automatically numbered, with for example level 4 being the top level, and for a certain number of levels down. Level 4 would then be given numbers 1, and 2, and 3 and so on, while level 5 would be assigned 1.1 and 1.2 and 1.3 and 2.1 as found, and level six would be assigned 1.1.1 and 1.1.2 and so on.
The result is a transformation that includes a standardized output 36, and all structural information and numbering, including object citation numbering for a common citation system that is used in all subsequent processing. The processing method 10 can be called by each subsequent process to generate its output for use by the downstream process, or the output 36 can be saved to be read and processed by downstream processes.
Figure 2 is a diagrammatic example of an input document, and the output resulting from utilizing the processing technique of Figure 1. It shows that a document is divided into a header which contains processing instructions, and/or semantic information about the document; and the document body. The input document body of the input document (12) is of one of two types: Content Units (CU) and Note Units (NU). CU's are substantive Objects and non-substantive Objects which are given a tag indicating that they should not be serialized, and usually most or all Objects are substantive. Note Units (NU) include footnotes and endnotes, which may be either contained within an Object or may be placed after an Object or at the end of the document in the order in which it occurs in relation to other Note Units. The output document 36 in Figure 2 shows heading levels that have been identified and assigned Levels; substantive Objects (objects that have not been given an un-serialized tag) that have been given an Object Citation Number (OCN); and Note Units (footnotes/endnotes) that have been assigned a note number (NN) and standardized in their representation. All are now contained within the Object from which they are referenced, and at the location from which they are referenced (those which were not already represented in this way have been moved to their appropriate location and transformed to the appropriate representation).
SiSU Book Samples and Markup Examples
Viral Spiral - How the Commoners Built a Digital Republic of Their Own
David Bollier
2009
The Wealth of Networks - How Social Production Transforms Markets and Freedom
Yochai Benkler
2006
Free Culture - How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity
Lawrence Lessig
2004
CONTENT - Selected Essays on Technology, Creativity, Copyright and the Future of the Future
Cory Doctorow
2008
Eric von Hippel
2005
Free As In Freedom - Richard Stallman's Crusade for Free Software
Sam Williams
2002
Two Bits - The Cultural Significance of Free Software
Christopher Kelty
2008
Free For All - How Linux and the Free Software Movement Undercut the High Tech Titans
Peter Wayner
2002
The Cathedral & the Bazaar - Musings on Linux and Open Source by an Accidental Revolutionary
Erik S. Raymond
1999
Cory Doctorow
2008
Down and Out in the Magic Kingdom
Cory Doctorow
2003
Cory Doctorow
2008
Free Software Foundation - FSF
GPL - GNU General Public License