SiSU -->
[ document manifest ]
<< previous TOC next >>
< ^ >

SiSU - SiSU information Structuring Universe - Structured information, Serialized Units,
Ralph Amissah

Structured information, Serialized Units

SiSU - from less markup than the most elementary equivalent html, you can have more

1. Description

1.1 Outline
1.2 Short summary of features
1.3 How it works
1.4 Simple markup
1.4.1 Sparse markup requirement, try to get the most out of markup
1.4.2 Single markup file provides multiple output formats
1.4.3 Syntax relatively easy to read and remember
1.4.4 Kept simple by having a limited publishing feature set, and features identified as most important, are available across several document types
1.5 Designed with usability in mind
1.6 Code separate from content
1.7 Object citation numbering, a text or object positioning / citation system - "paragraph" (or text object) numbering, that remains same and usable across all output formats by people and machine
1.8 Handling of Dublin Core meta-tags making use of the Resource Description Framework
1.9 Easy directory management
1.10 Document Version Control Information
1.11 Table of contents
1.12 Auto-numbering of headings
1.13 Numbering and cross-hyperlinking of endnotes
1.14 "Skinnable"
1.15 Multiple Outputs
1.15.1 html - several presentations: full length & segmented; css & table based
1.15.2 EPUB
1.15.3 XML
1.15.4 ODT:ODF, Open Document Format - ISO/IEC 26300:2006
1.15.5 PDF - portrait and landscape, (through the generation of LaTeX output which is then transformed to pdf)
1.15.6 Search - loading/populating of relational database while retaining document structure information, object citation numbering and other features (currently PostgreSQL and/or SQLite)
1.15.7 Search - database frontend sample, utilising database and SiSU features, including object citation numbering (backend currently PostgreSQL)
1.15.8 Other forms
1.16 Concordance / Word Map or rudimentary index
1.17 Managed (document) directory, database, or site structure
1.18 Batch processing
1.19 Integration to superior Gnu/Linux and Unix tools
1.19.1 Backup and version control
1.19.2 Editor support
1.20 Modular design, need something new add a module

2. Markup and Output Examples

2.1 Markup examples
2.2 A few book (and other) examples
2.2.1 "Viral Spiral", David Bollier
"The Wealth of Networks", Yochai Benkler
"Two Bits", Christopher Kelty
"Free Culture", Lawrence Lessig
"CONTENT", Cory Doctorow
"Democratizing Innovation", by Eric von Hippel
"Free as in Freedom: Richard Stallman's Crusade for Free Software", by Sam Williams
"Free For All: How Linux and the Free Software Movement Undercut the High Tech Titans", by Peter Wayner
"The Cathedral and the Bazaar", by Eric S. Raymond
"Down and out in the Magic Kingdom", Cory Doctorow
"Little Brother", Cory Doctorow
"For the Win", Cory Doctorow
"Accelerando", Charles Stross
"Tainaron", Leena Krohn
"Sphinx or Robot", Leena Krohn
"War and Peace", Leo Tolstoy, PG Etext 2600
"Don Quixote", Miguel de Cervantes [Saavedra], translated by John Ormsby, PG Etext 996
"Gulliver's Travels", Jonathan Swift, transcribed from the 1892 George Bell and Sons edition by David Price, PG Etext 829
"Alice's Adventures in Wonderland", Lewis Carroll, PG Etext 11
"Through The Looking-Glass", Lewis Carroll, PG Etext 12
"Alice's Adventures in Wonderland" and "Through The Looking-Glass", Lewis Carroll, PG Etexts 11 and 12
"Gnu Public License 2", (GPL 2) Free Software Foundation
"Gnu Public License v3 - Third discussion draft", (GPLv3) Free Software Foundation
"Debian Social Contract"
"Debian Constitution v1.3", (simple/default markup)
"Debian Constitution v1.3", (markup adjusted for output to more closely match the original)
"Debian Constitution v1.2", (simple/default markup)
"Debian Constitution v1.2", (markup adjusted for output to more closely match the original)
"A Uniform Sales Terminology", Vikki Rogers and Albert Kritzer
"The Autonomous Contract" 1997 - markup sample
"The Autonomous Contract Revisited" - markup sample
"United Nations Convention on Contracts for the International Sale of Goods"
/PECL/ the "Principles of European Contract Law"
2.3 SQL - PostgreSQL, SQLite
2.4 Lex Mercatoria as an example
2.5 For good measure the markup for a document with lots of (simple) tables
2.6 And a link to the output of a reported case

3. A Checklist of Output Features

4. Introduction to SiSU Markup  114 

4.1 Summary
4.2 Markup Examples
4.2.1 Online
4.2.2 Installed

5. Markup of Headers

5.1 Sample Header
5.2 Available Headers

6. Markup of Substantive Text

6.1 Heading Levels
6.2 Font Attributes
6.3 Indentation and bullets
6.4 Footnotes / Endnotes
6.5 Links
6.5.1 Naked URLs within text, dealing with urls
6.5.2 Linking Text
6.5.3 Linking Images
6.6 Grouped Text
6.6.1 Tables
6.6.2 Poem
6.6.3 Group
6.6.4 Code
6.7 Book index

7. Composite documents markup

Markup Syntax History

8. Notes related to Files-types and Markup Syntax

9. Commands Summary

9.1 Description
9.2 Document Processing Command Flags

10. command line modifiers

11. database commands

12. Shortcuts, Shorthand for multiple flags

12.1 Command Line with Flags - Batch Processing

Technical Information

13. Technical notes

13.1 See abandoned U.S. Provisional Patent Application

14. Diagram / Chart

14.1 The Chart
14.2 I/O
14.3 The Program
14.4 Software utilised
14.4.1 SiSU
14.4.2 SiSU Modules

15. SiSU development environment and technologies of interest, including data formats

15.1 Development environment, Debian
15.2 Programming language, Ruby
15.3 SGML & XML Family
15.3.1 SGML
15.3.2 XML Family
15.4 TeX Family
15.5 Pdf
15.6 Relational Databases, SQL
15.7 Other Databases
15.8 Text Search
15.9 Character Encoding, Unicode
15.10 Information Visualization
15.11 Metadata - semantic
15.12 Syndication, Web feed formats
15.13 Other
15.14 Editors
15.15 Version Control
15.16 Licenses

A Summary of notable events

16. A history of SiSU and its outputs including search

A Chronological history of developments on SiSU

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

January
February
March
April
June
July
August
September
November
December

2004

January
February
March
April
May
June
July
August
September
October
November
December

2005

January
February
March
April
May
June
July
August
September
October
November
December

2006

January
February
March
April
May
June
July
August
September
October
November
December

2007

January
February
March
April
May
June
July
August
September
November
December

2008

January
February
April
June
September
October
November
December

2009

January
December

2010

March

2010

March

FAQ, Howto, Installation, etc.

HowTo

17. Getting Help

17.1 SiSU "man" pages
17.2 SiSU built-in help
17.3 Command Line with Flags - Batch Processing

18. Setup, initialisation

18.1 initialise output directory
18.1.1 Use of search functionality, an example using sqlite
18.2 misc
18.2.1 url for output files -u -U
18.2.2 toggle screen color
18.2.3 verbose mode
18.2.4 quiet mode
18.2.5 maintenance mode intermediate files kept -M
18.2.6 start the webrick server
18.3 remote placement of output

19. Configuration Files

20. Markup

20.1 Headers
20.2 Font Face
20.2.1 Bold
20.2.2 Italics
20.2.3 Underscore
20.2.4 Strikethrough
20.3 Endnotes
20.4 Links
20.5 Number Titles
20.6 Line operations
20.7 Tables
20.8 Grouped Text
20.9 Composite Document

21. Change Appearance

21.1 Skins
21.2 CSS

Extracts from the README

22. README

22.1 Online Information, places to look
22.2 Installation
22.2.1 Debian
22.2.2 RPM
22.2.3 Source package .tgz
22.2.4 to use setup.rb
22.2.5 to use install (prapared with "Rake")
22.2.6 to use install (prapared with "Rant")
22.3 Dependencies
22.4 Quick start
22.5 Configuration files
22.6 Use General Overview
22.7 Help
22.8 Directory Structure
22.9 Configuration File
22.10 Markup
22.11 Additional Things
22.12 License
22.13 SiSU Standard

Extracts from man 8 sisu

23. Post Installation Setup

23.1 Post Installation Setup - Quick start
23.2 Document markup directory
23.2.1 Configuration files
23.2.2 Debian INSTALLATION Note
23.2.3 Document Resource Configuration
23.2.4 Skins

24. FAQ - Frequently Asked/Answered Questions

24.1 Why are urls produced with the -v (and -u) flag that point to a web server on port 8081 ?
24.2 I cannot find my output, where is it?
24.3 I do not get any pdf output, why?
24.4 Where is the latex (or some other interim) output?
24.5 Why isn't SiSU markup XML
24.6 LaTeX claims to be a document preparation system for high-quality typesetting. Can the same be said about SiSU?
24.7 Can the SiSU markup be used to prepare for a LaTex automatic building of an index to the work?
24.8 Can the conversion from SiSU to LaTeX be modified if we have special needs for the LaTeX, or do we need to modify the LaTeX manually?
24.9 How do I create GIN or GiST index in Postgresql for use in SiSU
24.10 Are there some examples of using Ferret Search with a SiSU repository?
Have you had any reports of building SiSU from tar on Mac OS 10.4?
24.12 Where is version 1?
24.13 What is the difference between version 1 and 2?

Installation

25. Installation

25.1 Debian
25.2 Other Unix / Linux
25.2.1 source tarball

26. SiSU Components, Dependencies and Notes

26.1 sisu
26.2 sisu-complete
26.3 sisu-examples
26.4 sisu-pdf
26.5 sisu-postgresql
26.6 sisu-remote
26.7 sisu-sqlite

27. Quickstart - Getting Started Howto

27.1 Installation
27.1.1 Debian Installation
27.1.2 RPM Installation
27.1.3 Installation from source
27.2 Testing SiSU, generating output
27.2.1 basic text, plaintext, html, XML, ODF, EPUB
27.2.2 LaTeX / pdf
27.2.3 relational database - postgresql, sqlite
27.3 Getting Help
27.3.1 The man pages
27.3.2 Built in help
27.3.3 The home page
27.4 Markup Samples

28. SiSU Components, Dependencies and Notes

29. Breakage and Fixes

31st October 2006 - SiSU < 0.48.3 break against Ruby > 1.8.5-3, break on cyclic include; Fixed SiSU: >=0.48.3 (see notes)
21st September 2005 - Avoid ruby-1.8.3 (2005-09-21) and (2005-10-12), Ruby Segfaults; Fixed: later versions of Ruby (see notes)

License, Standard

30. License

31. Things SiSU Standard

Download information

Download information

32. Download SiSU - Linux/Unix

SiSU Current Version - Linux/Unix
Source (tarball tar.gz)
Git (source control management)
Debian
RPM

Changelog - sisu

33. SiSU Version Manifest / changelog

Current version
3.0
Previous versions
2.7
2.6
2.5
2.4
2.3
2.2
2.1
2.0
1.0
0.71
0.70
0.69
0.68
0.67
0.66
0.65
0.64
0.63
0.62
0.61
0.60
0.59
0.58
0.57
0.56
0.55
0.54
0.53
0.52
0.51
0.50
0.49
0.48
0.47
0.46
0.45
0.44
0.43
0.42
0.41
0.40
0.39
0.38
0.37
0.36
0.35
0.34
0.33
0.32
0.31
0.30
0.29
0.28
0.27
0.26
0.25
0.24
0.23
0.22
0.21
0.20
0.18
0.16
0.14
0.12
0.10
0.8
0.6
0.4
0.2
0.1
Release

Changelog - sisu-markup-samples

34. Version Manifest / changelog - SiSU Markup Samples

Current version
2.0
1.1
1.0

Method for providing digital documents including a common citation structure

[SiSU Provisional Patent Application of 2004 based on much older idea and work on SiSU, Abandoned]

The 'Invention' described (and diagrams) by Ralph Amissah.
Provisional patent application text prepared by Stephan Filipek of Winston & Strawn LLP

35. 1. Background

36. 2. Definitions

37. 3. Brief Descriptions of the Drawings

38. 4. Detailed Description of the Preferred Embodiments

39. 5. Document Processing, examples of subsequent steps

40. 6. Advantages of the Invention

41. 7. THE CLAIMS

Post Filing Appendix

42. Post Filing Appendix: Reasons for Abandonment of Patent Process Claim

Endnotes

Endnotes

Metadata

SiSU Metadata, document information

Manifest

SiSU Manifest, alternative outputs etc.

Method for providing digital documents including a common citation structure

[SiSU Provisional Patent Application of 2004 based on much older idea and work on SiSU, Abandoned]

The 'Invention' described (and diagrams) by Ralph Amissah.
Provisional patent application text prepared by Stephan Filipek of Winston & Strawn LLP

38. 4. Detailed Description of the Preferred Embodiments

Figure 1 is a flowchart of a processing technique 10 illustrating how an input document 12 is processed in steps 14-34 to provide output 36 for downstream processes, according to the invention. (It is to be noted that, in Figures 2 and 3, the processing technique 10 is equivalently referred to as "Processing Step 1".) In order for the processing technique 10 to function, the input document 12 must be prepared. A prepared input document 12 is one that is marked up in a way that is understood by the processing technique 10. It is the task of the processing method 10 of Figure 1 to take an input document and transform and/or prepare it for all the downstream processes. This involves a number of steps, including checking headers and implementing any instructions contained therein that are relevant to steps 14 and 22, and checking the text body for recognized tags and processing them in a pre-determined way, or as instructed in the headers. If recognized tags are not in the form preferred by the program, they are then converted to the preferred form of tagging (18,26,30), which will be explained below.

Most Markup for the input document is optional, but the use of Markup is recommended as it affects the presentations of the document. In particular, although a digital document without any Markup can be processed, in such a case the system is crippled because it can do little more than add Object Citation Numbers. In practice at least a title would be provided and the document structure would be defined, either by a pattern defining structure, or by specific tagging of headings with their level in the document structure.

The prepared input document 12 is in some primary text representation format, such as ASCII or UNICODE (ASCII is currently used). The markup in the prepared input document is characterized by being visible as tags that are instructions to the process, (rather than to the human reader) and easily understood (by a human) as a simple set of tags, and wherever possible visual mnemonics, and current text practices (in mail, chat and newsgroups) are used. Syntax high-lighters can be used to make markup easily visible within the text. Extensive help is available for a user concerning the markup tags and their meaning and effect on processes.

In a typical document, such as an article or news story, not much markup is required. The alternative modes of markup are provided for flexibility and to provide options that simplify document preparation. The most appropriate markup depends on the nature of the contents of the document being prepared, and the form in which it is received. In a document which has a structure that can be defined by a pattern in a header, virtually no markup is required, the title and pattern defining header being all that is required. A pattern header, preferably also a title, are all that is required. Optional additional semantic information about the document, which adds to its value in several downstream processes for searching purposes, may also be included. In the text body any font an paragraph appearance modifications could be tagged, and any headings not caught by the defined pattern would be tagged with their Level.

A prepared input document must contain information about the document structure, which may be done by providing descriptions in the appropriate header, and/or by explicit Tagging of a heading with a Level. If a pattern can be used to define the document structure, as explained below, then only the pattern for the level need be provided in the document header. A combination of both pattern descriptors and manual Tagging may be used.

Nothing needs to be done for the process to assign object citation numbers, because this is automatically done in the processing technique 10. However, if a particular paragraph or object should not be numbered, it has to be marked. At present, a dash or tilde ("-" or "~") followed by a hash at the end of the line of a paragraph is used to indicate that the object is not to be numbered. If a tilde is used, it is kept and presented but not numbered; if a dash is used, the unnumbered object is dropped from the text in output forms that do not need it (this permits the creation of dummy levels in html, that do not appear in the LaTeX/pdf output).

Referring again to Figure 1, the document data is processed in a stream, one object at a time. An object, except (at present) in the case of a table, corresponds roughly to a paragraph, of information, that is anything that is not separated by an empty line (two carriage returns). Examples of objects include a heading, an ordinary text paragraph, a reference to an external image, placed on its own. Tables are processed according to their own rules as a single object, and numbered accordingly. Poetry and blocks of code are delimited as objects in the same way, but are processed line by line. Thus, steps 14 through 34 are performed on every Object within a document, and multiple passes of the entire document may occur to generate required data for use as input to the downstream processes as required. The entire output 36 is then utilized as an input for further downstream processing.

Footnotes are the other special case, as they may have different representations, and are subject to their own numbering system. Because Footnotes "belong" to the object, from which they refer.

In each case a directory is created (at the location the program has been instructed to use), using the file name in which the input data is stored without the suffix recognized by the program (so these can contain human meaningful names). All output data that is created for storage on the file-system for a given document is placed within that document's directory.

The program begins by checking the document header for processing instructions (steps 14 and 18). Headers are currently represented by 0 at the beginning of the line, followed by an open curly bracket, a tilde and the associated name (e.g. 0{~toc, 0{~markup, 0{~skin, 0{~links, or for semantic header data 0{~creator, 0{~title, 0{~date, etc.). This is then followed by any relevant information associated with the tag.

  • The most important header processing instruction, (if it exists, as it is optional) from a processing perspective is Pattern Matching information about document levels (14). Established computer techniques are used for matching patterns in text, and defining what pattern to use for each level to mark up the structure of the document. If, for example, a line of text of level 4 starts with the word Article, or Chapter, this can be defined, and level 4 of the document is taken care of. Alternatively, if level 5 is characterized by a line that starts with a number of the pattern digit, stop, digit, stop, that can be defined, and level 5 would be taken care of. The technique currently used for pattern matching are Regular Expressions. Instead of, or in addition to providing a structuring pattern matching processing header instruction, it is possible to manually specify that a given line of text is a heading of a particular level, by marking it with the appropriate tag. The tags currently used occurring at the beginning of a line containing a heading of a given level, are digit followed by an open curly brace for level 1, up to level 8 (1{ to 8{ ). Levels 1 to 3 are document heading dividers, a typical example might be the text name, divided into level 2 Parts and level 3 Sections. Levels 4 to 8 are text headings, with a typical example of level 4 being chapters or articles, and sub-levels of this being sub-titles within level 4. However as instructions are provided the user may determine the level assigned to each type of heading, such as sections, chapters, articles.
  • Another example of a header processing instruction is one that instructs that headings should be automatically numbered, with for example level 4 being the top level, and for a certain number of levels down. Level 4 would then be given numbers 1, and 2, and 3 and so on, while level 5 would be assigned 1.1 and 1.2 and 1.3 and 2.1 as found, and level six would be assigned 1.1.1 and 1.1.2 and so on.

  • Processing instructions may refer to complicated instruction sets, or to alternative default values (comprehensive or otherwise) that are stored in another file; or just change the behavior of the program. They may be used to produce extensive changes to the output that is produced.
  • Some other processing instructions are latent at this stage, being used by a downstream process that recognizes it. For example, a processing instruction may inform that the document is to have a particular appearance, possibly belonging to a different organization than the main body of texts (and this would make reference to an alternative template or process to apply in the downstream preparation of the document for relevant output modules). Another example, would be an instruction that is specific to a particular downstream module, such as an instruction concerning what separation should be given between levels by the LaTeX process, e.g. deciding that each Chapter is to start on a new page, and each Section in the next column (if there is only one column per page, then on the next page).
  • Document headers may also contain semantic information about the document, such as its title, who it was created by, when it was created, its subject matter, language and so on. Nothing is done to these in the initial processing method shown in Figure 1, unless there are none at all, in which case the first heading/title is used to create a semantic title. In the absence of an explicitly defined first top level heading/title the first content line/paragraph will be turned into such a title and used for the semantic title tag. This semantic information is attached to the documents they belong to. It is particularly valuable for additional search possibilities in relational databases, and in the creation of RSS feeds and the like. Adding this data to the document is optional, but extremely valuable. For current purposes the Dublin Core is used as a defined standard, but other systems could easily be accommodated.
  • Document headers, semantic information, and processing instructions may be arbitrarily extended. They are just ignored if not understood by downstream processes.
  • Object Citation Numbers are assigned 36 to every object found, unless there is a tagged instruction that a particular object should be skipped (this is usually done on a per object basis). This includes, for example, each heading, each paragraph, each table, each image. The current representation of an object citation number, (step 32) are angular braces containing a tilde and the number assigned to the Object, tagged to the end of the Object (the line, paragraph and the like representing the Object). So the first object would be represented by <~1>.
  • Footnotes/endnotes are numbered, and if not in the standard form, which is currently to embed them within the text at the point at which the occur, then they are transformed from the alternative input representation (step 32). The current standard form is to enclose the content of an endnote within curly braces that have a tilde before the opening bracket and a tilde after the closing brackets, as follows: ~{an endnote}~ . Alternative input representations (which are converted to the standard representations described above) place a marker for where the endnote occurs within the text, and have the endnote identified as such, placed after the paragraph containing the marker, in the order in which it occurs. For practical reasons (related to human legibility of input), this will usually be either immediately after the paragraph that contains them, or at the end of the document (in the order in which they occur within the text).
  • The possibilities for markup in a prepared input document are fairly flexible. To reduce the complexity of downstream processing, the first processing method standardizes the representations/markup for various things. For example, where there may be alternative ways of preparing the input document for font faces such as emphasis, bold, underscore, or alternative ways of representing footnotes, their representation is standardized by the processing technique illustrated in Figure 1.
  • Pre-processing of a document can be done on text from some other or multiple sources such as a text or word processor that is able to save in a format 12 understood by the processing technique 10, or a process that combines term-sheets and standard form documents to produce an input file 12, (represented as reference number 1000 in Figure 3) or as described by the relational database, to produce the format as required for a prepared input document 12 ready for subsequent processing (represented as reference number 700 in Figure 3).
  • The result is a transformation that includes a standardized output 36, and all structural information and numbering, including object citation numbering for a common citation system that is used in all subsequent processing. The processing method 10 can be called by each subsequent process to generate its output for use by the downstream process, or the output 36 can be saved to be read and processed by downstream processes.

    Figure 2 is a diagrammatic example of an input document, and the output resulting from utilizing the processing technique of Figure 1. It shows that a document is divided into a header which contains processing instructions, and/or semantic information about the document; and the document body. The input document body of the input document (12) is of one of two types: Content Units (CU) and Note Units (NU). CU's are substantive Objects and non-substantive Objects which are given a tag indicating that they should not be serialized, and usually most or all Objects are substantive. Note Units (NU) include footnotes and endnotes, which may be either contained within an Object or may be placed after an Object or at the end of the document in the order in which it occurs in relation to other Note Units. The output document 36 in Figure 2 shows heading levels that have been identified and assigned Levels; substantive Objects (objects that have not been given an un-serialized tag) that have been given an Object Citation Number (OCN); and Note Units (footnotes/endnotes) that have been assigned a note number (NN) and standardized in their representation. All are now contained within the Object from which they are referenced, and at the location from which they are referenced (those which were not already represented in this way have been moved to their appropriate location and transformed to the appropriate representation).




    [ document manifest ]
    << previous TOC next >>
    < ^ >



    SiSU


    Viral Spiral - How the Commoners Built a Digital Republic of Their Own

    David Bollier

    2009


    The Wealth of Networks - How Social Production Transforms Markets and Freedom

    Yochai Benkler

    2006


    Free Culture - How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity

    Lawrence Lessig

    2004


    CONTENT - Selected Essays on Technology, Creativity, Copyright and the Future of the Future

    Cory Doctorow

    2008


    Democratizing Innovation

    Eric von Hippel

    2005


    Free As In Freedom - Richard Stallman's Crusade for Free Software

    Sam Williams

    2002


    Two Bits - The Cultural Significance of Free Software

    Christopher Kelty

    2008


    Free For All - How Linux and the Free Software Movement Undercut the High Tech Titans

    Peter Wayner

    2002


    The Cathedral & the Bazaar - Musings on Linux and Open Source by an Accidental Revolutionary

    Erik S. Raymond

    1999


    Little Brother

    Cory Doctorow

    2008


    Down and Out in the Magic Kingdom

    Cory Doctorow

    2003


    For the Win

    Cory Doctorow

    2008


    Free Software Foundation - FSF