|
|
|
SiSU - SiSU information Structuring Universe - Structured information, Serialized Units Ralph Amissah |
Rights: Copyright © 2009 Ralph Amissah
SiSU is a flexible document preparation, generation publishing and search system. 1
SiSU ("SiSU information Structuring Universe" or "Structured information, Serialized Units"), 2 is a Unix command line oriented framework for document structuring, publishing and search. Featuring minimalistic markup, multiple standard outputs, a common citation system, and granular search.
Using markup applied to a document, SiSU can produce plain text, HTML, XHTML, XML, OpenDocument, EPUB, LaTeX or PDF files, and populate an SQL database with objects 3 (equating generally to paragraph-sized chunks) so searches may be performed and matches returned with that degree of granularity (e.g. your search criteria is met by these documents and at these locations within each document). Document output formats share a common object numbering system for locating content. This is particularly suitable for "published" works (finalized texts as opposed to works that are frequently changed or updated) for which it provides a fixed means of reference of content.
SiSU is the data/information structuring and transforming tool, that has resulted from work on one of the oldest law web projects. It makes possible the one time, simple human readable markup of documents, that SiSU can then publish in various forms, suitable for paper 4 , web 5 and relational database 6 presentations, retaining common data-structure and meta-information across the output/presentation formats. Several requirements of legal and scholarly publication on the web have been addressed, including the age old need to be able to reliably cite/pinpoint text within a document, to easily make footnotes/endnotes, to allow for semantic document meta-tagging, and to keep required markup to a minimum. These and other features of interest are listed and described below. A few points are worth making early (and will be repeated a number of times):
(i) The SiSU document generator was the first to place material on the web with a system that makes possible citation across different document types, with paragraph, or rather object citation numbering 7 a text positioning system, available for the pinpointing of text, 1997, a simple idea from which much benefit, and SiSU remains today, to the best of my knowledge, the only multiple format e-book/ electronic-document system on the web that gives you this possibility (including for relational databases).
(ii) Markup is done once for the multiple formats produced.
(iii) Markup is simple, and human readable (with a little practice), in almost all cases there is less and simpler markup required than basic html. In any event the markup required is very much simpler than the html, EPUB, LaTeX, [lout], structured XML, ODF (OpenDocument), PostgreSQL or SQLite feed etc. that you can have SiSU generate for you.
(iv) SiSU is a batch processor, dealing with as many files as you need to generate at a time.
(v) Scalability is dependent on your file system (in my case Reiserfs), the database (currently Postgresql and/or SQLite) and your hardware.
SiSU Sabaki 8 (or just SiSU) is the provisional name given to the software described here that helps structure documents for web and other publication. The name SiSU is a loose anagram for something along the lines of "SiSU is structuring unit", or "SiSU, information structuring unit" or the more descriptive "Structured information, Serialized Units" or "simple - information structuring unit" or the more descriptive "Structured information, Serialized Units" or what it may be directed towards "*semantic* and information structuring universe", 9 tongue in cheek, only just. Guess I'll get away with "Simple - information Structuring Universe". SiSU is also a Finnish word roughly meaning guts, inner strength and perseverance. 10
SiSU was born of the need to find a way, with minimal effort, and for as wide a range of document types as possible, to produce high quality publishing output in a variety of document formats. As such it was necessary to find a simple document representation that would work across a large number of document types, and the most convenient way(s) to produce acceptable output formats. The project leading to this program was started in 1993 (together with the trade law project now known as Lex Mercatoria) as an investigation of how to effectively/efficiently place documents on the web. The unified document handling, together with features such as paragraph numbering, endnote handling and tables... appeared in 1996/97. SiSU was originally written in Perl, 11 and converted to Ruby, 12 in 2000, one of the most impressive programming languages in existence! In its current form it has been written to run on the Gnu/Linux platform, and in particular on Debian, 13 taking advantage of many of the wonderful projects that are available there.
SiSU markup is based on requiring the minimum markup needed to determine the structure of a document. (This can be as little as saying in a header to look for the word Book at a specified level and the word Chapter at another level). SiSU then breaks a document into its smallest parts (at a heading, and paragraph level) while retaining all structural information. This break up of the document and information on its structure is taken advantage of in the transformations made in generating the very different output types that can be created, and in providing as much as can be for what each output type is best at doing, e.g. LaTeX (professional document typesetting, easy conversion to pdf or Postscript), EPUB, XML (in this case, structural representation), ODF (OpenDocument [experimental]), SQL (e.g. document search; representing constituent parts of documents based on their structure, headings, chapters, paragraphs as required; user control). 14
From markup that is simpler and more sparse than html you get:
For more see the short summary of features provided below.
SiSU processes files with minimal tagging to produce various document outputs including html, EPUB, ODF, LaTeX (which is converted to pdf) and if required loads the structured information into an SQL database (PostgreSQL and SQLite have been used for this). SiSU produces an intermediate processing format. 15
SiSU was originally used in constructing Lex Mercatoria ‹http://lexmercatoria.org/› or ‹http://www.jus.uio.no/lm/› (one of the oldest law web sites), and considerable thought went into producing output that would be suitable for legal and academic writings (that do not have formulae) given the limitations of html, and publication in a wide variety of "formats", in particular in relation to the convenient and accurate citation of text. However, the construction of Lex Mercatoria uses only a fraction of the features available from SiSU today, /vis/ generation of flat file structures, rather than in addition the building of ("granular") SQL database content, (at an object level with relevant relational tables, and other outputs also available).
(i) markup syntax: (a) simpler than html, (b) mnemonic, influenced by mail/messaging/wiki markup practices, (c) human readable, and easily writable,
(ii) (a) minimal markup requirement, (b) single file marked up for multiple outputs,
notes:
* documents are prepared in a single UTF-8 file using a minimalistic mnemonic syntax. Typical literature, documents like "War and Peace" require almost no markup, and most of the headers are optional.
* markup is easily readable/parsed by the human eye, (basic markup is simpler and more sparse than the most basic html), [this may also be converted to XML representations of the same input/source document].
* markup defines document structure (this may be done once in a header pattern-match description, or for heading levels individually); basic text attributes (bold, italics, underscore, strike-through etc.) as required; and semantic information related to the document (header information, extended beyond the Dublin core and easily further extended as required); the headers may also contain processing instructions.
(iii) (a) multiple outputs primarily industry established and institutionally accepted open standard formats, include amongst others: plaintext (UTF-8); html; EPUB; (structured) XML; ODF (Open Document text)l; LaTeX; PDF (via LaTeX); SQL type databases (currently PostgreSQL and SQLite). Also produces: concordance files; document content certificates (md5 or sha256 digests of headings, paragraphs, images etc.) and html manifests (and sitemaps of content). (b) takes advantage of the strengths implicit in these very different output types, (e.g. PDFs produced using typesetting of LaTeX, databases populated with documents at an individual object/paragraph level, making possible granular search (and related possibilities))
(iv) outputs share a common numbering system (dubbed "object citation numbering" (ocn)) that is meaningful (to man and machine) across various digital outputs whether paper, screen, or database oriented, (PDF, html, EPUB, XML, Opendocument, sqlite, postgresql), this numbering system can be used to reference content.
(v) SQL databases are populated at an object level (roughly headings, paragraphs, verse, tables) and become searchable with that degree of granularity, the output information provides the object/paragraph numbers which are relevant across all generated outputs; it is also possible to look at just the matching paragraphs of the documents in the database; [output indexing also work well with search indexing tools like hyperesteier].
(vi) use of semantic meta-tags in headers permit the addition of semantic information on documents, (the available fields are easily extended)
(vii) creates organised directory/file structure for (file-system) output, easily mapped with its clearly defined structure, with all text objects numbered, you know in advance where in each document output type, a bit of text will be found (e.g. from an SQL search, you know where to go to find the prepared html output or PDF etc.)... there is more; easy directory management and document associations, the document preparation (sub-)directory may be used to determine output (sub-)directory, the skin used, and the SQL database used,
(viii) "Concordance file" wordmap, consisting of all the words in a document and their (text/ object) locations within the text, (and the possibility of adding vocabularies),
(ix) document content certification and comparison considerations: the document and each object within it stamped with an md5 hash making it possible to easily check or guarantee that the substantive content of a document is unchanged.
(x) SiSU's minimalist markup makes for meaningful "diffing" of the substantive content of markup-files,
(xi) easily skinnable, document appearance on a project/site wide, directory wide, or document instance level easily controlled/changed,
(xii) in many cases a regular expression may be used (once in the document header) to define all or part of a documents structure obviating or reducing the need to provide structural markup within the document,
(xiii) prepared files may be batch process, documents produced are static files so this needs to be done only once but may be repeated for various reasons as desired (updated content, addition of new output formats, updated technology document presentations/representations)
(xiv) possible to pre-process, which permits: the easy creation of standard form documents, and templates/term-sheets, or; building of composite documents (master documents) from other sisu marked up documents, or marked up parts, i.e. import documents or parts of text into a main document should this be desired
there is a considerable degree of future-proofing, output representations are "upgradeable", and new document formats may be added.
(xv) there is a considerable degree of future-proofing, output representations are "upgradeable", and new document formats may be added: (a) modular, (thanks in no small part to Ruby) another output format required, write another module.... (b) easy to update output formats (eg html, XHTML, EPUB, LaTeX/PDF produced can be updated in program and run against whole document set), (c) easy to add, modify, or have alternative syntax rules for input, should you need to,
(xvi) scalability, dependent on your file-system (ext3, Reiserfs, XFS, whatever) and on the relational database used (currently Postgresql and SQLite), and your hardware,
(xvii) only marked up files need be backed up, to secure the larger document set produced,
(xviii) document management,
(xix) Syntax highlighting for SiSU markup is available for a number of text editors.
(xx) remote operations: (a) run SiSU on a remote server, (having prepared sisu markup documents locally or on that server, i.e. this solution where sisu is installed on the remote server, would work whatever type of machine you chose to prepare your markup documents on), (b) generated document outputs may be posted by sisu to remote sites (using rsync/scp) (c)document source (plaintext utf-8) if shared on the net may be identified by its url and processed locally to produce the different document outputs.
(xxi) document source may be bundled together (automatically) with associated documents (multiple language versions or master document with inclusions) and images and sent as a zip file called a sisupod, if shared on the net these too may be processed locally to produce the desired document outputs, these may be downloaded, shared as email attachments, or processed by running sisu against them, either using a url or the filename.
(xxii) for basic document generation, the only software dependency is Ruby, and a few standard Unix tools (this covers plaintext, html, EPUB, XML, ODF, LaTeX). To use a database you of course need that, and to convert the LaTeX generated to PDF, a LaTeX processor like tetex or texlive.
as a developers tool it is flexible and extensible
SiSU was developed in relation to legal documents, and is strong across a wide variety of texts (law, literature...). SiSU handles images but is not suitable for formulae/ statistics, or for technical writing at this time.
SiSU has been developed and has been in use for several years. Requirements to cover a wide range of documents within its use domain have been explored.
Some modules are more mature than others, the most mature being html and LaTeX / pdf. PostgreSQL and search functions are useable and together with /ocn/ unique (to the best of my knowledge). The XML output document set is "well formed" but largely proof of concept.
SiSU markup is fairly minimalistic, it consists of: a (largely optional) document header, made up of information about the document (such as when it was published, who authored it, and granting what rights) and any processing instructions; and markup within text which is related to document structure and typeface. SiSU must be able to discern the structure of a document, (text headings and their levels in relation to each other), either from information provided in the instruction header or from markup within the text (or from a combination of both). Processing is done against an abstraction of the document comprising of information on the document's structure and its objects, 16 which the program serializes (providing the object numbers) and which are assigned hash sum values based on their content. This abstraction of information about document structure, objects, (and hash sums), provides considerable flexibility in representing documents different ways and for different purposes (e.g. search, document layout, publishing, content certification, concordance etc.), and makes it possible to take advantage of some of the strengths of established ways of representing documents, (or indeed to create new ones).
SiSU markup is based on requiring the minimum markup needed to determine the structure of a document. (This can be as little as saying in a header to look for the word Book at a specified level and the word Chapter at another level). SiSU then breaks a document into its smallest parts (at a heading, and paragraph level) while retaining all structural information. This break up of the document and information on its structure is taken advantage of in the transformations made in generating the very different output types that can be created, and in providing as much as can be for what each output type is best at doing, e.g. LaTeX (professional document typesetting, easy conversion to pdf or Postscript), EPUB, XML (in this case, structural representation), ODF (OpenDocument), SQL (e.g. document search; representing constituent parts of documents based on their structure, headings, chapters, paragraphs as required; user control). 17
One of its strengths is that very small amounts of initial tagging is required for the program to generate its output.
This is a basic markup example:
Emphasis has been on simplicity and minimalism in markup requirements. Design philosophy is to try keep the amount of markup required low, for whatever has been determined to be acceptable output. 19
SiSU's markup is more minimalistic and simpler than (the equivalent) html and for it, you get considerably more than just html, as this preparation gives you all available output formats, upon request.
For each document, there is only one (input, minimalistically marked up) file from which all the available output types are generated. 20
Eg. the markup example:
Produces the following output:
(and in addition to these: PostgreSQL, SQLite, texinfo and YAML 33 versions if desired)
Syntax is kept simple and mnemonic. 34
To keep SiSU markup sparse and simple SiSU deliberately provides a limited publishing feature set, including: indent levels; bold; italics; superscript; subscript; simple tables; images; tables of contents and; endnotes. Which in most cases are available across the different output formats.
The publishing feature set may be expanded as required.
Output is designed to be uniform, easy to read, navigate and cite.
Code 35 is separated from content. This means that when changes are desired in the output presentation, the code that produces them, and not the marked up text data set (which could be thousands of documents) is modified. Separating code from content makes large scale changes to output appearance trivial, and permits the easy addition of new output modules.
Object citation numbering is a simple object (text) positioning and cition system that is human relevant and machine useable, used by SiSU for all manner of presentations, and that is available for use in all text mappings. It is based on the automated sequential numbering of objects (roughly paragraphs, (headings, tables, verse) or other blocks of text or images etc.). The text positioning system (in which I claim copyright) is invaluable for publishing requiring the citing text across multiple output formats, and for the general mapping of text within a document:
I claim copyright on the system I use which is the most basic of all, numbering all text in headings and paragraphs sequentially (with tables and images being treated as a single paragraph) and only footnotes/endnotes not following this numbering, as their position in text is not strictly determined, (a change from footnotes to endnotes would change their numbering), footnotes instead "belong" to the paragraph from which they are referenced, and have sequential numbers of their own.
SiSU has a paragraph numbering system, that remains the same regardless of the output format. This provides an effective means of citation, pinpointing text accurately in all output formats, using the same reference. This is particularly useful where text has to be located across different output formats - for example once html is printed the number of pages and pages on which given text is found will vary depending on the browser, its settings the font size setting etc. Similarly SiSU produces pdf in different forms, eg. on the example site Lex Mercatoria as portrait and landscape documents - here too page numbering varies, but paragraph numbering is the same, vis a vis all versions of the text (portrait and landscape pdf and the html versions of the text, and as stored (with "paragraphs" as records) to the PostgreSQL or SQLite database).
These numbers are placed in the text margins and are intended to be independent of and not to interfere with authors tagging. [The citation system (object citation numbering system, automated "paragraph numbering") which is automatically generated and is common and identical across all document formats] The paragraph numbering system is more accurately described as an (text) object numbering system, as headings are also numbered... all headings and paragraphs are numbered sequentially. Endnotes are automatically numbered independently and rather "belong" to the paragraph from which they are referenced, as an endnote does not (necessarily) form a part of a documents sequence, (they may be produced as either endnotes or footnotes (or both depending on what output you choose to look at - if you take the segmented html version document provided as an example, you will find that the endnotes are placed both at the end of each section, and in a separate section of their own called endnotes, and these are hyper-linked)). An attractive feature of providing citation numbering in this way is that it is independent of the document structure... it remains the same regardless of what is done about the document structure.
The rules have been kept very simple, unique incremental object citation numbers are assigned to headings, paragraphs, verse, tables and images. It is possible to manually override this feature on a per heading or comment basis though this should be used exceptionally, it may be of use where there a substantive text, and the addition of a minor comment by the publisher that should not be mapped as part of the text.
The object citation number markers contain additional numbering information with regard to the document structure, that can be used for alternative presentations, including such detail as the type of object (heading, paragraph, table, image, etc.), numbered sequentially.
An advantage is that the numbering remains the same regardless of document structure.
Text object ("paragraph") numbering is the same for all output versions of the same document, vis html, epub, pdf, pgsql, etc.
In the relational database, as individual text objects of a document stored (and indexed) together with object numbers, and all versions of the document have the same numbering, the results of searches may be tailored just to provide the location of the search result in all available document formats.
Note: there is a bug in the released behaviour of object citation numbering, (not certain when it was introduced) tables should be numbered, ie each table gets an ocn, required amongst other things for relational database. This will be corrected in a future release. Citation numbering of existing documents that contain tables will changed.
This provides the means of providing semantic information about a document, both as computer processable meta-tags, and as human readable information that may be of value for classification purposes.
This information is provided both in html metatags, and (where available) under the section titled "Document Information - Metadata", near the end of a document, for example in the segmented html version of this text at: ‹http://www.jus.uio.no/sisu/SiSU/metadata.html›
1. Directory file association, skins and special image management, made simpler. 39
The last part of the name of the work directory in which markup is being done, or rather from where SiSU is run in order to generate document output, is used in determining the sub-directory name for output files, that is created in the document output directory. This provides a rather easy way to associate documents e.g. of a given subject, or by owner.
/www/docs
/intellectual_property
/arbitration
/contract_law
/www/docs
/ralph
/sisu
all are placed in their own directories within the directory structure created. Similar rules are used in the creation of sql type databases (though they can be overridden).
There are a couple of further associations with these directories.
Directory wide skins.
Directory specific images.
2. If there is a "directory skin", that is a skin of the same name as the directory, it is used in the generation of the documents within it, rather than the default skin, unless the document has a specific skin associated with it.
a. default skin (always available)
b. directory skin (precedence over default if exists)
c. document skin (takes precedence wherever document requests a specific skin)
Skins are defined in the document skin directory and if a directory association is desired a softlink made to the relevant skin. Skins (directory association auto load) auto load skin if a directory skin exists of same name as directory stub, (and there is no specific doc skin)
3. If the working directory has within it a sub-directory called image_local, the images within that directory are used for references to images, that are not part of the default site build.
The possibility of citing an exact document version.
Permits the inclusion of document version control information to the document body and metatags. 40 This provides a much more certain method of referring to the exact version of a particular document, (assuming that the document is from a trusted source, that will retain earlier versions of a document). 41
This information (where available) is provided under the section of the document titled "Document Information - MetaData", near the end of a document, for example in the segmented html version of this text at: ‹http://www.jus.uio.no/sisu/SiSU/metadata.html›
SiSU produces a rudimentary a table of contents based on document headings.
Headings can be automatically numbered, (and automatically named for hyper-linking)
SiSU can automatically number footnotes/endnotes. This is the default operation where no number is provided.
Footnotes/endnotes may also be manually numbered. Where a number, or numbers are provided for a footnote/endnote, this does not increment the automatic footnote/endnote number counter.
In the html output footnotes/endnotes are cross-hyper-linked (to their reference point and vice versa). In th pdf output footnotes are linked from their reference point only.
SiSU is skinnable, on a site-wide, directory-wide and per document basis, so different looking versions of things may be produced with little difficulty. There is a default skin which may be modified, as the background site skin, and each working directory may have a skin associated with it, as may each individual document. The hierarchy of application is document, directory, then site... ie if a document skin exists it gets precedence.
Whilst it is skinnable, the default output styles are selected to work across the widest possible range of document types.
From markup that is simpler and more sparse than html you get:
As many output formats/presentations as one cares to write modules for - several types of html (e.g. structure based on css, or structure based on tables); LaTeX/pdf and Lout/pdf; pgsql other databases easily added; yaml...
Most documents are produced in single and segmented html versions, described below:
The Scroll (full length text presentations)
The full length of the text in a single scrollable document. 43 As a rule the files they are saved in are named: /doc/ or more precisely doc.html
For various reasons texts may only be provided in this form (such as this one which is short), though most are also provided as segmented texts.
"Scroll" is a reference to the historical scroll, a single long document/ parchment, and also no doubt to what you will have to do to get to the bottom of the text. 44
The Segmented Text
The text divided into segments (such as articles or chapters depending on the text) 45 As a rule the files they are saved in are named: /toc/ and /index/ or more precisely toc.html and index.html
If you know exactly what you are looking for, loading a segment of text is faster (the segments being smaller). Occasionally longer documents such as the WTA 1994 ‹http://www.jus.uio.no/lm/wta.1994/toc› are only provided in segmented form.
Cascading Style Sheet, and Table based html
SiSU outputs html, two current standard forms available are:
and
table based [largely discontinued] 46
The html is tested across several browsers
I like to remind you that there are other excellent browsers out there, many of which have long supported practical features like tabbing.
The html is tested across several browsers, including:
Also lighter weight graphical browsers:
And for console/text browsing:
The html tables output is rendered more accurately across a wider variety set and older versions of browsers (than the html css output).
SiSU generates EPUB documents.
SiSU generates well formed XML, and multiple versions. An XML SAX version with a flat/shallow structure, and XML DOM version with a deeper (embedded) structure. There is also a released working xhtml module. Examples of SAX and DOM versions are provided within this document.
SiSU generates Open Document Output format.
SiSU outputs LaTeX if required which is easily transformed to PDF. 60 PDF documents are generated on the site from the same source files and Ruby program that produce html. Landscape oriented pdf introduced, providing easier screen viewing, they are also (paper saving, being currently) formatted to have fewer pages than their portrait equivalents.
SiSU (from the same markup input file) automatically feeds into PostgreSQL 64 and/or SQLite 65 database (could be any other of the better relational databases) 66 - together with all additional information related to document structure, and the alternative ways in which it is generated on the site retained. As regards scaling of the database, it is as scalable as the database (here Postgresql or SQLite) and hardware allow. I will prune the images later.
This is one of the more interesting output forms, as all the structural data for the documents are retained (though can be ignored by the user of the database should they so choose). All site texts/documents are (currently) streamed to four pgsql database tables:
There is of course the possibility to add further structures.
At this level SiSU loads a relational database with documents broken in to their smallest logical structurally constituent parts, as text objects, with their object citation number and all other structural information needed to construct the structured document. Text is stored (at this text object level) with and without elementary markup tagging, the stripped version being so as to facilitate ease of searching.
Because the document structure of sites created is clearly defined, and the text object citation system is available for all forms of output, it is possible to search the sql database, and either read results from that database, or just as simply map the results to the html output, which has richer text markup.
The combination of the SiSU citation system with a relational database is pretty powerful, giving rise to several possibilities. As individual text objects of a document stored (and indexed) together with object numbers, and all versions of the document have the same numbering, complex searches can be tailored to return just the locations of the search results relevant for all available output formats, with live links to the precise locations in the database or in html/xml documents; or, the structural information provided makes it possible to search the full contents of the database and have headings in which search content appears, or to search only headings etc. (as the Dublin Core is incorporated it is easy to make use of that as well).
This is a larger scale project, (with little development on the front end largely ignored), though the "infrastructure" has been in place since 2002.
Sample search frontend 67 A small database and sample query front-end (search from) that makes use of the citation system, object citation numbering to demonstrates functionality. 68
SiSU can provide information on which documents are matched and at what locations within each document the matches are found. These results are relevant across all outputs using object citation numbering, which includes html, EPUB, XML, LaTeX, PDF and indeed the SQL database. You can then refer to one of the other outputs or in the SQL database expand the text within the matched objects (paragraphs) in the documents matched.
(further work needs to be done on the sample search form, which is rudimentary and only passes simple booleans correctly at present to the SQL engine)
A few canned searches, showing object numbers. Search for:
Note that the searches done in this form are case sensitive.
Expand those same searches, showing the matching text in each document:
Note you may set results either for documents matched and object number locations within each matched document meeting the search criteria; or display the names of the documents matched along with the objects (paragraphs) that meet the search criteria. 69
OCN index mode, (object citation number) the numbers displayed are relevant (and may be used to reference the match) in any sisu generated rendition of the text 70 the links provided are to the locations of matches within the html generated by SiSU.
Paragraph mode, you may alternatively display the text of each paragraph in which the match was made, again the object/paragraph numbers are relevant to any SiSU generated/published text.
Several options for output - select database to search, show results in index view (links to locations within text), show results with text, echo search in form, show what was searched, create and show a "canned url" for search, show available search fields. Also shows counters number of documents in which found and number of locations within documents where found. [could consider sorting by document with most occurrences of the search result].
Simple search, results with files in which search found, and text object (paragraph or endnote) where found within files.
There are other forms as well, YAML file, Ruby Marshal dumps, document pre-processing (processing of documents prior to the steps described here, to produce input suitable for the program) snap in a new module as required/desired, well formed XML, no problem.
Concordance /WordMaps: 71 SiSU produces a rudimentary index based on the words within the text, making use of paragraph numbers to identify text locations. This is generated in html and hyper-linked but identifies these words locations in the other document formats. Though it is possible to search using a search engine, this is a means for browsing an alphabetical list of words which may suggest other useful content.
SiSU builds the web site (or more generically provides a suitable directory structure) - placing various output texts in the hierarchy of the web-site (or db), which (for directories) is a sub-directory with the name of the text file.
SiSU is a batch processing tool, handling and transforming multiple (or individual) documents (in many ways) with a single instruction.
As should have been noted by the above description of SiSU, it makes use of existing programs found on Gnu/Linux and Unix, amongst those already mentioned include the LaTeX to pdf converters and the database PostgreSQL or SQLite.
Unix provides many tools for version control. For documents Subversion, CVS and even the old RCS are useful for the per-document histories they provide.
For writing code superior (more recent) version control system exist. These can also be used for documents though they tend to take stamps of changes across the repository as a whole, rather than for each individual file that is tracked, (as CVS and RCS do). My personal preference is for distributed systems such as Git, Mercurial or Darcs, of which I use Git for both code and documents.
Several backup tools exist. At the base level I tend to use rdiff.
SiSU documents are prepared / marked up in utf-8 text you are free to use the text editor of your choice.
Syntax highlighting for a number of editors are provided. Amongst them Vim, Kwrite, Kate, Gedit and diakonos. These may be found with configuration instructions at ‹http://www.sisudoc.org/sisu/sisu_syntax_highlighting/doc.html› Vim 72 as of version 7 has built in sytax highlighting for SiSU.
Need a new output format that does not already exist, write a new module.
Prefer a new input syntax, you could write a new syntax matching the existing design, though my personal preference is some uniformity in entry appearance. If necessary has been fairly easy to extend the design parameters. It is intended to incorporate some additional basic semantic tagging, (book, article, author etc.) However, keeping the requirements for input minimal, and relatively simple has been a design goal.
Current markup examples and document output samples are provided at ‹http://www.jus.uio.no/sisu/SiSU/examples.html›
For some documents hardly any markup at all is required at all, other than a header, and an indication that the levels to be taken into account by the program in generating its output are.
"Viral Spiral", David Bollier
document manifest 74
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"The Wealth of Networks", Yochai Benkler
document manifest 75
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Two Bits", Christopher Kelty
document manifest 76
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Free Culture", Lawrence Lessig
document manifest 77
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"CONTENT", Cory Doctorow
document manifest 78
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Democratizing Innovation", by Eric von Hippel
document manifest 79
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Free as in Freedom: Richard Stallman's Crusade for Free Software", by Sam Williams
document manifest 80
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Free For All: How Linux and the Free Software Movement Undercut the High Tech Titans", by Peter Wayner
document manifest 81
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"The Cathedral and the Bazaar", by Eric S. Raymond
document manifest 82
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Down and out in the Magic Kingdom", Cory Doctorow
document manifest 83
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Little Brother", Cory Doctorow
document manifest 84
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"For the Win", Cory Doctorow
document manifest 85
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Accelerando", Charles Stross
document manifest 86
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Tainaron", Leena Krohn
document manifest 87
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Sphinx or Robot", Leena Krohn
document manifest 88
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"War and Peace", Leo Tolstoy 89
document manifest 90
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Don Quixote", Miguel de Cervantes [Saavedra]
document manifest 91
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Gulliver's Travels", Jonathan Swift
document manifest 92
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Alice's Adventures in Wonderland", Lewis Carroll
document manifest 93
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Through The Looking-Glass", Lewis Carroll
document manifest 94
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Alice's Adventures in Wonderland" and "Through The Looking-Glass", Lewis Carroll
document manifest 95
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Gnu Public License 2", (GPL 2) Free Software Foundation
document manifest 96
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Gnu Public License 3 - Third discussion draft", (GPL v3 draft3) Free Software Foundation
document manifest 97
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Debian Social Contract"
document manifest 98
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Debian Constitution v1.3"
document manifest 99
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Debian Constitution v1.3", (markup adjusted for output to more closely match the original)
document manifest 100
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Debian Constitution v1.2 (more translations)"
document manifest 101
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Debian Constitution (more translations)", (markup adjusted for output to more closely match the original)
document manifest 102
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"A Uniform Sales Terminology", Vikki Rogers and Albert Kritzer
document manifest 103
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"The Autonomous Contract" 1997 - markup sample
document manifest 104
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"The Autonomous Contract Revisited" - markup sample 105
document manifest 106
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"United Nations Convention on Contracts for the International Sale of Goods" 107
document manifest 108
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
"Principles of European Contract Law"
document manifest 109
html, segmented text
html, scroll, document in one
epub
pdf, landscape
pdf, portrait
odf:odt, open document text
xhtml scroll
xml, sax
xml, dom
plain text utf-8
concordance
dcc, document content certificate (digests)
markup source text
markup source (zipped) pod
A Sample search form is available at ‹http://search.sisudoc.org›
A few canned searches, showing object numbers. Search for:
Note that the searches done in this form are case sensitive.
Expand those same searches, showing the matching text in each document:
Note you may set results either for documents matched and object number locations within each matched document meeting the search criteria; or display the names of the documents matched along with the objects (paragraphs) that meet the search criteria. 110
There is quite a bit to peruse if you explore the site Lex Mercatoria:
or perhaps:
SiSU is not optimised for table making, but does handle simple tables.
This table gives an indication of the features that are available for various forms of output of SiSU.
sisu-2.0.0 on 2010-03-06
| feature | txt | ltx/pdf | HTML | EPUB | XML/s | XML/d | ODF | SQLite | pgSQL |
|---|---|---|---|---|---|---|---|---|---|
| headings | * | * | * | * | * | * | * | * | * |
| footnotes | * | * | * | * | * | * | * | * | * |
| bold, underscore, italics | . | * | * | * | * | * | * | * | * |
| strikethrough | . | * | * | * | * | * | * | ||
| superscript, subscript | . | * | * | * | * | * | * | ||
| extended ascii set (utf-8) | * | * | * | * | * | * | * | * | |
| indents | * | * | * | * | * | * | * | ||
| bullets | . | * | * | * | * | * | . | ||
| groups | |||||||||
| * tables | * | * | * | . | . | . | . | . | |
| * poem | * | * | * | * | . | . | * | . | . |
| * code | * | * | * | * | . | . | * | . | . |
| url | * | * | * | * | * | * | * | . | . |
| links | * | * | * | * | * | * | * | . | . |
| images | - | * | * | * | T | T | * | T | T |
| image caption | - | * | * | * | |||||
| table of contents | * | * | * | * | * | . | |||
| page header/footer? | - | * | * | * | * | * | t | ||
| line break | * | * | * | * | * | * | * | ||
| page break | * | * | |||||||
| segments | * | * | |||||||
| skins | * | * | * | * | * | * | |||
| ocn | . | * | * | * | * | * | -? | * | * |
| auto-heading numbers | * | * | * | * | * | * | * | * | * |
| minor list numbering | * | * | * | * | * | * | * | * | * |
| special characters | . | . | . | . |
sisu-1.0.0 on 2009-10-28
| feature | txt | ltx/pdf | HTML | XML/s | XML/d | ODF | SQLite | pgSQL |
|---|---|---|---|---|---|---|---|---|
| headings | * | * | * | * | * | * | * | * |
| footnotes | * | * | * | * | * | * | * | * |
| bold, underscore, italics | . | * | * | * | * | * | * | * |
| strikethrough | . | * | * | * | * | * | ||
| superscript, subscript | . | * | * | * | * | * | ||
| extended ascii set (utf-8) | * | * | * | * | * | * | * | |
| indents | * | * | * | * | * | * | ||
| bullets | . | * | * | * | * | . | ||
| groups | ||||||||
| * tables | * | * | . | . | . | . | . | |
| * poem | * | * | * | . | . | * | . | . |
| * code | * | * | * | . | . | * | . | . |
| url | * | * | * | * | * | * | . | . |
| links | * | * | * | * | * | * | . | . |
| images | - | * | * | T | T | * | T | T |
| image caption | - | * | * | |||||
| table of contents | * | * | * | * | . | |||
| page header/footer? | - | * | * | * | * | t | ||
| line break | * | * | * | * | * | * | ||
| page break | * | * | ||||||
| segments | * | |||||||
| skins | * | * | * | * | * | |||
| ocn | . | * | * | * | * | -? | * | * |
| auto-heading numbers | * | * | * | * | * | * | * | * |
| minor list numbering | * | * | * | * | * | * | * | * |
| special characters | . | . | . |
sisu-0.36.6 on 2006-01-23
| feature | txt | ltx/pdf | HTML | XHTML | XML/s | XML/d | ODF | SQLite | pgSQL |
|---|---|---|---|---|---|---|---|---|---|
| headings | * | * | * | * | * | * | * | * | * |
| footnotes | * | * | * | * | * | * | * | * | * |
| bold, underscore, italics | . | * | * | * | * | * | * | * | * |
| strikethrough | . | * | * | * | * | * | * | ||
| superscript, subscript | . | * | * | * | * | * | * | ||
| extended ascii set (utf-8) | * | * | * | * | * | * | * | * | |
| indents | * | * | * | * | * | * | * | ||
| bullets | . | * | * | * | * | * | . | ||
| groups | |||||||||
| * tables | * | * | . | . | . | . | . | . | |
| * poem | * | * | * | . | . | . | * | . | . |
| * code | * | * | * | . | . | . | * | . | . |
| url | * | * | * | * | * | * | * | . | . |
| links | * | * | * | * | * | * | * | . | . |
| images | - | * | * | T | T | T | * | T | T |
| image caption | - | * | * | ||||||
| table of contents | * | * | * | * | * | . | |||
| page header/footer? | - | * | * | * | * | * | t | ||
| line break | * | * | * | * | * | * | * | ||
| page break | * | * | |||||||
| segments | * | ||||||||
| skins | * | * | * | * | * | * | |||
| ocn | . | * | * | * | * | * | -? | * | * |
| auto-heading numbers | * | * | * | * | * | * | * | * | * |
| minor list numbering | * | * | * | * | * | * | * | * | * |
| special characters | . | . | . |
Done
* yes/done
. partial
- not available/appropriate
Not Done
T task todo
t lesser task/todo
not done
SiSU source documents are plaintext (UTF-8) 115 files
All paragraphs are separated by an empty line.
Markup is comprised of:
Some interactive help on markup is available, by typing sisu and selecting markup or sisu --help markup
To check the markup in a file:
sisu --identify [filename].sst
For brief descriptive summary of markup history
sisu --query-history
or if for a particular version:
sisu --query-0.38
Online markup examples are available together with the respective outputs produced from ‹http://www.jus.uio.no/sisu/SiSU/examples.html› or from ‹http://www.jus.uio.no/sisu/sisu_examples/›
There is of course this document, which provides a cursory overview of sisu markup and the respective output produced: ‹http://www.jus.uio.no/sisu/sisu_markup/›
an alternative presentation of markup syntax: /usr/share/doc/sisu/on_markup.txt.gz
With SiSU installed sample skins may be found in: /usr/share/doc/sisu/markup-samples (or equivalent directory) and if sisu-markup-samples is installed also under: /usr/share/doc/sisu/markup-samples-non-free
Headers contain either: semantic meta-data about a document, which can be used by any output module of the program, or; processing instructions.
Note: the first line of a document may include information on the markup version used in the form of a comment. Comments are a percentage mark at the start of a paragraph (and as the first character in a line of text) followed by a space and the comment:
% this would be a comment
This current document is loaded by a master document that has a header similar to this one:
% SiSU master 2.0
@title: SiSU
:subtitle: Manual
@creator: :author: Amissah, Ralph
@rights: Copyright (C) Ralph Amissah 2007, part of SiSU documentation, License GPL 3
@classify:
:type: information
:topic_register: SiSU:manual;electronic documents:SiSU:manual
:subject: ebook, epublishing, electronic book, electronic publishing,
electronic document, electronic citation, data structure,
citation systems, search
% used_by: manual
@date:
:published: 2008-05-22
:created: 2002-08-28
:issued: 2002-08-28
:available: 2002-08-28
:modified: 2010-03-03
@make:
:num_top: 1
:breaks: new=C; break=1
:skin: skin_sisu_manual
:bold: /Gnu|Debian|Ruby|SiSU/
:manpage: name=sisu - documents: markup, structuring, publishing in multiple standard formats, and search;
synopsis=sisu [-abcDdeFhIiMmNnopqRrSsTtUuVvwXxYyZz0-9] [filename/wildcard ]
. sisu [-Ddcv] [instruction]
. sisu [-CcFLSVvW]
. sisu --v2 [operations]
. sisu --v3 [operations]
@links:
{ SiSU Homepage }http://www.sisudoc.org/
{ SiSU Manual }http://www.sisudoc.org/sisu/sisu_manual/
{ Book Samples & Markup Examples }http://www.jus.uio.no/sisu/SiSU/examples.html
{ SiSU Download }http://www.jus.uio.no/sisu/SiSU/download.html
{ SiSU Changelog }http://www.jus.uio.no/sisu/SiSU/changelog.html
{ SiSU Git repo }http://git.sisudoc.org/?p=code/sisu.git;a=summary
{ SiSU List Archives }http://lists.sisudoc.org/pipermail/sisu/
{ SiSU @ Debian }http://packages.qa.debian.org/s/sisu.html
{ SiSU Project @ Debian }http://qa.debian.org/developer.php?login=sisu@lists.sisudoc.org
{ SiSU @ Wikipedia }http://en.wikipedia.org/wiki/SiSU
Header tags appear at the beginning of a document and provide meta information on the document (such as the Dublin Core), or information as to how the document as a whole is to be processed. All header instructions take the form @headername: or on the next line and indented by once space :subheadername: All Dublin Core meta tags are available
@indentifier: information or instructions
where the "identifier" is a tag recognised by the program, and the "information" or "instructions" belong to the tag/indentifier specified
Note: a header where used should only be used once; all headers apart from @title: are optional; the @structure: header is used to describe document structure, and can be useful to know.
This is a sample header
% SiSU 2.0 [declared file-type identifier with markup version]
@title: [title text] [this header is the only one that is mandatory]
:subtitle: [subtitle if any]
:language: English
@creator:
:author: [Lastname, First names]
:illustrator: [Lastname, First names]
:translator: [Lastname, First names]
:prepared_by: [Lastname, First names]
@date:
:published: [year or yyyy-mm-dd]
:created: [year or yyyy-mm-dd]
:issued: [year or yyyy-mm-dd]
:available: [year or yyyy-mm-dd]
:modified: [year or yyyy-mm-dd]
:valid: [year or yyyy-mm-dd]
:added_to_site: [year or yyyy-mm-dd]
:translated: [year or yyyy-mm-dd]
@rights:
:copyright: Copyright (C) [Year and Holder]
:license: [Use License granted]
:text: [Year and Holder]
:translation: [Name, Year]
:illustrations: [Name, Year]
@classify:
:topic_register: SiSU:markup sample:book;book:novel:fantasy
:type:
:subject:
:description:
:keywords:
:abstract:
:isbn: [ISBN]
:loc: [Library of Congress classification]
:dewey: [Dewey classification
:pg: [Project Gutenberg text number]
@links: { SiSU }http://www.sisudoc.org
{ FSF }http://www.fsf.org
@make:
:skin: skin_name [skins change default settings related to the appearance of documents generated]
:num_top: 1
:headings: [text to match for each level
(e.g. PART; Chapter; Section; Article; or another: none; BOOK|FIRST|SECOND; none; CHAPTER;)
:breaks: new=:C; break=1
:promo: sisu, ruby, sisu_search_libre, open_society
:bold: [regular expression of words/phrases to be made bold]
:italics: [regular expression of words/phrases to italicise]
@original:
:language: [language]
@notes:
:comment:
:prefix: [prefix is placed just after table of contents]
Heading levels are :A~ ,:B~ ,:C~ ,1~ ,2~ ,3~ ... :A - :C being part / section headings, followed by other heading levels, and 1 -6 being headings followed by substantive text or sub-headings. :A~ usually the title :A~? conditional level 1 heading (used where a stand-alone document may be imported into another)
:A~ [heading text] Top level heading [this usually has similar content to the title @title: ] NOTE: the heading levels described here are in 0.38 notation, see heading
:B~ [heading text] Second level heading [this is a heading level divider]
:C~ [heading text] Third level heading [this is a heading level divider]
1~ [heading text] Top level heading preceding substantive text of document or sub-heading 2, the heading level that would normally be marked 1. or 2. or 3. etc. in a document, and the level on which sisu by default would break html output into named segments, names are provided automatically if none are given (a number), otherwise takes the form 1~my_filename_for_this_segment
2~ [heading text] Second level heading preceding substantive text of document or sub-heading 3 , the heading level that would normally be marked 1.1 or 1.2 or 1.3 or 2.1 etc. in a document.
3~ [heading text] Third level heading preceding substantive text of document, that would normally be marked 1.1.1 or 1.1.2 or 1.2.1 or 2.1.1 etc. in a document
1~filename level 1 heading,
% the primary division such as Chapter that is followed by substantive text, and may be further subdivided (this is the level on which by default html segments are made)
markup example:
normal text, *{emphasis}*, !{bold text}!, /{italics}/, _{underscore}_, "{citation}",
^{superscript}^, ,{subscript},, +{inserted text}+, -{strikethrough}-, #{monospace}#
normal text
*{emphasis}* [note: can be configured to be represented by bold, italics or underscore]
!{bold text}!
/{italics}/
_{underscore}_
"{citation}"
^{superscript}^
,{subscript},
+{inserted text}+
-{strikethrough}-
#{monospace}#
resulting output:
normal text, emphasis, bold text, italics, underscore, citation, superscript, subscript, inserted text, strikethrough, monospace
normal text
emphasis [note: can be configured to be represented by bold, italics or underscore]
bold text
italics
underscore
citation
superscript
subscript
inserted text
strikethrough
monospace
markup example:
ordinary paragraph
_1 indent paragraph one step
_2 indent paragraph two steps
_9 indent paragraph nine steps
resulting output:
ordinary paragraph
indent paragraph one step
indent paragraph two steps
indent paragraph nine steps
markup example:
_* bullet text
_1* bullet text, first indent
_2* bullet text, two step indent
resulting output:
Numbered List (not to be confused with headings/titles, (document structure))
markup example:
# numbered list numbered list 1., 2., 3, etc.
_# numbered list numbered list indented a., b., c., d., etc.
Footnotes and endnotes are marked up at the location where they would be indicated within a text. They are automatically numbered. The output type determines whether footnotes or endnotes will be produced
markup example:
~{ a footnote or endnote }~
resulting output:
markup example:
normal text~{ self contained endnote marker & endnote in one }~ continues
resulting output:
normal text 117 continues
markup example:
normal text ~{* unnumbered asterisk footnote/endnote, insert multiple asterisks if required }~ continues
normal text ~{** another unnumbered asterisk footnote/endnote }~ continues
resulting output:
normal text * continues
normal text ** continues
markup example:
normal text ~[* editors notes, numbered asterisk footnote/endnote series ]~ continues
normal text ~[+ editors notes, numbered asterisk footnote/endnote series ]~ continues
resulting output:
normal text *1 continues
normal text +1 continues
Alternative endnote pair notation for footnotes/endnotes:
% note the endnote marker "~^"
normal text~^ continues
^~ endnote text following the paragraph in which the marker occurs
the standard and pair notation cannot be mixed in the same document
urls found within text are marked up automatically. A url within text is automatically hyperlinked to itself and by default decorated with angled braces, unless they are contained within a code block (in which case they are passed as normal text), or escaped by a preceding underscore (in which case the decoration is omitted).
markup example:
normal text http://www.sisudoc.org/ continues
resulting output:
normal text ‹http://www.sisudoc.org/› continues
An escaped url without decoration
markup example:
normal text _http://www.sisudoc.org/ continues
deb http://www.jus.uio.no/sisu/archive unstable main non-free
resulting output:
normal text http://www.sisudoc.org/ continues
deb http://www.jus.uio.no/sisu/archive unstable main non-free
where a code block is used there is neither decoration nor hyperlinking, code blocks are discussed later in this document
resulting output:
deb http://www.jus.uio.no/sisu/archive unstable main non-free
deb-src http://www.jus.uio.no/sisu/archive unstable main non-free
To link text or an image to a url the markup is as follows
markup example:
about { SiSU }http://url.org markup
resulting output:
about SiSU markup
A shortcut notation is available so the url link may also be provided automatically as a footnote
markup example:
about {~^ SiSU }http://url.org markup
resulting output:
Internal document links to a tagged location, including an ocn
markup example:
about { text links }#link_text
resulting output:
about text links
Shared document collection link
markup example:
about { SiSU book markup examples }:SiSU/examples.html
resulting output:
markup example:
{ tux.png 64x80 }image
% various url linked images
{tux.png 64x80 "a better way" }http://www.sisudoc.org/
{GnuDebianLinuxRubyBetterWay.png 100x101 "Way Better - with Gnu/Linux, Debian and Ruby" }http://www.sisudoc.org/
{~^ ruby_logo.png "Ruby" }http://www.ruby-lang.org/en/
resulting output:
linked url footnote shortcut
{~^ [text to link] }http://url.org
% maps to: { [text to link] }http://url.org ~{ http://url.org }~
% which produces hyper-linked text within a document/paragraph, with an endnote providing the url for the text location used in the hyperlink
note at a heading level the same is automatically achieved by providing names to headings 1, 2 and 3 i.e. 2~[name] and 3~[name] or in the case of auto-heading numbering, without further intervention.
Tables may be prepared in two either of two forms
markup example:
table{ c3; 40; 30; 30;
This is a table
this would become column two of row one
column three of row one is here
And here begins another row
column two of row two
column three of row two, and so on
}table
resulting output:
| This is a table | this would become column two of row one | column three of row one is here |
| And here begins another row | column two of row two | column three of row two, and so on |
a second form may be easier to work with in cases where there is not much information in each column
markup example: 120
!_ Table 3.1: Contributors to Wikipedia, January 2001 - June 2005
{table~h 24; 12; 12; 12; 12; 12; 12;}
|Jan. 2001|Jan. 2002|Jan. 2003|Jan. 2004|July 2004|June 2006
Contributors* | 10| 472| 2,188| 9,653| 25,011| 48,721
Active contributors** | 9| 212| 846| 3,228| 8,442| 16,945
Very active contributors*** | 0| 31| 190| 692| 1,639| 3,016
No. of English language articles| 25| 16,000| 101,000| 190,000| 320,000| 630,000
No. of articles, all languages | 25| 19,000| 138,000| 490,000| 862,000|1,600,000
\* Contributed at least ten times; \** at least 5 times in last month; \*\** more than 100 times in last month.
resulting output:
Table 3.1: Contributors to Wikipedia, January 2001 - June 2005
| Jan. 2001 | Jan. 2002 | Jan. 2003 | Jan. 2004 | July 2004 | June 2006 | |
|---|---|---|---|---|---|---|
| Contributors* | 10 | 472 | 2,188 | 9,653 | 25,011 | 48,721 |
| Active contributors** | 9 | 212 | 846 | 3,228 | 8,442 | 16,945 |
| Very active contributors*** | 0 | 31 | 190 | 692 | 1,639 | 3,016 |
| No. of English language articles | 25 | 16,000 | 101,000 | 190,000 | 320,000 | 630,000 |
| No. of articles, all languages | 25 | 19,000 | 138,000 | 490,000 | 862,000 | 1,600,000 |
* Contributed at least ten times; ** at least 5 times in last month; *** more than 100 times in last month.
basic markup:
poem{
Your poem here
}poem
Each verse in a poem is given an object number.
markup example:
poem{
`Fury said to a
mouse, That he
met in the
house,
"Let us
both go to
law: I will
prosecute
YOU. --Come,
I'll take no
denial; We
must have a
trial: For
really this
morning I've
nothing
to do."
Said the
mouse to the
cur, "Such
a trial,
dear Sir,
With
no jury
or judge,
would be
wasting
our
breath."
"I'll be
judge, I'll
be jury,"
Said
cunning
old Fury:
"I'll
try the
whole
cause,
and
condemn
you
to
death."'
}poem
resulting output:
`Fury said to a
mouse, That he
met in the
house,
"Let us
both go to
law: I will
prosecute
YOU. --Come,
I'll take no
denial; We
must have a
trial: For
really this
morning I've
nothing
to do."
Said the
mouse to the
cur, "Such
a trial,
dear Sir,
With
no jury
or judge,
would be
wasting
our
breath."
"I'll be
judge, I'll
be jury,"
Said
cunning
old Fury:
"I'll
try the
whole
cause,
and
condemn
you
to
death."'
basic markup:
group{
Your grouped text here
}group
A group is treated as an object and given a single object number.
markup example:
group{
`Fury said to a
mouse, That he
met in the
house,
"Let us
both go to
law: I will
prosecute
YOU. --Come,
I'll take no
denial; We
must have a
trial: For
really this
morning I've
nothing
to do."
Said the
mouse to the
cur, "Such
a trial,
dear Sir,
With
no jury
or judge,
would be
wasting
our
breath."
"I'll be
judge, I'll
be jury,"
Said
cunning
old Fury:
"I'll
try the
whole
cause,
and
condemn
you
to
death."'
}group
resulting output:
`Fury said to a
mouse, That he
met in the
house,
"Let us
both go to
law: I will
prosecute
YOU. --Come,
I'll take no
denial; We
must have a
trial: For
really this
morning I've
nothing
to do."
Said the
mouse to the
cur, "Such
a trial,
dear Sir,
With
no jury
or judge,
would be
wasting
our
breath."
"I'll be
judge, I'll
be jury,"
Said
cunning
old Fury:
"I'll
try the
whole
cause,
and
condemn
you
to
death."'
Code tags code{ ... }code (used as with other group tags described above) are used to escape regular sisu markup, and have been used extensively within this document to provide examples of SiSU markup. You cannot however use code tags to escape code tags. They are however used in the same way as group or poem tags.
A code-block is treated as an object and given a single object number. [an option to number each line of code may be considered at some later time]
use of code tags instead of poem compared, resulting output:
`Fury said to a
mouse, That he
met in the
house,
"Let us
both go to
law: I will
prosecute
YOU. --Come,
I'll take no
denial; We
must have a
trial: For
really this
morning I've
nothing
to do."
Said the
mouse to the
cur, "Such
a trial,
dear Sir,
With
no jury
or judge,
would be
wasting
our
breath."
"I'll be
judge, I'll
be jury,"
Said
cunning
old Fury:
"I'll
try the
whole
cause,
and
condemn
you
to
death."'
From SiSU 2.7.7 on you can number codeblocks by placing a hash after the opening code tag code{# as demonstrated here:
1 ┆ `Fury said to a
2 ┆ mouse, That he
3 ┆ met in the
4 ┆ house,
5 ┆ "Let us
6 ┆ both go to
7 ┆ law: I will
8 ┆ prosecute
9 ┆ YOU. --Come,
10 ┆ I'll take no
11 ┆ denial; We
12 ┆ must have a
13 ┆ trial: For
14 ┆ really this
15 ┆ morning I've
16 ┆ nothing
17 ┆ to do."
18 ┆ Said the
19 ┆ mouse to the
20 ┆ cur, "Such
21 ┆ a trial,
22 ┆ dear Sir,
23 ┆ With
24 ┆ no jury
25 ┆ or judge,
26 ┆ would be
27 ┆ wasting
28 ┆ our
29 ┆ breath."
30 ┆ "I'll be
31 ┆ judge, I'll
32 ┆ be jury,"
33 ┆ Said
34 ┆ cunning
35 ┆ old Fury:
36 ┆ "I'll
37 ┆ try the
38 ┆ whole
39 ┆ cause,
40 ┆ and
41 ┆ condemn
42 ┆ you
43 ┆ to
44 ┆ death."'
To make an index append to paragraph the book index term relates to it, using an equal sign and curly braces.
Currently two levels are provided, a main term and if needed a sub-term. Sub-terms are separated from the main term by a colon.
Paragraph containing main term and sub-term.
={Main term:sub-term}
The index syntax starts on a new line, but there should not be an empty line between paragraph and index markup.
The structure of the resulting index would be:
Main term, 1
sub-term, 1
Several terms may relate to a paragraph, they are separated by a semicolon. If the term refers to more than one paragraph, indicate the number of paragraphs.
Paragraph containing main term, second term and sub-term.
={first term; second term: sub-term}
The structure of the resulting index would be:
First term, 1,
Second term, 1,
sub-term, 1
If multiple sub-terms appear under one paragraph, they are separated under the main term heading from each other by a pipe symbol.
Paragraph containing main term, second term and sub-term.
={Main term:sub-term+1|second sub-term
A paragraph that continues discussion of the first sub-term
The plus one in the example provided indicates the first sub-term spans one additional paragraph. The logical structure of the resulting index would be:
Main term, 1,
sub-term, 1-3,
second sub-term, 1,
It is possible to build a document by creating a master document that requires other documents. The documents required may be complete documents that could be generated independently, or they could be markup snippets, prepared so as to be easily available to be placed within another text. If the calling document is a master document (built from other documents), it should be named with the suffix .ssm Within this document you would provide information on the other documents that should be included within the text. These may be other documents that would be processed in a regular way, or markup bits prepared only for inclusion within a master document .sst regular markup file, or .ssi (insert/information) A secondary file of the composite document is built prior to processing with the same prefix and the suffix ._sst
basic markup for importing a document into a master document
<< filename1.sst
<< filename2.ssi
The form described above should be relied on. Within the Vim editor it results in the text thus linked becoming hyperlinked to the document it is calling in which is convenient for editing. Alternative markup for importation of documents under consideration, and occasionally supported have been.
<< filename.ssi
<<{filename.ssi}
% using textlink alternatives
<< |filename.ssi|@|^|
2.0 introduced new headers and is therefore incompatible with 1.0 though otherwise the same with the addition of a couple of tags (i.e. a superset)
0.38 is substantially current for version 1.0
depreciated 0.16 supported, though file names were changed at 0.37
provides a short history of changes to SiSU markup
SiSU 2.0 (2010-03-06:09/6) same as 1.0, apart from the changing of headers and the addition of a monospace tag related headers now grouped, e.g.
@title:
:subtitle:
@creator:
:author:
:translator:
:illustrator:
@rights:
:text:
:illustrations:
see document markup samples, and sisu --help headers
the monospace tag takes the form of a hash '#'
#{ this enclosed text would be monospaced }#
1.0 (2009-12-19:50/6) same as 0.69
0.69 (2008-09-16:37/2) (same as 1.0) and as previous (0.57) with the addition of book index tags
/^={.+?}$/
e.g. appended to a paragraph, on a new-line (without a blank line in between) logical structure produced assuming this is the first text "object"
={GNU/Linux community distribution:Debian+2|Fedora|Gentoo;Free Software Foundation+5}
Free Software Foundation, 1-6
GNU/Linux community distribution, 1
Debian, 1-3
Fedora, 1
Gentoo,
0.66 (2008-02-24:07/7) same as previous, adds semantic tags, [experimental and not-used]
/[:;]{.+?}[:;][a-z+]/
0.57 (2007w34/4) SiSU 0.57 is the same as 0.42 with the introduction of some a shortcut to use the headers @title and @creator in the first heading [expanded using the contents of the headers @title: and @author:]
:A~ @title by @author
0.52 (2007w14/6) declared document type identifier at start of text/document:
SiSU 0.52
or, backward compatible using the comment marker:
% SiSU 0.38
variations include 'SiSU (text|master|insert) [version]' and 'sisu-[version]'
0.51 (2007w13/6) skins changed (simplified), markup unchanged
0.42 (2006w27/4) * (asterisk) type endnotes, used e.g. in relation to author
SiSU 0.42 is the same as 0.38 with the introduction of some additional endnote types,
Introduces some variations on endnotes, in particular the use of the asterisk
~{* for example for describing an author }~ and ~{** for describing a second author }~
* for example for describing an author
** for describing a second author
and
~[* my note ]~ or ~[+ another note ]~
which numerically increments an asterisk and plus respectively
*1 my note +1 another note
0.38 (2006w15/7) introduced new/alternative notation for headers, e.g. @title: (instead of 0~title), and accompanying document structure markup, :A,:B,:C,1,2,3 (maps to previous 1,2,3,4,5,6)
SiSU 0.38 introduced alternative experimental header and heading/structure markers,
@headername: and headers :A~ :B~ :C~ 1~ 2~ 3~
as the equivalent of:
0~headername and headers 1~ 2~ 3~ 4~ 5~ 6~
The internal document markup of SiSU 0.16 remains valid and standard Though note that SiSU 0.37 introduced a new file naming convention
SiSU has in effect two sets of levels to be considered, using 0.38 notation A-C headings/levels, pre-ordinary paragraphs /pre-substantive text, and 1-3 headings/levels, levels which are followed by ordinary text. This may be conceptualised as levels A,B,C, 1,2,3, and using such letter number notation, in effect: A must exist, optional B and C may follow in sequence (not strict) 1 must exist, optional 2 and 3 may follow in sequence i.e. there are two independent heading level sequences A,B,C and 1,2,3 (using the 0.16 standard notation 1,2,3 and 4,5,6) on the positive side: the 0.38 A,B,C,1,2,3 alternative makes explicit an aspect of structuring documents in SiSU that is not otherwise obvious to the newcomer (though it appears more complicated, is more in your face and likely to be understood fairly quickly); the substantive text follows levels 1,2,3 and it is 'nice' to do most work in those levels
0.37 (2006w09/7) introduced new file naming convention, .sst (text), .ssm (master), .ssi (insert), markup syntax unchanged
SiSU 0.37 introduced new file naming convention, using the file extensions .sst .ssm and .ssi to replace .s1 .s2 .s3 .r1 .r2 .r3 and .si
this is captured by the following file 'rename' instruction:
rename 's/\.s[123]$/\.sst/' *.s{1,2,3}
rename 's/\.r[123]$/\.ssm/' *.r{1,2,3}
rename 's/\.si$/\.ssi/' *.si
The internal document markup remains unchanged, from SiSU 0.16
0.35 (2005w52/3) sisupod, zipped content file introduced
0.23 (2005w36/2) utf-8 for markup file
0.22 (2005w35/3) image dimensions may be omitted if rmagick is available to be relied upon
0.20.4 (2005w33/4) header 0~links
0.16 (2005w25/2) substantial changes introduced to make markup cleaner, header 0~title type, and headings [1-6]~ introduced, also percentage sign (%) at start of a text line as comment marker
SiSU 0.16 (0.15 development branch) introduced the use of
the header 0~ and headings/structure 1~ 2~ 3~ 4~ 5~ 6~
in place of the 0.1 header, heading/structure notation
SiSU 0.1 headers and headings structure represented by header 0{~ and headings/structure 1{ 2{ 3{ 4{~ 5{ 6{
SiSU SiSU is a document publishing system, that from a simple single marked-up document, produces multiple of output formats including: plaintext, html, xhtml, XML, epub, odt (odf text), LaTeX, pdf, info, and SQL (PostgreSQL and SQLite), which share numbered text objects ("object citation numbering") and the same document structure information. For more see: ‹http://www.jus.uio.no/sisu›
-a [filename/wildcard]
produces plaintext with Unix linefeeds and without markup, (object numbers are omitted), has footnotes at end of each paragraph that contains them [ -A for equivalent dos (linefeed) output file] [see -e for endnotes]. (Options include: --endnotes for endnotes --footnotes for footnotes at the end of each paragraph --unix for unix linefeed (default) --msdos for msdos linefeed)
-b [filename/wildcard]
see --xhtml
--color-toggle [filename/wildcard]
screen toggle ansi screen colour on or off depending on default set (unless -c flag is used: if sisurc colour default is set to 'true', output to screen will be with colour, if sisurc colour default is set to 'false' or is undefined screen output will be without colour). Alias -c
--concordance [filename/wildcard]
produces concordance (wordmap) a rudimentary index of all the words in a document. (Concordance files are not generated for documents of over 260,000 words unless this limit is increased in the file sisurc.yml). Alias -w
-C [--init-site]
configure/initialise shared output directory files initialize shared output directory (config files such as css and dtd files are not updated if they already exist unless modifier is used). -C --init-site configure/initialise site more extensive than -C on its own, shared output directory files/force update, existing shared output config files such as css and dtd files are updated if this modifier is used.
-CC
configure/initialise shared output directory files initialize shared output directory (config files such as css and dtd files are not updated if they already exist unless modifier is used). The equivalent of: -C --init-site configure/initialise site, more extensive than -C on its own, shared output directory files/force update, existing shared output config files such as css and dtd files are updated if -CC is used.
-c [filename/wildcard]
see --color-toggle
--dal [filename/wildcard/url]
assumed for most other flags, creates new intermediate files for processing (document abstraction) that is used in all subsequent processing of other output. This step is assumed for most processing flags. To skip it see -n. Alias -m
--delete [filename/wildcard]
see --zap
-D [instruction] [filename]
see --pg
-d [--db-[database type (sqlite|pg)]] --[instruction] [filename]
see --sqlite
--epub [filename/wildcard]
produces an epub document, [sisu version 2 only] (filename.epub). Alias -e
-e [filename/wildcard]
see --epub
-F [--webserv=webrick]
see --sample-search-form
--git [filename/wildcard]
produces or updates markup source file structure in a git repo (experimental and subject to change). Alias -g
-g [filename/wildcard]
see --git
--harvest *.ss[tm]
makes two lists of sisu output based on the sisu markup documents in a directory: list of author and authors works (year and titles), and; list by topic with titles and author. Makes use of header metadata fields (author, title, date, topic_register). Can be used with maintenance (-M) and remote placement (-R) flags.
--help [topic]
provides help on the selected topic, where topics (keywords) include: list, (com)mands, short(cuts), (mod)ifiers, (env)ironment, markup, syntax, headers, headings, endnotes, tables, example, customise, skin, (dir)ectories, path, (lang)uage, db, install, setup, (conf)igure, convert, termsheet, search, sql, features, license
--html [filename/wildcard]
produces html output, segmented text with table of contents (toc.html and index.html) and the document in a single file (scroll.html). Alias -h
-h [filename/wildcard]
see --html
-I [filename/wildcard]
see --texinfo
-i [filename/wildcard]
see --manpage
-L
prints license information.
--machine [filename/wildcard/url]
see --dal (document abstraction level/layer)
--maintenance [filename/wildcard/url]
maintenance mode files created for processing preserved and their locations indicated. (also see -V). Alias -M
--manpage [filename/wildcard]
produces man page of file, not suitable for all outputs. Alias -i
-M [filename/wildcard/url]
see --maintenance
-m [filename/wildcard/url]
see --dal (document abstraction level/layer)
--no-ocn
[with --html --pdf or --epub] switches off object citation numbering. Produce output without identifying numbers in margins of html or LaTeX/pdf output.
-N [filename/wildcard/url]
document digest or document content certificate ( DCC ) as md5 digest tree of the document: the digest for the document, and digests for each object contained within the document (together with information on software versions that produced it) (digest.txt). -NV for verbose digest output to screen.
-n [filename/wildcard/url]
skip the creation of intermediate processing files (document abstraction) if they already exist, this skips the equivalent of -m which is otherwise assumed by most processing flags.
--odf [filename/wildcard/url]
see --odt
--odt [filename/wildcard/url]
output basic document in opendocument file format (opendocument.odt). Alias -o
-o [filename/wildcard/url]
see --odt
--pdf [filename/wildcard]
produces LaTeX pdf (portrait.pdf & landscape.pdf). Default paper size is set in config file, or document header, or provided with additional command line parameter, e.g. --papersize-a4 preset sizes include: 'A4', U.S. 'letter' and 'legal' and book sizes 'A5' and 'B5' (system defaults to A4). Alias -p
--pg [instruction] [filename]
database postgresql ( --pgsql may be used instead) possible instructions, include: --createdb; --create; --dropall; --import [filename]; --update [filename]; --remove [filename]; see database section below. Alias -D
--po [language_directory/filename language_directory]
see --po4a
--po4a [language_directory/filename language_directory]
produces .pot and po files for the file in the languages specified by the language directory. SiSU markup is placed in subdirectories named with the language code, e.g. en/ fr/ es/. The sisu config file must set the output directory structure to multilingual. v3, experimental
-P [language_directory/filename language_directory]
see --po4a
-p [filename/wildcard]
see --pdf
--quiet [filename/wildcard]
quiet less output to screen.
-q [filename/wildcard]
see --quiet
--rsync [filename/wildcard]
copies sisu output files to remote host using rsync. This requires that sisurc.yml has been provided with information on hostname and username, and that you have your "keys" and ssh agent in place. Note the behavior of rsync different if -R is used with other flags from if used alone. Alone the rsync --delete parameter is sent, useful for cleaning the remote directory (when -R is used together with other flags, it is not). Also see --scp. Alias -R
-R [filename/wildcard]
see --rsync
-r [filename/wildcard]
see --scp
--sample-search-form [--webserv=webrick]
generate examples of (naive) cgi search form for sqlite and pgsql depends on your already having used sisu to populate an sqlite and/or pgsql database, (the sqlite version scans the output directories for existing sisu_sqlite databases, so it is first necessary to create them, before generating the search form) see -d -D and the database section below. If the optional parameter --webserv=webrick is passed, the cgi examples created will be set up to use the default port set for use by the webrick server, (otherwise the port is left blank and the system setting used, usually 80). The samples are dumped in the present work directory which must be writable, (with screen instructions given that they be copied to the cgi-bin directory). -Fv (in addition to the above) provides some information on setting up hyperestraier for sisu. Alias -F
--scp [filename/wildcard]
copies sisu output files to remote host using scp. This requires that sisurc.yml has been provided with information on hostname and username, and that you have your "keys" and ssh agent in place. Also see --rsync. Alias -r
--sqlite --[instruction] [filename]
database type default set to sqlite, (for which --sqlite may be used instead) or to specify another database --db-[pgsql, sqlite] (however see -D) possible instructions include: --createdb; --create; --dropall; --import [filename]; --update [filename]; --remove [filename]; see database section below. Alias -d
--sisupod
produces a sisupod a zipped sisu directory of markup files including sisu markup source files and the directories local configuration file, images and skins. Note: this only includes the configuration files or skins contained in ./_sisu not those in ~/.sisu -S [filename/wildcard] option. Note: (this option is tested only with zsh). Alias -S
--sisupod [filename/wildcard]
produces a zipped file of the prepared document specified along with associated images, by default named sisupod.zip they may alternatively be named with the filename extension .ssp This provides a quick way of gathering the relevant parts of a sisu document which can then for example be emailed. A sisupod includes sisu markup source file, (along with associated documents if a master file, or available in multilingual versions), together with related images and skin. SiSU commands can be run directly against a sisupod contained in a local directory, or provided as a url on a remote site. As there is a security issue with skins provided by other users, they are not applied unless the flag --trust or --trusted is added to the command instruction, it is recommended that file that are not your own are treated as untrusted. The directory structure of the unzipped file is understood by sisu, and sisu commands can be run within it. Note: if you wish to send multiple files, it quickly becomes more space efficient to zip the sisu markup directory, rather than the individual files for sending). See the -S option without [filename/wildcard]. Alias -S
--source [filename/wildcard]
copies sisu markup file to output directory. Alias -s
-S
see --sisupod
-S [filename/wildcard]
see --sisupod
-s [filename/wildcard]
see --source
--texinfo [filename/wildcard]
produces texinfo and info file, (view with pinfo). Alias -I
--txt [filename/wildcard]
produces plaintext with Unix linefeeds and without markup, (object numbers are omitted), has footnotes at end of each paragraph that contains them [ -A for equivalent dos (linefeed) output file] [see -e for endnotes]. (Options include: --endnotes for endnotes --footnotes for footnotes at the end of each paragraph --unix for unix linefeed (default) --msdos for msdos linefeed). Alias -t
-T [filename/wildcard (*.termsheet.rb)]
standard form document builder, preprocessing feature
-t [filename/wildcard]
see --txt
--urls [filename/wildcard]
prints url output list/map for the available processing flags options and resulting files that could be requested, (can be used to get a list of processing options in relation to a file, together with information on the output that would be produced), -u provides url output mapping for those flags requested for processing. The default assumes sisu_webrick is running and provides webrick url mappings where appropriate, but these can be switched to file system paths in sisurc.yml. Alias -U
-U [filename/wildcard]
see --urls
-u [filename/wildcard]
provides url mapping of output files for the flags requested for processing, also see -U
--v2 [filename/wildcard]
invokes the sisu v2 document parser/generator. This is the default and is normally omitted.
--v3 [filename/wildcard]
invokes the sisu v3 document parser/generator. Currently under development and incomplete, v3 requires >= ruby1.9.2p180. You may run sisu3 instead.
--verbose [filename/wildcard]
provides verbose output of what is being generated, where output is placed (and error messages if any), as with -u flag provides a url mapping of files created for each of the processing flag requests. Alias -v
-V
on its own, provides SiSU version and environment information (sisu --help env)
-V [filename/wildcard]
even more verbose than the -v flag.
-v
on its own, provides SiSU version information
-v [filename/wildcard]
see --verbose
--webrick
starts ruby's webrick webserver points at sisu output directories, the default port is set to 8081 and can be changed in the resource configuration files. [tip: the webrick server requires link suffixes, so html output should be created using the -h option rather than -H ; also, note -F webrick ]. Alias -W
-W
see --webrick
--wordmap [filename/wildcard]
see --concordance
-w [filename/wildcard]
see --concordance
--xhtml [filename/wildcard]
produces xhtml/XML output for browser viewing (sax parsing). Alias -b
--xml-dom [filename/wildcard]
produces XML output with deep document structure, in the nature of dom. Alias -X
--xml-sax [filename/wildcard]
produces XML output shallow structure (sax parsing). Alias -x
-X [filename/wildcard]
see --xml-dom
-x [filename/wildcard]
see --xml-sax
-Y [filename/wildcard]
produces a short sitemap entry for the document, based on html output and the sisu_manifest. --sitemaps generates/updates the sitemap index of existing sitemaps. (Experimental, [g,y,m announcement this week])
-y [filename/wildcard]
produces an html summary of output generated (hyperlinked to content) and document specific metadata (sisu_manifest.html). This step is assumed for most processing flags.
--zap [filename/wildcard]
Zap, if used with other processing flags deletes output files of the type about to be processed, prior to processing. If -Z is used as the lone processing related flag (or in conjunction with a combination of -[mMvVq]), will remove the related document output directory. Alias -Z
-Z [filename/wildcard]
see --zap
--no-ocn
[with --html --pdf or --epub] switches off object citation numbering. Produce output without identifying numbers in margins of html or LaTeX/pdf output.
--no-annotate
strips output text of editor endnotes *2 denoted by asterisk or dagger/plus sign
--no-asterisk
strips output text of editor endnotes *3 denoted by asterisk sign
--no-dagger
strips output text of editor endnotes +2 denoted by dagger/plus sign
dbi - database interface
-D or --pgsql set for postgresql -d or --sqlite default set for sqlite -d is modifiable with --db=[database type (pgsql or sqlite)]
--pg -v --createall
initial step, creates required relations (tables, indexes) in existing postgresql database (a database should be created manually and given the same name as working directory, as requested) (rb.dbi) [ -dv --createall sqlite equivalent] it may be necessary to run sisu -Dv --createdb initially NOTE: at the present time for postgresql it may be necessary to manually create the database. The command would be 'createdb [database name]' where database name would be SiSU_[present working directory name (without path)]. Please use only alphanumerics and underscores.
--pg -v --import
[filename/wildcard] imports data specified to postgresql db (rb.dbi) [ -dv --import sqlite equivalent]
--pg -v --update
[filename/wildcard] updates/imports specified data to postgresql db (rb.dbi) [ -dv --update sqlite equivalent]
--pg --remove
[filename/wildcard] removes specified data to postgresql db (rb.dbi) [ -d --remove sqlite equivalent]
--pg --dropall
kills data" and drops (postgresql or sqlite) db, tables & indexes [ -d --dropall sqlite equivalent]
The -v is for verbose output.
--update [filename/wildcard]
Checks existing file output and runs the flags required to update this output. This means that if only html and pdf output was requested on previous runs, only the -hp files will be applied, and only these will be generated this time, together with the summary. This can be very convenient, if you offer different outputs of different files, and just want to do the same again.
-0 to -5 [filename or wildcard]
Default shorthand mappings (note that the defaults can be changed/configured in the sisurc.yml file):
-0
-mNhwpAobxXyYv [this is the default action run when no options are give, i.e. on 'sisu [filename]']
-1
-mhewpy
-2
-mhewpaoy
-3
-mhewpAobxXyY
-4
-mhewpAobxXDyY --import
-5
-mhewpAobxXDyY --update
add -v for verbose mode and -c for color, e.g. sisu -2vc [filename or wildcard]
consider -u for appended url info or -v for verbose output
In the data directory run sisu -mh filename or wildcard eg. "sisu -h cisg.sst" or "sisu -h *.{sst,ssm}" to produce html version of all documents.
Running sisu (alone without any flags, filenames or wildcards) brings up the interactive help, as does any sisu command that is not recognised. Enter to escape.
This section has been much reduced in content since the release of SiSU which it predated. It provides links to some relevant information.
The description provided in the abandoned U.S. Provisional Patent Application may be of interest as it provides greater detail and by an large supersedes the description given here ‹http://www.jus.uio.no/sisu/sisu_provisional_patent_application_200408› and accompanying diagrams ‹http://www.jus.uio.no/sisu/diagram/sisu_provisional_patent_application_diagram_200408.pdf› and reasons for abandoning ‹http://www.jus.uio.no/sisu/SiSU/2005.html#ppa›
Of particular interest is the ease of streaming documents to a relational database, at an object (roughly paragraph) level and the potential for increased precision in the presentation of matches that results thereby. The ability to serialise html, latex, xml, sql, (whatever) is also inherent in / incidental to the design.
This is the short form of an old summary based on design decisions of 2002. It predates the release of SiSU by a number of years, and should possibly be removed.
A rough chart of the SiSU program structure can be found here:
What follows is a brief description of the chart's components, based on the numbers and letters used in the chart.
A Input text ascii with minimalistic human markup requirements
B Machine intermediate processing output, used by all other modules - there for the time being is a selection: human readable; a Ruby marshal dump of the same, and; a YAML file 121 Once the intermediate stage is created, if no changes to input (i.e. A) are made, it is possible to start with B as input for program (i.e. to skip stage A and processing required to get to stage B). This might be of interest if document appearance is modified but not content. Abstract document structure is "created" here, with the pre-processing of the likes of tables, numbering (headings, paragraphs etc.) and endnotes, to ensure that all subsequent processing is based on the same integral document structures.
C Various final publication outputs that all share a common citation numbering system
html - there are possibilities for output based on tables or output based on css
pdf - landscape and portrait currently set to A4 paper size
XML with a flat structure,sax, and with a deeper (embedded) structure, dom
sql - data in sql database retaining document structure, this is in some ways similar to B output, as is likely to be further processed for presentation.
1 data feed controller for other program components
2 Creation of intermediate stage B, which contains information related to document structure used by all subsequent data output modules.
3 Parameter extraction. Program takes data related to the document being processed.
4 Relates primarily to appearance/ design, how the site or document should look:
4a Initialised variables used for "typesetting". eg. Margin widths etc. called by program. This can be done in 3 stages, there are i. the default program-wide settings, ii. possibility of setting alternative site-wide settings, iii. possibility of providing settings for an individual document
4b Template, includes formatting classes eg. for appearance of html (whether table based or css) or for pdf output. (For examples of templates at work, see examples provided earlier of html output in css and tables versions, and of pdf landscape and portrait outputs that result from templates that provide different the LaTeX output for the resulting pdfs.
5 Here we have the logic engines that call process B the intermediate machine generated data and call upon the relevant templates to produce the different presentations of the document.
5a html module - to construct html documents
5b LaTeX module - to construct LaTeX, which is then fed to pdflatex to produce pdf files
5c SQL module - to import data into PostgreSQL database retaining document structure detail and other detail common to the other output formats. This keeps all information regarding document structure in four relational database tables, one containing semantic and other headers, a second substantive texts, a third endnotes, a fourth pre-formatted texts. (the flexibility exists to carry this further)
SiSU is written in Ruby and assumes Linux OS (development has been on Debian/Gnu/Linux)
SiSU generates
html output
LaTeX output, then uses LaTeX (and /pdflatex/ LaTeX to pdf) for pdf output
lout output lout, then uses Lout to produce postscript (and postscript to pdf conversion), [not currently maintained]
sql output (database feed) eg PostgreSQL, making use of Ruby dbi or pgsql modules to be used by PostgerSQL, or sqlite, making use of Ruby dbi or sqlite modules to be used by sqlite
Not required but taken advantage of if available:
tidy (XML, xhtml well formed check)
trang (relaxng, rnc to dtd conversion)
there are other modules ... see this document.
SiSU started as a way to make html manageable, together with the core concept of making text citable through the use of object character numbering. LaTeX/pdf provided a way of making near print quality output, and demonstrating how conveniently the concept worked across different output formats. Relational database storage using the same concept underscored this and the concept makes database search results relevant, to locating results quickly in all output formats that use object character numbers.
There are a number of data formats and technologies that are of particular interest to SiSU, and to keep an eye on more generally. These links are kept here for convenience. Note that whilst all the technologies mentioned are of interest in the context of SiSU, not all of them are supported by SiSU.
Organisations
*OASIS* - Organization for the Advancement of Structured Information Standards, 132 wikipedia entry 133
Information
Organisations
Technologies
Organisations
Licenses
Note: a much more comprehensive history can be gleaned from the Chronology pages, which however, also contain all sorts of additional random information and opinion of the author, and since the release of SiSU as Software Libre under the GPL in the document changelog.
While working with legal texts and in an academic environment, a site that was first called Ananse, The International Trade Law Monitor and later still Lex Mercatoria, 263 I was faced with a number of issues, those of interest here being technical. Amongst them was the relatively fast evolution of html, (in which text was prepared for the Web), which made having to continually update text/document representations to reflect the improvements in what was possible with the latest html markup cumbersome. There was also the fact that some of the strengths of html were limitations in other document representational contexts, e.g. good document rendition across multiple screens was a different problem from ideal paper rendition. Also within an academic and law environment one of the limits of html repeatedly presented as critical with regard to academic writing was the fact that it was not possible to reliably cite the location of content within a document. HTML rendered differently in different browsers; change the font size and it again came out differently. This lead to work on figuring out how these limitations could be overcome, which resulted amongst other things in the early development of the object number system, that could be used independently of page numbers to locate text.
The use case came to be scholarly writings in law and literature, and conventions and useful across writings in literature, the humanities and law, and a smaller section of the social sciences.
SiSU came to be through a series of steps which started from seeking to overcome these problems, starting with the recognition that multiple document format types could be generated (and technically updated as need be) from a single lightly structured prepared source text/document, and that these multiple output formats could share a common numbering system for the referencing of text within a document and further, that to achieve this text could be usefully represented as individual objects identified by these object numbers, and these could be the building blocks from which the alternative document representations and formats could be built, to take advantage of many of the individual and distinct native strengths of various primary standard ways in existence, for the convenient representation or extraction of text, each idealised for a different context, amongst them html, XML, ODF, LaTeX, pdf and (SQL type) relational databases.
Seeking to achieve the requirement of minimal effort (in the form of preparation and maintenance) relative to payoff as regards the described objectives: the idea was to have a document structure meta-markup that with as little effort as possible initially and over time (it should be possible to develop (change or add) output formats without having to think about the original source document), was able to the greatest extent possible, to take advantage of as many of the most interesting features available in each of the most important standard document representational methods, viz. html, XML, ODF, LaTeX, PDF and SQL type relational databases, from that common prepared document source, and that resulted in a meaningful common way of identifying text content.
This resulted in: (a) a minimalist/light structured markup from which the primary benefits of multiple document representation types could be generated. 264 Keeping markup/preparation relatively minimalist and easy to remember, and independent of the development/evolution of document output representations, in order to keep document preparation effort to a minimum, both initially and with regard to maintenance over time; (b) having an abstraction layer for the representation of the document, that was generated independently of the prepared source, which represented text as numbered objects that could be utilised in any of the final document output representational forms in a shared/ common/ similar way for the location of content within a document 265 Separating markup from abstraction and subsequent outputs meant that the markup syntax and underlying output generating modules could be developed/evolved independently of each other. You could arbitrarily change the markup syntax (or have alternative preparation syntaxes) provided you could generate the abstraction layer, from which subsequent outputs would result. Or you could change the abstraction layer and related output generation modules whilst retaining the markup syntax.
The first technical work that in any way relates to the way SiSU works dates back to earlyish in the history of the site Lex Mercatoria, which was at the time called Ananse, (and later the International Trade Law Project and then International Trade Law Monitor). Looking for more convenient ways to manage site content, while at the University of Tromso, I had a young student Tommy Johansen look at it whilst over a summer. I (and Geofrey Armstrong) at the time gathered content for the site. Tommy Johansen wrote some Perl scripts for generating html content, which were used early in the sites history and which were convenient in particular for: (a) producing uniform output, (b) separating code from markup, (c) their ability to produce tables of content, (d) the possibility of matching text in a header to segment text (not yet regular expressions). After Tommy Johansen left his scripts were used, pretty much unchanged for a good while, and though this was before text objects, or object numbers, document abstraction, or any document representation other than html, these were features that were retained by what was to become SiSU.
In 1997/1998 object numbers were introduced to html output, overcoming the problem of the precise location of text within a fixed/published html document. The possibility of using text objects (and object numbers) for other forms of output was conceptually conceived around the same time as the introduction of object numbers to html, as it was clear that this system should have wider use across different types of output. 266
In 1999 I was switching from Windows to Gnu/Linux... first Red Hat then SuSE 267 as far as SiSU was concerned, the program was written in Perl and relatively easy to port. 268
In 2000 I was switching from Perl to Ruby... well that was the end of 2000, November (Dave Thomas' book which I was waiting for from the beginning of the year was published at last, and I finally received my copy). 269
By June 2001 SiSU was generating LaTeX output that was converted to both portrait and landscape pdf that shared the same object numbers as the html output.
In May 2002 tired of waiting for the version dubbed Woody, I was switching to Debian... 270
SiSU search was finally actually implemented in 2002, 271 in the form of the database structure that made object search possible and the ability to populate the database with objects with corresponding object numbers from same document source as other output formats. I did not have much of an immediate incentive to implement search as I did not have an online database. However, having an implementation and showing it around was the reason for the initial opening of these pages and placing a description of what SiSU did on the Net in November 2002, ‹http://www.jus.uio.no/sisu› and updated regularly if haphazardly 272 since, and a pdf chart/diagram that included the relational database aspect as a feature, which should still be available at ‹http://www.jus.uio.no/sisu/diagram/sisu.chart.pdf› (prepared in 2002). 273
Concordance files, first called "wordmaps" were introduced the same year 2002. The search front-end has continued to evolve, and screen-shots of that were made in 2004.
In June 2004 an IBM software innovations evaluator (at first reluctantly) met me, (he was busy at the time, though the contact was arranged through an IBM Manager met at a Linux show, who was curious about what a lawyer was doing with Linux and programming, he asked what is it you are doing and said "we [IBM] should have a look at it"), anyhow, the software innovations evaluator had a look at SiSU and gave it a very positive/ enthusiastic review (so naturally I thought he was great), this was not a code review, mind, it was a "review"/reaction based on what it SiSU did and how it did it, and the implications of it all ... what it meant could be done. To paraphrase, he said:
We have large document management systems. We can search over a hundred thousand documents and tell you that your search criteria is met by say 300 of them, but there is no way we can tell you without going in to each document, where those matches are... once you open a document we can highlight matches.
He wrote a letter I kept and published as a souvenir.
"Ralph Good to meet with you today, I was very impressed with your software.
[colleague's name] - in summary - Ralph has built an application that runs on linux and takes ASCII documents and pulls them apart in to the smallest constituent parts, storing them as XML, PDF and HTML, the HTML are hyperlinked up so the document can be browsed in its full form. the format and text data created is stored in a database.
This has potential in any place that needs the power of full text search whilst holding the structural concepts of the document i.e. legal, pharma, education, research.. which ones we need to figure out, ..."
He suggested I get a software patent. I reluctantly agreed to investigate (that story is told elsewhere).
Subsequent meetings with IBM were odd ;-) 274
Well the person who arranged the original meeting with the "software innovations evaluator", did say that IBM was such a large organisation that different groups were working on different projects and had different interests, and frequently it was a question of meeting the right people; and that there usually were multiple entry points which could be quite different in their interests and responses. Interesting encounters, entertaining mail.
I was an example of a prime beneficiary of Software Libre, and one who had come to understand/know (believe if you prefer) through use that it was technically superior to proprietary software.
In January 2005 SiSU was first released under GPL.
May 2005 first Debian packages for SiSU. I had visited Wookey earlier in the year as a shortcut to building my first Debian package.
In July 2005 at Debconf5, Helsinki, 275 SiSU was first uploaded into Debian, by Gunnar Wolf.
At Debconf5 after talking to various people, it was clarified to me that generating hash sums was a fast and not particularly memory intensive process, so the decision was made to incorporate md5 or optionally sha256 hash sums into the document abstraction representation, as this makes possible several additional/alternative forms of document representation that rely on the hashes for unique identification of objects (also across document collections). Document Content Certificates were introduced shortly afterwards that make use of the hash sums to identify objects - headings, paragraphs, footnotes, images etc. and make it possible to evidence the existence of a document's contents without actually publishing it... or show a summary proving that the document remains unchanged.
In March 2005 with internationalisation in mind, character representation for source documents was switched over to Unicode UTF-8 ... and as a result output readily available across most languages in: html, XML and SQL database representation (PostgreSQL and SQLite), tested to be OK even for Chinese... LaTeX / PDF output, and for ODF, work across several European languages, but need further implementation work for other languages that not yet covered.
Open Document Format output was first introduced to a SiSU release late in 2005 (October).
Manifests that summarise the generated output made available, were also introduced late in 2005 as were Zipped versions of SiSU markup containing all related documents and images (sisupod.zip). These latter being a bit interesting as they gather the constituent parts of a document, which include the source document and any images, (and in the case of multilingual documents, may contain multiple language versions of the source document), in a single zipped file, which can be emailed, and which outputs can also be generated from.
In 2006 I got to visit Oaxtepec, Mexico for Debconf6
Alternative XML representations for SiSU markup were introduced in 2006 shortly after Subtech... they provide 3 forms of XML (SAX, DOM and a Node based tree, that can be converted to and from SiSU markup) these work though are largely proof of concept and require further work, especially as regards what the XML should most conveniently be.
Since the release of SiSU code and features have continued to evolve gently... Over the years many "requirements" have been requested, and incorporated, too many to make mention of here, including amongst them things like "canned search" in the sample cgi search forms to fairly complex footnote alternatives, and alternative XML representations of the input text. Since 2005 (SiSU becoming Software Libre), most of these have been mentioned in the changelog, and a few others may be evident from the Chronology pages dating back to 1993.
Wookey has been a Debian mentor (he introduced me to Debian packaging, and did uploads subsequent to the initial upload of SiSU), in recent times the greatest indirect support (i.e. not coding/programming or developing SiSU directly, that has now run to date for around 10 years now solo) has come from the young Daniel Baumann who is amazing in providing feedback especially in relation to how to package and things technical in Debian, and who has been extremely generous with his time and expertise.
It was not until March 2007 that a sample search database was put online which can be found at ‹http://search.sisudoc.org›
A rule of thumb for SiSU remains that what it does - the idea, and what it means can be done is more beautiful than the code, which is again a lot more beautiful than these descriptive pages... for which there has been little time and attention, but which indeed I return to and have plans to work on.
October 3, 1993 Ananse aka the International Trade Law Monitor and then Lex Mercatoria, is live online from this date.
The origins of SiSU were intertwined with those of the International Trade Law project, first named Ananse (subsequently named the International Trade Law Monitor and then Lex Mercatoria) which was started at the Law Faculty of the University of Tromsø, and had a web presence from this date. From this date the efforts that resulted in SiSU had begun and progress was visible on the Net.
The project presented legal content (conventions, treaties related to international commercial law) on the web through the site LexMercatoria (aka. Ananse, The International Trade Law Monitor) and resulted in the exploration of the techniques by which this was best done started out as a single multi-faceted project which began in 1993 at the University of Tromsø. The activities of providing legal information, and developing content generating technologies were conceptually easily distinguishable, though most of the early history of what became SiSU was shared/common (between the law content, and the programming for the generation of documents) until LexMercatoria, (the law content of the site, and domain) was acquired in 2000 by the International Law Publishers, Cameron May.
Lex Mercatoria is dedicated to the provision of information on international commercial law with subsidiary interests in commerce and (mostly open standard) Net technologies that may be of interest to law academics and professionals worldwide.
Lex Mercatoria is dedicated to the provision of information on international commercial law with subsidiary interests in commerce and (mostly open standard) Net technologies that may be of interest to law academics and professionals worldwide. As such Lex Mercatoria provides information and links related to international commerce and trade law. The LM presents the full texts and where relevant country implementation details of several of the most important conventions and other documents used in international trade and commerce. These materials are presented by subject (e.g. free trade, sale of goods, transport, insurance, payment), chronologically, and has information pages on trade related organisations. LM also maintains extensive links to other sites related by the subject international commerce.
The subsidiary interests result in a rather large scope of interest for which we try to keep a manageable set of links. Lex Mercatoria is interested in global commerce, both traditional and electronic, and in following the use made of the Web and Net for its promotion. It is interested in the legal and technological infrastructure that exists and that is being developed to facilitate global commerce (both traditional and electronic). More generally Lex Mercatoria is also interested in the means by which paper is replaced electronically in commerce and publishing. Lex Mercatoria is particularly interested in the use of Open Standards and in the availability of adequate information on matters related to the conduct of global commerce. As such interests include:
Another attempt to describe Lex Mercatoria's origins and purpose:
Lex Mercatoria was begun in 1993 at the Law Faculty of the University of Tromsø, in Northern Norway. It was originally named Ananse and then the International Trade Law Monitor. It was the first legal website devoted to a particular subject area (admittedly a general and broad one) namely, international trade and commercial law. Lex Mercatoria provides the text of some of the more important treaties, conventions, model laws, rules aimed at harmonizing international trade/commerce, and sets of links to sites that are of interest for (the working of) international commerce. Lex Mercatoria has continued in its original spirit to grow its independent and egalitarian set of link collections in response to a continuous exploration of the use and implications of the Net for international commercial law, international commerce and publishing. Recognising the problems for information management resulting from the glut of information available on the web an attempt is made to organise and restrict the links provided to those that are likely to be most useful in the area targeted.
Lex Mercatoria is particularly interested in uses made of the Net (both in international commercial law and in technology related to electronic commerce) for the provision and development of: open (and harmonizing) standards; and for readily available deep and accurate information.
Always remembering that we are a small unit and will continue to do what we can, we have defined our objective broadly and generously as being:
"To investigate the potential of W3 as an information resource, with regard to legal research and education. This we plan to do taking a practical example, - focusing on international trade law as a limited and vitally important area of law that is of global interest". [This we shall pursue as far as we are able.]
This statement of "our objective" dates back to the project's conception in 1993. It ought now be moderated, but its spirit remains unaltered. Within this time span The Web has proven its worth, independently of any individual's efforts or investigations - its' creators apart.
We however have multiple objectives, which include:
The area of attention of Lex Mercatoria has expanded somewhat with the developments in use of the Net as they pertain to international commerce, a short description is attempted in the next section.
The history and more general information on LexMercatoria may be found at ‹http://www.lexmercatoria.org/› or ‹http://www.jus.uio.no/lm/› its home pages, or more specifically off information pages on the site ‹http://www.jus.uio.no/lm/lm.information/toc.html›
July - August, 1994 The first steps towards automation on the Trade Law Monitor, a number of Perl scripts, for presentation of convention texts by Tommy Johansen.
January 1995 We were visited by the Director Professor Nicholas Triffin and Executive Secretary Albert Kritzer, of the Institute of International Commercial Law (IICL), Pace University School of Law. The IICL, under the direction of Professor Albert Kritzer, are engaged in a Project on the United Nations Convention on Contracts for the International Sale of Goods. Professor Kritzer has significant publications on this Convention.
This visit, was the most important event to happen to the International Trade Law site, at the time and generated a lot of positive press.
11th April 1995 Volume of PC Magazine 276 our "Trade Law Library Page" 277 was selected as one of PC Magazine's top 100 Web Sites. "Trade Law Library" The one other law site selected being: The Legal Information Institute at Cornell University 278
16th June 1995 New/ Reorganized, less cluttered: International Trade Law - Home Page All work from this time is done on a "new" second server, which remains unofficial. The original server still open to the public. An attempt is made to maintain both servers. The intention is that data transfer and mirroring between the new NT and original UNIX HP server running NCSA Mosaic Server should be seamless.
August, 1995 Extensively listed by the Yale University United Nations Scholars' Workstation August 12, 1995: (1) Decision made to ensure that the ITL site is portable and not tied physically to any given location. (2) Decision to transfer from UNIX to Windows NT platform. All substantial additions and changes since mid-June have been on this server.
17th November 1995 "Evaluation of the ITL" 279 positive Project evaluation of the ITL (International Trade Law Project by Professor Olav Torvund, of the Norwegian Research Center for Computers and Law, for the Information Technology, Oslo. ITL presentation of International Trade Law materials on The Internet using World Wide Web. This also gives a history of the effort.
Ralph Amissah - SubTech: attended the "Fourth International Conference on Substantive Technology in Law School and Law Practice", hosted by the University of Quebec, Montreal.
August, 1996 US Library of Congress - complimentary remarks on the work up to this time.
US Library of Congress 280 "Guide to Law Online Linking Page to:
INTERNATIONAL TRADE LAW SITES
INTERNATIONAL TRADE LAW Treaties, etc. (from Tromsø, Norway) This superb site created by Ralph Amissah and hosted by the University of Tromsoe, in Norway, is one of the very finest law Web sites in the world. It provides an extensive list of international trade conventions and related instruments, including rules and model Laws, and often provides hypertext access to the full texts. This basic list is arranged by decades, but the component lists, arranged by topic, may often be more useful, and may be accessed directly ..." [bold text added for emphasis] verified 06/1997 (till 02/2001)
February 7, 1997 "On the Net and the liberation of information that 'wants' to be free" 281 published on ITL - updated 17 and submitted for paper publication. Work paper contributed to the publication prepared in commemoration of the 10th Anniversary of the Law Faculty of the University of Tromsø. 282 Now I must confess that I did not know about the FSS or OSS at the time, and it would have been a good thing/useful to incorporate these ideas.
This article was attached part of an official submission to the judges of the Washington Supreme Court and Court Commissioner's Office, prepared by Mr. Bradley Hillis; Office of the Administrator for the Courts, State of Washington, U.S. in October 1997.
March 14, 1997 Rudimentary Electronic Citation System and Electronic ID or Document Verification System complete. Presentation of Article "On the Net..." provided as a practical example, the substantive text remains the same as that of February 17. Electronic Citation is mentioned in passing within the article that is provided as an example, at e§ 75 and e§ 148. These are the numbers found at the end of "paragraphs" marked ecs § # in a ghost or shadow colour and in superscript, or: "ecs § ..." [this was subsequently changed to { 75 } and then just the number displayed in the page margin]. Although provided as hyperlinks here, these numbers are particularly useful in written citation of the text, as different browsers format html texts differently, and most browsers print the same document out in different font sizes, resulting in different page numbering. After some consideration, it appears to me that for the present time the preferred citation system is the simplest, numbering sequentially all elements of the substantive document - including title, author, headings and paragraphs. Anything else requires decisions as to what may be best and why and how to achieve uniformity of adoption. For example should headings be numbered differently from ordinary text? What about the author's numbering of such headings if any? Should sub-headings be numbered differently from headings, why not? If so how? Until such questions are decided, this is my take, our "ghost 'paragraph' numbering" will be incorporated into future texts presented at this site without other suitable means of referencing.
March 1997 At this time am also working on an Electronic ID or Document Verification System, using the hash value of the ascii content of text to ascertain that a published version has not been changed. The tools are already available to do this, but it is a new idea, and challenge for me.
Summer 1997 "Missing Specifications in International Sales, Article 65 of the CISG" published in the Pace International Law Review 283
September 1997 Paper: The Autonomous Contract: Reflecting the borderless electronic-commercial environment in contracting. Presented at the XIII Nordic Conference on Legal Informatics 17th - 19th September 1997 and published in "Elektronisk handel - rettslige aspekter. Nordisk årsbok i rettsinformatikk 1997" (Electronic Commerce - Legal Aspects. The Nordic yearbook of Legal Informatics 1997) edited by Randi Punsvik. ISBN 82 518 3686 7.
October 1997 "On the Net" 284 article was attached as part of an official submission to the Judges of the Washington Supreme Court and Court Commissioner's Office that was prepared by Mr. Bradley Hillis; Office of the Administrator for the Courts, State of Washington, USA.
November 7, 1997 "The Autonomous Contract: Reflecting the borderless electronic-commercial environment in contracting" complete 285
December, 1997 Ralph Amissah - paper: Missing Specifications in International Sales: Article 65 of the United Nations Convention on Contracts for the International Sale of Goods, 9 Pace International Law Review (1997) 239-255, December 1997.
1998 All work done off-line. One site update in April. Updating continued after that offline.
January 1998 Ralph Amissah - Guest Speaker at the Association of American Law Schools Annual Meeting, San Francisco by invitation and under the sponsorship of the National Center for Automated Information Research. Topic: Thinking and Teaching about Law in A Global Context as an Exercise in Common Enterprise. (presentation of "The Trade Law Monitor: Recognizing, Understanding and Taking Advantage of the Discontinuity in Information Dissemination that the Net Represents"). With respect to the substantive technology related to the project, a few ideas related to and implemented in SiSU at the time were presented (the name SiSU came later). These included citation independent of format (independent of page numbers: numbering of everything sequentially as an object, headings, paragraphs etc. except footnotes/endnotes, which belong to the object/paragraph that references them, (which were not sequential as they could be either footnotes or endnotes) point being work that was taking place at the time to set rules for distinguishing headings and other objects and numbering them differently added little value and was more of a hindrance than an aid), and document authentication (which less has been done with, but is also evident in the work).
April 1998 SiSU pre-processing of standard form documents against termsheets to produce Banking legal documentation sets.
somewhere in 1998 Finally understood that the OSS and FSS works, and that it has and continues to produce some of the very best software in existence today 286
somewhere in 1998 Generated/published a version of "Tainaron - Mail from another city" by Leena Krohn 287 using SiSU 288
from late 1998 - 1999 Extensive work with "rationalising" the design and maintenance of the site - close to 90% of the site as a result is automatically generated from various Perl scripts that identify what to do with each text. Optimisation for more recent versions of the browsers: Opera; Internet Explorer; and Netscape (in roughly that order). Scripts do large batches. Finally an easy/convenient way to handle tables.
February 1999 Decision made Gnu/Linux identified as the most attractive way forward. Perl works as it should on the platform. I have had a good time with NT but it is resource hungry. (more recently I hear MS has plans to do something to address its shortcomings in the Perl department). 289
a better way
March 1999 Lex Mercatoria site down. Critical hard disk failure. Have been working on a new site - all texts being generated by Perl scripts, which greatly improve the ease of maintenance. A trip to Norway is called for. Question is whether to get the old site back up, or push on to have the new site ready as soon as possible.
8th March 1999 Ralph Amissah - made a Fellow of the Institute of International Commercial Law, School of Law, Pace University, White Plains, NY, USA
17th May 1999 New site is ready, planned hosting in Norway and the US as detailed in the credits at the bottom of the pages.
More efficient techniques used in creating the site.
May 27-29, 1999 Lex Mercatoria back on the air and grateful to the Law Faculty of the University of Oslo for hosting the site. Somewhat streamlined, possibly slightly smaller than we were and for the time being, but technically superior to anything that we have been (construction of the site is fully automated with only one page being manually constructed) and with the potential to become better yet. At this time the home page is the only manually generated page on the site, which is once again hosted on a UNIX platform (Sun Solaris running Apache) which happens to be what the University of Oslo uses.
May Scripts (numbering system etc.) have been used at the request of Albert Kritzer and Richard Hainebach to produce a Kluwer text Uniform Law for International Sales, Sales under the 1980 United Nations Convention, Third Edition by John O. Honnold, Schnader Professor of Commercial Law Emeritus University of Pennsylvania, Secretary, UNCITRAL, and Chief, U.N. International Trade Law Branch, 1969 - 1974, Kluwer Law International. Also made kindly made available by Kluwer for testing of scripts International Project Finance by Hoffman. At some point prepared content from the Trade Law Project (prepared by our scripts) is noticed within the Kluwer Arbitration site, did not have a problem with this, but the direction of content flow should remain clear.
2nd June 1999 LexMercatoria regenerated with first set of "bugs" cleared most documents should now have titles, which are required for meaningful query results from the search engine. (any fresh bugs will be corrected in next update).
14th July 1999 There has been quite an extensive update of the site though much remains to be done. For a trial period of three weeks we will try to wean you off our old home page and trust you will be able to find your way about our new one. If your browser supports redirection, you will be redirected to the auto-generated page one minute after the old home page has been fully loaded. Unless there is good reason to reconsider we are likely to phase out the old home page, in time.
Download times for the site would speed up considerably if we dropped the use of tables on long documents, and we are considering this. This is particularly noticeable if you (like myself at present) are not amongst the privileged with broadband Net access. There are bound to be a few bugs. Not all files have yet been transferred from the old site to the new, though the new site contains a more up to date set of documents. Our old file system was insensitive to case, the new file system is case sensitive, some links may not yet be fully compliant. Patience, these and any other issues will be addressed.
6th December 1999 Another new interface for the site is under test, the result of another generation of improvement in our site building tools (collectively fondly nicknamed SiSU). Information on the text presentations and navigation is available 290 . There is much greater consistency in presentation and viewing should have been enhanced and (for most part) made faster, across most graphical browsers and platforms. What we unfortunately do not provide examples of and so you will not see is that it is particularly well suited to the electronic publication of books, and has been tested on several legal academic and practitioners texts of over 500 pages. In parts of the site there are likely to be some "bugs", these however bad they look, should from a technical standpoint be minor to correct.
Status as of year end 1999 The document providing information on the text presentations and navigation contains a summary of the year from that perspective which is copied below:
The site has undergone a facelift for the Millennium, but in most respects our focus with regard to the presentation of documents has remained the same. We hope it results in an improved user experience.
In 1993 we boldly set out amongst other things:
"To explore, utilize and demonstrate the potential of the new IT mediums insofar as they pertain to our chosen subject area."
We have largely achieved this goal in demonstrating how various complicated legal (and other) documents of different content, structures and sizes can be can be presented on the Net using simple html.
If we have been limited in the possibilities that we have explored and utilized, our path has been selected by figuring out what could be achieved most effectively/ successfully with limited resources. We have stuck to a few basic tools and rules of thumb, and have gained considerable experience in: getting the most out of the basic text markup language of the Web, html, without frills; efficient site management; the selection and effective use of basic tools (an editor, markup languages, scripting languages); and how to efficiently maintain cross platform (server and browser) compatibility in our product, through the selection and careful use of inter-operable and preferably open standards, and focus of effort on (few of) what we determine to be key complementary technologies. Our approach has been to identify simple, effective and efficient tools and solutions and to get the most out of them. In effect we have been exploring what can be made of technologies that are available to anyone on the Net. We have also kept an eye on other IT technologies that we do not necessarily use but provide for your perusal and benefit through the maintenance of an information technology compendium.
In the construction of this site our primary focus has remained since the outset (1993) been on presenting texts using html in a convenient manner. It has in part represented an experiment in how best this might be done for our purposes. The results remain as good as can be found anywhere for publications using html 4.0.
Our aim has been to be able to provide and create and maintain efficiently high quality usable presentations of texts (legal, academic, practitioner's, & including conventions, rules, contracts) whilst avoiding unnecessary complexity, indeed, so far it has been achieved using the most basic of markup languages on the Net, plain html with the help of Perl scripts 291 for its transformation from ascii.
Our 1996 list of design criterion for text presentations has now been met and implemented consistently throughout the site [though a few bugs may still remain]. Whilst most individual requirements set were met as early as 1997, presentations have been continuously improved upon. The rationalisation of how best to achieve consistent presentation across various types of text, and its implementation is a feature of the 1999. 292 An idea of these criterion may be gleaned from the contents of this document.
The year's changes improve the site and to provide greater utility from text presentations, including: greater consistency between different types of presentation; improved navigation of the site and individual texts; faster loading and better rendition of texts across different types of browser, the main ones we support being Opera, Internet Explorer, Netscape Navigator, (and we expect Konquerer).
The programs that generate the site have been tested on several books (academic and practitioner's texts) of over 500 pages, and the results are particularly well suited for their electronic presentation. The text navigation and presentation features (generated by the site generation program) come to their own on these longer texts, in which it is easy to appreciate the utility of the resulting document presentations.
So on the technical front we are now, in a sense, free to set new goals, and indeed may look in a number of additional directions. The site has concentrated on making the most of html presentations across most modern browsers, and without making concession to having different presentations for different types of browser. In future we may also present texts as in RTF and possibly pdf, but our primary additional focus will be on XML and we will look at xhtml. /PHP/ being open source and designed for cross-platform functionality is of interest. We may if requested go back to having (in addition) html presentations without our paragraph numbering. In mentioning these possibilities we perhaps run a bit ahead of ourselves, as far as this text is concerned.
Introduce a navigation page describing how to use the auto-generated pages on Lex Mercatoria ‹http://www.jus.uio.no/lm/navigation/doc.html›
"Always remembering that we remain a small unit and will continue to do what we can."
16th February 2000 First read about Ruby, around this date (appears in diary), with the comment "just read of, apparently combines the best features of Perl and /Python/". 293 Immediately installed Ruby, 294 and started reading ruby-talk. 295
28th April 2000 Ruby Talk item lists my having voted for the Ruby Newsgroup by this date. 296
June 2000 Ralph Amissah - paper Revisiting the Autonomous Contract presented at the Schmitthoff Symposium 2000, Law and Trade in the 21st Century, Legal Problems in International Business at the Dawn of the New Millennium, held by The Centre for Commercial Law Studies, Queen Mary and Westfeild College, University of London.
July 2000 Ralph Amissah - at LII, Cornell Law School, (Professor Tom Bruce and Professor Peter Martin) "Summit" for 18 participants on Emerging Public Legal Information Standards, session leader for Site Structuring for an International Audience.
8th July 2000 Lex Mercatoria ‹http://www.lexmercatoria.org/› is acquired by Specific Paragraph Cameron May , internationally renowned law publishers and conference organizers. Ralph Amissah the original site author and owner remains actively involved with the site. The programs (SiSU) that were and continue to be used to generate Lex Mercatoria remain with Ralph Amissah.