GLOSS: main concepts

GLOSS is about converting plain text into XML. The plain text may be specially written for input into gloss, or may be some existing legacy document. Normally output from gloss will be processed further, for example by XSLT stylesheets.

The main stages carried out by the computer in glossing a text are as follows:

  1. The plain text is broken into units called tokens.
  2. The tokens are scanned by modes of the glosser. A mode will configure and invoke the tokenizer to return a sequence of token and will process those tokens.
  3. The glosser builds up a xml dataset, an in-memory representation of the output XML document.
  4. The glosser optionally may transform the xml dataset, for example using an XSLT transformation.
  5. The glosser prints the xml dataset out.

Gloss has its own built-in tokenizer that recognises certain kinds of tokens. This includes: XML names, numbers, base64 sections, individual characters and strings. What's more, the kinds of tokens that will be recognised are context dependent, so the tokenizer will often work differently at different stages of the process, depending on which mode the system is in.

Amongst the token types GLOSS uses are individual characters and user-defined punctuation combinations, so in principle a very wide variety of text files can be parsed by the system. Not every transformation will be feasible to write or to operate, however. GLOSS was designed with certain specifical transformations in mind. But, if what you want can't be solved in GLOSS, there may be other solutions, including writing your own tokenizer or writing a text preprocessor, or using XSLT or some other such transformation afterwards.

The mode's actions, when it receives a token is it can handle, typically involve adding data to the xml DOM representation of the XML being built up in memory. Quite complex data constructs can be added from a single token. The processing on one token may also involve entering other modes and scanning scan one or more subtokens. Of course, there are a number of ways a mode may complete its set of actions, as described elsewhere, including a return command and/or placing a limit on the number of tokens allowed.

The xml DOM representing the xml dataset in memory is stored using a special xml vocabulary designed to represent the required syntactical details of a textual XML document in XML itself. (This is necessary because in-memory XML datasets on their own do not uniquely determine any particular textual representation.) The GLOSS printer class recognises the special xml representation tags and transforms them to a textual representation in the intended way.

Before the xml dataset in memory is printed, it can be processed additionally with XSLT. This happens in the standard HTML processing to resolve cross-references and generate a menu bar, for example. XSLT allows several transformations that would be impossible in GLOSS alone: GLOSS's job is to generate XML is as straightfoward way as possible for further processing with standard XML tools.

This page is copyright. Web page design and creation by GLOSS.