GLOSS: the program

From early experiments, I discovered that no single system of tokenising and parsing an input document would work in a sufficiently efficient way for all documents, or indeed for all parts of a single document. For example, a section of a document may be presented as:

Thus processing the document involves one or more modes of processing and the way the input document is broken into tokens is therefore modular in the sense of being based on one or more modes of operation. Moreover, because of the variety of types of input documents likely to be required, this modularity should not be hard-wired into the processing program but must be presented externally as a configuration file or files for each particular type of document being parsed. Hence the concept of a Modular Vocabulary (MV) central to the GLOSS system.

The GLOSS program itself is a command-line JAVA application that reads a configuration or MV file that defines the vocabularies for the transformation, and then transforms a plain text input file into a marked-up XML output file.

The external configuration files or MV files are also written in XML and work in a way reminiscent of XSLT. GLOSS uses the file extension .mv for such files. In fact to enable the maximum possible re-use of code, MV files usually include sets of modes from various other files, and the the file extension .modes is used in GLOSS to denote a set of modes, i.e., an file to be included from a MV file.

The text input and XML output files may have any names you wish, but by convention an input file for GLOSS should have suffix .gloss and the rest of the filename should indicate the resulting output file name, so file.xml.gloss is a file created with a text editor which when processed produces the xml file file.xml. On operating systems which only allow a single dot in filenames I propose filenames such as file_xml.gloss are used instead.

This page is copyright. Web page design and creation by GLOSS.