Modes

1 Introduction

A mode describes the current execution pattern of the glosser, and at all times the glosser is operating in some particular mode. Normally, the execution pattern is to scan for one of the acceptable tokens and process it according to a table or actions defined in the content of the <mv:mode> element. When a token has been processed, the input is scanned for the next token. This can carry on indefinitely but is usually concluded when a <mv:return/> command is executed.

The basic structure of a mode is as follows.

<mv:mode name="..." accept="..." ...>

  <mv:match type="..." value="...">
    <!-- actions -->
  </mv:match>

  <mv:match ...>
    ...
  </mv:match>

  ...

</mv:mode>

Here, the list of <mv:match> elements specifies the action to be taken for each token type or value. This is, once again a simplified picture: as well as <mv:match> elements, a mode may contain <mv:include> elements and a <mv:default> element.

The accept attribute lists a number of allowed token types, in a definite order or priority, for the scanner to look for. The allowed token types are documented fully in the the tokens DTD documentation, but include:

elt, an xml element name, such as banana.
attr, an attribute name token, such as @value.
punc, a user-defined punctuation sequence, such as --.
uc, an arbitrary unicode character, c.
ns, any unicode character other than a whitespace character, c.

and so on. More information on accept="..." can be found below.

When a token has been obtained, the mode looks for the first match available in its list that matches to the token and passes control to the contents of this mv:match element. More information on matches can be found below.

2 The accept attribute

As mentioned, the accept attribute to <mv:mode> defines a list of token types and punctuation that should be scanned for when in this mode. The value of the accept attribute is a list of token types or punctuation combinations, separated by |. The restrictions are: (1) the first character of the value may not be $ and (2) the punctuation sequence cannot contain a |. (Values of accept starting with $ are reserved for possible future expansion.)

The available token types are:

attr, attribute
b64, base 64 data
char, character
cref, character reference
elt, element
eos, end-of-stream
eref, entity reference
fp, floating point
hex, hexadecimal
int, integer
label, label
ns, nonspace unicode character
pdef, parameter definition
pi, processing instruction
pref, parameter reference
uc, unicode character
uri, uri

Blank space and comments will be skipped unless uc is given as one of the types in accept="...". Thus the accept attribute is important to drive the token scanner correctly. The presence of the label type also alters the way tokenization takes place and is more specialist (and more experimental) but not described here.

For example, accept="elt|attr|[[|]]|ns" indicates that the tokenizer should accept elt tokens, attr tokens, punc tokens with value [[, punc tokens with value ]], or non-space unicode characters, and look for such tokens in this order. Since uc is not specified, whitespace and comments in the input should be skipped.

These are documented more fully in the Token Types DTD. Note that ns is not strictly a special token type, but a version of uc that allows the tokenizer to skip space characters. Thus, in the above example, if a ns is matched the token is actually returned as a token of type uc. There isn't a distinct ns token type. Also, the punc type is not listed above as it is not itself specified in accept but rather as a user-defined punctuation sequence. If this sequence is found is is returned by the tokenizer as a punc token. The empty punctuation sequence (usually given at the end of a list of token types as in accept="...|") is allowed and turns out to be rather useful, as we shall see. Thus accept="" means to skip spaces and comments and only accept the empty punctuation sequence.

If there is no available token of a type listed in the mode's accept="..." the tokenizer returns the null token to the mode and control is passed back (as if via <mv:return/>) to the parent mode.

3 The use-indentation attribute

The use-indentation attribute has allowed values true and false.

Any token has a so-called depth which is normally its indentation or the column number that its first character is in, if the token is the first token to be read on this line, or 10000000 plus the column number otherwise. (Presence of the experimental label token-type in tha accept attribute changes these rules. This is not discussed here.) A mode has a parent token and looks for child tokens. If use-indentation="true" then the depth of all child tokens must be greater than the depth of the parent token. If no such token can be found then the mode is automatically closed, as if <mv:return/> had been executed. To ignore depth of tokens set use-indentation="false". The default is true.

4 The children attribute

A mode counts the number of its children modes as it is executed. Normally any number of children are allowed. But if children="..." is specified then this sets the maximum number of children to be allowed. If a token is read that would exceed this number, then the new token is put back onto the input and control is passed back to the parent mode as if an <abort/> command had been executed.

See also the discussion of the children-total="...", children-adjust="...", and children-remaining="..." attributes of the mv:match modes below.

5 The mv:match element

A mode typically contains a list of mv:match elements (and may also contain <mv:default> and <mv:include> elements). Matches are tried in the order they are presented in the mode.

There are two groups of attributes to <mv:match>. The first group define when a token that has been read matches the specification and that these particular actions should take place. The second group can be used to adjust the number of child tokens permitted in this mode. These two groups are discussed below.

If a match does take place and there are not too many child tokens, then control is passed to the contents of the <mv:match> element, and the commands there are executed, usually resulting in data being added to the intermediate XML infoset. The commands in the content of the match node can also specify that further grandchildren tokens should be processed in another mode, in which case control will be passed back to this mode after a <mv:return/> command, <mv:abort/> command, when the number of child tokens is exceeded, or if indentation rules so dictate.

It is an error (giving GlossExeception Illegal token) if a token type listed in the mode's accept="..." can be found, but there is no corresponding <mv:match> for that token.

6 Attributes to mv:match specifying the match

mv:match takes a manditory attribute type="..." specifying the type of the matching token. The value of this attribute is an item from the above list except that ns is not a valid type (use uc instead), and punctuation sequences should be specified with type="punc".

To further specify the match, any of the following attributes may also be added. Every such attribute from the following list that is given should agree if a match is to be made. If none of these is given then a match is made on the basis of the token type only.

data="...", matches the token's $d value against the data given in .... The data value is usually the token data as entered, sometimes less leading or trailing delimiters such as ' or ".
prefix="...", matches the token's $p value against the data given in .... The prefix value is usually the part of the fullname before the leading :, or "" if there is no such prefix.
localname="...", matches the token's $n value against the data given in .... The localname value is usually the part of the fullname after the leading :, or the whole of fullname if there is no such prefix.
fullname="...", matches the token's $q value against the data given in .... The fullname value is usually the name part of the data, name of element or attribute, or hexadecimal code point for a unicode character, character entity or char.
value="...", matches the token's $v value against the data given in .... The value is usually the value part of the data, unicode character for a character reference, character entity or char, or the data part of a string, int, fp, hex, etc.

7 Extended syntax for matches

To specify a set of alternatives or ranges of values an extended syntax is available for the above attributes. To use it give $ as the first character in data="$...", prefix="$...", localname="$...", fullname="$...", value="$...". Then separate items with ||, as in localname="$apple||orange||banana". In the case of data and value and the types cref, char, uc, fp, int, hex, items separated by || can also be ranges, with start and end separated by --, as in value="$A--Z||a--z||0--9".

To use the characters |, -, \ in the matched data, escape them with \ as in data="$'||£||$||\-||\|".

These extended matches are experimental. TO DO: some things don't work, e.g., data="$'||£||\|||\-||\|" splits in the wrong place; also more documentation on extended matches is needed.

8 children-YYY attributes for matches

The three children-YYY="..." attributes modify the allowed number of children tokens after the token has been read but before the number of children has been checked and before any of the match commands are executed.

These attributes are as follows.

children-total="XXX", set the total number of child-tokens for this mode to XXX.
children-remaining="XXX", set the total number of child-tokens for this mode to be the required value so that a further XXX children (not including the current token) are allowed.
children-adjust="XXX", increase the total number of child-tokens for this mode by XXX (which is normally positive, but could be negative if the number of child tokens is to be decreased).

In all cases, the value of the attribute is a base 10 integer string.

9 The mv:default element

The mv:default element is like mv:match except that a match is guaranteed. mv:default does not take type, value, etc., attributes, but does take the children-total="XXX", children-remaining="XXX", and children-adjust="XXX" attributes with the same meaning.

10 The mv:include element

As well as <mv:match> elements, the mode may contain <mv:include mode="modename"/> elements, meaning act as if all the content of the mode with name "modename" are included here at this point. The attribute mode="modename" is fully interpolated, so modename can be evaluated from parameter values. However, for compatibility with future versions of GLOSS it is recommended that this feature be only used for parameters whose value is known at set-up time and not for parameters that change value during execution. (This is for efficiency considerations as in future modes may be hashed at set-up stage.)

The mv:include element is empty, and has one mandatory attribute mode="modename" for the mode to be included. Note that the included mode's accept and other attributes are all ignored, so the include mechanism allows a set of modes to be used in two or more different contexts and ways.

11 Summary: cases when a mode ends

A mode ends when:

A <mv:return/> command is executed.
An <mv:abort/> command is executed. In this case the last token read is un-got.
A null token is read since no token of one of the correct types given in accept="..." is available.
A token is read, but its depth is less than or equal to the parent token's depth and the attribute use-indentation="true" is present or implied in the <mv:mode>. In this case the last token read is un-got. The default for use-indentation is true.
The attribute children="number" (representing the maximum number of direct child tokens allowed for this mode) is present or implied in the <mv:mode> or has been dynamically modified, and this number of child tokens has already been scanned by the mode. In this case the last token read is un-got.

The depth of a token is usually the column number of the line in which it starts (column numbers are counted from 0) provided it is the first token in that line, the column number plus 10000000 if it is not the first token on that line. (This works in all cases except where labels are present. The depth of a label is the depth of the first non-label token following it, and for the sake of this computation labels are considered as whitespace. See the Tokenizer module for more details.)

12 Content of mv:default and mv:match modes

The elements mv:default and mv:match contain a list of commands or other XML data to be added to the intermediate XML infoset at the current position.

author home contents next

This page is copyright. Web page design and creation by GLOSS.