DTD for XR: XML representation of textual XML files

This software is part of the gloss system. Author: Richard Kaye, 2006-7, copyright reserved. Licence: This software may be used under the conditions of the latest version of the GPL. No warranty.



XML documents exist in many ways in a computer system, notably as in-memory infosets and as textual files. An in-memory XML infoset does not itself determine a unique textual representation or serialization of the XML document, though other applications, especially a non-XML text-based application may require a particular textual format for an XML file. This DTD sets out to define an accurate XML represention of an external textual XML file.

Valid XML documents matching one of the three main elements here (document, document-fragment and dtd) will be called XR documents. XR documents may exist in any form, such as in memory, or in a file with a particular character encoding, but it is expected that they will in the main be created or transformed in memory, and printed with a special program similar to an XSLT stylesheet. (For technical reasons due to the way character data must be escaped, it is infeasible to write an XSLT stylesheet to print out such data.)

The GLOSS system contains such a program, uk.ac.bham.gloss.XRPrinter. This DTD and its documentation defines the relationships and content model of the XR elements and the way a processing program should output each type of node of an XR documents.

An additional program, uk.ac.bham.gloss.XRNormalizer, provides a pre-processing stage for an XML document loosing conforming to this specification to be transformed to one exactly matching the specification.

The current scope and progress of the work here is as follows:

Namespace and prefixes

This DTD is designed to be used as a standalone DTD or to be included in other DTDs. The standard prefix is xr but this may be redefined using DTD parameters in the usual way. This documentation will refer to elements in this namespace with the xr: prefix, but this prefix may be altered by the usual mechanisms.

xr.nsprefix defines the prefix for these elements; redefine this to use another prefix.

<!ENTITY % xr.nsprefix "xr">

xr.namespace defines the namespace for these elements.

<!ENTITY % xr.namespace "'http://gloss.bham.ac.uk/xmlns/xmlrepresentation'">

xr.prefixed is INCLUDE or IGNORE as to whether these elements should be prefixed.

<!ENTITY % xr.prefixed "INCLUDE">
  <!ENTITY % xr.nsattr "xmlns:%xr.nsprefix;">
  <!ENTITY % xr.prefix "%xr.nsprefix;:">
  <!ENTITY % xr.nsattrdecl "xmlns:%xr.nsprefix; CDATA #FIXED %xr.namespace;">
<!ENTITY % xr.nsattr "xmlns">
<!ENTITY % xr.prefix "">
<!ENTITY % xr.nsattrdecl "xmlns CDATA #FIXED %xr.namespace;">

Element Names

This is a list of element names defined here. Items on this list describe features in the output XML file.

General elements and elements for XML documents and document-fragments

<!ENTITY % xr.attribute "%xr.prefix;attribute">
<!ENTITY % xr.cdata "%xr.prefix;cdata">
<!ENTITY % xr.character-reference "%xr.prefix;character-reference">
<!ENTITY % xr.comment "%xr.prefix;comment">
<!ENTITY % xr.doctype "%xr.prefix;doctype">
<!ENTITY % xr.document "%xr.prefix;document">
<!ENTITY % xr.document-fragment "%xr.prefix;document-fragment">
<!ENTITY % xr.element "%xr.prefix;element">
<!ENTITY % xr.entity-reference "%xr.prefix;entity-reference">
<!ENTITY % xr.group "%xr.prefix;group">
<!ENTITY % xr.literal "%xr.prefix;literal">
<!ENTITY % xr.processing-instruction "%xr.prefix;processing-instruction">
<!ENTITY % xr.text "%xr.prefix;text">

Additional elements used in DTDs

<!ENTITY % xr.ANY "%xr.prefix;ANY">
<!ENTITY % xr.attlistdecl "%xr.prefix;attlistdecl">
<!ENTITY % xr.calts "%xr.prefix;calts">
<!ENTITY % xr.CDATA "%xr.prefix;CDATA">
<!ENTITY % xr.clist "%xr.prefix;clist">
<!ENTITY % xr.conditional "%xr.prefix;conditional">
<!ENTITY % xr.copt "%xr.prefix;copt">
<!ENTITY % xr.coptrep "%xr.prefix;coptrep">
<!ENTITY % xr.crep "%xr.prefix;crep">
<!ENTITY % xr.dtd "%xr.prefix;dtd">
<!ENTITY % xr.dtdtext "%xr.prefix;dtdtext">
<!ENTITY % xr.elementdecl "%xr.prefix;elementdecl">
<!ENTITY % xr.enumeration "%xr.prefix;enumeration">
<!ENTITY % xr.EMPTY "%xr.prefix;EMPTY">
<!ENTITY % xr.ENTITIES "%xr.prefix;ENTITIES">
<!ENTITY % xr.ENTITY "%xr.prefix;ENTITY">
<!ENTITY % xr.FIXED "%xr.prefix;FIXED">
<!ENTITY % xr.ID "%xr.prefix;ID">
<!ENTITY % xr.entitydecl "%xr.prefix;entitydecl">
<!ENTITY % xr.IDREF "%xr.prefix;IDREF">
<!ENTITY % xr.IDREFS "%xr.prefix;IDREFS">
<!ENTITY % xr.IGNORE "%xr.prefix;IGNORE">
<!ENTITY % xr.IMPLIED "%xr.prefix;IMPLIED">
<!ENTITY % xr.INCLUDE "%xr.prefix;INCLUDE">
<!ENTITY % xr.mixed "%xr.prefix;mixed">
<!ENTITY % xr.name "%xr.prefix;name">
<!ENTITY % xr.ndata "%xr.prefix;ndata">
<!ENTITY % xr.NMTOKEN "%xr.prefix;NMTOKEN">
<!ENTITY % xr.NMTOKENS "%xr.prefix;NMTOKENS">
<!ENTITY % xr.notationdecl "%xr.prefix;notationdecl">
<!ENTITY % xr.notationtype "%xr.prefix;notationtype">
<!ENTITY % xr.PCDATA "%xr.prefix;PCDATA">
<!ENTITY % xr.pentity-reference "%xr.prefix;pentity-reference">
<!ENTITY % xr.pentitydecl "%xr.prefix;pentitydecl">
<!ENTITY % xr.publicid "%xr.prefix;publicid">
<!ENTITY % xr.publiclit "%xr.prefix;publiclit">
<!ENTITY % xr.q "%xr.prefix;q">
<!ENTITY % xr.qq "%xr.prefix;qq">
<!ENTITY % xr.REQUIRED "%xr.prefix;REQUIRED">
<!ENTITY % xr.systemid "%xr.prefix;systemid">

Mark-up and semantics for XML documents and document-fragments

Some parameter entities

Parameter entity xr.reference lists elements allowed for a reference; since such references are not expanded and may be internal characters these are allowed by this DTD in text or attribute values. However, further XML well-formedness and validity constraints may apply to the resulting document.

<!ENTITY % xr.reference "%xr.character-reference;|%xr.entity-reference;">

Parameter entity xr.corpi lists elements allowed more-or-less anywhere in a document; it maps to comment or processing-instruction.

<!ENTITY % xr.corpi "%xr.comment;|%xr.processing-instruction;">

Parameter entity xr.dtd-content lists all elements allowed at top level in a (full) DTD.

<!ENTITY % xr.dtd-content "%xr.comment;|%xr.processing-instruction;|%xr.elementdecl;|

Parameter entity xr.intdtd-content lists all elements allowed at top level in an internal DTD.

<!ENTITY % xr.intdtd-content "%xr.comment;|%xr.processing-instruction;|%xr.elementdecl;|

Parameter entity xr.element-content lists elements allowed in the content of an element.

<!ENTITY % xr.element-content
            %xr.element;|%xr.entity-reference;|%xr.processing-instruction;|%xr.text;" >

document element

The document element is a top level element representing an XML document and its xml declaration. Attribute xmldecl="include" or "omit" determines if the XML declaration <?xml ... ?> should be present, and attributes version, encoding and standalone provide the additional data for this declaration. When version is required, "1.0" is the default.

The attribute bom provides a suggestion to the processing application whether to insert a Byte-Order-Mark (\uFEFF) in the output. This may be overriden by the application, for example it is expected that a BOM will always be included where it is manditory irrespective of the value of the bom attribute.

Output encoding is UTF-8 by default, but is selected by the encoding attribute. Legal values for encoding depend on the platform, but should include

Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
Eight-bit UCS Transformation Format
Sixteen-bit UCS Transformation Format, big-endian byte order
Sixteen-bit UCS Transformation Format, little-endian byte order
Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
<!ELEMENT %xr.document; ((%xr.corpi;)*,(%xr.doctype;)?,(%xr.corpi;)*,(%xr.element;),(%xr.corpi;)*) >
<!ATTLIST %xr.document;
    bom (include|omit) #IMPLIED
    xmldecl (include|omit) #IMPLIED
    version CDATA #IMPLIED
    encoding CDATA #IMPLIED 
    standalone CDATA #IMPLIED >

dtd element

The dtd element is a top level element representing an XML DTD and its xml encoding declaration. Attribute xmldecl="include" or "omit" determines if the encoding declaration <?xml ... ?> should be present and attributes version and encoding provide further data for this declaration. When version is required, "1.0" is the default. Attribute "bom" is used as for the "document" element. Output encoding is UTF-8 by default, but is selected by the encoding attribute as for document.

<!ELEMENT %xr.dtd; ((%xr.dtd-content;)*) >
<!ATTLIST %xr.dtd;
    bom (include|omit) #IMPLIED
    xmldecl (include|omit) #IMPLIED
    version CDATA #IMPLIED 
    encoding CDATA #IMPLIED >

document-fragment element

The document-fragment element is a top level element representing an XML text declaration for a document fragment. Attribute xmldecl="include" or "omit" determines if the encoding declaration <?xml ... ?> should be present and attributes version and encoding provide further data for this declaration. When version is required, "1.0" is the default. Attribute "bom" is used as for the "document" element. Output encoding is UTF-8 by default, but is selected by the encoding attribute as for document.

<!ELEMENT %xr.document-fragment; ((%xr.element-content;)*) >
<!ATTLIST %xr.document-fragment;
    bom (include|omit) #IMPLIED
    xmldecl (include|omit) #IMPLIED
    version CDATA #IMPLIED 
    encoding CDATA #IMPLIED >

doctype element

The docype element represents the DOCTYPE declaration of an XML document. Attributes define the name of the root element, the PUBLIC identifier and the SYSTEM identifier. The root element name is determined automatically if not present, and it is recommended that it is omitted where possible. The content of the node represents the internal DTD subset (if any). Standard XML rules apply to DTDs. See the section on DTDs for more information.

<!ELEMENT %xr.doctype; ((%xr.intdtd-content;)*) >
<!ATTLIST %xr.doctype;
    public CDATA #IMPLIED 
    system CDATA #IMPLIED >

literal element

The literal element can only contain text and signifies that the literal text be inserted into the output. White space is significant in the content of this element. There is no guarantee that the resulting output text is well-formed XML.

<!ELEMENT %xr.literal; (#PCDATA)>

attribute element

The attribute element provides an attribute to an element, the element defined by the parent node of this element. An attribute may be defined many times, in which case it is the first value that takes precidence.

Use as:

<xr:attribute name="attname" [ns="URI"]>value</xr:attribute>

The name of the attribute is given in the "name" attribute. If the name of the attribute has a namespace-prefix the namespace MUST be provided in the ns attribute, and conversely, if the attribute has no prefix then the "ns" attribute must not be given. Processors should report any cases contrary to these rules as errors.

It is possible that a single prefix is associated with serveral conflicting namespaces by use of the "element" and "attribute" tags. Processors must signal all such errors.

<!ELEMENT %xr.attribute; (#PCDATA|%xr.reference;)*>
<!ATTLIST %xr.attribute;
    ns    CDATA #IMPLIED >

cdata element

The cdata element creates one (or more) cdata sections on the output. The data should be the text content of the element and on printing should be escaped by replacing all "]]>" with "]]]]><![CDATA[>". Therefore, any data may occur inside cdata but users should not expect a single cdata element to necessarily result from a single cdata node.

<!ELEMENT %xr.cdata; (#PCDATA)>

comment element

The comment element describes a comment to appear on the output. The data is the text content of the node (white space, as always, being significant) and escaped by replacing all the double-hyphen combinations "--" with "&#x2D;&#x2D;". (So for example "---" becomes "&#x2D;&#x2D;&#x2D;".) The data is normally wrapped with <!-- and -->, but the single space character in either of these combinations may be omitted if they precede (respectively follow) whitespace in the text.

<!ELEMENT %xr.comment; (#PCDATA)>

element element

The element element defines an element node with the name and namespace given. Except for the prefixes xml and xmlns, it is an error to omit the ns attribute if name has a prefix. It is an error to give two or more different ns values corresponding to same same prefix in an element and its attributes. Elements SHOULD be printed where feasable with the empty element syntax <name/> though it may not be possible to guarantee this.

<!ELEMENT %xr.element; ((%xr.element-content;)*) >
<!ATTLIST %xr.element;
    xmlns:xml CDATA "http://www.w3.org/XML/1998/namespace"
    xml:id ID #IMPLIED

character-reference element

entity-reference element

The character-reference and entity-reference elements define a character reference or entity reference &amp;name;. The character-reference element and entity-reference are operationally identical, though character-reference should be used for explicit (hexadecimal or decimal) numerical characters whereas entity-reference should be used for a named reference to a sequence of 1 or more characters. Processors MUST NOT expand the reference and NEED NOT check that the name is valid. The string name is given by the name attribute.

<!ELEMENT %xr.character-reference; EMPTY>
<!ATTLIST %xr.character-reference; name CDATA #REQUIRED >
<!ELEMENT %xr.entity-reference; EMPTY>
<!ATTLIST %xr.entity-reference; name CDATA #REQUIRED >

processing-instruction element

The processing-instruction element represents a PI. The node should be printed as <?target data?> using value of attribute target as target and content data as default value. This data is the PCDATA content with any ?> in the content being transformed to ?&amp;lt; on output. The space separating target and data may be omitted if data starts with ' '.

<!ELEMENT %xr.processing-instruction; (#PCDATA) >
<!ATTLIST %xr.processing-instruction;
    target CDATA #REQUIRED >

text element

The text node is is a place-holder for text, output as escaped PCDATA. The data is printed in a by escaping using XML escapes &amp; &le; &ge; &quot; and possibly also &quot;.

<!ELEMENT %xr.text; (#PCDATA|%xr.reference;)*>

group element

The group element is available to group together a section of an XR document. It is not permitted in any elements mentioned here, but is available for looser versions of this document type. (See other information on normalizing XR documents for more details.)

<!ELEMENT %xr.group; ANY>

Mark-up and semantics for DTDs

Marking-up DTDs presents a problem, as DTDs are conceptually text-based and use parameter entities as macros. Most DTD elements defined here allow a choice of either structured content or else very free dtd-text (e.g., using parameter entities). This encourages the free dtd-text to follow the maximum amount of general structure of the declarations. (Something that is required by the XML standards anyway.) It also enables the XML representation of the DTD to be as clear as possible about the structure of the DTD.

Avoiding use of the element dtdtext completely except in the austere form when dtdtext is only allowed to contain a parameter-entity-reference would aid DTD processing considerably, and would allow almost all extant DTDs to be written. This is allowed as an experimental DTD option here.

<!ENTITY % xr.austere "IGNORE">
  <!ENTITY % xr.dtdtextdecl "<!ELEMENT %xr.dtdtext; (%xr.pentity-reference;) >" >

It was tempting to define this XML vocabulary with a syntax using attributes much more heavily, but this would have complicated the issues of processing considerably, and ordinary XML elements are much simpler.

Basic DTD text

XR mark-up for DTDs allows the source to simply provide plain DTD text as an alternative. (This is to cater for macro-like processing that the DTD processor does using parameter entities.) Users (or application software) are encouraged not to use plain text but to mark-up their DTDs as far as possible using the specific mark-up defined here. Occasionally there is no option, however.

The xr.dtdreference entity lists elements allowed for references that may occur in a DTD. The element dtdtext can contain any DTD constructs. It is intended that dtdtext be used as a way of showing the structure, or of bracketing DTD constructs, i.e., content of dtdtext should be entire grammatical DTD constructs. This DTD cannot guarantee that, however.

The name element is used to mark-up the fact that the content is a valid name, usually the object being declared.

Many DTD constructs involve quoted strings. Quotes as plain text can be problematic. It is better to mark-up quoted strings with the q or qq elements. The XR printer will wrap these with ' ... ' or " ... " respectively.

Text or CDATA node children of dtdtext, name, q, qq are printed verbatim (i.e., unescaped) by the XR printer or other XR processors. This rule may not apply to other decendents other than immediate descendants. The usual rules for the standard xml:space attributes apply.

<!ENTITY % xr.dtdreference "%xr.character-reference;|%xr.entity-reference;|%xr.pentity-reference;">
<!ENTITY % xr.dtdtextdecl 
    "<!ELEMENT %xr.dtdtext; (#PCDATA|%xr.dtdreference;|%xr.text;|%xr.q;|%xr.qq;|%xr.dtd-content;)* >" >
<!ELEMENT %xr.name; (#PCDATA|%xr.dtdreference;|%xr.text;)*>
<!ELEMENT %xr.q; (#PCDATA|%xr.dtdreference;|%xr.text;|%xr.qq;)* >
<!ELEMENT %xr.qq; (#PCDATA|%xr.dtdreference;|%xr.text;|%xr.q;)* >

Entity references

A pentity-reference contains the name of a parameter. The XR printer will wrap it with % ... ;. Only PCDATA is allowed as content; this should be white-space trimmed by processors, and the result should be a valid name.

<!ELEMENT %xr.pentity-reference; (#PCDATA) >

element and attribute declarations

We next have a complex set of declarations for element and attribute declarations, content specification, and attribute types and defaults. Entities children, contentspec, atttype, defaultdecl and attdef define valid child-content-specifications, content-specifications, attribute-types, attribute-default-declarations and attribute definitions, respectively. Each has dtdtext as a fall-back option.

<!ENTITY  % xr.children "%xr.dtdtext;|%xr.clist;|%xr.calts;|%xr.copt;|%xr.crep;|%xr.coptrep;">
<!ENTITY  % xr.contentspec "%xr.dtdtext;|%xr.EMPTY;|%xr.ANY;|%xr.PCDATA;|%xr.mixed;|%xr.children;">
<!ENTITY  % xr.atttype "%xr.CDATA;|%xr.ID;|%xr.IDREF;|%xr.IDREFS;|%xr.ENTITY;|%xr.ENTITIES;|
<!ENTITY  % xr.defaultdecl "%xr.dtdtext;|%xr.REQUIRED;|%xr.IMPLIED;|%xr.FIXED;|%xr.q;|%xr.qq;">
<!ENTITY  % xr.attdef "%xr.dtdtext;|(%xr.name;,(%xr.atttype;),(%xr.defaultdecl;))">

These are used in the definitions of elementdecl and attlistdecl, the mark-up specifications for element-declarations and attribute-list-definitions. The definitions here should be self-explanatory. XR-processors or printers should wrap elementdecl with <!ELEMENT ... > and attlistdecl with <!ATTLIST ... >.

<!ELEMENT %xr.elementdecl; (%xr.dtdtext;|(%xr.name;,(%xr.contentspec;)))>
<!ELEMENT %xr.attlistdecl; (%xr.dtdtext;|(%xr.name;,(%xr.attdef;)*))>

Element content specifications

We now have a group of elements for element-content-specifications. These (and their intended display) are: clist resulting in ( child , child , ... , child ); calts resulting in ( child | child | ... | child ); copt resulting in child ?; crep resulting in child +; coptrep resulting in child *; mixed resulting in (#PCDATA| child | child | ... | child )*; ANY resulting in ANY; EMPTY resulting in EMPTY; PCDATA resulting in (#PCDATA). In some cases, the XR-processor not only wraps the child-content but ands a separator between children.

<!ELEMENT %xr.clist; (%xr.dtdtext;|(%xr.name;|%xr.children;)*) >
<!ELEMENT %xr.calts; (%xr.dtdtext;|(%xr.name;|%xr.children;)*) >
<!ELEMENT %xr.copt; (%xr.dtdtext;|(%xr.name;|%xr.children;)) >
<!ELEMENT %xr.crep; (%xr.dtdtext;|(%xr.name;|%xr.children;)) >
<!ELEMENT %xr.coptrep; (%xr.dtdtext;|(%xr.name;|%xr.children;)) >
<!ELEMENT %xr.mixed; (%xr.dtdtext;|(%xr.name;)*) >

Attribute types and defaults

Next defined are the elements for mark-up of attribute defintions. Again, these are self-explanatory, with the following desplays intended: CDATA resulting in CDATA; ID resulting in ID; IDREF resulting in IDREF; IDREFS resulting in IDREFS; ENTITY resulting in ENTITY; ENTITIES resulting in ENTITIES; NMTOKEN resulting in NMTOKEN; NMTOKENS resulting in NMTOKENS; notationtype resulting in NOTATION ( name | name | ... | name ); enumeration resulting in ( name | name | ... | name ); REQUIRED resulting in #REQUIRED; IMPLIED resulting in #IMPLIED; FIXED resulting in #FIXED child.

<!ELEMENT %xr.notationtype; (%xr.dtdtext;|(%xr.name;)*) >
<!ELEMENT %xr.enumeration; (%xr.dtdtext;|(%xr.name;)*) >
<!ELEMENT %xr.FIXED; (#PCDATA|%xr.dtdreference;|%xr.text;|%xr.q;|%xr.qq;)* >

Entity and notation declarations

The elements systemid, publicid and publiclit match grammar elements with the same names in the XML specification. They print out as SYSTEM data, PUBLIC pdata sdata and PUBLIC pdata repectively. (The difference beteen a public literal and a public identifier is that a fall-back system URI is not needed for the latter. Don't ask me, I didn't write the spec!) The data, pdata, sdata etc are quoted, so should be presented in q or qq elements. dtdtext presents a fall-back for nonstructured systemid, etc. (In other words you have to get it right yourself!)

<!ELEMENT %xr.systemid; (%xr.dtdtext;|%xr.q;|%xr.qq;)>
<!ELEMENT %xr.publicid; (%xr.dtdtext;|((%xr.q;|%xr.qq;),(%xr.q;|%xr.qq;)))>
<!ELEMENT %xr.publiclit; (%xr.dtdtext;|%xr.q;|%xr.qq;)>

The tags pentitydecl, entitydecl, notationdecl mark-up parameter-entity, entity and notation declarations. In all cases you can provided the content yourself with dtdtext content and the content will be printed wrapped with <!ENTITY % ... >, <!ENTITY ... >, or <!NOTATION ... >. Otherwise you must specify the name of the object being declared and its data. The data is either a string (marked-up with q or qq) or a system or public id, (in the case of entitydecl, optionally with an NDATA notation name) or (for notationdecl) a publiclit. ndata is marked-up content that resolved to an XML name, the name of a notation.

<!ELEMENT %xr.ndata; (#PCDATA|%xr.dtdreference;|%xr.text;)*>
<!ELEMENT %xr.pentitydecl; (%xr.dtdtext;|(%xr.name;,(%xr.q;|%xr.qq;|%xr.publicid;|%xr.systemid;)))>
<!ELEMENT %xr.entitydecl; (%xr.dtdtext;|(%xr.name;,(%xr.q;|%xr.qq;|((%xr.publicid;|%xr.systemid;),(%xr.ndata;)?))))>
<!ELEMENT %xr.notationdecl; (%xr.dtdtext;|(%xr.name;,(%xr.systemid;|%xr.publicid;|%xr.publiclit;)))>


The tags INCLUDE, IGNORE, conditional mark-up conditional inclusions. Content should be provided as shown in dtd-content. The attribute "name" is the name of a parameter-entity controlling inclusion and in this form produces <![%name;[ ... ]]>. (If you want something more complicated than this you'll have to code it with dtdtext.)

<!ELEMENT %xr.INCLUDE; (%xr.dtd-content;)*>
<!ELEMENT %xr.IGNORE; (%xr.dtd-content;)*>
<!ELEMENT %xr.conditional; ((%xr.attribute;)?,(%xr.dtd-content;)*)>
<!ATTLIST %xr.conditional; name CDATA #IMPLIED > 

See the documentation elsewhere for more about this DTD and the GLOSS XML representation.

This page is copyright. Web page design and creation by GLOSS.