The unicode problem

1 What to do when converting XML to [La]TeX?

Input XML documents are, in principle, `richer' than TeX or LaTeX, in that they allow arbitrary unicode characters as well as XML's tree structure. Provided the intended target document does not include glyphs outside TeX's capabilities the document may be encoded in XML in many ways. Which is the `right' one, or best one?

The other problem is that (La)TeX has a strange legacy encoding of characters. (Even with different characters at the same position in different fonts.) In any case, not all the characters you want ever seem to be available. This seems to be a major problem without any solution, despite many people taking much time over it.

There is also a minor issue with the special characters of TeX which must be encoded correctly. These are include $ _ { } # & < > % ~ ' " ^ \ [ ]. This issue seems to be resolved.

I am not going to spend much more time on this. The files here should convert a minimal amount of XHTML+MathML into LaTeX. If it doesn't work, try something else.

2 An incomplete technical discussion

The choice to be made here is technical and somewhat difficult. It is necessary to prioritise the options. Are your priorities...

...to use standard LaTeX2e as the TeX processor?
...to use any available TeX processor but only allow standard DVI files as output?
...to use some other software (operating on the XML) as the norm, but to provided LaTeX for a fall-back?
...to only use safe 7-bit characters at all times in gl, xml and tex files?

The options, at present seem to be...

...to use omega (lambda) or conTeXt as the TeX processor with UTF-8 input encodings and if possible produce dvi files that work with standard fonts? (E.g., unicode.tex/uft8-tex from Bruno Haible.
...to use TeXML or some other intermediate program to convert XML+unicode to ordinary TeX? (http://getfo.sourceforge.net/texml/)
...to restrict the input more fully to ensure that all operations are 7-bit characters only?
...to go all-out with some new typesetting program such as conTeXt or omega.

I couldn't get TeXML working (wrong version of python/library) but this route seems flawed as it requires knowing which mode TeX is in (maths or text?) at compile time. Thus using real unicode requires more modern TeX software. ConTeXt looks good, but... We need to stick with safe 7-bit characters for now.

Ughhh... I give up!