Output encodings

[UP]
Reference

Dialogs

Data types

Events

Templates

Session management and security

Internationalization

Output encodings

Processing instructions

The UI language

The standard UI library

WDialog API (O'Caml)

Runtime models

Output encodings

This section explains how character data in the generated HTML output are encoded. The are various aspects of this theme, and it is quite easy to get totally confused. Because of this, I will first explain how character data change their encoding type during the processing steps by default, and later, how this behaviour can be modified.

The standard way of encoding characters

Phase 1: Parsing XML, and the internal representation

Most of the character data are read from the XML files containing the UI definition, but some strings are also dynamically added by the program (e.g. read from a database, or another background store). The XML data are parsed, and the result is an in-memory representation as XML tree. This tree can be seen as a reference point of the various recoding steps as it expresses what is meant. This becomes clearer by an example:

<ui:variable name="company">
  <ui:string-value>Meyer &amp; Son</ui:string-value>
</ui:variable>

This literal XML fragment is parsed, and represented as a tree:

  |
  ui:variable
      |
      +-- attribute "name" has value "company"
      +-- ui:string-value
              |
              +-- text "Meyer & Son"

Especially, the ampersand is now represented as ampersand, and does not need any escaping notation.

Of course, there are many more data structures than just XML trees. We have declared a variable here, and this creates a container for the variable. The important point is that the initial value of the variable can be directly taken from the XML tree, here it is "Meyer & Son". If the value is later changed (e.g. overwritten by some database record), no encoding changes are necessary. The general idea is that the internal representation never escapes characters.

Phase 2: Internal processing

In order to get HTML output, the XML tree needs to be transformed, for example, template calls must be expanded. The transformation never changes the way character data are encoded.

Phase 3: Writing the HTML output

The result of the transformation step is an HTML tree that must be written as text stream. There are essentially two major cases:

Element context: This simply means that the HTML node to write occurs within an outer HTML node as sub element. HTML tags are printed with the normal tag syntax: <tag>...</tag>. Character data are HTML-escaped, i.e. < is printed as < etc.
For example, this HTML tree
```
  |
  b
  |
  +-- text "Meyer & Son"
```
is printed as
```
<b>Meyer &amp; Son</b>
```
It is also possible that the HTML node has attributes. These are HTML-escaped, too, e.g. if the value attribute has the value "Meyer & Son", the whole input element is printed as:
```
<input type="button" value="Meyer &amp; Son">
```
Attribute context: Here, the HTML node to write occurs inside the attribute value of an outer HTML node. What? Well, this is a consequence of the template expansion algorithm. For example:
```
<ui:template name="bold_meyer">
  <b>Meyer &amp; Son</b>
</ui:template>
...
<ui:template name="make_button" from-caller="value">
  <input type="button" value="$value"/>
</ui:template>
...
<t:make_button>
  <p:value><t:bold_meyer/></p:value>
</t:make_button>
```
Here, the HTML subtree <b>Meyer & Son</b> is finally inserted as the value of the value attribute! The tree looks like:
```
  |
  input
  |
  +-- attribute "type" has value "button"
  +-- attribute "value":
        |
        +-- b
            |
            +-- text "Meyer & Son"
```
This case is handled in two steps. First, the HTML subtree within the attribute is linearized into a single string. Second, the string is printed as attribute, and this is the same algorithm as above, i.e. HTML meta characters are escaped.
Linearization: HTML elements are printed in tag notation. Text nodes are simply left as they are, i.e. no HTML-escaping happens in this step.
In the example, the result of the linearization is the string "<b>Meyer & Son</b>", and this string is printed as attribute, leading to the final result
```
<input type="button" value="&lt;b&gt;Meyer &amp; Son &lt;/b&gt;">
```
I know that it is a bit surprising that this case exists, but I think it is treated in a straight-forward way.

How to modify the way output is encoded

Forcing the algorithm for attribute context

One drawback of the normal output encoding is that it is impossible to generate raw HTML dynamically. Imagine you have a database containing HTML pages. How do you include the pages into your generated output?

Let us assume the variable html_page contains the page. If you include it by

<ui:dynamic variable="html_page"/>

the ui:dynamic statement expands to a text node, and the normal encoding escapes all HTML meta characters. The result is that the browser displays the code of the page as such, but does not interpret it.

It is possible to force the algorithm that is used for attribute context. The important point is that this algorithm does not escape within text nodes. The ui:special element selects this algorithm, e.g.

<ui:special>
  <ui:dynamic variable="html_page"/>
</ui:special>

Now the HTML meta characters are left as they are, without any escaping. The browser interprets the HTML code.

Additional output encodings

The HTML pre tag preserves the formatting of the inner character block. Sometimes it would be nice to simulate the effect of pre without using it, by replacing spaces with  , newlines with <br>, and by expanding tabs. The ui:encode element allows one to add an escaping algorithm to the current active set of encoders:

<ui:encode enc="pre">
This is the first line.
Second line.
</ui:encode>

The two lines are first encoded by the HTML-escaping algorithm, the default algorithm. The ui:encode element takes the result of this, and applies pre-style escaping to it. The printed HTML code is:

This&nbsp;is&nbsp;the&nbsp;first&nbsp;line.<br>
Second&nbsp;line.<br>

Another example: You want to generate a Javascript function that pops up an alert box on the screen:

<ui:template name="alert" from-caller="body">
  <script type="text/javascript">
    <ui:special>
window.alert("${body/js}");
    </ui:special>
  </script>
</ui:template>

The ui:special element makes that HTML-escaping is turned off. The /js notation applies the js encoding to the value of body. This encoding escapes characters that cannot occur in Javascript strings literally, e.g. the quotation mark itself.

The list of defined output encodings

The following names can be used in ui:encode, and when expanding parameters (${param/encname}) and in bracket expressions ($[expr/encname]):

html: The HTML-escaping algorithm substitutes < for <, > for >, " for ", and & for &.
pre: This encoding substitutes   for spaces, <br> for newline characters, and expands tabs (tab width is 8).
para: Multiple newline characters are replaced by <p>.
js: The characters \, ", ', <, % and control characters are escaped according to the Javascript rules such that the string can be used inside a Javascript string literal.
jslong: A problem of js is that Javascript interpreters do not like long lines. To be on the safe side, jslong should be used instead. It puts "+\n+" sequences into the string to avoid that the resulting lines become too long.

You can define your own encodings by calling the method add_output_encoding of the application object.

The encodings can be referred to at a number of places:

ui:encode: The element ui:encode applies the encoding to what is printed for the subelements.
Parameters: The syntax ${param/enc} applies the encoding enc to the value of the template parameter param.
Bracket expressions: The syntax $[expr/enc] applies the encoding enc to the result of the bracket expression expr.