LMNL Tutorial

Introduction

This tutorial introduces you to LMNL, the Layered Markup and Annotation Language. We've littered this tutorial with references to the data model, syntax and reified LMNL documents, so that when you're ready, you can move on to those more formal documents, but this should act as your introduction to LMNL.

The brief overview is that LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.

Last modified 11 Oct 2002 by Jeni Tennison.

LMNL Documents

It's traditional to start a tutorial with a “Hello World!” example. So to kick us off, let's look at the “Hello World!” LMNL document:

Hello World!

Yep, that's it. Not particularly hard, is it? In fact any text document is a legal LMNL document as long as it follows two rules:

The next couple of sections show you what to do if you want to break either of these rules — how to escape the markup-significant characters using the predefined entities, and how to use an encoding other than UTF-8 or UTF-16 in your document.

Predefined Entities

So what happens if you want to include a [, { or & in your document? What if you wanted to have the document say “Hello World, & Welcome!”? These characters are special in LMNL because they indicate markup. If you want to use them as just characters within some text, you have to escape them so that a parser reading the document doesn't misinterpret them as markup.

You can escape the [, { and & characters using the predefined entities: [, { and &. So the “Hello World, & Welcome!” document would look like:

Hello World, & Welcome!

There are seven predefined entities in LMNL, but [, { and & are the only ones that you have to use in the normal run of things (it doesn't hurt if you use the others, but there's no need to). The predefined entities are:

We'll see later that you can also define your own entities if you want to, to give shorthands for characters that are hard to type.

While we're looking at characters, we'll just quickly mention that you can use character references in LMNL as well, if you need to insert a Unicode character that isn't supported by the encoding that you're using, or if it's simply easier to type that way! Character references are the same as in XML. For example, you can use Щ or Щ to include a “CYRILLIC CAPITAL LETTER SHCHA” (Щ) in your text.

LMNL Declaration

If you want to use an encoding other than UTF-8 or UTF-16, or if you simply want to state for the record that the document is a LMNL document, then you should include a LMNL declaration right at the top of the file, before anything else (even spaces). A basic LMNL declaration looks like:

[!lmnl version="0.2"]

By the way, that's a legal LMNL document as well; a LMNL document doesn't have to have any content.

The version declaration specifies the version of LMNL being used in the document. We're on version 0.2 at the moment because we haven't finished drafting LMNL yet.

If you want to specify the encoding of the document (the way in which the characters in the document are written, as bytes, on disk) then you can use an encoding declaration. For example, to say that our “Hello World!” document has been saved using ISO-8859-1 (the encoding that's used by many Western text editors) you can use:

[!lmnl version="0.1" encoding="ISO-8859-1"]
Hello World!

Note that you must specify the version of LMNL that you're using if you include a LMNL declaration; you can't just specify the encoding of the LMNL document without specifying the version.

Ranges

Having a text document be recognised as a LMNL document is all very well, but the point of a markup language is that you mark up things — you indicate the structure of a document by embedding tags, markers in the normal text of the document.

To illustrate mark up, we're going to need a document that's a bit more sophisticated than the simple “Hello World!” example, so we'll use this extract from http://www.zeta.org.au/~annskea/Trickstr.htm:

Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. Alan Garner has collected Trickster stories from many countries in his 
book The Guizer and he writes:

        If we take the elements from which our emotions are built 
        and give them separate names such as Mother, Hero, Father, 
        King, Child, Queen, the element that I think marks most of 
        us is that of the Fool. It is where our humanity lies. For 
        the Fool is the advocate of uncertainty: he is at once 
        creator and destroyer, bringer of help and harm. He draws 
        a boundary for chaos, so that we can make sense of the 
        rest. He is the shadow that shapes the light. Psychology 
        calls him Trickster. I have called him Guizer.
        Guizer is the proper word for an actor in a mumming play. 
        He is comical, grotesque, stupid, cunning, ambiguous. He 
        is sometimes part animal, and always part something else. 
        The something else is what is so special. He is the 
        dawning godhead in Man.

Tags

Using LMNL, we can mark up any contiguous sequence of characters within this text as a range. Ranges within text are indicated by tags which mark the start and end of the range. For example, we can mark up the extract above as a [paragraph] range and a [extract] range:

[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. Alan Garner has collected Trickster stories from many countries in his 
book The Guizer and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{extract]

Usually, the start tag looks like [name} while the end tag looks like {name]. Within a document, every start tag must have a matching end tag and vice versa. You can also have empty tags that mark points within the text. Empty tags look like [name]. And you can have tags without names that indicate anonymous ranges. Empty tags and anonymous ranges really come into their own when you start adding annotations (as we'll see later).

Overlapping Ranges

If you're familiar with SGML and XML you're probably thinking “OK, but this is exactly what SGML and XML does. What's so different?”. What's different in LMNL is that tags indicate ranges rather than elements, and, unlike elements, ranges can overlap each other. For example, if I wanted to mark up the section of the text that refers to and quotes from the book “The Guizer”, I could do so despite the fact that this range runs across the [paragraph] and [extract]:

[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book The Guizer and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.{reference]
{extract]

Enabling ranges to overlap is incredibly useful. It's often very hard to squeeze a document's structure into a neat tree, for example if you're including comments, marking up insertions and deletions or marking up text that has multiple structures such as the Bible (chapters and verses vs. sections and paragraphs). This isn't to say that tree structures are useless — of course they're incredibly useful, not least because they're easy to process — but they don't meet everyone's requirements.

Tag Identifiers

If you're particularly alert, you may have noticed that there's a potential problem with allowing overlapping ranges when two ranges with the same name overlap each other. For example:

[keyphrase}overlapping [keyphrase}ranges{keyphrase] with 
identifiers{keyphrase]

Given this piece of LMNL, we can see that there are two [keyphrase]s, but it's not clear whether they're supposed to be:

  1. “overlapping ranges” and “ranges with identifiers” or
  2. “overlapping ranges with identifiers” and “ranges”

To overcome this problem, you can assign an identifier to a start tag, in which case it can only match an end tag with the same identifier. For example, to markup the two [keyphrase]s “overlapping ranges” and “ranges with identifiers”, you could use:

[keyphrase=key1}overlapping [keyphrase}ranges{keyphrase=key1] with 
identifiers{keyphrase]

The values of the identifiers aren't important; they're not carried through into the data model, so applications can't use them for linking, for example. They also don't have to be unique within the document.

Because an identifier explicitly says which start tag an end tag matches, it's not actually necessary to have the name of the range in the end tag. You can use the shorthand:

[keyphrase=key1}overlapping [keyphrase}ranges{=key1] with 
identifiers{keyphrase]

to give the same two ranges as above.

If you don't use an identifier, or if the same identifier is used twice in on ranges with the same name in the same scope then an application will interpret the document as if the ranges were nested inside each other. For example, both the following documents give the same pair of [keyphrase] ranges — “overlapping ranges with identifiers” and “ranges”:

[keyphrase}overlapping [keyphrase}ranges{keyphrase] with 
identifiers{keyphrase]

[keyphrase=key1}overlapping [keyphrase=key1}ranges{=key1] with 
identifiers{keyphrase=key1]

Annotations

As you've seen, LMNL enables you to label ranges of text within a document — give them a name. LMNL also allows you to annotate them — add meta-information to a range. For example, we can label our document as an extract, and add a [href] annotation that points to the page where we got it:

[extract [href}http://www.zeta.org.au/~annskea/Trickstr.htm{href]}
[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book The Guizer and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.{reference]
{extract]
{extract]

As you can see, an annotation can go within a start tag, and it's delimited with start and end tags. Annotations look a lot like ranges, but they have the important feature that, unlike ranges, they can't overlap. Because it's guaranteed that annotations don't overlap, you can actually use a shorthand for the end tag if you want, of {]. For example:

[extract [href}http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book The Guizer and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.{reference]
{extract]
{extract]

Whether or not you use the shorthand for annotation end tags is up to you; it can make it easier to type annotations, especially when they have simple values, but if you have complex annotations it can make it harder to keep track of where you are.

Annotations in End Tags

Annotations don't have to go in the start tag of a range; you can also put them in an end tag if that's more appropriate. For example, it sometimes feels more natural to put a citation or a comment after the thing that you're citing or commenting on. In this next example, the [reference] range has a [cite] annotation in its end tag:

[extract [href}http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book The Guizer and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{reference [cite}Garner, A., The Guizer: A Book of Fools, London, Hamish Hamilton, 1975, p.9.{]]
{extract]
{extract]

Multiple Annotations

A tag can contain as many annotations as you like, and they don't necessarily have to have different names (unlike attributes in SGML or XML). The following example adds a [book] range with an [ISBN] and a number of [buy] annotations that list URLs from which you can buy the book:

[extract [href}http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book 
[book [ISBN}0241892228{]
      [buy}http://www.allbookstores.com/book/compare/0241892228{]
      [buy}http://www.abebooks.com/{]
      [buy}http://www.bookfinder.com/{]}The Guizer{book] and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{reference [cite}Garner, A., The Guizer: A Book of Fools, London, Hamish Hamilton, 1975, p.9.{]]
{extract]
{extract]

Unlike attributes in elements, the order of annotations is preserved. Note that that doesn't necessarily mean the order matters — whether the annotations have to appear in a particular order or not depends on the markup language.

Structure in Annotations

Another major difference between attributes in SGML and XML and annotations in LMNL is that annotations can themselves have structure. You can put ranges inside annotations; you can put annotations on annotations (and annotations on annotations on ranges inside annotations and so on). There is no limit. This means that the only decision you have to make when deciding whether to use an annotation or a range to hold a piece of information is whether it is “content” or “metadata”.

In this next example, the [href] annotation has itself been annotated with a [title] annotation, giving the title of the referenced page and the value of the [cite] annotation has been marked up with ranges that indicate the different parts of the citation:

[extract 
  [href
    [title}Ted Hughes and Crow{]
    }http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [reference}Alan Garner has collected Trickster stories from many 
countries in his book 
[book [ISBN}0241892228{]
      [buy}http://www.allbookstores.com/book/compare/0241892228{]
      [buy}http://www.abebooks.com/{]
      [buy}http://www.bookfinder.com/{]}The Guizer{book] and he writes:
{paragraph]
[extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{reference 
  [cite}[author}Garner, A.{author], [title}The Guizer: A Book of Fools{title], 
        London, [publisher}Hamish Hamilton{publisher], [year}1975{year], 
        p.[page}9{page].{cite]]
{extract]
{extract]

Extra Twists

Really, the above is all you need to know in order to use LMNL quite happily, so it might be a good idea to stop now and try to use what you've learned. The rest of the tutorial discusses some of the more esoteric aspects of LMNL.

Comments

Like any good language, LMNL has a syntax for including comments within the text. Comments don't get passed on to applications, so they're useful for commenting out bits of LMNL that you want to ignore. A comment looks like:

[!-- This is a comment --]

Comments can go pretty much anywhere within a document, including within start tags and end tags.

Entities

We introduced you to the predefined entities at the beginning of this tutorial, and hinted that you could create your own as well. Well, guess what, you can! You can declare entities with an entity declaration anywhere you like in your LMNL document, including within start and end tags, as long as it's before the first use of that entity. For example, to declare a   entity in order to easily insert non-breaking spaces in your document, you could use:

[!lmnl version="0.1"]
[!entity nbsp="&#xA0"]
Hello World!

Entities in LMNL are quite restricted compared to entities in XML, however. They're roughly the same as internal entities, but (and this is important) they can't contain markup. The point of entities in LMNL is to provide names for characters to save you from having to remember their Unicode code point or the precise sequence of keys you have to type to get them, not as a general include mechanism. We'll eventually be layering stuff on top of LMNL to provide inclusions...

Anyway, back to entities. It would be a real pain if you had to declare every single entity that you wanted to use in your document, so there's a quick and easy way to borrow entities from other documents: an entities declaration. For example, if html.lmnl contained a bunch of entity declarations (including one for  ), I could reuse them in my document by importing them as follows:

[!lmnl version="0.1"]
[!entities href="html.lmnl"]
Hello World!

The html.lmnl document might look like:

[!lmnl version="0.1"]
This document contains declarations of the following entities for 
you to use in your own documents:

[!entity nbsp  = " "]
[entity [name}nbsp{]  [char} {] }non-breaking space character{entity]

[!entity iexcl = "¡"]
[entity [name}iexcl{] [char}¡{]}inverted exclamation mark{entity]

[!entity cent  = "¢"]
[entity [name}cent{]  [char}¢{] }cent sign{entity]

[!entity pound = "£"]
[entity [name}pound{] [char}£{]}pound sign{entity]
...

html.lmnl is a document of its own right, with its own content. When you use an entities declaration, you pull in the entity declarations from that document, regardless of the content. Documents (such as html.lmnl) that exist purely to define bunches of entities can have content that describes the entities they declare (as in the example above) or can have no content at all if they want.

You can't override entities by redeclaring them with a different value after they've already been declared. It's probably a bad idea to use an entity for any text that you might want to override anyway, so this shouldn't be too much of a restriction.

Namespaces

If you're used to XML, you're probably wondering whether LMNL uses namespaces. The answer is that it does. Namespaces are built in to LMNL; when we talk about the name of a range or an annotation, we're really referring to its expanded name — a pair of a namespace and a local name — and the names that we use in tags are actually qualified names that are resolved into these expanded names.

Namespace declarations in LMNL are quite different in effect from those in XML, however.

First, namespace declarations can appear anywhere within a LMNL document, including within start tags and inside annotations, and they have a scope that extends from that point on in the document. Once you've associated a prefix with a namespace, that's it — you can't change the prefix for the namespace, nor the namespace for the prefix. This guarantees that all LMNL documents are “sane” (see http://www.flightlab.com/~joe/sgml/sanity.txt for the definition of namespace sanity).

Second, there's isn't such a thing as the “default namespace” in LMNL — if a name has a prefix then it's in a namespace, if it hasn't then it's not, and this applies to both ranges and annotations.

Third, prefixes are only significant within LMNL syntax — LMNL applications usually don't have access to what prefix was used on a particular range. Importantly, this means that if you want to include qualified names in content then you have to use another method (for example an annotation) to associate prefixes to namespaces.

So here's our example again. This time the [reference] range, and the ranges held within its [cite] annotation are in the namespace http://www.example.com/bibliographic (associated with the prefix bib) and the other ranges are in the namespace http://www.example.com/paper (associated with the prefix p). The annotations are all in no namespace aside from the [ISBN] annotation, which is in the bibliographic namespace:

[!lmnl version="0.1"]
[!ns bib="http://www.example.com/bibliographic"]
[!ns p="http://www.example.com/paper"]
[p:extract 
  [href
    [title}Ted Hughes and Crow{]
    }http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[p:paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep. [bib:reference}Alan Garner has collected Trickster stories from many 
countries in his book 
[p:book [bib:ISBN}0241892228{]
        [buy}http://www.allbookstores.com/book/compare/0241892228{]
        [buy}http://www.abebooks.com/{]
        [buy}http://www.bookfinder.com/{]}The Guizer{p:book] and he writes:
{p:paragraph]
[p:extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{bib:reference 
  [cite}[bib:author}Garner, A.{bib:author], [bib:title}The Guizer: A Book of Fools{bib:title], 
        London, [bib:publisher}Hamish Hamilton{bib:publisher], [bib:year}1975{bib:year], 
        p.[bib:page}9{bib:page].{cite]]
{p:extract]
{p:extract]

And here's the same document again, this time with the namespace declaration for the bibliographic namespace in a different place:

[!lmnl version="0.1"]
[!ns p="http://www.example.com/paper"]
[p:extract 
  [href
    [title}Ted Hughes and Crow{]
    }http://www.zeta.org.au/~annskea/Trickstr.htm{]}
[p:paragraph}
Trickster has never been restricted to one society. In European countries he 
appears in the guise of Jester or Fool, and his roots in the human psyche are 
deep.
[!ns bib="http://www.example.com/bibliographic"]
[bib:reference}Alan Garner has collected Trickster stories from many 
countries in his book 
[p:book [bib:ISBN}0241892228{]
        [buy}http://www.allbookstores.com/book/compare/0241892228{]
        [buy}http://www.abebooks.com/{]
        [buy}http://www.bookfinder.com/{]}The Guizer{p:book] and he writes:
{p:paragraph]
[p:extract}
If we take the elements from which our emotions are built and give them 
separate names such as Mother, Hero, Father, King, Child, Queen, the element 
that I think marks most of us is that of the Fool. It is where our humanity 
lies. For the Fool is the advocate of uncertainty: he is at once creator and 
destroyer, bringer of help and harm. He draws a boundary for chaos, so that we 
can make sense of the rest. He is the shadow that shapes the light. Psychology 
calls him Trickster. I have called him Guizer. Guizer is the proper word for an 
actor in a mumming play. He is comical, grotesque, stupid, cunning, ambiguous. 
He is sometimes part animal, and always part something else. The something else 
is what is so special. He is the dawning godhead in Man.
{bib:reference 
  [cite}[bib:author}Garner, A.{bib:author], [bib:title}The Guizer: A Book of Fools{bib:title], 
        London, [bib:publisher}Hamish Hamilton{bib:publisher], [bib:year}1975{bib:year], 
        p.[bib:page}9{bib:page].{cite]]
{p:extract]
{p:extract]

Since namespace declarations are always scoped to “the whole of the rest of the document”, no matter where they appear, it's good practice to put them all at the top of the document. They're allowed to appear anywhere so that it's easy for streaming applications to serialise LMNL documents.

Layers

So far the documents that we've looked at have been what are known as flat documents. They consist purely of text and a single set of ranges over that text. In LMNL terms, they have two layers: a text layer and a layer containing ranges that range over the characters in the text layer. Diagrammatically, it might look like:

Flat LMNL
Flat LMNL

There are two kinds of extensions that we can make to flat LMNL.

First, we can add other layers that contain ranges that range over the text layer. This is especially useful because the different layers can provide different views of the same document without worrying out the ranges they contain interacting with each other. For example, another person could add their own markup to label the year, month and day differently:

LMNL with Two Layers over the
        Text
LMNL with Two Layers over the Text

Second, we can add other layers that contain ranges that range over the ranges in another layer. This is mainly used by applications that derive structure based on the presence of ranges in a layer. For example, an application might detect the fact that [year], [month] and [day] ranges occur in sequence, and from that deduce a [date] range:

LMNL with Ranges over
        Ranges
LMNL with Ranges over Ranges

In the rest of this section we'll see how to represent these constructions using LMNL syntax.

Declaring Layers

If you want to create anything other than a flat document in LMNL, you have to declare the layers that the document holds using layer declarations. A layer declaration specifies a name for the layer and indicates the base that the ranges held within the layer range over. For example, to declare a layer called type that contains ranges that range over the ranges in the layer called lexical, you can use:

[!layer name="type" base="lexical"]

The base attribute can take two special values: #text and #default. The value #text means that the layer ranges over the text within the document. For example, in our document we can say that the lexical layer contains ranges that range over the text in the document, and the type layer contains ranges that range over the lexical layer, we can use:

[!lmnl version="0.1"]
[!layer name="lexical" base="#text"]
[!layer name="type" base="lexical"]
2002-09-12

The special #default value in the base attribute means that the layer ranges over the default layer — a layer that contains all the ranges that aren't explicitly associated with another layer. The default layer is a convenience because it means you can add layers to a flat document without having to change the tags that it contains.

Like namespace declarations, layer declarations can go practically anywhere in the document (including within tags) and have a scope that spans from that point on. The only thing that actually limits where you put them is that they have to come before the first range that belongs to that layer and before any layer declarations that refer to it as a base. Like namespace declarations, it's a good idea to put all the layer declarations at the start of the document, but they're allowed anywhere because that makes LMNL documents easier to stream.

Layers and Ranges

How do you know which ranges belong to which layers? Well, you have to associate ranges with the layers that they belong to using a layer identifier. Here's an example where the default layer contains the ranges [year], [month] and [day], the type layer contains the [date] range which ranges over these three ranges, and the fr layer contains [an], [mois] and [jour] ranges:

[!lmnl version="0.1"]
[!layer name="fr" base="#text"]
[!layer name="type" base="#default"]
[date~type
  }[an~fr}[year}2002{year]{an~fr
  ]-[mois~fr}[month}09{month]{mois~fr
  ]-[jour~fr}[day}12{day]{jour~fr]{date~type]

In case you're wondering, within a start or end tag the range identifier comes before the layer identifier, so you might have [keyphrase=key1~jt}...{=key1~jt].

Reified LMNL

Now you've got your head around layers, it's time to look more closely at the details of how physical LMNL documents get processed. We've described the syntax of LMNL here in terms of how it gets interpreted as a basic LMNL data model. The trouble is that this way of interpreting LMNL documents leads to some unintuitive results. Take the following example:

[graphic [src}crow.gif{]}A crow{graphic]

We can interpret this piece of LMNL as a [graphic] range that ranges over the text “A crow”. Diagrammatically, it would look like:

LMNL of a [graphic]
        Range
LMNL of a [graphic] Range

Now let's say we add a [link] range to the same layer:

[link}[graphic [src}crow.gif{]}A crow{graphic]{link [href}crow.xml{]]

The layer now contains two ranges — a [link] range and a [graphic] range, both of which span the same set of characters:

LMNL with Two Ranges over the
        Same Span
LMNL with Two Ranges over the Same Span

But here's where our problems start, because this same picture of two ranges spanning over the same set of characters could be generated from four different LMNL documents:

[link}[graphic [src}crow.gif{]}A crow{graphic]{link [href}crow.xml{]]

[link}[graphic [src}crow.gif{]}A crow{link [href}crow.xml{]]{graphic]

[graphic [src}crow.gif{]}[link}A crow{graphic]{link [href}crow.xml{]]

[graphic [src}crow.gif{]}[link}A crow{link [href}crow.xml{]]{graphic]

An author, though, is likely to mean something different by each of these markup possibilities. In the first case, they might mean that the graphic crow.gif (or the text “A crow” if the graphic can't be shown) should be linked to crow.xml. In the last case, they might mean that should the graphic not be displayable, an application should show the text “A crow” with a link to crow.xml around it.

So, what to do? There is an argument that says that if the author intended the [link] to be over the [graphic] rather than the text “A crow” then the [link] and the [graphic] should be on different layers:

LMNL with [link]
        Ranging over [graphic]
LMNL with [link] Ranging over [graphic]

But layering and containment are two different things — just because a range contains another range doesn't mean that it should be on a different layer from that range — so this isn't a valid solution.

Instead, we have reified LMNL. In reified XML, the text layer in the document is the LMNL syntax itself:

LMNL Document used as Text
        Layer
LMNL Document used as Text Layer

An application then builds layers on top of this text layer, pulling out the important features of the LMNL syntax. For example, it could build a syntax layer as in the following:

Syntactic Ranges over LMNL
        Document Text Layer
Syntactic Ranges over LMNL Document Text Layer

Eventually, the application comes to what's known as the reified LMNL layer. The reified LMNL layer contains a set of ranges in the reified LMNL namespace of http://www.lmnl.org/namespace/reified: [rl:document], [rl:range], [rl:annotation], [rl:value] and [rl:text]. The following diagram shows the reified LMNL layer for the document we're looking at:

Reified LMNL
        Layer
Reified LMNL Layer

There's nothing to say that the processor has to go through the intermediate syntax layer before getting to the reified LMNL layer. There can be as many or as few layers between the LMNL text and the reified LMNL layer as is useful for an application.

Don't worry if you can't make out all the lines. In this example, the reified LMNL layer forms a nice tree structure. If we were to write it out the reified LMNL layer in LMNL, it would look something like:

[!lmnl version="0.1"]
[!ns lmnl="http://www.lmnl.org/namespace"]
 [rl:document
   }[rl:range [name}link{]
      }[rl:value
         }[rl:range [name}graphic{]
            }[rl:annotation [name}src{]
               }[rl:value
                  }[rl:text [characters}crow.gif{]
               ]{rl:value
            ]{rl:annotation
            ][rl:value
               }[rl:text [characters}A crow{]
            ]{rl:value
         ]{rl:range
      ]{rl:value
      ][rl:annotation [name}href{]
         }[rl:value
            }[rl:text [characters}crow.xml{]
         ]{rl:value
      ]{rl:annotation
   ]{rl:range
]{rl:document]

The important thing about this layer is that it contains all the relevant information that you would normally get from the LMNL document — the character data, the range and annotation markup and so on — and maintains the relationships between the ranges. In this example, if it needs to, an application can tell that the [graphic] range is inside the [link] range (rather than the other way round) because the reified [rl:range] that represents the [graphic] range is inside the reified [rl:range] that represents the [link] range.

The reified LMNL layer also retains other occasionally useful information such as the prefix you've used for a particular namespace or the name you've used for a particular layer. These things are meaningless at the “pure” level of the LMNL data model, but people retain an attachment to meaningful prefixes and names, so they're not just thrown away.

You don't really need to know any of the details of the reified LMNL layer unless you're implementing a LMNL processor; if you're doing that, you'll have to dive into the real spec.