This document describes the LMNL data model. The definitions in this document can be used in other documents, for example API or syntax definitions, that need to refer to the information available within a LMNL document and the relationships between the components that it contains.
This document starts by defining the basic terms of character, namespace name, local part and expanded name. These are terms that are borrowed from the XML canon, and refer to concepts that are used throughout the LMNL data model.
This document goes on to describe the conceptual foundations of LMNL. These are the concepts of the layer, the range and the annotation. In brief, a LMNL document consists of one or more layers, arranged on top of each other. The bottom layer is a text layer that contains a sequence of characters. Other layers contain ranges, which label a particular subsequence of items from the layer underneath them with a name, providing information about the structure of the underlying layer. Ranges can also be annotated to provide meta-information about the range that they span over. These annotations hold values that are themselves text layers, and can therefore have their own internal structure as well.
This document then describes two subsets of LMNL data models: the flat subset which are simple data models that only contain one layer of ranges; and the tree subset, which are data models that can be represented by tree structures (contain no overlapping ranges).
Last modified 8 Oct 2002 by Jeni Tennison.
LMNL borrows a number of basic concepts from XML, Namespaces in XML and XPath. Rather than restate them here, we simply provide links to the relevant definitions in these Recommendations.
| Concept | Notes |
|---|---|
| character |
A character in LMNL is any Unicode character excluding
LMNL requires that sequences of characters are normalized, as in Section 2.13 (Normalization Checking) of XML 1.1. Any sequence of characters used in LMNL must be Unicode normalized per Unicode Normalization Form NFC for reasons described in Character Model for the World Wide Web. |
| namespace name |
A namespace name is a URI that identifies a particular markup language. We also sometimes use the term namespace URI, or simply namespace to refer to the namespace name. Two namespace names are equal if they are equal on a character-by-character basis. |
| local part |
A local part of a name, sometimes called a local name, follows the same format as that defined in Namespaces for XML. Technically, they must only follow the slightly looser rules described in Appendix B (Suggestions for XML Names) in XML 1.1 but for compatibility with XML 1.0 LMNL data models should use names that follow the XML 1.0 rules. Two local parts are equal if they are equal on a character-by-character basis. |
| expanded name |
An expanded name is a pair of a namespace name and a local part. The related term qualified name is sometimes used synonymously with expanded name, but technically refers to the syntactic construction of a local part optionally preceded by a prefix and a colon. Two expanded names are equal if the namespace names are equal and the local parts are equal. |
A layer is a sequence of characters or a sequence of ranges. Layers layer on top of each other: each layer is based on a base underneath it and may have many overlays layered on top of it. There is no limit to the number of layers that can be layered on top of each other.
If a layer holds ranges, the ranges in that layer's content range over the items held by the layer underneath (the layer's base). Layers that contain a sequence of characters can only appear at the bottom of the pile. These layers are known as text layers.
A LMNL document is a text layer. Because each layer has its own set of overlays, containing ranges, which themselves have annotations, by extension a LMNL document contains any number of layers, ranges and annotations.
A layer has the following properties:
| Property | Notes |
|---|---|
| base |
The layer that holds the sequence of characters or ranges over which the ranges in this layer span, if there is one. If this layer is a text layer (and therefore the layer does not have a base) then this property is empty. |
| content |
A sequence of items, which can either be a sequence of characters or a sequence of ranges. A layer whose content is a sequence of characters is a text layer. |
| overlays |
A (possibly empty) sequence of layers whose base is this layer. |
Layers do not have any explicit relationship with each other (except in so far as they might have the same base).
Layer A and layer B have equal content if each item in layer A's content is equal to the equivalently positioned item of layer B's content and vice versa (two text layers have equal content if they hold the same sequence of characters). Two layers have equal bases if neither of them have a base or if their bases have equal content and equal bases.
Two layers are equal if they have equal bases, equal content and equal overlays. Layer A and layer B have equal overlays if neither of them have any overlays or if each of layer A's overlays has equal content and equal overlays to layer B's equivalently positioned overlay and vice versa.
A range is a span over a sequence of items within a layer. A range might be constructed automatically, through a range constructor that recognises sequences within the items in a layer, or manually, by someone marking up a document. Ranges usually have names, which label the span of text; ranges that do not have names are termed anonymous ranges.
A range has the following properties:
| Property | Notes |
|---|---|
| owner layer |
A layer to which the range belongs. A range will appear within its owner layer's content. |
| name |
An expanded name that provides a label for the range. If this range is an anonymous range (and therefore doesn't have a name), then this property is empty. |
| start |
An integer indicating the start point of the range in the content of the base of the range's owner layer. Counting starts at 0 (the point before the first item in the content). |
| length |
An integer indicating the length of the range. A range with a length of 0 is a point. |
| annotations |
A (possibly empty) sequence of annotations that provide meta-information about the content of the range. |
It's possible to derive a two properties from those listed above:
| Property | Notes |
|---|---|
| end |
An integer indicating the end point of the range in the content of the base of the range's owner layer. This is equal to the range's start plus its length. |
| value |
A (possibly empty) sequence of characters or ranges from the range's owner layer's base that are within the range. This is equal to the subsequence of the range's owner layer's base's content starting at the range's start and ending at the range's end. |
Ranges do not have any explicit relationship with each other (except in so far as they might be part of the same owner layer). However, it is possible to derive relationships between ranges based on their properties.
Two ranges are equal if their owner layers's bases have equal content, and if the ranges are both anonymous or have equal names, the same start, the same length and if their annotations are equal. Range A and range B have equal annotations if each of range A's annotations is equal to the equivalently positioned annotation of range B and vice versa.
Two ranges are clones if they have the same owner layer, the same start and the same length.
Range A encloses range B if they have the same owner layer, are not clones, range B's start is greater than or equal to range A's start and range B's end is less than or equal to range A's end. The enclosing ranges of a range are all those ranges that enclose the range. The closest enclosing ranges of a range are those enclosing ranges of the range that do not enclose any other of the range's enclosing ranges. The enclosed ranges of a range are all those ranges that are enclosed by (or within) the range. The closest enclosed ranges of a range are those enclosed ranges of the range that are not within any other of the range's enclosed ranges.
Range A and range B overlap if they have the same owner layer and either the start or end of range B (but not both) is greater than the start of range A and less than the end of range A. If the start of range B is within A then range A overlaps the start of range B. If the end of range B is within A then range A overlaps the end of range B.
Range A precedes range B if they have the same owner layer and range A's end is less than or equal to range B's start. Range A follows range B if they have the same owner layer and range A's start is greater than or equal to range B's end.
These definitions partition the sequence of ranges within a particular layer with respect to a given range. Given a range A, all the other ranges in the layer must precede A, enclose A, overlap the start of A, be a clone of A, be within A, overlap the end of A or follow A. Each other range in the layer can only be related to A in one of these seven ways.
The start order of a sequence of ranges is defined such that range A is before range B in start order if they have the same owner layer and range A's start is less than range B's start. If the start points of the two ranges are the same then range A is before range B if its length is greater than range B's length. If the lengths of the two ranges are the same, then range A is before range B if it was before range B in the original sequence. The reverse start order is the precise reverse of this ordering.
All the ranges that precede a range are before it in start order; all the ranges that follow a range are after it in start order. All the ranges that overlap the start of a range are before it in start order; all the ranges that overlap the end of a range are after it in start order. All the enclosing ranges of a range are before it in start order; all the enclosed ranges of a range are after it in start order. The clones of a range might appear before or after it in start order, but not before or after another range that is not a clone of the range.
An annotation is essentially a name-value pair, but one which can have annotations itself and whose value is a text layer and can therefore have internal structure.
Annotations have the following properties:
| Property | Notes |
|---|---|
| owner |
A range (or annotation) that the annotation annotates. An annotation will appear within its owner range's annotations (or within its owner annotation's annotations). |
| name |
An expanded name that provides a label for the annotation. |
| value |
A text layer, that is the value for the annotation. An annotation must have a value, even if it is a text layer with empty content. |
| annotations |
A sequence of annotations that provide meta-information about the value of the annotation. |
Note that annotations cannot be anonymous.
Annotations do not have any explicit relationship with each other (except in so far as they might have the same owner).
Two annotations are equal if they have equal names, if their values are equal, and if their annotations are equal. Annotation A and annotation B have equal annotations if each of annotation A's annotations is equal to the equivalently positioned annotation of annotation B and vice versa.
There are two subsets of the LMNL data model that are useful in certain circumstances: the flat subset and the tree subset.
The flat subset of LMNL data models are those data models in which each text layer has at most one overlay, and no other layers have any overlays. In other words, flat LMNL does not contain any layers that contain ranges that range over sequences of ranges.
Data models belonging to the flat subset are the easiest kind to create and understand.
It's possible to flatten any LMNL data model by combining layers into a single layer.
First, each layer whose base is not a text layer should be combined with its base, starting with those layers that don't have any overlays. Layer A and its base, layer B, can be combined to create layer C. Layer C's base is the base of layer B. Layer C's content is initialised to a sequence of ranges, each one equal to the equivalently positioned range in layer B's content. A new range is then added to layer C for each of the ranges in layer A. Given a range R in layer A, the new range is inserted into the sequence at R's start. The new range's name is the same as range R's. Its start is equal to the start of the first range in R's value and its end is equal to the end of the last range in R's value. Layer C does not have any overlays.
The result of this process is a data model in which each text layer has a number of overlays, none of which have any overlays themselves. The next step is to combine the overlays of each text layer, so that each text layer has a maximum of one overlay.
A sequence of layers that are overlays of the same layer and don't themselves have overlays can be combined to create a new layer. The base of the new layer is the base of the layers that are being combined. The content of layer C is a sequence of ranges that are equal to the ranges in each of the layers being combined, arranged in start order.
The tree subset of LMNL data models are those data models that can be represented as a tree structure, for example using XML. Data models in the tree subset satisfy four rules:
These rules ensure that the LMNL document can be turned into a
node tree as defined in XPath. To
satisfy the somewhat tighter rules of well-formed XML documents (namely that
there must be a single document element that contains the rest of the content
of the document), it must be the case that, after flattening, there is a range
whose start is 0 and whose
length is equal to the number of
characters in the document's content.
Changing a LMNL data model so that it satisfies these rules is a complex operation; there is no set algorithm for doing so.
Following is an example of a LMNL data model for a document holding
a date. For conciseness, character sequences are represented as strings (e.g.
"foo" is the character sequence { 'f', 'o', 'o' }).
document d1 { base: {}, content: "2002-08-23", overlays: { l1 } } layer l1 { base: { d1 }, content: { r1, r2, r3 }, overlays: { l2 } } range r1 { name: { "", "year" }, owner layer: l1, start: 0, length: 4, annotations: {} } range r2 { name: { "", "month" }, owner layer: l1, start: 5, length: 2, annotations: { a1 } } annotation a1 { name: { "", "name" }, value: l3, owner: r2, annotations: { a2 } } layer l3 { base: {}, content: "August", overlays: {} } annotation a2 { name: { "", "abbreviation" }, value: l4, owner: a1, annotations {} } layer l4 { base: {}, content: "Aug", overlays: {} } range r3 { name: { "", "day" }, owner layer: l1, start: 8, length: 2, annotations: {} } layer l2 { base: l1, content: { r4 }, overlays: {} } range r4 { name: { "", "date" }, owner layer: l2, start: 0, length: 3, annotations: { a3 } } annotation a3 { name: { "", "day-of-week" }, value: l5, owner: r4, annotations: {} } layer l5 { base: {}, content: "Friday", overlays: {} }
Thanks to the following for their comments on this document: John Cowan, Gavin Nicol, Wendell Piez.
| © 2002 by the authors and LMNL.org All rights reserved |
![]() |