This document describes the LMNL syntax, which is a syntax that can be used to represent LMNL data models. It isn't the only syntax that could be used with LMNL data models, but it's fairly simple and it covers all possible LMNL data models, which gives it an advantage over XML syntax, for example.
Within this document, each piece of syntax is described in terms of how it maps on to both the LMNL data model and to reified LMNL. If a parser interprets a document in LMNL syntax as representing a LMNL data model, it will lose some potentially useful information, such as which of two ranges that are clones is within the other, or whether they simply overlap. If a parser needs to retain this information (and we imagine that most will) it should map the syntax on to a reified LMNL layer instead.
Last modified 11 Oct 2002 by Jeni Tennison.
A document is in LMNL syntax if it
satisifies the rules described in this document. A document in LMNL syntax maps
to a LMNL document in a
LMNL data model or a
[rl:document] range
in a reified LMNL layer.
Taken as a whole, a document in LMNL
syntax must match the document production.
| [1] |
|
::= |
|
Note that as a consequence of these productions, an empty document
is in fact recognised as a document in LMNL
syntax. Indeed, any text document in UTF-8 or UTF-16 that does not
contain any unescaped [, { or & is
recognised as a document in LMNL syntax.
LMNL syntax contains intermixed character data and markup.
Markup comes in two flavours: syntactic markup and semantic markup. Syntactic markup is markup that might not be passed through to an application, and consists of the LMNL declaration, layer declarations, namespace declarations, entity declarations, entities declarations, comments, entity references, character references and whitespace before the first character data or semantic markup in the document. Semantic markup must be passed through to an application, and consists of start tags, end tags and empty tags.
Character data is all text in the document that is not markup.
| [2] |
|
::= |
|
The character data will be part
of the content of a
text layer in the
data model. In the
reified LMNL layer, it will
map to characters in the
[characters]
annotation of a [rl:text] range.
The prolog of a document in LMNL syntax contains an optional LMNL declaration followed by any number of layer declarations, namespace declarations, entity declarations, entities declarations and comments, which may be separated with whitespace.
| [3] |
|
::= |
|
| [4] |
|
::= |
|
| [5] |
|
::= |
|
The content of a document in LMNL syntax is any character data interspersed with tags, namespace declarations, layer declarations, comments and character or entity references.
| [6] |
|
::= |
|
Within content, every start tag must have a matching end tag and vice versa. See the following section on tags for more details.
Tags indicate the starts and ends of
ranges and
annotations. There are three kinds
of tags: start tags, which indicate the start of a
range or annotation; end tags, which indicate the end
of a range or annotation; and empty tags which
indicate a range with a length
of 0 or an annotation whose
value has no
characters in its content.
This section describes tags for ranges. Tags for annotations are described later.
| [7] |
|
::= |
|
| [8] |
|
::= |
|
| [9] |
|
::= |
|
| [10] |
|
::= |
|
| [11] |
|
::= |
|
Each StartTag must be paired with an EndTag that follows it within a piece of content. A StartTag and EndTag only match if they have matching tag names, matching tag identifiers and matching tag layers. It is not an error if an EndTag matches more than one StartTag based on its name, identifier and layer (although a parser should issue a warning if this occurs); in this case the EndTag matches the nearest StartTag (the one that occurs latest in the document prior to the EndTag).
An EmptyTag maps to a
[rl:range] range in the
reified LMNL layer; the
[rl:range] range will have within it a number of
[rl:annotation]
ranges, but no [rl:value] range. A
StartTag and EndTag pair similarly maps to a
[rl:range] range in the reified LMNL layer. This
[rl:range] range contains a number of [rl:annotation]
ranges mirroring those Annotations in the StartTag, followed
by a [rl:value] range that covers everything between the
StartTag and EndTag, followed by a number of
[rl:annotation] ranges mirroring those Annotations in the
EndTag.
The tag name indicates the
name of the
range represented by the
EmptyTag or by the StartTag and EndTag pair. The tag
name is a qualified name, which is
resolved to provide the
expanded name for the range.
This expanded name is represented in a
reified LMNL layer as a
[name] annotation
on the [rl:range]
range. The value of
the [name] annotation is the
local part of the expanded name.
If the namespace name of the
expanded name is not an empty string then the [name] annotation
has a [namespace]
annotation whose value is the namespace name. This
[namespace] annotation should have a
[syn:prefix]
annotation whose value is Prefix of the qualified name.
| [12] |
|
::= |
|
A StartTag and EndTag have matching tag names if the end tag does not have a tag name or if neither tag has a tag name, or if the tag names of the two tags resolve to the same expanded name.
Note that since, within a single document, a Prefix cannot be associated with more than one namespace name and a namespace name cannot be associated with more than one prefix, tag names can be matched on a character-by-character basis, without necessarily being resolved.
The identifier of a tag is used to disambiguate situations where two ranges with the same name or two ranges without names overlap.
| [13] |
|
::= |
|
| [14] |
|
::= |
|
A StartTag and EndTag have matching identifiers if neither tag has an identifier or if the two identifiers are equal.
A tag identifier is represented in
the reified LMNL layer
through a [syn:id]
annotation on a [rl:range] range. The
value of the
[syn:id] annotation is the identifier.
Tag identifiers do not have to be unique within a document. Note that they are not carried through into the LMNL data model.
The layer of a tag is used to identify the layer to which the range represented by the EmptyTag or by the StartTag and EndTag pair belongs. The tag layer is identified by matching the LayerName with that of one of the layers declared earlier in the document. If an EmptyTag or a StartTag and EndTag pair does not have a LayerSpec then the range belongs to a default layer. It is an error if the LayerName is not the same as the name of an in-scope layer.
| [15] |
|
::= |
|
If present, the LayerIdentifier maps to the
[owner-layer]
annotation on the [rl:range] range. The
value of the
[owner-layer] annotation contains the LayerName. The
[owner-layer] annotation has a [base] annotation whose value is
equal to the base of the in-scope layer referenced by the LayerName.
If an EmptyTag or a StartTag and EndTag pair does
not have a LayerSpec then the [rl:range] range does not
have a [owner-layer] annotation.
A StartTag and EndTag have matching layers if neither tag has a tag layer or if their two tag layers are the same.
Note that since, within a single document, a Prefix cannot be associated with more than one namespace name and a namespace name cannot be associated with more than one prefix, tag layers can be matched by comparing the LayerNames on a character-by-character basis, without necessarily resolving them.
MetaData appears within tags and comprises namespace declarations, layer declarations, comments and annotations, with intermingled whitespace.
| [16] |
|
::= |
|
Annotations in
LMNL syntax map on to
annotations in the
data model and
[rl:annotation]
range in a reified LMNL
layer. The tags that delimit annotations within a
tag are much like those used for ranges
within content, but they are guaranteed to match and therefore do not
need identifiers and can in fact used an abbreviated syntax.
| [17] |
|
::= |
|
| [18] |
|
::= |
|
| [19] |
|
::= |
|
| [20] |
|
::= |
|
| [21] |
|
::= |
|
| [22] |
|
::= |
|
| [23] |
|
::= |
|
The name of
the annotation is the
expanded name you get from
resolving the AnnotationName. As with
ranges, this expanded name is represented in a
reified LMNL layer as a
[name] annotation
on the [rl:annotation]
range. The value of
the [name] annotation is the
local part of the expanded name.
If the namespace name of the
expanded name is not an empty string then the [name] annotation
has a [namespace]
annotation whose value is the namespace name. This
[namespace] annotation should have a
[syn:prefix]
annotation whose value is Prefix of the qualified name.
Syntactic markup affects the
way in which semantic markup is interpreted,
and thus the LMNL data model that is created
from a document in LMNL syntax, but the
constructs that it declares are not part of the LMNL data model and thus
usually the information that it conveys (which prefix
is associated with which namespace
name, for example) is not retained when a document in LMNL syntax is
parsed. To make this markup easy to identify, most syntactic markup starts with
the characters [! and ends with the character ]; the
exceptions are entity and character references, which start with the character
& and end with the character ;, and leading and
trailing whitespace.
The reified LMNL layer does retain certain syntactic information, namely those parts that involve natural language comprehension and cannot therefore be easily reconstructed: tag identifiers, namespace prefixes, layer names, entity references and comments. It's perfectly acceptable for an application such as an editor to retain more syntactic markup, but most LMNL applications are expected to operate over the LMNL data model or a reified LMNL layer.
Documents in LMNL syntax should begin with a LMNL declaration which specifies the version of LMNL syntax being used and the encoding of the document.
| [24] |
|
::= |
|
The VersionNum for this version of the
LMNL syntax is 0.2. A parser must
raise an error if it encounters a document using a version of the LMNL syntax
that it does not support.
The LMNL version will remain less than 1.0 during the drafting phase.
| [25] |
|
::= |
|
| [26] |
|
::= |
|
The EncodingDecl specifies the encoding used for a document in LMNL syntax. All LMNL parsers must be able to read
documents in both the UTF-8 and UTF-16 encodings.
Documents encoded in UTF-16 must begin with the Byte Order Mark in
the same way as described for XML entities in the
XML Recommendation,
and the encoding declaration specifies the encoding used for the document as
defined for XML.
| [27] |
|
::= |
|
| [28] |
|
::= |
|
A layer declaration declares
a layer and its
base. A layer declaration
associates a LayerName with a layer. These names are used to resolve
references to the layer, in order to associate
ranges with the layer and to refer to
it as a base of another layer. Layer
names are not carried through into the data
model, but they are used within the
[owner-layer]
annotation of [rl:range] ranges in the
reified LMNL layer, as
described earlier.
| [29] |
|
::= |
|
| [30] |
|
::= |
|
| [31] |
|
::= |
|
| [32] |
|
::= |
|
Layer declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of a layer declaration is the entire document following the layer declaration. It is not possible to “undeclare” a layer.
Like the fact that namespace declarations are allowed anywhere in the document, the ability for annotations to declare layers locally facilitates the serialisation of LMNL documents by streaming processors
It is an error if a layer declaration specifies a name that has already been used in another layer declaration. Once a layer has been declared with a particular base, the base cannot be changed.
There are two predefined layers: a predefined text layer that contains all the character data in the scope, and a predefined default layer that contains all the ranges that are represented by tags that do not have LayerIdentifiers.
The predefined layers make it possible to write documents in LMNL syntax without declaring or referring to particular layers. Such documents represent data models that are part of the flat subset of LMNL.
A LayerName specified within the baseSpec
indicates the base for the
layer. It is an error if this expanded
name is not equal to the
name of an in-scope layer. If the baseSpec contains the special value
"#text" then the base is the predefined text layer. If the
baseSpec contains the special value "#default" then the
base is the predefined default
layer.
| [33] |
|
::= |
|
| [34] |
|
::= |
|
| [35] |
|
::= |
|
A namespace declaration associates a Prefix with a namespace name, which is a URI that identifies a markup language. Namespace declarations are used when resolving a qualified name to give an expanded name. The expanded name might be an entity name, a range identifier, a range name or an annotation name.
| [36] |
|
::= |
|
Namespace declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of a namespace declaration is the entire document following the namespace declaration. It is not possible to “undeclare” a namespace.
It is an error if a namespace declaration associates a namespace name with a Prefix that has already been associated with a different namespace name. Once a namespace has been declared with a particular prefix, that prefix cannot be associated with a different namespace name. It is also an error for more than one prefix to be bound to the same namespace name.
The Prefix associated with a
namespace is represented in
the reified LMNL layer with
[syn:prefix]
annotations on [namespace]
annotations on [name] and
[syn:id] annotations.
Comments may appear pretty much anywhere in a document in LMNL syntax outside other syntactic markup.
| [37] |
|
::= |
|
Note that comments in
LMNL syntax do not have the
XML restriction of
not allowing "--" (double hyphen) within comments.
Comments are not carried through into
the data model, but they are represented by
[syn:comment] ranges
in the reified LMNL layer.
In LMNL syntax, an entity is a pair of an entity name and a entity value. An entity name is a Name. An entity value is some replacement text that should be used to replace an entity reference that references the entity's name.
An entity is declared with an entity declaration, which may be held in an external file referred to by an entities declaration. An entity's value is included in a document via an entity reference.
LMNL entities are similar to but tightly constrained compared to entities in XML. Entities in LMNL are the equivalent of internal entities in XML, with the additional constraint that they can only include character data, character references and entity references, not other markup.
An entity declaration declares an entity by associating an entity name with an entity value.
| [38] |
|
::= |
|
| [39] |
|
::= |
|
The entity name is the EntityName. The entity value is the EntityValue (without quotes), with all character references replaced by the characters that they represent.
Entity declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of an entity declaration is the entire document following the entity declaration. It is not possible to “undeclare” an entity. It is an error for an entity declaration to declare an entity with the same name as an existing entity, unless it also has the same value.
An entities declaration declares a number of entities at once by referring to an external document in LMNL syntax.
| [40] |
|
::= |
|
The entities declaration imports all the entities declared by the entity declarations in this file recursively (such that entities declarations in the referenced file themselves cause the import of more entity declarations).
Note that this is not a textual include. In particular, the qualified names used in the entity declarations in the referenced document are resolved in the context of the namespace declarations within that file. In addition, the external document referenced by an entities declaration can have its own content, for example to describe the entities defined in the file, but this content will be ignored.
It is not an error if the document referenced by the entities declaration is unreachable. It is an error if the referenced document is not a document in LMNL syntax (which entails that the entire document must be parsed even though the entity declarations can only appear in the prolog).
An entity reference refers
to an entity. When an entity reference is
encountered, the value of the referenced entity
is retrieved and processed, in place of the entity reference itself, as though
it were part of the document at the location the entity reference was
recognised. The replacement text may
include both character data and entity
references, which are themselves recognised and replaced. The replacement text
must not contain any other markup; it is an error if either [ or
{ are encountered in the replacement text of an entity.
| [41] |
|
::= |
|
The entity reference is replaced by the value of the entity named in the EntityReference. It is an error if there is no such entity.
Note that it is possible for an entity declaration and an entity reference to use different qualified names, as long as those qualified names resolve to the same expanded name.
An entity reference is not
represented in the data model, but it is
represented in the reified LMNL
layer, as a [syn:entity] range. The
[syn:entity] range has a [name] annotation that
represents the entity name. The
value of the
[name] annotation is the entity
name. The [syn:entity] range
encloses
[rl:text] and
[syn:entity] ranges representing the entity value.
Entity and
character references can both be used to
escape the left square bracket, left curly bracket, ampersand, and other
delimiters. A set of predefined entities
(amp, lsqb, rsqb, lcub,
rcub, apos and quot) is specified for
this purpose. Numeric character references may also be used; they are expanded
immediately when recognized and must be treated as character data, so the
numeric character references "[",
"{" and "&" may be used to escape
[, { and & when they occur in
character data.
All LMNL syntax parsers must
recognize these predefined entities
whether they are declared or not. If the
entities lsqb, lcub or amp are declared,
they must be declared as entities whose
replacement text is a
character reference to the respective
character (left square bracket,
left curly bracket or ampersand) being escaped; the double escaping is required
for these entities so that references to them do not generate an error. If the
entities rsqb, rcub, apos or
quot are declared, they must be declared as entities whose
replacement text is the single character being escaped (or a character
reference to that character; the double escaping here is unnecessary but
harmless). For example:
[!entity lsqb="["] [!entity rsqb="]"] [!entity lcub="{"] [!entity rcub="}"] [!entity amp="&"] [!entity apos="'"] [!entity quot="""]
Only the lsqb, lcub and
amp entities are strictly necessary; the
others are provided for balance and to enable entity
values to easily include apostrophes and double quotes whichever
delimiter they use.
A character reference is a way of referring to a character by its code point rather than including it as a native character, perhaps because the encoding used for the document does not support the character, or simply because it is easier to insert than the native character. Character references follow the same syntax as character references in XML.
| [42] |
|
::= |
|
If the character reference
begins with "&#x", the digits and letters up to the
terminating ; provide a hexadecimal representation of the
character's code point in ISO/IEC 10646. If it begins just with
"&#", the digits up to the terminating ; provide
a decimal representation of the character's code point. Character references
must represent legal LMNL
characters.
The character
represented by a character reference
will be part of the content of
a text layer in the
data model. In the
reified LMNL layer, it will
map to one of the characters in the
[characters]
annotation of a [rl:text] range.
This section describes some of the common syntactic constructs that are used in the LMNL syntax. Most of them should be familiar from the XML Recommendation and Namespaces in XML.
The legal characters in LMNL syntax are the same as the legal characters in the LMNL data model — any Unicode character, excluding control characters and the surrogate blocks. As in the LMNL data model, documents in LMNL syntax must be Unicode normalized per Unicode Normalization Form NFC for reasons described in Character Model for the World Wide Web.
| [43] |
|
::= |
|
Whitespace is considered to be any sequence of space, tab or line-feed characters.
| [44] |
|
::= |
|
Like XML, line endings must be normalized by a LMNL parser as if
before parsing it translated certain line-ending sequences into a single
#xA character. These line-ending sequences are:
#xD #x85#xD #xA#x85#xD (not followed by #xA or
#x85)
#2028LMNL syntax represents expanded names using qualified names. Qualified names consist of an optional Prefix followed by a LocalPart, both of which are names.
Note that no Names in LMNL syntax can contain colons.
| [45] |
|
::= |
|
| [46] |
|
::= |
|
| [47] |
|
::= |
|
| [48] |
|
::= |
|
| [49] |
|
::= |
|
| [50] |
|
::= |
|
A qualified name is resolved to an expanded name by identifying the namespace declaration that appears earlier in the document that associates the Prefix of the qualified name with a namespace name. It is an error if there is no namespace declaration for the specified Prefix appearing before the qualified name in the document. If the qualified name doesn't have a Prefix then the namespace name of the expanded name is the empty string. The expanded name represented by the qualified name is the pair of this namespace name and the LocalPart of the qualified name.
Namespace declarations are scoped such that it doesn't matter what context a namespace declaration appears in — within a start tag, in the content of an annotation — it will always be in scope from that point on through the document. Namespaces declarations cut across the structures defined by the semantic markup in the document.
Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the values of entities (EntityValue) and identifiers for namespace names and locations of entity declarations (URILiteral).
| [51] |
|
::= |
|
| [52] |
|
::= |
|
| [53] |
|
::= |
|
Note that URILiterals can be parsed without scanning for markup. Their values (between the quotes) should be valid URIs as defined in RFC 2396.
Following is an example of a document in LMNL syntax for the example provided in the data model specification:
[!lmnl version="0.1" encoding="ISO-8859-1"]
[!layer name="types" base="#default"]
[date~types [day-of-week}Friday{]
}[year}2002{year
]-[month [name}August{name [abbreviation}Aug{abbreviation]]}08{month
]-[day}23{day]{date~types]Thanks to the following for their comments on this document: John Cowan, Gavin Nicol, Wendell Piez.
| © 2002 by the authors and LMNL.org All rights reserved |
![]() |