LMNL Syntax

Introduction

This document describes the LMNL syntax, which is a syntax that can be used to represent LMNL data models. It isn't the only syntax that could be used with LMNL data models, but it's fairly simple and it covers all possible LMNL data models, which gives it an advantage over XML syntax, for example.

Within this document, each piece of syntax is described in terms of how it maps on to both the LMNL data model and to reified LMNL. If a parser interprets a document in LMNL syntax as representing a LMNL data model, it will lose some potentially useful information, such as which of two ranges that are clones is within the other, or whether they simply overlap. If a parser needs to retain this information (and we imagine that most will) it should map the syntax on to a reified LMNL layer instead.

Last modified 11 Oct 2002 by Jeni Tennison.

Documents

A document is in LMNL syntax if it satisifies the rules described in this document. A document in LMNL syntax maps to a LMNL document in a LMNL data model or a [rl:document] range in a reified LMNL layer.

Taken as a whole, a document in LMNL syntax must match the document production.

[1] document ::= prolog content

Note that as a consequence of these productions, an empty document is in fact recognised as a document in LMNL syntax. Indeed, any text document in UTF-8 or UTF-16 that does not contain any unescaped [, { or & is recognised as a document in LMNL syntax.

LMNL syntax contains intermixed character data and markup.

Markup comes in two flavours: syntactic markup and semantic markup. Syntactic markup is markup that might not be passed through to an application, and consists of the LMNL declaration, layer declarations, namespace declarations, entity declarations, entities declarations, comments, entity references, character references and whitespace before the first character data or semantic markup in the document. Semantic markup must be passed through to an application, and consists of start tags, end tags and empty tags.

Character data is all text in the document that is not markup.

[2] CharData ::= [^[{&]*

The character data will be part of the content of a text layer in the data model. In the reified LMNL layer, it will map to characters in the [characters] annotation of a [rl:text] range.

Prolog

The prolog of a document in LMNL syntax contains an optional LMNL declaration followed by any number of layer declarations, namespace declarations, entity declarations, entities declarations and comments, which may be separated with whitespace.

[3] prolog ::= LMNLDeclaration? Misc*
[4] Misc ::= ScopedDecls | Comment | S
[5] ScopedDecls ::= LayerDeclaration | NSDeclaration | EntityDeclaration | EntitiesDeclaration

Content

The content of a document in LMNL syntax is any character data interspersed with tags, namespace declarations, layer declarations, comments and character or entity references.

[6] content ::= CharData? ((Tag | ScopedDecls | Reference | Comment) CharData?)*

Within content, every start tag must have a matching end tag and vice versa. See the following section on tags for more details.

Tags

Tags indicate the starts and ends of ranges and annotations. There are three kinds of tags: start tags, which indicate the start of a range or annotation; end tags, which indicate the end of a range or annotation; and empty tags which indicate a range with a length of 0 or an annotation whose value has no characters in its content.

This section describes tags for ranges. Tags for annotations are described later.

[7] Tag ::= StartTag | EndTag | EmptyTag
[8] StartTag ::= '[' TagContent '}'
[9] EndTag ::= '{' TagContent ']'
[10] EmptyTag ::= '[' TagContent ']'
[11] TagContent ::= (TagName S?)? (IdentitySpec S?)? LayerIdentifier? (S MetaData)? S?

Each StartTag must be paired with an EndTag that follows it within a piece of content. A StartTag and EndTag only match if they have matching tag names, matching tag identifiers and matching tag layers. It is not an error if an EndTag matches more than one StartTag based on its name, identifier and layer (although a parser should issue a warning if this occurs); in this case the EndTag matches the nearest StartTag (the one that occurs latest in the document prior to the EndTag).

An EmptyTag maps to a [rl:range] range in the reified LMNL layer; the [rl:range] range will have within it a number of [rl:annotation] ranges, but no [rl:value] range. A StartTag and EndTag pair similarly maps to a [rl:range] range in the reified LMNL layer. This [rl:range] range contains a number of [rl:annotation] ranges mirroring those Annotations in the StartTag, followed by a [rl:value] range that covers everything between the StartTag and EndTag, followed by a number of [rl:annotation] ranges mirroring those Annotations in the EndTag.

Tag Names

The tag name indicates the name of the range represented by the EmptyTag or by the StartTag and EndTag pair. The tag name is a qualified name, which is resolved to provide the expanded name for the range. This expanded name is represented in a reified LMNL layer as a [name] annotation on the [rl:range] range. The value of the [name] annotation is the local part of the expanded name. If the namespace name of the expanded name is not an empty string then the [name] annotation has a [namespace] annotation whose value is the namespace name. This [namespace] annotation should have a [syn:prefix] annotation whose value is Prefix of the qualified name.

[12] TagName ::= QName

A StartTag and EndTag have matching tag names if the end tag does not have a tag name or if neither tag has a tag name, or if the tag names of the two tags resolve to the same expanded name.

Note that since, within a single document, a Prefix cannot be associated with more than one namespace name and a namespace name cannot be associated with more than one prefix, tag names can be matched on a character-by-character basis, without necessarily being resolved.

Tag Identifiers

The identifier of a tag is used to disambiguate situations where two ranges with the same name or two ranges without names overlap.

[13] IdentitySpec ::= '=' S? Identifier
[14] Identifier ::= Name

A StartTag and EndTag have matching identifiers if neither tag has an identifier or if the two identifiers are equal.

A tag identifier is represented in the reified LMNL layer through a [syn:id] annotation on a [rl:range] range. The value of the [syn:id] annotation is the identifier.

Tag identifiers do not have to be unique within a document. Note that they are not carried through into the LMNL data model.

Tag Layers

The layer of a tag is used to identify the layer to which the range represented by the EmptyTag or by the StartTag and EndTag pair belongs. The tag layer is identified by matching the LayerName with that of one of the layers declared earlier in the document. If an EmptyTag or a StartTag and EndTag pair does not have a LayerSpec then the range belongs to a default layer. It is an error if the LayerName is not the same as the name of an in-scope layer.

[15] LayerIdentifier ::= S? '~' S? LayerName

If present, the LayerIdentifier maps to the [owner-layer] annotation on the [rl:range] range. The value of the [owner-layer] annotation contains the LayerName. The [owner-layer] annotation has a [base] annotation whose value is equal to the base of the in-scope layer referenced by the LayerName. If an EmptyTag or a StartTag and EndTag pair does not have a LayerSpec then the [rl:range] range does not have a [owner-layer] annotation.

A StartTag and EndTag have matching layers if neither tag has a tag layer or if their two tag layers are the same.

Note that since, within a single document, a Prefix cannot be associated with more than one namespace name and a namespace name cannot be associated with more than one prefix, tag layers can be matched by comparing the LayerNames on a character-by-character basis, without necessarily resolving them.

Metadata

MetaData appears within tags and comprises namespace declarations, layer declarations, comments and annotations, with intermingled whitespace.

[16] MetaData ::= (ScopeDecls | Comment | Annotation | S)*

Annotations

Annotations in LMNL syntax map on to annotations in the data model and [rl:annotation] range in a reified LMNL layer. The tags that delimit annotations within a tag are much like those used for ranges within content, but they are guaranteed to match and therefore do not need identifiers and can in fact used an abbreviated syntax.

[17] Annotation ::= (AnnotationStartTag content AnnotationEndTag) | EmptyAnnotationTag
[18] AnnotationStartTag ::= '[' AnnotationName (S MetaData)? S? '}'
[19] AnnotationEndTag ::= AbbreviatedAnnotationEndTag | FullAnnotationEndTag
[20] AbbreviatedAnnotationEndTag ::= '{]'
[21] FullAnnotationEndTag ::= '{' AnnotationName (S MetaData)? S? ']'
[22] EmptyAnnotationTag ::= '[' AnnotationName (S MetaData)? S? ']'
[23] AnnotationName ::= QName

The name of the annotation is the expanded name you get from resolving the AnnotationName. As with ranges, this expanded name is represented in a reified LMNL layer as a [name] annotation on the [rl:annotation] range. The value of the [name] annotation is the local part of the expanded name. If the namespace name of the expanded name is not an empty string then the [name] annotation has a [namespace] annotation whose value is the namespace name. This [namespace] annotation should have a [syn:prefix] annotation whose value is Prefix of the qualified name.

Syntactic Markup

Syntactic markup affects the way in which semantic markup is interpreted, and thus the LMNL data model that is created from a document in LMNL syntax, but the constructs that it declares are not part of the LMNL data model and thus usually the information that it conveys (which prefix is associated with which namespace name, for example) is not retained when a document in LMNL syntax is parsed. To make this markup easy to identify, most syntactic markup starts with the characters [! and ends with the character ]; the exceptions are entity and character references, which start with the character & and end with the character ;, and leading and trailing whitespace.

The reified LMNL layer does retain certain syntactic information, namely those parts that involve natural language comprehension and cannot therefore be easily reconstructed: tag identifiers, namespace prefixes, layer names, entity references and comments. It's perfectly acceptable for an application such as an editor to retain more syntactic markup, but most LMNL applications are expected to operate over the LMNL data model or a reified LMNL layer.

LMNL Declaration

Documents in LMNL syntax should begin with a LMNL declaration which specifies the version of LMNL syntax being used and the encoding of the document.

[24] LMNLDeclaration ::= '[!lmnl' VersionInfo EncodingDecl? S? ']'

The VersionNum for this version of the LMNL syntax is 0.2. A parser must raise an error if it encounters a document using a version of the LMNL syntax that it does not support.

The LMNL version will remain less than 1.0 during the drafting phase.

[25] VersionInfo ::= S 'version' Eq (("'" VersionNum "'") | ('"' VersionNum '"'))
[26] VersionNum ::= ([a-zA-Z0-9_.:] | '-')+

The EncodingDecl specifies the encoding used for a document in LMNL syntax. All LMNL parsers must be able to read documents in both the UTF-8 and UTF-16 encodings. Documents encoded in UTF-16 must begin with the Byte Order Mark in the same way as described for XML entities in the XML Recommendation, and the encoding declaration specifies the encoding used for the document as defined for XML.

[27] EncodingDecl ::= S 'encoding' Eq (("'" EncName "'") | ('"' EncName '"'))
[28] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')*

Layer Declarations

A layer declaration declares a layer and its base. A layer declaration associates a LayerName with a layer. These names are used to resolve references to the layer, in order to associate ranges with the layer and to refer to it as a base of another layer. Layer names are not carried through into the data model, but they are used within the [owner-layer] annotation of [rl:range] ranges in the reified LMNL layer, as described earlier.

[29] LayerDeclaration ::= '[!layer' LayerNameSpec baseSpec S? ']'
[30] LayerNameSpec ::= S 'name' Eq LayerNameLiteral
[31] LayerNameLiteral ::= (("'" LayerName "'") | ('"' LayerName "'"))
[32] LayerName ::= Name

Layer declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of a layer declaration is the entire document following the layer declaration. It is not possible to “undeclare” a layer.

Like the fact that namespace declarations are allowed anywhere in the document, the ability for annotations to declare layers locally facilitates the serialisation of LMNL documents by streaming processors

It is an error if a layer declaration specifies a name that has already been used in another layer declaration. Once a layer has been declared with a particular base, the base cannot be changed.

There are two predefined layers: a predefined text layer that contains all the character data in the scope, and a predefined default layer that contains all the ranges that are represented by tags that do not have LayerIdentifiers.

The predefined layers make it possible to write documents in LMNL syntax without declaring or referring to particular layers. Such documents represent data models that are part of the flat subset of LMNL.

A LayerName specified within the baseSpec indicates the base for the layer. It is an error if this expanded name is not equal to the name of an in-scope layer. If the baseSpec contains the special value "#text" then the base is the predefined text layer. If the baseSpec contains the special value "#default" then the base is the predefined default layer.

[33] baseSpec ::= S 'base' Eq (TextLayer | DefaultLayer | LayerNameLiteral)
[34] TextLayer ::= "'#text'" | '"#text"'
[35] DefaultLayer ::= "'#default'" | '"#default"'

Namespace Declarations

A namespace declaration associates a Prefix with a namespace name, which is a URI that identifies a markup language. Namespace declarations are used when resolving a qualified name to give an expanded name. The expanded name might be an entity name, a range identifier, a range name or an annotation name.

[36] NSDeclaration ::= '[!ns' S Prefix Eq URILiteral S? ']'

Namespace declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of a namespace declaration is the entire document following the namespace declaration. It is not possible to “undeclare” a namespace.

It is an error if a namespace declaration associates a namespace name with a Prefix that has already been associated with a different namespace name. Once a namespace has been declared with a particular prefix, that prefix cannot be associated with a different namespace name. It is also an error for more than one prefix to be bound to the same namespace name.

The Prefix associated with a namespace is represented in the reified LMNL layer with [syn:prefix] annotations on [namespace] annotations on [name] and [syn:id] annotations.

Comments

Comments may appear pretty much anywhere in a document in LMNL syntax outside other syntactic markup.

[37] Comment ::= '[!--' (Char - '--]')* '--]'

Note that comments in LMNL syntax do not have the XML restriction of not allowing "--" (double hyphen) within comments.

Comments are not carried through into the data model, but they are represented by [syn:comment] ranges in the reified LMNL layer.

Entities

In LMNL syntax, an entity is a pair of an entity name and a entity value. An entity name is a Name. An entity value is some replacement text that should be used to replace an entity reference that references the entity's name.

An entity is declared with an entity declaration, which may be held in an external file referred to by an entities declaration. An entity's value is included in a document via an entity reference.

LMNL entities are similar to but tightly constrained compared to entities in XML. Entities in LMNL are the equivalent of internal entities in XML, with the additional constraint that they can only include character data, character references and entity references, not other markup.

Entity Declarations

An entity declaration declares an entity by associating an entity name with an entity value.

[38] EntityDeclaration ::= '[!entity' S EntityName Eq EntityValue S? ']'
[39] EntityName ::= Name

The entity name is the EntityName. The entity value is the EntityValue (without quotes), with all character references replaced by the characters that they represent.

Entity declarations can occur anywhere within the prolog or content of a document and within start tags, end tags and empty tags. The “scope” of an entity declaration is the entire document following the entity declaration. It is not possible to “undeclare” an entity. It is an error for an entity declaration to declare an entity with the same name as an existing entity, unless it also has the same value.

Entities Declarations

An entities declaration declares a number of entities at once by referring to an external document in LMNL syntax.

[40] EntitiesDeclaration ::= '[!entities' S 'href' Eq URILiteral S? ']'

The entities declaration imports all the entities declared by the entity declarations in this file recursively (such that entities declarations in the referenced file themselves cause the import of more entity declarations).

Note that this is not a textual include. In particular, the qualified names used in the entity declarations in the referenced document are resolved in the context of the namespace declarations within that file. In addition, the external document referenced by an entities declaration can have its own content, for example to describe the entities defined in the file, but this content will be ignored.

It is not an error if the document referenced by the entities declaration is unreachable. It is an error if the referenced document is not a document in LMNL syntax (which entails that the entire document must be parsed even though the entity declarations can only appear in the prolog).

Entity References

An entity reference refers to an entity. When an entity reference is encountered, the value of the referenced entity is retrieved and processed, in place of the entity reference itself, as though it were part of the document at the location the entity reference was recognised. The replacement text may include both character data and entity references, which are themselves recognised and replaced. The replacement text must not contain any other markup; it is an error if either [ or { are encountered in the replacement text of an entity.

[41] EntityReference ::= '&' EntityName ';'

The entity reference is replaced by the value of the entity named in the EntityReference. It is an error if there is no such entity.

Note that it is possible for an entity declaration and an entity reference to use different qualified names, as long as those qualified names resolve to the same expanded name.

An entity reference is not represented in the data model, but it is represented in the reified LMNL layer, as a [syn:entity] range. The [syn:entity] range has a [name] annotation that represents the entity name. The value of the [name] annotation is the entity name. The [syn:entity] range encloses [rl:text] and [syn:entity] ranges representing the entity value.

Predefined Entities

Entity and character references can both be used to escape the left square bracket, left curly bracket, ampersand, and other delimiters. A set of predefined entities (amp, lsqb, rsqb, lcub, rcub, apos and quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references "[", "{" and "&" may be used to escape [, { and & when they occur in character data.

All LMNL syntax parsers must recognize these predefined entities whether they are declared or not. If the entities lsqb, lcub or amp are declared, they must be declared as entities whose replacement text is a character reference to the respective character (left square bracket, left curly bracket or ampersand) being escaped; the double escaping is required for these entities so that references to them do not generate an error. If the entities rsqb, rcub, apos or quot are declared, they must be declared as entities whose replacement text is the single character being escaped (or a character reference to that character; the double escaping here is unnecessary but harmless). For example:

[!entity lsqb="["]
[!entity rsqb="]"]
[!entity lcub="{"]
[!entity rcub="}"]
[!entity amp="&"]
[!entity apos="'"]
[!entity quot="""]

Only the lsqb, lcub and amp entities are strictly necessary; the others are provided for balance and to enable entity values to easily include apostrophes and double quotes whichever delimiter they use.

Character References

A character reference is a way of referring to a character by its code point rather than including it as a native character, perhaps because the encoding used for the document does not support the character, or simply because it is easier to insert than the native character. Character references follow the same syntax as character references in XML.

[42] CharRef ::= ('&#' [0-9]+ ';') | ('&#x' [0-9a-fA-F]+ ';')

If the character reference begins with "&#x", the digits and letters up to the terminating ; provide a hexadecimal representation of the character's code point in ISO/IEC 10646. If it begins just with "&#", the digits up to the terminating ; provide a decimal representation of the character's code point. Character references must represent legal LMNL characters.

The character represented by a character reference will be part of the content of a text layer in the data model. In the reified LMNL layer, it will map to one of the characters in the [characters] annotation of a [rl:text] range.

Common Syntactic Constructs

This section describes some of the common syntactic constructs that are used in the LMNL syntax. Most of them should be familiar from the XML Recommendation and Namespaces in XML.

Characters

The legal characters in LMNL syntax are the same as the legal characters in the LMNL data model — any Unicode character, excluding control characters and the surrogate blocks. As in the LMNL data model, documents in LMNL syntax must be Unicode normalized per Unicode Normalization Form NFC for reasons described in Character Model for the World Wide Web.

[43] Char ::= #x9 | #xA | #xD | [#x20-#x7E] | #x85 | [#xA0-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Whitespace

Whitespace is considered to be any sequence of space, tab or line-feed characters.

[44] S ::= (#x20 | #x9 | #xA)+

Like XML, line endings must be normalized by a LMNL parser as if before parsing it translated certain line-ending sequences into a single #xA character. These line-ending sequences are:

Qualified Names

LMNL syntax represents expanded names using qualified names. Qualified names consist of an optional Prefix followed by a LocalPart, both of which are names.

Note that no Names in LMNL syntax can contain colons.

[45] QName ::= (Prefix ':')? LocalPart
[46] Prefix ::= Name
[47] LocalPart ::= Name
[48] Name ::= NameStartChar (NameChar)*
[49] NameStartChar ::= [A-Z] | "_" | [a-z] | [#xC0-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xEFFFF]
[50] NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

A qualified name is resolved to an expanded name by identifying the namespace declaration that appears earlier in the document that associates the Prefix of the qualified name with a namespace name. It is an error if there is no namespace declaration for the specified Prefix appearing before the qualified name in the document. If the qualified name doesn't have a Prefix then the namespace name of the expanded name is the empty string. The expanded name represented by the qualified name is the pair of this namespace name and the LocalPart of the qualified name.

Namespace declarations are scoped such that it doesn't matter what context a namespace declaration appears in — within a start tag, in the content of an annotation — it will always be in scope from that point on through the document. Namespaces declarations cut across the structures defined by the semantic markup in the document.

Literals

Literal data is any quoted string not containing the quotation mark used as a delimiter for that string. Literals are used for specifying the values of entities (EntityValue) and identifiers for namespace names and locations of entity declarations (URILiteral).

[51] Eq ::= S? '=' S?
[52] EntityValue ::= ('"' ([^&[{"] | Reference)* '"') | ("'" ([^&[{'] | Reference)* "'"
[53] URILiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")

Note that URILiterals can be parsed without scanning for markup. Their values (between the quotes) should be valid URIs as defined in RFC 2396.

Example Document

Following is an example of a document in LMNL syntax for the example provided in the data model specification:

[!lmnl version="0.1" encoding="ISO-8859-1"]
[!layer name="types" base="#default"]
[date~types [day-of-week}Friday{]
  }[year}2002{year
  ]-[month [name}August{name [abbreviation}Aug{abbreviation]]}08{month
  ]-[day}23{day]{date~types]

Acknowledgements

Thanks to the following for their comments on this document: John Cowan, Gavin Nicol, Wendell Piez.