Explanations
node
In GS everything is node. A node has this form:
node = '<' specialType? name? attribute* body? attribute* '>'
name
Nodes have zero or one name.
name = rawCharacters | quotedStr | boundedStr
Examples:
<> <tagName> <'quoted name'> <'quoted \'name\' with escaping'> <|'strange 'name' with bounded escaping|'>
body
Nodes have zero or one of the 4 body types:
body = bodyText | bodyList | bodyMap | bodyMixed
- Body text '""', defines a terminal text.
bodyText = formattable? ( quotedText | boundedText )
- Body list '[]', defines a list of nodes as children.
bodyList = '[' ( node | simpleNode )* ']'
- Body map '{}', defines a set of properties (name-node pairs) as children.
bodyMap = '{' ( property | node )* '}'
property = name ( '=' node | simpleNode)?
- Body mixed '``', also defines a list of nodes as children,
but with a different, useful and author-friendly syntax for document oriented content with paragraphs and inline tags.
bodyMixed = formattable? '`' ( mixedText | node ) '`'
Examples:
<noBody> <text "text node"> <text "text \"node\" with escaping"> <text !"text "node" with bounded escaping!"> <list[<child>]> <map{ property= <child>}> <mixed `paragraph with <em `inline`> tags`>
attribute
Nodes can have zero or more attributes before and after it's body.
An attribute is a name-value pair, the value is optional.
An attribute, like a node, can have a special type.
attribute = specialType? name ( '=' value )?
value = rawCharacters | formattable ? quotedStr | formattable ? boundedStr
Examples:
< name=value quoted='value with any unicode 😊' bounded=|'value with 'bounded' escaping|' attWithoutValue> < 'name quoted'=1 |'name with 'bounded' escaping|'=2>
specialType
Nodes nad attributes can have zero or one of the 4 special types:
specialType = '#' | '&' | '%' | '?'
- Comment '#', for inserting comments in the content or wraping some nodes to exclude them.
- Meta '&', for inserting metadata in the content without changing the content itself. It is similar to annotations in programming languages.
- Instruction '%', for inserting processing instructions in interaction with the content.
- Syntax '?', for declaring the current GS profile. It is useful for editors or validators
in order to emit more dedicated errors and warnings. For example, in a GS-ML profile, the
map
body should not be used in this content (not allowed in DOM and XML). In a GS-ON profile, named nodes should not be used, etc.
Examples:
<# "text in the comment"> <#TODO by=Mark "typed comment"> <#THREAD [<comment by=Mark "Structured comment">]> <& "meta node. It can be also typed or structured like comments"> <%repeat count=10 {}> <?ON> <?ML> <?OM> <specialTypesInAtt %instruction #todo='comment' &meta ?ML>
simpleNode
A simple node is a node without specialType
, name
or attribute
and have only a body
.
In this case, the <
and >
marks can be omitted.
simpleNode = body | rawCharacters
Examples:
"text" rawCharacters [<child>] {name= <child>} `mixed text and <child "node">`
formattable
The formattable flag '~'
can be added before values and texts: it indicates this characters sequence can be formatted
and indented, for example by editors.
formattable ='~'
The change allowed is simple : any space sequence can be replaced by any other space sequence. A space sequence is defined by the regular expression: [ \t\n\r]+
.
Examples:
~"Long text" <title ~"Long title..."> <product description=~'Long description...'{}> ~`Long mixed <span ~`text and node`>`
rawCharacters
rawCharacters
is the limited characters set for names, values and body text in simple nodes usable without delimiters.
It is defined by a regular expression:
quotedStr
When names or values use a character not allowed in rawCharacters
, it must be delimited by a single quote.
It is defined by a regular expression:
Characters in this sequence can be escaped by a '\'
. See Commons quoted escaping rules.
boundedStr
String delimited by a boundary can be used in names and values.
Where:
boundary
can be any sequence of characters except'
. The start and end boundary MUST be the same.any
can be any character sequence but MUST not contain the|boundary'
sequence.
quotedText
Quoted text is used in the bodyText
. It is defined by a regular expression:
Characters in this sequence can be espcaped by a '\'
. See Commons quoted escaping rules.
boundedText
Text delimited by a boundary can be used in a bodyText
.
Where:
boundary
can be any sequence of characters except"
. The start and end boundary MUST be the same.any
can be any character sequence but MUST not contain the!boundary"
sequence.
mixedText
Mixed text is used in the bodyMixed
. It is defined by a regular expression:
Characters in this sequence can be escaped by a '\'
. See Commons quoted escaping rules.
Commons quoted escaping rules
In quotedStr
, quotedText
and mixedText
some characters can be escaped:
\\
,\'
,\"
,\`
,\<
escape the second character\b
escape the backspace08
unicode character\f
escape the form feed0C
unicode character\n
escape the line feed0A
unicode character\r
escape the carriage return0D
unicode character\t
escape the tabulation09
unicode character
More over, any unicode character can be escaped with '\u'
followed by six hexadecimal digits corresponding to the unicode number.
For example \u01F60A
corresponds to the 😊 character.
Full syntax definition
The GS syntax is formalized in three parts :
- A grammar of rules for combining tokens
- A set of tokens defined by regular expressions
- Two specific supplementary rules for parser implementations
Grammar
GS = (nodeLike s*)* nodeLike = node | simpleNode simpleNode = body | rawCharacters node = '<' specialType? name? attr* s* (body attr* s*)? '>' name = rawCharacters | quotedStr | boundedStr attr = s* specialType? name (s* '=' s* value)? value = rawCharacters | formattable? quotedStr | formattable? boundedStr body = bodyList | bodyText | bodyMap | bodyMixed bodyList = '[' s* (nodeLike s*)* ']' bodyText = formattable? (quotedText | boundedText) bodyMap = '{' s* ((prop | node) s*)* '}' prop = name ('=' s* nodeLike)? bodyMixed = formattable? '`' (mixedText | node)* '`' specialType = '#' | '&' | '%' | '?' formattable = '~'
Tokens
Tokens are defined as regular expressions
s = /[ \t\n\r]/ rawCharacters = /[a-zA-Z0-9_:\-.\/]+/ quotedStr = /'([^'\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*'/ boundedStr = /\|[^']*'.*\|[^']*'/ quotedText = /"([^"\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*"/ boundedText = /![^"]*".*![^"]*"/ mixedText = /([^<\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*/
Supplementary rules
Two features in GS need specific rules:
- The
mixedText
regular expression must only be applied in thebodyMixed
rule. A GS parser implementation needs to be context sensitive just at this point. - For
boundedStr
andboundedText
tokens, the start boundary and the end boundary must be identical. The regular expression does not permit this check. For example,boundedText
, the boundary is defined by![^"]*"
, the same sequence of characters must start the token and end it. A GS parser implementation needs to use a custom tokenizer just for this two tokens.
Except for these two specific cases, implementing an efficient GS parser is straightforward. The GS syntax doesn't need speculative try and rewind strategies (like HTML).
You can find a parser implementation in Typescript here: https://github.com/generic-syntax/gs-js/blob/master/src/core/gsParser.ts.