Generic-Syntax definition

Explanations

node

In GS everything is node. A node has this form:

node = '<' specialType? name? attribute* body? attribute* '>'

name

Nodes have zero or one name.

name = rawCharacters | quotedStr | boundedStr

Examples:

<>
<tagName>
<'quoted name'>
<'quoted \'name\' with escaping'>
<|'strange 'name' with bounded escaping|'>

body

Nodes have zero or one of the 4 body types:

body = bodyText | bodyList | bodyMap | bodyMixed

Examples:

<noBody>
<text "text node">
<text "text \"node\" with escaping">
<text !"text "node" with bounded escaping!">
<list[<child>]>
<map{ property= <child>}>
<mixed `paragraph with <em `inline`> tags`>

attribute

Nodes can have zero or more attributes before and after it's body.

An attribute is a name-value pair, the value is optional.

An attribute, like a node, can have a special type.

attribute = specialType? name ( '=' value )?

value = rawCharacters | formattable ? quotedStr | formattable ? boundedStr

Examples:

< name=value quoted='value with any unicode 😊' bounded=|'value with 'bounded' escaping|' attWithoutValue>
< 'name quoted'=1 |'name with 'bounded' escaping|'=2>

specialType

Nodes nad attributes can have zero or one of the 4 special types:

specialType = '#' | '&' | '%' | '?'

Examples:

<# "text in the comment">
<#TODO by=Mark "typed comment">
<#THREAD [<comment by=Mark "Structured comment">]>
<& "meta node. It can be also typed or structured like comments">
<%repeat count=10 {}>
<?ON>
<?ML>
<?OM>
<specialTypesInAtt %instruction #todo='comment' &meta ?ML>

simpleNode

A simple node is a node without specialType, name or attribute and have only a body. In this case, the < and > marks can be omitted.

simpleNode = body | rawCharacters

Examples:

"text"
rawCharacters
[<child>]
{name= <child>}
`mixed text and <child "node">`

formattable

The formattable flag '~' can be added before values and texts: it indicates this characters sequence can be formatted and indented, for example by editors.

formattable ='~'

The change allowed is simple : any space sequence can be replaced by any other space sequence. A space sequence is defined by the regular expression: [ \t\n\r]+.

Examples:

~"Long text"
<title ~"Long title...">
<product description=~'Long description...'{}>
~`Long mixed <span ~`text and node`>`

rawCharacters

rawCharacters is the limited characters set for names, values and body text in simple nodes usable without delimiters. It is defined by a regular expression:

rawCharacters = [a-zA-Z0-9_:\-./]+

quotedStr

When names or values use a character not allowed in rawCharacters, it must be delimited by a single quote. It is defined by a regular expression:

quotedStr = '([^'\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*'

Characters in this sequence can be escaped by a '\'. See Commons quoted escaping rules.

boundedStr

String delimited by a boundary can be used in names and values.

boundedStr = '|' boundary? ''' any '|' boundary? '''

Where:

quotedText

Quoted text is used in the bodyText. It is defined by a regular expression:

quotedText = "([^"\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*"

Characters in this sequence can be espcaped by a '\'. See Commons quoted escaping rules.

boundedText

Text delimited by a boundary can be used in a bodyText.

boundedText = '!' boundary? '"' any '!' boundary? '"'

Where:

mixedText

Mixed text is used in the bodyMixed. It is defined by a regular expression:

mixedText = ([^<\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*

Characters in this sequence can be escaped by a '\'. See Commons quoted escaping rules.

Commons quoted escaping rules

In quotedStr, quotedText and mixedText some characters can be escaped:

More over, any unicode character can be escaped with '\u' followed by six hexadecimal digits corresponding to the unicode number. For example \u01F60A corresponds to the 😊 character.

Full syntax definition

The GS syntax is formalized in three parts :

Grammar

GS = (nodeLike s*)*

nodeLike = node | simpleNode
simpleNode = body | rawCharacters

node = '<' specialType? name? attr* s* (body attr* s*)? '>'

name = rawCharacters | quotedStr | boundedStr
attr = s* specialType? name (s* '=' s* value)?
value = rawCharacters | formattable? quotedStr | formattable? boundedStr

body = bodyList | bodyText | bodyMap | bodyMixed

bodyList = '[' s* (nodeLike s*)* ']'

bodyText = formattable? (quotedText | boundedText)

bodyMap = '{' s* ((prop | node) s*)* '}'
prop = name ('=' s* nodeLike)?

bodyMixed = formattable? '`' (mixedText | node)* '`'

specialType = '#' | '&' | '%' | '?'
formattable = '~'
	

Tokens

Tokens are defined as regular expressions

s = /[ \t\n\r]/

rawCharacters = /[a-zA-Z0-9_:\-.\/]+/

quotedStr = /'([^'\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*'/
boundedStr = /\|[^']*'.*\|[^']*'/

quotedText = /"([^"\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*"/
boundedText = /![^"]*".*![^"]*"/

mixedText = /([^<\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*/

Supplementary rules

Two features in GS need specific rules:

Except for these two specific cases, implementing an efficient GS parser is straightforward. The GS syntax doesn't need speculative try and rewind strategies (like HTML).

You can find a parser implementation in Typescript here: https://github.com/generic-syntax/gs-js/blob/master/src/core/gsParser.ts.