GS definition

Explanations

node

In GS everything is node. A node has this form:

node = '<' specialType? name? attribute* body? attribute* '>'

name

Nodes have zero or one name.

name = rawCharacters | quotedStr | boundedStr

Examples:

<>
<tagName>
<'quoted name'>
<'quoted \'name\' with escaping'>
<|'strange 'name' with bounded escaping|'>

body

Nodes have zero or one of the 4 body types:

body = bodyText | bodyList | bodyMap | bodyMixed

Body text '""', defines a terminal text.
bodyText = formattable? ( quotedText | boundedText )
Body list '[]', defines a list of nodes as children.
bodyList = '[' ( node | simpleNode )* ']'
Body map '{}', defines a set of properties (name-node pairs) as children.
bodyMap = '{' ( property | node )* '}'

property = name ( '=' node | simpleNode)?
Body mixed '``', also defines a list of nodes as children, but with a different, useful and author-friendly syntax for document oriented content with paragraphs and inline tags.
bodyMixed = formattable? '`' ( mixedText | node ) '`'

Examples:

<noBody>
<text "text node">
<text "text \"node\" with escaping">
<text !"text "node" with bounded escaping!">
<list[<child>]>
<map{ property= <child>}>
<mixed `paragraph with <em `inline`> tags`>

attribute

Nodes can have zero or more attributes before and after it's body.

An attribute is a name-value pair, the value is optional.

An attribute, like a node, can have a special type.

attribute = specialType? name ( '=' value )?

value = rawCharacters | formattable ? quotedStr | formattable ? boundedStr

Examples:

< name=value quoted='value with any unicode 😊' bounded=|'value with 'bounded' escaping|' attWithoutValue>
< 'name quoted'=1 |'name with 'bounded' escaping|'=2>

specialType

Nodes nad attributes can have zero or one of the 4 special types:

specialType = '#' | '&' | '%' | '?'

Comment '#', for inserting comments in the content or wraping some nodes to exclude them.
Meta '&', for inserting metadata in the content without changing the content itself. It is similar to annotations in programming languages.
Instruction '%', for inserting processing instructions in interaction with the content.
Syntax '?', for declaring the current GS profile. It is useful for editors or validators in order to emit more dedicated errors and warnings. For example, in a GS-ML profile, the map body should not be used in this content (not allowed in DOM and XML). In a GS-ON profile, named nodes should not be used, etc.

Examples:

<# "text in the comment">
<#TODO by=Mark "typed comment">
<#THREAD [<comment by=Mark "Structured comment">]>
<& "meta node. It can be also typed or structured like comments">
<%repeat count=10 {}>
<?ON>
<?ML>
<?OM>
<specialTypesInAtt %instruction #todo='comment' &meta ?ML>

simpleNode

A simple node is a node without specialType, name or attribute and have only a body. In this case, the < and > marks can be omitted.

simpleNode = body | rawCharacters

Examples:

"text"
rawCharacters
[<child>]
{name= <child>}
`mixed text and <child "node">`

formattable

The formattable flag '~' can be added before values and texts: it indicates this characters sequence can be formatted and indented, for example by editors.

formattable ='~'

The change allowed is simple : any space sequence can be replaced by any other space sequence. A space sequence is defined by the regular expression: [ \t\n\r]+.

Examples:

~"Long text"
<title ~"Long title...">
<product description=~'Long description...'{}>
~`Long mixed <span ~`text and node`>`

rawCharacters

rawCharacters is the limited characters set for names, values and body text in simple nodes usable without delimiters. It is defined by a regular expression:

rawCharacters = [a-zA-Z0-9_:\-./]+

quotedStr

When names or values use a character not allowed in rawCharacters, it must be delimited by a single quote. It is defined by a regular expression:

quotedStr = '([^'\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*'

Characters in this sequence can be escaped by a '\'. See Commons quoted escaping rules.

boundedStr

String delimited by a boundary can be used in names and values.

boundedStr = '|' boundary? ''' any '|' boundary? '''

Where:

boundary can be any sequence of characters except '. The start and end boundary MUST be the same.
any can be any character sequence but MUST not contain the |boundary' sequence.

quotedText

Quoted text is used in the bodyText. It is defined by a regular expression:

quotedText = "([^"\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*"

Characters in this sequence can be espcaped by a '\'. See Commons quoted escaping rules.

boundedText

Text delimited by a boundary can be used in a bodyText.

boundedText = '!' boundary? '"' any '!' boundary? '"'

Where:

boundary can be any sequence of characters except ". The start and end boundary MUST be the same.
any can be any character sequence but MUST not contain the !boundary" sequence.

mixedText

Mixed text is used in the bodyMixed. It is defined by a regular expression:

mixedText = ([^<\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*

Characters in this sequence can be escaped by a '\'. See Commons quoted escaping rules.

Commons quoted escaping rules

In quotedStr, quotedText and mixedText some characters can be escaped:

\\, \', \", \`, \< escape the second character
\b escape the backspace 08 unicode character
\f escape the form feed 0C unicode character
\n escape the line feed 0A unicode character
\r escape the carriage return 0D unicode character
\t escape the tabulation 09 unicode character

More over, any unicode character can be escaped with '\u' followed by six hexadecimal digits corresponding to the unicode number. For example \u01F60A corresponds to the 😊 character.

Full syntax definition

The GS syntax is formalized in three parts :

A grammar of rules for combining tokens
A set of tokens defined by regular expressions
Two specific supplementary rules for parser implementations

Grammar

GS = (nodeLike s*)*

nodeLike = node | simpleNode
simpleNode = body | rawCharacters

node = '<' specialType? name? attr* s* (body attr* s*)? '>'

name = rawCharacters | quotedStr | boundedStr
attr = s* specialType? name (s* '=' s* value)?
value = rawCharacters | formattable? quotedStr | formattable? boundedStr

body = bodyList | bodyText | bodyMap | bodyMixed

bodyList = '[' s* (nodeLike s*)* ']'

bodyText = formattable? (quotedText | boundedText)

bodyMap = '{' s* ((prop | node) s*)* '}'
prop = name ('=' s* nodeLike)?

bodyMixed = formattable? '`' (mixedText | node)* '`'

specialType = '#' | '&' | '%' | '?'
formattable = '~'

Tokens

Tokens are defined as regular expressions

s = /[ \t\n\r]/

rawCharacters = /[a-zA-Z0-9_:\-.\/]+/

quotedStr = /'([^'\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*'/
boundedStr = /\|[^']*'.*\|[^']*'/

quotedText = /"([^"\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*"/
boundedText = /![^"]*".*![^"]*"/

mixedText = /([^<\\]|\\['"`<bfnrt]|\\u[0-9A-Fa-f]{6})*/

Supplementary rules

Two features in GS need specific rules:

The mixedText regular expression must only be applied in the bodyMixed rule. A GS parser implementation needs to be context sensitive just at this point.
For boundedStr and boundedText tokens, the start boundary and the end boundary must be identical. The regular expression does not permit this check. For example, boundedText, the boundary is defined by ![^"]*", the same sequence of characters must start the token and end it. A GS parser implementation needs to use a custom tokenizer just for this two tokens.

Except for these two specific cases, implementing an efficient GS parser is straightforward. The GS syntax doesn't need speculative try and rewind strategies (like HTML).

You can find a parser implementation in Typescript here: https://github.com/generic-syntax/gs-js/blob/master/src/core/gsParser.ts.

Generic-Syntax definition

Explanations

node

name

body

attribute

specialType

simpleNode

formattable

rawCharacters

quotedStr

boundedStr

quotedText

boundedText

mixedText

Commons quoted escaping rules

Full syntax definition

Grammar

Tokens

Supplementary rules