Friday 3 May 2019

TOML Syntax Diagrams

I recently discovered TOML - Tom's Obvious, Minimal Language. I really like it. It comes close to JSON's simplicity but adds some features that make it kinder to humans. It also adds date-time values which makes up for a lot.

But I also really like the way JSON is documented on json.org. The syntax diagrams (sometimes called railroad diagrams) provide a concise visual representation of a syntax.

So with an online railroad diagram generator and starting from the ABNF on the TOML GitHub page, I put these together. The result ended up having quite a lot fewer rules than are given in the ABNF on Github. This was done with the aim to reduce the number of diagrams, shorten the output, and make the whole thing easier to grok.

To this end, the whitespace rules that are in the ABNF have been removed completely here. Instead, the places where whitespace is allowed is described below each diagram.

7th of May, 2019 - Updated with a description of each diagram. Fixed a couple of issues.

Feedback welcome.





TOML:






TOML     ::= ( Key '=' Value | '[' Key ']' | '[[' Key ']]' )? Comment? ( Newline ( Key '=' Value | '[' Key ']' | '[[' Key ']]' )? Comment? )*



A TOML document is a table or dictionary; it consists of key-value pairs and is equivalent to a JSON object. The values themselves can be strings, numbers, booleans, dates, arrays of values, or nested tables.

It is a line-oriented document. This means that apart from multi-line strings and arrays (described later), everything happens on one line. Blank lines are OK and used to improve readability.

There are three main types of lines:
  • Lines that describe key-value pairs. This is the first option above where key and value are separated by an equals-sign.
  • Lines that set a parent key for the key-value pairs that follow. The parent key is set by putting a key in square-brackets.
  • Lines that append an element to table entry that contains an array. The parent key of the array is put in double-square brackets.




Key      ::= ( [A-Za-z0-9#x2D_]+ | QuotedString ) ( '.' ( [A-Za-z0-9#x2D_]+ | QuotedString ) )*


referenced by:


A key is a string containing Unicode characters. In a TOML document, unquoted keys are sequences of upper or lower case letters, digits, dashes or underscores. In order to contain other characters it must be quoted using single or double quotes.

All keys are relative to the current parent key (if it has been set in either of the square-bracket or double-square-bracket forms above), or the root if no parent key has been set yet.

A path down through a series of nested tables is indicated by separating the elements of the path by period characters. A quoted string can be used to include a period in a key itself. Spaces or tabs are allowed either side of the period.





           | QuotedString

           | 'true'

           | 'false'

           | Array

           | InlineTable

           | DateTime

           | Number


referenced by:


As mentioned above, the values themselves can be strings, numbers, booleans, dates, arrays of values, or nested tables of values.

Booleans are simply the text 'true' or 'false'. Other value types are understood by either a token that introduces them, or the format of their data.

The different formats are described below.

Comment:





Comment  ::= '#' [#x0009#x0020-#x10FFF]*


referenced by:


Comments, as in many scripting languages, start with a hash or pound character, and comments end at the end of a line. A comment can appear either on a line by itself, or at the end of a line of content.

Spaces or tabs are allowed either side of tokens. This means at the start of a line, either side of the equals square bracket or double square bracket.




Newline  ::= #x000D? #x000A


referenced by:


The document can be Unix or DOS formatted. Lines end in either a newline character or carriage-return and newline combination.





         ::= '"' ( [#x0020-#x0021#x0023-#x005B#x005D-#x007E#x0080-#x10FFFF] | Escaped )* '"'

           | "'" [#x0009#x0020-#x0026#x0028-#x007E#x0080-#x10FFFF]* "'"


referenced by:


A quoted string is introduced by either single or double quotes and must be closed by the same character on the same line. It can contain Unicode characters and double-quoted strings support escape characters using backslash. The single-quoted strings
don't support escaped characters and hence can't contain single-quotes.

Spaces or tabs within a string are preserved.





         ::= '"""' ( [#x0020-#x005B#x005D-#x007E#x0080-#x10FFFF] | Escaped | '\'? Newline )* '"""'

           | "'''" ( [#x0009#x0020-#x007E#x0080-#x10FFFF] | Newline )* "'''"


referenced by:


Multiline strings, as the name suggests, allow a string to appear over a number of lines. They start with either a triple double quote or a triple single quote, and end when a matching triple appears.

A backslash at the end of a line consumes all following newlines, spaces, or tabs.

Newlines, spaces, or tabs can appear after the opening token and are ignored.





Escaped  ::= '\' ( '"' | '\' | 'b' | 'f' | 'n' | 'r' | 't' | 'uXXXX' | 'UXXXXXXXX' )


referenced by:


Escape sequences in double-quoted strings and multiline strings are introduced with a backslash. A similar set to those found in JSON are supported except for the absence of forward slash. There is also the ability to specify a 32-bit Unicode code point using upper case U and eight hex characters.

No spaces or tabs are allowed after the backslash.


Array:




Array    ::= '[' ArrayComment ( Value ( ',' ArrayComment Value )* ','? ArrayComment )? ']'


referenced by:


Arrays of values are introduced by a left square bracket.

A value in the array is always followed by a comma but then can be followed by a comment and/or newline. The final value in the array is allowed to be followed by a comma which is ignored.

Newline, spaces, or tabs can follow the opening square bracket and preceded the closing square bracket.


ArrayComment:





         ::= ( Comment? Newline )*


referenced by:







         ::= '{' ( Key '=' Value ( ',' Key '=' Value )* ','? )? '}'


referenced by:


An inline-table allows a number of key-value values in a table to be specified on a single line. It is surrounded by curly braces and the key-value pairs are separated by comma.

Key value pairs are separated by equals sign.

Spaces or tabs can appear around the curly braces, commas, and equals signs.




DateTime ::= 'YYYY-MM-DD' ( ( 'T' | ' ' ) 'HH:NN:SS' ( '.' [0-9]+ )? ( 'Z' | ( '+' | '-' )
'HH:NN' )? )?

           | 'HH:NN:SS' '.' [0-9]+


referenced by:


DateTime values are either absolute or relative to some unspecified entity, depending on whether a timezone is specified. A time of day value (irrespective of date) an also be specified; in this case it is always 'relative'.

The start of a DateTime or relative time is recognised by there being four digits followed by minus or two digits followed by a colon. A DateTime can optionally by followed by a time which is separate from the date by upper case T or a space. Fractional seconds can be given in the time.

The timezone can be given as upper case Z meaning UTC, or as an offset given as plus or minus character and a value in hours and minutes.

No spaces or tabs other than the separator between the date and time are allowed.




Number   ::= ( '+' | '-' )? ( ( '0' | [1-9] ( '_'? [0-9] )+ ) ( '.' [0-9] ( '_'? [0-9] )*
)? 'e' ( '+' | '-' )? [0-9] ( '_'? [0-9] )* | 'inf' | 'nan' )

           | '0x' [0-9A-F] ( '_'? [0-9A-F] )*

           | '0o' [0-7] ( '_'? [0-7] )*

           | '0b' [0-1] ( '_'? [0-1] )*


referenced by:


Numeric values are either decimal, hexadecimal, octal, or binary depending on how they start. Decimals start with a digit, a plus character, or a minus character. A decimal value can have an optional fractional part, introduced by a period, and an exponent part introduced by 'e'. The exponent can be positive or negative.

Hexadecimal, octal and boolean values are always integers. Decimal values with no fraction or exponent are considered integers. Otherwise they are floating point values. There are also the special floating point values for infinity, given as 'inf', and not-a-number given as 'nan'.

Space and tab characters are not allowed in numbers although, as it is in Ruby, any pair of digits can be separated by an underscore to provide spacing.





 

... generated by Railroad Diagram Generator

The following is the source EBNF I gave to the Railroad Diagram Generator.


TOML ::= ( 
  ( Key '=' Value | '[' Key ']' | '[[' Key ']]') )?
  Comment? 
  ( Newline ( 
    ( Key '=' Value | '[' Key ']' | '[[' Key ']]') )? Comment?
  )*

Key ::= ( ( [A-Za-z0-9-_] )+ | QuotedString ) ( '.' Key )?

Value ::= MultilineString | QuotedString | 'true' | 'false'
        | Array | InlineTable | DateTime | Number

Comment ::= '#' [#x0009#x0020-#x10FFF]*

Newline ::= ( #x0D )? #x0A 

QuotedString ::= ( '"' ( '[#x20-#x21#x23-#x5B#x5D-#x7E#x80-#x10FFFF]' | Escaped )* '"' ) |
( "'" ( [#x09#x20-#x26#x28-#x7E#x80-#x10FFFF] )* "'" )

MultilineString ::= ( '"""' ( ( [#x20-#x5B#x5D-#x7E#x80-#x10FFFF] | Escaped ) | Newline | ( '\' Newline ) )* '"""' ) |
( "'''" ( [#x09#x20-#x7E#x80-#x10FFFF] | Newline )* "'''" )

Escaped ::= '\' ( '"' | '\' | 'b' | 'f' | 'n' | 'r' | 't' | 'uXXXX' | 'UXXXXXXXX' )

Array ::= '[' ( ArrayVals )? ArrayComment ']' 
ArrayVals ::= ( ArrayComment Value ',' ArrayVals?  ) | ( ArrayComment Value )
ArrayComment ::= ( ( Comment )? Newline )*

InlineTable ::= '{' ( InlineTableKeyvals )?  '}'
InlineTableKeyvals ::= Key '=' Value ( ',' InlineTableKeyvals? )?

DateTime ::= 'YYYY-MM-DD' ( ( 'T' | ' ' ) 'HH:NN:SS' ( '.' [0-9]+ )? ( 'Z' | ( '+' | '-' ) 'HH:NN' )? )?
           | 'HH:NN:SS' ( '.' [0-9]+ )

Number ::= 
( '+' | '-' )? (
    ( '0' | [1-9] '_'? [0-9] ( '_'? [0-9] )* )
    ( '.' [0-9] ( '_'? [0-9] )* )?
    ( 'e' ( '+' | '-' )? [0-9] ( '_'? [0-9] )*
  ) |
 'inf' |
 'nan' ) |
'0x'  [0-9A-F] ( '_'? [0-9A-F] )* |
'0o' [0-7] ( '_'? [0-7] )* |
'0b' [0-1] ( '_'? [0-1] )*

1 comment:

Unknown said...

Many thanks for this! Look forward to using this to implement my parser.