CSY/reowolf Files · docs/spec/02_tokens.md · Centrum Wiskunde & Informatica (CWI)

Files @ 3f81dc1d23ec
Branch filter:
Location: CSY/reowolf/docs/spec/02_tokens.md

3f81dc1d23ec 9.0 KiB text/markdown Show Annotation Show as Raw Download as Raw
renamed files in docs/spec
# Tokens

## Notation

While parsing the raw text into tokens, we will use `"c"` to specify the ASCII character `c`, `"word"` to specify the ASCII string `word`, and we will use `0x20` to mean the ASCII character associated with hexadecimal integer 0x20 (i.e. the space character). Terms placed one after the other must appear in that order. When different combinations are possible within a term we will use the syntax `["option 1" | "option 2" | "option 3"]`. We will reserve the `*` symbol to mean "0 or more repetitions", the `+` symbol to mean "1 or more repetitions", and the `?` symbol to mean "0 or 1 occurrence".  To apply these symbols to multiple terms we will put them between parentheses. So `(["0" | "1"])+` means that we expect a sequence of at least 1 character, where each character must be "0" or "1".

Finally, we may specify (inclusive) ranges of characters by putting a dash between two elements. So `0x00-0x1F` represent all control characters. `0x41-0x5A`, or equivalently: `"A"-"Z"` represent all uppercase characters. 

## Parsing Raw Text

We will define the following terms in terms of the raw input text:

```
RawWS = [" ", 0x09]
RawNL = [(0x0D 0x0A) | 0x0A]
RawWSNL = RawWS | RawNL
```

That is: whitespace may be the space character or a tab character. A newline may be the unix-like newline character, or the windows-like carriage feed + newline character.

With regards to visible character groups, we may define:

```
RawDigit = "0"-"9"
RawUpperAlpha = "A"-"Z"
RawLowerAlpha = "a"-"z"
RawVChar = 0x20-0x7E | 0x09
```

Where the `RawVChar` defines a visible character.

Finally, because the text stream ends at some point, we interpret the text stream as if it always has an `RawEOF` token at the end, indicating the end of the file.

## Parsing Unimportant Tokens

The raw text stream will be interpreted as a series of tokens. For simplicity we will call these `TkRaw`, which may essentially be one of two variants: firstly we will define `TkUnimportant`, which we will lex but which will not continue to influence the parsing of the raw text stream. Secondly we have the `Tk` the token that serves as the building block for the remainder of the parsing.

```
TkRaw = TkUnimportant | Tk
TkUnimportant = TkCommentLine | TkCommentBlock | TkWS

TkCommentLine = "//" RawVChar* [RawNL | RawEOF]
TkCommentBlock = "/*" [RawVChar | RawNL]* ["*/" | RawEOF]
TkWS = RawWSNL
```

Essentially: we consider all line comments, block comments (which do not nest) and whitespace to be unimportant.

## Parsing Important Tokens

There are two main types of important tokens: the varying length tokens (e.g. an identifier or an integer literal), and the fixed-length tokens (e.g. "!"). We'll start with the various kinds of varying length tokens.

### Identifiers

Identifiers are defined as:

```
IdentInit = "_" | RawUpperAlpha | RawLowerAlpha
IdentRem = IdentInit | RawDigit
TkIdent = IdentInit IdentRem*
```

That is: an identifier starts with an identifier or an underscore, and is followed by a sequence containing those characters or a number. Note that the later definition of keywords also matches the definition of identifiers. The tokenizer will prefer to pick keywords instead of identifiers.

### Pragmas

A pragma is a hint to the compiler and is indicated by a pound sign. Hence we define a pragma as:

```
TkPragma = "#" IdentInit IdentRem*
```

### Character

A character literal is a single character bounded by single-quote marks. It is defined as:

```
CharUnescaped = 0x20-0x26 | 0x28-0x5B | 0x5D-0x7E
CharEscaped = "\" ["r" | "n" | "t" | "0" | "\" | "'" | 0x22]
CharElement = CharUnescaped | CharEscaped
TkChar = "'" CharElement "'" 
```

That is: a character is any of the visible characters (except for the quotation mark, because that has to be escaped, and except for the backslash character, because that is the indicator of the escaping). Or it is an escaped character. From left to right we have the following supported escape characters:

1. `r`, `0x0D`: Carriage feed,
2. `n`, `0x0A`: Newline
3. `t`, `0x09`: Horizontal tab
4. `0`, `0x00`: Null character
5. `\`, `0x5C`: Backslash character
6. `'`, `0x27`: Single quote character
7. `"`, `0x22`: Double quote character

### String Literal

A string literal is essentially defined in the same way as the character literal, however now we have to escape the '"' character. So it is defined as:

```
StrUnescaped = 0x20-0x21 | 0x23-0x5B | 0x5D-0x7E
StrEscaped = CharEscaped
StrElement = StrUnescaped | StrEscaped
TkStr = 0x22 StrElement 0x22
```

Where again, 0x22 is the double quote character itself.

### Integer Literal

PDL currently supports binary, octal, decimal and hexadecimal integers. These are defined as:

```
IntBinEl = "0" | "1"
IntBinElSep = IntBinEl | "_"
IntBin = "0" ["b" | "B"] IntBinElSep* IntBinEl IntBinElSep*

IntOctEl = "0"-"7"
IntOctElSep = IntOctEl | "_"
IntOct = "0" ["o" | "O"] IntOctElSep* IntOctEl IntOctElSep*

IntDecEl = "0"-"9"
IntDecElSep = IntDecEl | "_"
IntDec = IntDecEl IntDecElSep*

IntHexEl = "0"-"9" | "A"-"F" | "a"-"f"
IntHexElSep = IntHexEl | "_"
IntHex = "0" ["x" | "X"] IntHexElSep* IntHexEl IntHexElSep*

TkInt = IntBin | IntOct | IntDec | IntHex
```

For the regular decimal integers we expect the first character to be an actual digit (to prevent ambiguity with the `TkIdent` token). The remainder may be any decimal digit or the separating `_` character.

For the non-decimal integers we expect two initial characters indicating the radix of the integer. The remainder of the integer literal must then consist at least once of an element in its alphabet, and the remainder may contain the separating `_` character where possible.

The separating character is visually useful for the programmer (e.g. `0xDEADBEEF_CAB3CAF3_DEADC0D3_C0D3CAF3` or `0b0001_0010_0100_1000`). But will not contribute to the interpretation of the integer character.

### Boolean literals

A boolean literal is just the word "true" or "false". The interpretation of these strings takes precedence over identifiers. We define:

```
TkBool = "true" | "false"
```

### Keywords

Several sequences of characters are reserved keywords which have a special meaning within the context of importing modules or the definition of a procedure body. These keywords take precedence over the interpretation of the character sequence as an identifier. Hence, all of these keywords may not be used as identifiers within the program.

```
KwLet = "let"
KwAs = "as"
KwStruct = "struct"
KwEnum = "enum"
KwUnion = "union"
KwFunc = "func"
KwPrim = "primitive"
KwComp = "composite"
KwImport = "import"

Kw = KwLet | KwAs | KwStruct | 
     KwEnum | KwUnion | KwStruct |
     KwFunc | KwPrim | KwComp |
     KwImport
```

For statements we have the following keywords:

```
KwChannel = "channel"
KwIf = "if"
KwElse = "else"
KwWhile = "while"
KwBreak = "break"
KwContinue = "continue"
KwGoto = "goto"
KwReturn = "return"
KwSync = "synchronous"
KwNew = "new"

Stmt = KwChannel | 
       KwIf | KwElse | KwWhile |
       KwBreak | KwContinue | KwGoto | KwReturn |
       KwSync | KwNew
```

We use these two lists to define the keyword token as:

```
TkKw = Kw | Stmt
```

Apart from these keywords the language also features several builtin methods and types. However, from the point of view of the tokenizer these are simply interpreted as identifiers. Scoping rules will ensure that they are not doubly defined. The same is true for types: several types (e.g. the basic integers) are given special attention, but we will define these later.

### Fixed-Width Tokens

The remaining tokens are the fixed-width types of punctuation. There are several ones whose first couple of characters are identical. In this case we pick the largest matching sequence of characters. We have:

```
TkExcl = "!"
TkQuestion = "?"
TkPound = "#"

TkLAngle = "<"
TkLCurly = "{"
TkLParen = "("
TkLSquare = "["
TkRAngle = ">"
TkRCurly = "}"
TkRParen = ")"

TkRSquare = "]"
TkColon = ":"
TkComma = ","
TkDot = "."
TkSemiColon = ";"

TkAt = "@"
TkPlus = "+"
TkMinus = "-"
TkStar = "*"
TkSlash = "/"
TkPercent = "%"
TkCaret = "^"
TkAnd = "&"
TkOr = "|"
TkTilde = "~"
TkEqual = "="

TkColonColon = "::"
TkDotDot = ".."
TkArrowRight = "->"
TkAtEquals = "@="
TkPlusPlus = "++"
TkPlusEquals = "+="
TkMinusMinus = "--"
TkMinusEquals = "-="
TkStarEquals = "*="
TkSlashEquals = "/="
TkPercentEquals = "%="
TkCaretEquals = "^="
TkAndAnd = "&&"
TkAndEquals = "&="
TkOrOr = "||"
TkOrEquals = "|="
TkEqualEqual = "=="
TkNotEqual = "!="
TkShiftLeft = "<<"
TkLessEqual = "<="
TkShiftRight = ">>"
TkGreaterEqual = ">="

TkShiftLeftEqual = "<<="
TkShiftRightEqual = ">>="

TkPunct = ... all of the above
```

For brevity's sake, we will not actually use the identifier above when we move onto specifying how definitions/statements/expressions are defined. The reason for specifying all of these combinations is that the tokenizer produces these tokens, and reports errors based on these tokens. Some of the tokens above are not used by the parser at all, and are merely parsed to produce reasonable error messages

### Combining All Variants

Our definition for a useful token now becomes:

```
Tk = TkChar | TkStr | TkInt | TkBool | TkKw | TkPunct