Files
@ b06a29504c82
Branch filter:
Location: CSY/reowolf/docs/spec/tokens.md - annotation
b06a29504c82
9.0 KiB
text/markdown
more documentation of language validation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df 8e4c0fc8b4df | # Tokens
## Notation
While parsing the raw text into tokens, we will use `"c"` to specify the ASCII character `c`, `"word"` to specify the ASCII string `word`, and we will use `0x20` to mean the ASCII character associated with hexadecimal integer 0x20 (i.e. the space character). Terms placed one after the other must appear in that order. When different combinations are possible within a term we will use the syntax `["option 1" | "option 2" | "option 3"]`. We will reserve the `*` symbol to mean "0 or more repetitions", the `+` symbol to mean "1 or more repetitions", and the `?` symbol to mean "0 or 1 occurrence". To apply these symbols to multiple terms we will put them between parentheses. So `(["0" | "1"])+` means that we expect a sequence of at least 1 character, where each character must be "0" or "1".
Finally, we may specify (inclusive) ranges of characters by putting a dash between two elements. So `0x00-0x1F` represent all control characters. `0x41-0x5A`, or equivalently: `"A"-"Z"` represent all uppercase characters.
## Parsing Raw Text
We will define the following terms in terms of the raw input text:
```
RawWS = [" ", 0x09]
RawNL = [(0x0D 0x0A) | 0x0A]
RawWSNL = RawWS | RawNL
```
That is: whitespace may be the space character or a tab character. A newline may be the unix-like newline character, or the windows-like carriage feed + newline character.
With regards to visible character groups, we may define:
```
RawDigit = "0"-"9"
RawUpperAlpha = "A"-"Z"
RawLowerAlpha = "a"-"z"
RawVChar = 0x20-0x7E | 0x09
```
Where the `RawVChar` defines a visible character.
Finally, because the text stream ends at some point, we interpret the text stream as if it always has an `RawEOF` token at the end, indicating the end of the file.
## Parsing Unimportant Tokens
The raw text stream will be interpreted as a series of tokens. For simplicity we will call these `TkRaw`, which may essentially be one of two variants: firstly we will define `TkUnimportant`, which we will lex but which will not continue to influence the parsing of the raw text stream. Secondly we have the `Tk` the token that serves as the building block for the remainder of the parsing.
```
TkRaw = TkUnimportant | Tk
TkUnimportant = TkCommentLine | TkCommentBlock | TkWS
TkCommentLine = "//" RawVChar* [RawNL | RawEOF]
TkCommentBlock = "/*" [RawVChar | RawNL]* ["*/" | RawEOF]
TkWS = RawWSNL
```
Essentially: we consider all line comments, block comments (which do not nest) and whitespace to be unimportant.
## Parsing Important Tokens
There are two main types of important tokens: the varying length tokens (e.g. an identifier or an integer literal), and the fixed-length tokens (e.g. "!"). We'll start with the various kinds of varying length tokens.
### Identifiers
Identifiers are defined as:
```
IdentInit = "_" | RawUpperAlpha | RawLowerAlpha
IdentRem = IdentInit | RawDigit
TkIdent = IdentInit IdentRem*
```
That is: an identifier starts with an identifier or an underscore, and is followed by a sequence containing those characters or a number. Note that the later definition of keywords also matches the definition of identifiers. The tokenizer will prefer to pick keywords instead of identifiers.
### Pragmas
A pragma is a hint to the compiler and is indicated by a pound sign. Hence we define a pragma as:
```
TkPragma = "#" IdentInit IdentRem*
```
### Character
A character literal is a single character bounded by single-quote marks. It is defined as:
```
CharUnescaped = 0x20-0x26 | 0x28-0x5B | 0x5D-0x7E
CharEscaped = "\" ["r" | "n" | "t" | "0" | "\" | "'" | 0x22]
CharElement = CharUnescaped | CharEscaped
TkChar = "'" CharElement "'"
```
That is: a character is any of the visible characters (except for the quotation mark, because that has to be escaped, and except for the backslash character, because that is the indicator of the escaping). Or it is an escaped character. From left to right we have the following supported escape characters:
1. `r`, `0x0D`: Carriage feed,
2. `n`, `0x0A`: Newline
3. `t`, `0x09`: Horizontal tab
4. `0`, `0x00`: Null character
5. `\`, `0x5C`: Backslash character
6. `'`, `0x27`: Single quote character
7. `"`, `0x22`: Double quote character
### String Literal
A string literal is essentially defined in the same way as the character literal, however now we have to escape the '"' character. So it is defined as:
```
StrUnescaped = 0x20-0x21 | 0x23-0x5B | 0x5D-0x7E
StrEscaped = CharEscaped
StrElement = StrUnescaped | StrEscaped
TkStr = 0x22 StrElement 0x22
```
Where again, 0x22 is the double quote character itself.
### Integer Literal
PDL currently supports binary, octal, decimal and hexadecimal integers. These are defined as:
```
IntBinEl = "0" | "1"
IntBinElSep = IntBinEl | "_"
IntBin = "0" ["b" | "B"] IntBinElSep* IntBinEl IntBinElSep*
IntOctEl = "0"-"7"
IntOctElSep = IntOctEl | "_"
IntOct = "0" ["o" | "O"] IntOctElSep* IntOctEl IntOctElSep*
IntDecEl = "0"-"9"
IntDecElSep = IntDecEl | "_"
IntDec = IntDecEl IntDecElSep*
IntHexEl = "0"-"9" | "A"-"F" | "a"-"f"
IntHexElSep = IntHexEl | "_"
IntHex = "0" ["x" | "X"] IntHexElSep* IntHexEl IntHexElSep*
TkInt = IntBin | IntOct | IntDec | IntHex
```
For the regular decimal integers we expect the first character to be an actual digit (to prevent ambiguity with the `TkIdent` token). The remainder may be any decimal digit or the separating `_` character.
For the non-decimal integers we expect two initial characters indicating the radix of the integer. The remainder of the integer literal must then consist at least once of an element in its alphabet, and the remainder may contain the separating `_` character where possible.
The separating character is visually useful for the programmer (e.g. `0xDEADBEEF_CAB3CAF3_DEADC0D3_C0D3CAF3` or `0b0001_0010_0100_1000`). But will not contribute to the interpretation of the integer character.
### Boolean literals
A boolean literal is just the word "true" or "false". The interpretation of these strings takes precedence over identifiers. We define:
```
TkBool = "true" | "false"
```
### Keywords
Several sequences of characters are reserved keywords which have a special meaning within the context of importing modules or the definition of a procedure body. These keywords take precedence over the interpretation of the character sequence as an identifier. Hence, all of these keywords may not be used as identifiers within the program.
```
KwLet = "let"
KwAs = "as"
KwStruct = "struct"
KwEnum = "enum"
KwUnion = "union"
KwFunc = "func"
KwPrim = "primitive"
KwComp = "composite"
KwImport = "import"
Kw = KwLet | KwAs | KwStruct |
KwEnum | KwUnion | KwStruct |
KwFunc | KwPrim | KwComp |
KwImport
```
For statements we have the following keywords:
```
KwChannel = "channel"
KwIf = "if"
KwElse = "else"
KwWhile = "while"
KwBreak = "break"
KwContinue = "continue"
KwGoto = "goto"
KwReturn = "return"
KwSync = "synchronous"
KwNew = "new"
Stmt = KwChannel |
KwIf | KwElse | KwWhile |
KwBreak | KwContinue | KwGoto | KwReturn |
KwSync | KwNew
```
We use these two lists to define the keyword token as:
```
TkKw = Kw | Stmt
```
Apart from these keywords the language also features several builtin methods and types. However, from the point of view of the tokenizer these are simply interpreted as identifiers. Scoping rules will ensure that they are not doubly defined. The same is true for types: several types (e.g. the basic integers) are given special attention, but we will define these later.
### Fixed-Width Tokens
The remaining tokens are the fixed-width types of punctuation. There are several ones whose first couple of characters are identical. In this case we pick the largest matching sequence of characters. We have:
```
TkExcl = "!"
TkQuestion = "?"
TkPound = "#"
TkLAngle = "<"
TkLCurly = "{"
TkLParen = "("
TkLSquare = "["
TkRAngle = ">"
TkRCurly = "}"
TkRParen = ")"
TkRSquare = "]"
TkColon = ":"
TkComma = ","
TkDot = "."
TkSemiColon = ";"
TkAt = "@"
TkPlus = "+"
TkMinus = "-"
TkStar = "*"
TkSlash = "/"
TkPercent = "%"
TkCaret = "^"
TkAnd = "&"
TkOr = "|"
TkTilde = "~"
TkEqual = "="
TkColonColon = "::"
TkDotDot = ".."
TkArrowRight = "->"
TkAtEquals = "@="
TkPlusPlus = "++"
TkPlusEquals = "+="
TkMinusMinus = "--"
TkMinusEquals = "-="
TkStarEquals = "*="
TkSlashEquals = "/="
TkPercentEquals = "%="
TkCaretEquals = "^="
TkAndAnd = "&&"
TkAndEquals = "&="
TkOrOr = "||"
TkOrEquals = "|="
TkEqualEqual = "=="
TkNotEqual = "!="
TkShiftLeft = "<<"
TkLessEqual = "<="
TkShiftRight = ">>"
TkGreaterEqual = ">="
TkShiftLeftEqual = "<<="
TkShiftRightEqual = ">>="
TkPunct = ... all of the above
```
For brevity's sake, we will not actually use the identifier above when we move onto specifying how definitions/statements/expressions are defined. The reason for specifying all of these combinations is that the tokenizer produces these tokens, and reports errors based on these tokens. Some of the tokens above are not used by the parser at all, and are merely parsed to produce reasonable error messages
### Combining All Variants
Our definition for a useful token now becomes:
```
Tk = TkChar | TkStr | TkInt | TkBool | TkKw | TkPunct
|