Changeset - 8e4c0fc8b4df
[Not reviewed]
1 0 6
mh - 4 years ago 2021-06-03 09:44:50
contact@maxhenger.nl
Initial commit of language spec
7 files changed with 932 insertions and 291 deletions:
0 comments (0 inline, 0 general)
docs/spec/definitions.md
Show inline comments
 
new file 100644
 
# Declarations
 

	
 
A module is a text stream that is convertable into a token stream. This token stream may contain declarations. At the top level of the module we expect three types of declarations: pragmas, imports, type definitions and procedure definitions. We'll specify each of these in this file in terms of the tokens (where appropriate).
 

	
 
## Types and Identifiers
 

	
 
Throughout the various declarations we will use the following building blocks. Some of these are exactly equal to one another, their meaning will arise from the context in which they're used.
 

	
 
```
 
Ident = KwIdent
 
Path = KwIdent "::"
 

	
 
TypeName = KwIdent
 
TypeRef = Path* KwIdent ("<" (Type ",")* Type? ">")?
 

	
 
ModuleName = (KwIdent ".")* KwIdent
 
```
 

	
 
So a type's name is a single identifier (which may be defined with polymorphic variables). But the reference to a type may exist in some other module, so we can have a path prefix to it (e.g. `other_module::type_name`). The optionally postfixed bit with the angular brackets is the specification of the polymorphic arguments (e.g. `other_module::some_type<u32, yet_another_module::another_type<s32>>`). The pattern `(Thing ",")* Thing?` appears quite often, and is a comma-separated list of `Thing`s which may have a trailing comma.
 

	
 
## Pragmas
 

	
 
Only two types of pragmas are currently supported: one to specify the module's name and one to specify the module's version:
 

	
 
```
 
PragmaName = "#module" ModuleName RawNL
 
PragmaVersion = "#version" TkInt RawNL
 

	
 
DeclPragma = PragmaName | PragmaVersion
 
```
 

	
 
Note that the pragma's are terminated by a newline. Hence we can have a module name defined like `#module some_name` or `#module some.dot.separated.name`, and a version is simply `#version 123` or even `#version 0x00_01_00_A2`.
 

	
 
## Imports
 

	
 
An import statement is followed by a module name and optionally a combination of aliases, specific symbols to import or a wildcard. These are specified as:
 

	
 
```
 
ImportModule = KwImport ModuleName (KwAs Ident)? ";"
 
ImportWildcard = KwImport ModuleName "::" "*" ";"
 

	
 
ImportedSymbol = TypeName (KwAs Ident)?
 
ImportOneSymbol = KwImport ModuleName "::" ImportedSymbol ";"
 
ImportMultipleSymbols = KwImport ModuleName "::" "{"
 
        (ImportedSymbol ",")* ImportedSymbol?
 
    "}" ";" 
 

	
 
DeclImport = ImportModule | ImportWildcard | ImportSymbols
 
```
 

	
 
The `ImportModule` is the import of a particular name where one may optionally specify an alias (e.g. `import some_module;` or `import some_module as some_alias;`). Whenever there is a dot-separated module name, we take the last section as its alias. So if module `Foo.Foozle.Wozzle` contains a type `Bar`, then if we `import Foo.Foozle.Wozzle;`, then we refer to `Bar` as `Wozzle::Bar` (**note**: this is an initial implementation without much thought about its practical use, default alias rule may change in the future).
 

	
 
The `ImportWildcard` is the import of all *defined* symbols (i.e. not the symbols the imported module imports itself) within the module, such that we can refer to them directly. Taking the above example, if we write `import Foo.Foozle.Wozzle::*;`, then we may directly refer to `Bar` in the importing module.
 

	
 
Finally, one may also import particular symbols from a module. One may import a single symbol with an optional alias with `import SomeModule::Symbol;` or `import SomeModule::Symbol as Alias;`. Or one may import multiple symbols using `import SomeModule::{One as AnAlias, Two, Three as AnotherAlias};`.
 

	
 
## Enum Type Definitions
 

	
 
An enum acts much like a C enumeration. It is a collection of type-safe identifiers associated with an integer. It is defined as:
 

	
 
```
 
PolyVars = "<" (Ident ",")* Ident ">"
 
EnumVariantDef = Ident ("=" TkInt)?
 
DeclEnum = KwEnum Ident PolyVars? "{"
 
        (EnumVariantDef ",")* EnumVariantDef?
 
    "}"
 
```
 

	
 
Hence we may write a non-polymorphic enumeration as:
 

	
 
```
 
enum FileStatus {
 
    Ok = 0,
 
    DoesNotExist,
 
    NotPermitted,
 
    FailedReading = 0xFF,
 
}
 
```
 

	
 
And a polymorphic enumeration as:
 

	
 
```
 
enum Handle<T> {
 
    Present,
 
    NotPresent
 
}
 
```
 

	
 
We'll mention in passing that polymorphic variables to enum types can never exist in the body of the enum type. But they're allowed for now as they might be used in a useful way within struct or procedure declarations. **note**: this decision might change in the future.
 

	
 
## Union Type Definitions
 

	
 
A union in PDL is a tagged union. The tag itself cannot be decided by the programmer. A union variant may act like an enum variant in that it only specifies variants. A union differs from an enum in that it may contain an arbitrary number of embedded types. A union definition is specified as:
 

	
 
```
 
UnionVariantDef = Ident ("(" (TypeRef ",")* TypeRef? ")")?
 
DeclUnion = KwUnion Ident PolyVars? "{"
 
        (UnionVariantDef ",")* UnionVariantDef?
 
    "}"
 
```
 

	
 
Hence we may write a non-polymorphic union as:
 

	
 
```
 
union IntegerAndBool {
 
    UnsignedSmall(u8, bool),
 
    UnsignedLarge(u64, bool),
 
    SignedSmall(s8, bool),
 
    SignedLarge(s64, bool),
 
    VeryLarge(some_module::BigIntType, bool)
 
}
 
```
 

	
 
And some classic examples of polymorphic unions are:
 

	
 
```
 
union Option<T> {
 
    Some(T),
 
    None
 
}
 

	
 
union Result<T, E> {
 
    Ok(T),
 
    Error(E)
 
}
 
```
 

	
 
## Struct Type Definitions
 

	
 
A struct type is a collection of named fields with a particular type. It is rather similar to structs/records in many other languages. A struct definition is specified as:
 

	
 
```
 
StructFieldDef = TypeRef Ident
 
DeclStruct = KwStruct Ident PolyVars? "{"
 
        (StructFieldDef ",")* StructFieldDef?
 
    "}"
 
```
 

	
 
Hence we may write a non-polymorphic struct as:
 

	
 
```
 
struct Integer3D {
 
    u32 x,
 
    u32 y,
 
    u32 z,
 
}
 

	
 
// Or alternatively:
 
struct Integer3D { u32 x, u32 y, u32 z }
 
```
 

	
 
And two possible polymorphic structs as:
 

	
 
```
 
struct Point3D<T> { T x, T y, T z }
 
struct Pair<L, R> {
 
    L left,
 
    R right,
 
}
 
```
 

	
 
## Function Definition
 

	
 
A function is a callable subprocedure. It is defined as:
 

	
 
```
 
FuncArgDef = TypeRef Ident
 
FuncRetDef = TypeRef
 
DeclFunc = KwFunc Ident PolyVars? "("
 
        (FuncArgDef ",")* FuncArgDef?
 
    ")" "->" FuncRetDef
 
    StmtBlock
 
```
 

	
 
Hence we can have polymorphic and non-polymorphic functions with zero or multiple arguments with one return type (**note**: The language will likely feature multiple return types in one form or the other in the future). The `StmtBlock` will be introduced in the section on statements.
 

	
 
Two examples of functions:
 

	
 
```
 
func PickAnInt(u32 left, u32 right, bool choose_left) -> u32 {
 
    if (choose_left) {
 
        return left;
 
    } else {
 
        return right;
 
    }
 
}
 

	
 
func PickASide<T>(Pair<T, T> pair, bool choose_left) -> T {
 
    if (choose_left) {
 
        return pair.left;
 
    } else {
 
        return pair.right;
 
    }
 
}
 
```
 

	
 
## Component Definition
 

	
 
A component definition is the definition of a Reo connector. Currently there are two variants. The meaning of these variants will be introduced in a later section. A component is defined much in the same way as a function, except that it doesn't have a return type. Is is defined as:
 

	
 
```
 
CompVariant = KwPrim | KwComp
 
DeclComp = CompVariant Ident PolyVars? "("
 
        (FuncArgDef ",")* FuncArgDef?
 
    ")"
 
    StmtBlock
 
```
 

	
 
We'll delay examples of components until we've introduced the concept of connectors in this document.
 

	
 
## Combining all Declarations
 

	
 
A module is the combination of any of the above declarations. So we may specify the possible contents of a module as:
 

	
 
```
 
Module = (
 
        DeclPragma | 
 
        DeclImport | 
 
        DeclEnum | 
 
        DeclUnion | 
 
        DeclStruct |
 
        DeclFunc |
 
        DeclComp
 
    )*
 
```
 
\ No newline at end of file
docs/spec/expressions.md
Show inline comments
 
new file 100644
 
# Expressions
 

	
 
Expressions are the parts that compute values (and optionally assign them).
 

	
 
**note:** We could do recursive descent formal specification. But I think it is rather ugly. We could also do the generic case, and then simply state what the precedence rules are?
 

	
 
## Assignment Expression
 

	
 
An assignment expression is a right-to-left associative binary operator that evaluates the right hand side expression and assigns it to the left hand side. It is defined as:
 

	
 
```
 
AssignOp = 
 
    "=" |
 
    "@=" |
 
    "*=" | "/=" | "%=" | "+=" | "-=" |
 
    "<<=" | ">>=" | 
 
    "&=" | "^=" | "|="
 
ExprAssign = ExprConditional AssignOp ExprConditional
 
```
 

	
 
Where the assignment operators are:
 
- `=`: Set assignment: assign right hand side to left hand side.
 
- `@=`: Concatenate assignment: concatenates right hand side to left hand side.
 
- `*=`: Multiply: multiplies the left hand side with the value on the right hand side.
 
- `/=`: Divide: divides the left hand side by the number on the right hand side.
 
- `%=`: Remainder: Applies modulo operator directly to left hand side using the divisor on the right hand side.
 
- `+=`: Addition: adds right hand side to left hand side.
 
- `-=`: Subtraction: subtracts right hand side from left hand side.
 
- `<<=`: Shift left: Bitshifts the number on the left hand side, towards the left, by the number of bits on the right hand side.
 
- `>>=`: Shift right: Bitshifts the number on the left hand side, towards the right, by the number of bits on the right hand side.
 
- `&=`: Bitwise and: Performs a bitwise `and` on the left hand side using the integer on the right hand side.
 
- `^=`: Bitwise xor: Performs a bitwise `xor` on the left hand side using the integer on the right hand side.
 
- `|=`: Bitwise or: Performs a bitwise `or` on the left hand side using the integer on the right hand side.
 

	
 
## Conditional Expression
 

	
 
A conditional expression is an expression that contains a test expression and two nested subexpressions. If the test expression evaluates to true, then the first nested subexpression is evaluated. Otherwise the second nester subexpression is evaluated.
 

	
 
It is defined as:
 

	
 
```
 
ExprConditional = ExprConcat "?" Expr ":" Expr
 
```
 

	
 
Due to its formulation one can see that nested conditional expressions are eagerly evaluated in the subexpressions, but not in the test expression.
 

	
 
## Binary Operator Expressions
 

	
 
Binary operators apply an operation to its two subexpressions. We have the following possible binary operators:
 

	
 
```
 
BinaryOp = 
 
    "@" |
 
    "||" | "&&" | 
 
    "|" | "^" | "&" |
 
    "==" | "!=" |
 
    "<" | ">" | "<=" | ">=" |
 
    "<<" | ">>" |
 
    "+" | "-" | "*" | "/" | "%"
 
```
 

	
 
Where we have:
 
- `@`: Concatenation
 
- `||`: Logical or
 
- `&&`: Logical and
 
- `|`: Bitwise or
 
- `^`: Bitwise xor
 
- `&`: Bitwise and
 
- `==`: Equality test
 
- `!=`: Inequality test
 
- `<`: Less-than comparison
 
- `>`: Greater-than comparison
 
- `<=`: Less-than-or-equal-to comparison
 
- `>=`: Greater-than-or-equal-to comparison
 
- `<<`: Bitshift left
 
- `>>`: Bitshift right
 
- `+`: Addition
 
- `-`: Subtraction
 
- `*`: Multiplication
 
- `/`: Division
 
- `%`: Remainder/modulo
 

	
 
We'll define the particular kinds of expressions that use these operators (and remaining kinds of operators in expressions) in such an order that the precedence rules are satisfied. The first in the chain has the lowest precedence.
 

	
 
```
 
ExprBinConcat = ExprBinLogicalOr ("@" ExprBinLogicalOr)*
 
ExprBinLogicalOr = ExprBinLogicalAnd ("||" ExprBinLogicalAnd)*
 
ExprBinLogicalAnd = ExprBinBitwiseOr ("&&" ExprBinBitwiseOr)*
 
ExprBinBitwiseOr = ExprBinBitwiseXor ("|" ExprBinBitwiseXor)*
 
ExprBinBitwiseXor = ExprBinBitwiseAnd ("^" ExprBinBitwiseAnd)*
 
ExprBinBitwiseAnd = ExprBinEquality ("&" ExprBinEquality)*
 
ExprBinEquality = ExprBinRelational (["==" | "!="] ExprBinRelational)*
 
ExprBinRelational = ExprBinShift (["<", ">", "<=", ">="] ExprBinShift)*
 
ExprBinShift = ExprBinAddSub (["<<", ">>"] ExprBinAddSub)*
 
ExprBinAddSub = ExprBinMulDivMod (["+" | "-"] ExprBinMulDivMod)*
 
ExprBinMulDivMod = ExprPrefix (["*" | "/" | "%"] ExprPrefix)*
 
```
 

	
 
## Prefix Operator Expression
 

	
 
Prefix expressions apply their operators to the suffixed expression. Prefix expressions are evaluated right-to-left. The following prefix operators are supported:
 

	
 
```
 
PrefixOp = "+" | "-" | "~" | "!"
 
```
 

	
 
Where we have:
 
- `+`: Positive. Essentially meaningless in this rather strictly typed language, but supported for completeness (and sometimes it looks visually appealing in e.g. `sign = (x > 0) ? +1 : -1`).
 
- `-`: Negation.
 
- `~`: Bitwise not.
 
- `!`: Logical not.
 

	
 
Which are used to define the prefix expression itself as:
 

	
 
```
 
ExprPrefix = (PrefixOp ExprPrefix)? | ExprPostfix
 
```
 

	
 
## Postfix Operator Expression
 

	
 
Postfix operations are slightly more involves, because the postfixes take arguments following a particular pattern. We have the following possible postfix expressions:
 

	
 
```
 
PostfixOp = 
 
    ("[" Expr "]") |
 
    ("[" Expr ".." Expr "]") |
 
    ("." Ident)
 
```
 

	
 
Where we have:
 

	
 
- `[a_number]`: Indexing expression, can be applied to arrays and strings.
 
- `[lower_idx..higher_idx]`: Slicing expressions, takes a range of values from an array or string.
 
- `.a_field`: Select expression, can only be applied to struct values and returns the selected field's value.
 

	
 
Which are used in:
 

	
 
```
 
ExprPostfix = ExprPrimary PostfixOp*
 
```
 

	
 
## Primary Expressions
 

	
 
Primary expressions are the ones with the highest precedence. We have the variants:
 

	
 
```
 
ExprParens = "(" Expr ")"
 

	
 
ExprLitArray = "{" (Expr ",")* Expr? "}"
 
ExprLitInt = TkInt
 
ExprLitString = TkStr
 
ExprLitChar = TkChar
 
ExprLitBool = TkBool
 

	
 
ExprLitStructField = Ident ":" Expr
 
ExprLitStruct = TypeRef "{"
 
        (ExprLitStructField ",")* ExprLitStructField?
 
    "}"
 
ExprLitEnum = TypeRef "::" Ident
 
ExprLitUnion = TypeRef "::" Ident ("(" (Expr ",")* Expr? ")")?
 

	
 
ExprBinding = KwLet ExprPrefix "=" ExprPrefix
 

	
 
ExprCall = TypeRef "(" (Expr ",")* Expr? ")"
 
ExprCast = "cast" PolyArgs? "(" Expr ")"
 

	
 
ExprVar = Ident
 
```
 

	
 
Here we find, consecutively: 
 
- The parenthesized expression, indicating its contents should be evaluated at a higher precedence.
 
- The various constructable literals.
 
- The binding expression, where the values on the right hand side of the equality are bound to the binding variables on the left hand side.
 
- Calling of functions (and components).
 
- Casting of values.
 
- Reference to a variable.
 

	
 
Naturally the validity of each of these expressions depends strongly on the context in which they're used and their arguments. Likewise for deciding which expression should be used (e.g. an enum variant and a union variant without embedded values differ only in the fact that their type has been defined using a different keyword). We'll come back to all of these rules in a later section of this document.
 

	
 
For now, we can group all of these possibilities together as:
 

	
 
```
 
ExprPrimary = 
 
    ExprParens | 
 
    ExprLitArray | ExprLitInt | ExprLitString | ExprLitChar | ExprLitBool |
 
    ExprLitStruct | ExprLitEnum | ExprLitUnion |
 
    ExprBinding | ExprCall | ExprCast | 
 
    ExprVar
 
```
 

	
 
## Putting Everything Together
 

	
 
Now that we have a chain of expressions defined, each depending on the next one in the chain, we can finish by simply defining the basic `Expr` as:
 

	
 
```
 
Expr = ExprAssign
 
```
 
\ No newline at end of file
docs/spec/overview.md
Show inline comments
 
new file 100644
 
# Protocol Description Language Specification
 

	
 
The Protocol Description Language (PDL) will be specified in this document. There are several levels to the specification. To simplify the specification we will roughly follow the manner in which the language is lexed and parsed by the compiler. We begin by assuming a set of source code streams/files. The specification here assumes that the source file is ASCII-encoded, but the lexer will throw human-readable errors in case it isn't. The source file is lexed into tokens. Where each token has a specific meaning within the language. Although the compiler has tokens for line- and block-comments, these are not important for the parsing. Likewise, apart from the fact that it might serve as a separator between tokens, whitespace will be ignored in the tokenizer.
 

	
 
Having defined the possible tokens, we will assume that comments and whitespace no longer factor into the interpretation of the source code, and continue defining the layout of a single source file. Such a source file consists of pragmas, imports, type, and procedure definitions. These are all specified in terms of tokens. Procedure definitions may contain statements, which will be specified next. Finally we will specify the expressions that may appear within these statements.
 

	
 
Note that the entire language is not yet fully set in stone. Parts of the syntax and it's interpretation may change.
 

	
 
Please view the following documents (in roughly this order):
 

	
 
1. tokens.md
 
2. definitions.md
 
3. statements.md
 
4. expressions.md
 
\ No newline at end of file
docs/spec/statements.md
Show inline comments
 
new file 100644
 
# Statements
 

	
 
Alongside expressions, statements are the main building block of the imperative procedure bodies. We'll define all of the possible statements in this document, and combine them at the end of this chapter into a `Stmt`.
 

	
 
## Block Statement
 

	
 
A block statement is a list of statements wrapped in curly braces. It has the additional meaning of scoping all of its defined variables. That is: it can access variables from its outer scopes. But not of the outer scopes can access the variables inside the inner block statement. We'll expand on the scoping rules in a later chapter.
 

	
 
A block statement is defined as:
 

	
 
```
 
StmtBlock = "{" Stmt* "}"
 
```
 

	
 
## If Statement
 

	
 
An if-statement contains a test expression that is evaluated and should resolve to a boolean. If that boolean is true then we take the "then" branch of the if-statement. If that boolean is false then we can take "else" branch if it is defined by the programmer, otherwise simply skip the "then" branch. An if statement is defined as:
 

	
 
```
 
StmtIf = 
 
    KwIf "(" Expr ")" Stmt
 
    (KwElse Stmt)?
 
```
 

	
 
## While Statement
 

	
 
A while statement is a loop that continues executing based on a test expression. If the test expression is true, then we enter the inner statement and execute it. If the test expression is false then we go to the next statement. If we have entered the inner statement, then upon reaching its end we go back to the while statement and re-evaluate the test expression. A while statement is defined as:
 

	
 
```
 
StmtWhile = KwWhile "(" Expr ")" Stmt
 
```
 

	
 
## Labeled Statement
 

	
 
A labeled statement attaches a label to a particular nested label. This label can be used for goto-statements. Secondly, if the label is placed before a loop statement, then one may use break- and continue-statements to refer to that particular loop.
 

	
 
```
 
Label = Ident
 
StmtLabeled = Label ":" Stmt
 
```
 

	
 
## Break Statement
 

	
 
A break statement may halt executing a loop body and jump to the statement coming after the loop statement. It may optionally have a label to break a particular loop. It is defined as:
 

	
 
```
 
StmtBreak = KwBreak Label? ";"
 
```
 

	
 
Some (contrived) examples:
 

	
 
```
 
u32 counter = 0;
 
while (true) {
 
    if (counter == 10) break;
 
    counter += 1;
 
}
 

	
 
u32 i = 0;
 
u32 counter = 0;
 

	
 
outer: while (i < 10) {
 
    u32 j = 0;
 
    while (j < 10) {
 
        if (counter == 42) break outer;
 
        j += 1;
 
        counter += 1;
 
    }
 
    i += 1;
 
}
 
```
 

	
 
## Continue Statement
 

	
 
A continue statement may halt executing a loop body and jump to the loop statement in order to evaluate its test expression (and optionally execute the loop body) again. It may optionally have a label to continue at a particular loop. It is defined as:
 

	
 
```
 
StmtContinue = KwContinue Label? ";"
 
```
 

	
 
## Return Statement
 

	
 
A return statement is used in function bodies in order to halt the execution of the function, and provide a return value to the caller. It is defined as:
 

	
 
```
 
StmtReturn = KwReturn Expr ";"
 
```
 

	
 
**note**: Like with functions, in the future we will likely support multiple return types, hence the syntax may change in the future.
 

	
 
## Goto Statement
 

	
 
A goto statement is used to jump to a particular label within a procedure body. it is defined as:
 

	
 
```
 
StmtGoto = KwGoto Label ";"
 
```
 

	
 
## New Statement
 

	
 
A new statement instantiates a primitive component. It is defined as:
 

	
 
```
 
StmtNew = KwNew CallExpr ";"
 
```
 

	
 
## Local Variable Statement
 

	
 
A local variable statement instantiates a new variable which will have assigned a particular initial value. It is defined as:
 

	
 
```
 
VarRef = Ident
 
StmtLocVar = TypeRef VarRef "=" Expr ";"
 
```
 

	
 
**note**: We may change local variable statements to act as expressions in the future. This has no practical effect on the manner in which valid code can be written. See the comment at "expression statement"
 

	
 
## Local Channel Statement
 

	
 
A local channel statement instantiates a new channel. It returns the two ports that compose the channel. It is defined as:
 

	
 
```
 
PolyArgs = "<" (TypeRef ",")* TypeRef ">"
 
StmtLocChan = KwChannel PolyArgs? VarRef "->" VarRef ";"
 
```
 

	
 
Some examples are:
 

	
 
```
 
channel input_of_channel -> output_from_channel
 
channel<u32> output_of_component -> input_of_component
 
```
 

	
 
Although we'll expand on types later. One should note that in practice a channel only accepts one polymorphic argument. The port on the left hand side of the channel we call the output port (with type `out<T>`) and the right hand side we call the input port (with type `in<T>`). These names depend on the manner in which one views the two ends of the channel. From the point of view of the channel data flows in from the left hand side and out from the right hand side. However, we take the point of view from the component that uses these ports. Hence we view the left hand side as the `out` port, because a component puts data into it, and the right hand side as the `in` port, because a component gets data from it.
 

	
 
## Expression Statement
 

	
 
An expression statement is simply an expression placed in a statement. Hence it is simply defined as:
 

	
 
```
 
StmtExpr = Expr ";"
 
```
 

	
 
**note**: Should I "formally" define assignment expressions here as well? They act like statements, but in code we treat them as expressions (and apply some checks to ensure they're used at the statement level) because this simplifies all code a lot. Maybe we make the document simpler as well by simply stating that they're expressions?
 

	
 
## Combining All Possible Statements
 

	
 
Combining all of the possible statements, we arrive at:
 

	
 
```
 
Stmt = 
 
    StmtBlock | StmtLabeled
 
    StmtIf | StmtWhile | 
 
    StmtBreak | StmtContinue | StmtReturn | StmtGoto |
 
    StmtLocVar | StmtLocChan | StmtExpr
 
```
 
\ No newline at end of file
docs/spec/tokens.md
Show inline comments
 
new file 100644
 
# Tokens
 

	
 
## Notation
 

	
 
While parsing the raw text into tokens, we will use `"c"` to specify the ASCII character `c`, `"word"` to specify the ASCII string `word`, and we will use `0x20` to mean the ASCII character associated with hexadecimal integer 0x20 (i.e. the space character). Terms placed one after the other must appear in that order. When different combinations are possible within a term we will use the syntax `["option 1" | "option 2" | "option 3"]`. We will reserve the `*` symbol to mean "0 or more repetitions", the `+` symbol to mean "1 or more repetitions", and the `?` symbol to mean "0 or 1 occurrence".  To apply these symbols to multiple terms we will put them between parentheses. So `(["0" | "1"])+` means that we expect a sequence of at least 1 character, where each character must be "0" or "1".
 

	
 
Finally, we may specify (inclusive) ranges of characters by putting a dash between two elements. So `0x00-0x1F` represent all control characters. `0x41-0x5A`, or equivalently: `"A"-"Z"` represent all uppercase characters. 
 

	
 
## Parsing Raw Text
 

	
 
We will define the following terms in terms of the raw input text:
 

	
 
```
 
RawWS = [" ", 0x09]
 
RawNL = [(0x0D 0x0A) | 0x0A]
 
RawWSNL = RawWS | RawNL
 
```
 

	
 
That is: whitespace may be the space character or a tab character. A newline may be the unix-like newline character, or the windows-like carriage feed + newline character.
 

	
 
With regards to visible character groups, we may define:
 

	
 
```
 
RawDigit = "0"-"9"
 
RawUpperAlpha = "A"-"Z"
 
RawLowerAlpha = "a"-"z"
 
RawVChar = 0x20-0x7E | 0x09
 
```
 

	
 
Where the `RawVChar` defines a visible character.
 

	
 
Finally, because the text stream ends at some point, we interpret the text stream as if it always has an `RawEOF` token at the end, indicating the end of the file.
 

	
 
## Parsing Unimportant Tokens
 

	
 
The raw text stream will be interpreted as a series of tokens. For simplicity we will call these `TkRaw`, which may essentially be one of two variants: firstly we will define `TkUnimportant`, which we will lex but which will not continue to influence the parsing of the raw text stream. Secondly we have the `Tk` the token that serves as the building block for the remainder of the parsing.
 

	
 
```
 
TkRaw = TkUnimportant | Tk
 
TkUnimportant = TkCommentLine | TkCommentBlock | TkWS
 

	
 
TkCommentLine = "//" RawVChar* [RawNL | RawEOF]
 
TkCommentBlock = "/*" [RawVChar | RawNL]* ["*/" | RawEOF]
 
TkWS = RawWSNL
 
```
 

	
 
Essentially: we consider all line comments, block comments (which do not nest) and whitespace to be unimportant.
 

	
 
## Parsing Important Tokens
 

	
 
There are two main types of important tokens: the varying length tokens (e.g. an identifier or an integer literal), and the fixed-length tokens (e.g. "!"). We'll start with the various kinds of varying length tokens.
 

	
 
### Identifiers
 

	
 
Identifiers are defined as:
 

	
 
```
 
IdentInit = "_" | RawUpperAlpha | RawLowerAlpha
 
IdentRem = IdentInit | RawDigit
 
TkIdent = IdentInit IdentRem*
 
```
 

	
 
That is: an identifier starts with an identifier or an underscore, and is followed by a sequence containing those characters or a number. Note that the later definition of keywords also matches the definition of identifiers. The tokenizer will prefer to pick keywords instead of identifiers.
 

	
 
### Pragmas
 

	
 
A pragma is a hint to the compiler and is indicated by a pound sign. Hence we define a pragma as:
 

	
 
```
 
TkPragma = "#" IdentInit IdentRem*
 
```
 

	
 
### Character
 

	
 
A character literal is a single character bounded by single-quote marks. It is defined as:
 

	
 
```
 
CharUnescaped = 0x20-0x26 | 0x28-0x5B | 0x5D-0x7E
 
CharEscaped = "\" ["r" | "n" | "t" | "0" | "\" | "'" | 0x22]
 
CharElement = CharUnescaped | CharEscaped
 
TkChar = "'" CharElement "'" 
 
```
 

	
 
That is: a character is any of the visible characters (except for the quotation mark, because that has to be escaped, and except for the backslash character, because that is the indicator of the escaping). Or it is an escaped character. From left to right we have the following supported escape characters:
 

	
 
1. `r`, `0x0D`: Carriage feed,
 
2. `n`, `0x0A`: Newline
 
3. `t`, `0x09`: Horizontal tab
 
4. `0`, `0x00`: Null character
 
5. `\`, `0x5C`: Backslash character
 
6. `'`, `0x27`: Single quote character
 
7. `"`, `0x22`: Double quote character
 

	
 
### String Literal
 

	
 
A string literal is essentially defined in the same way as the character literal, however now we have to escape the '"' character. So it is defined as:
 

	
 
```
 
StrUnescaped = 0x20-0x21 | 0x23-0x5B | 0x5D-0x7E
 
StrEscaped = CharEscaped
 
StrElement = StrUnescaped | StrEscaped
 
TkStr = 0x22 StrElement 0x22
 
```
 

	
 
Where again, 0x22 is the double quote character itself.
 

	
 
### Integer Literal
 

	
 
PDL currently supports binary, octal, decimal and hexadecimal integers. These are defined as:
 

	
 
```
 
IntBinEl = "0" | "1"
 
IntBinElSep = IntBinEl | "_"
 
IntBin = "0" ["b" | "B"] IntBinElSep* IntBinEl IntBinElSep*
 

	
 
IntOctEl = "0"-"7"
 
IntOctElSep = IntOctEl | "_"
 
IntOct = "0" ["o" | "O"] IntOctElSep* IntOctEl IntOctElSep*
 

	
 
IntDecEl = "0"-"9"
 
IntDecElSep = IntDecEl | "_"
 
IntDec = IntDecEl IntDecElSep*
 

	
 
IntHexEl = "0"-"9" | "A"-"F" | "a"-"f"
 
IntHexElSep = IntHexEl | "_"
 
IntHex = "0" ["x" | "X"] IntHexElSep* IntHexEl IntHexElSep*
 

	
 
TkInt = IntBin | IntOct | IntDec | IntHex
 
```
 

	
 
For the regular decimal integers we expect the first character to be an actual digit (to prevent ambiguity with the `TkIdent` token). The remainder may be any decimal digit or the separating `_` character.
 

	
 
For the non-decimal integers we expect two initial characters indicating the radix of the integer. The remainder of the integer literal must then consist at least once of an element in its alphabet, and the remainder may contain the separating `_` character where possible.
 

	
 
The separating character is visually useful for the programmer (e.g. `0xDEADBEEF_CAB3CAF3_DEADC0D3_C0D3CAF3` or `0b0001_0010_0100_1000`). But will not contribute to the interpretation of the integer character.
 

	
 
### Boolean literals
 

	
 
A boolean literal is just the word "true" or "false". The interpretation of these strings takes precedence over identifiers. We define:
 

	
 
```
 
TkBool = "true" | "false"
 
```
 

	
 
### Keywords
 

	
 
Several sequences of characters are reserved keywords which have a special meaning within the context of importing modules or the definition of a procedure body. These keywords take precedence over the interpretation of the character sequence as an identifier. Hence, all of these keywords may not be used as identifiers within the program.
 

	
 
```
 
KwLet = "let"
 
KwAs = "as"
 
KwStruct = "struct"
 
KwEnum = "enum"
 
KwUnion = "union"
 
KwFunc = "func"
 
KwPrim = "primitive"
 
KwComp = "composite"
 
KwImport = "import"
 

	
 
Kw = KwLet | KwAs | KwStruct | 
 
     KwEnum | KwUnion | KwStruct |
 
     KwFunc | KwPrim | KwComp |
 
     KwImport
 
```
 

	
 
For statements we have the following keywords:
 

	
 
```
 
KwChannel = "channel"
 
KwIf = "if"
 
KwElse = "else"
 
KwWhile = "while"
 
KwBreak = "break"
 
KwContinue = "continue"
 
KwGoto = "goto"
 
KwReturn = "return"
 
KwSync = "synchronous"
 
KwNew = "new"
 

	
 
Stmt = KwChannel | 
 
       KwIf | KwElse | KwWhile |
 
       KwBreak | KwContinue | KwGoto | KwReturn |
 
       KwSync | KwNew
 
```
 

	
 
We use these two lists to define the keyword token as:
 

	
 
```
 
TkKw = Kw | Stmt
 
```
 

	
 
Apart from these keywords the language also features several builtin methods and types. However, from the point of view of the tokenizer these are simply interpreted as identifiers. Scoping rules will ensure that they are not doubly defined. The same is true for types: several types (e.g. the basic integers) are given special attention, but we will define these later.
 

	
 
### Fixed-Width Tokens
 

	
 
The remaining tokens are the fixed-width types of punctuation. There are several ones whose first couple of characters are identical. In this case we pick the largest matching sequence of characters. We have:
 

	
 
```
 
TkExcl = "!"
 
TkQuestion = "?"
 
TkPound = "#"
 

	
 
TkLAngle = "<"
 
TkLCurly = "{"
 
TkLParen = "("
 
TkLSquare = "["
 
TkRAngle = ">"
 
TkRCurly = "}"
 
TkRParen = ")"
 

	
 
TkRSquare = "]"
 
TkColon = ":"
 
TkComma = ","
 
TkDot = "."
 
TkSemiColon = ";"
 

	
 
TkAt = "@"
 
TkPlus = "+"
 
TkMinus = "-"
 
TkStar = "*"
 
TkSlash = "/"
 
TkPercent = "%"
 
TkCaret = "^"
 
TkAnd = "&"
 
TkOr = "|"
 
TkTilde = "~"
 
TkEqual = "="
 

	
 
TkColonColon = "::"
 
TkDotDot = ".."
 
TkArrowRight = "->"
 
TkAtEquals = "@="
 
TkPlusPlus = "++"
 
TkPlusEquals = "+="
 
TkMinusMinus = "--"
 
TkMinusEquals = "-="
 
TkStarEquals = "*="
 
TkSlashEquals = "/="
 
TkPercentEquals = "%="
 
TkCaretEquals = "^="
 
TkAndAnd = "&&"
 
TkAndEquals = "&="
 
TkOrOr = "||"
 
TkOrEquals = "|="
 
TkEqualEqual = "=="
 
TkNotEqual = "!="
 
TkShiftLeft = "<<"
 
TkLessEqual = "<="
 
TkShiftRight = ">>"
 
TkGreaterEqual = ">="
 

	
 
TkShiftLeftEqual = "<<="
 
TkShiftRightEqual = ">>="
 

	
 
TkPunct = ... all of the above
 
```
 

	
 
For brevity's sake, we will not actually use the identifier above when we move onto specifying how definitions/statements/expressions are defined. The reason for specifying all of these combinations is that the tokenizer produces these tokens, and reports errors based on these tokens. Some of the tokens above are not used by the parser at all, and are merely parsed to produce reasonable error messages
 

	
 
### Combining All Variants
 

	
 
Our definition for a useful token now becomes:
 

	
 
```
 
Tk = TkChar | TkStr | TkInt | TkBool | TkKw | TkPunct
docs/spec/validation.md
Show inline comments
 
new file 100644
 
# Validating the AST
 

	
 
So far we've defined the language in terms of its grammar. Especially for expressions we're still dealing with ambiguity in the interpretation of raw text, and with certain statements or expressions that are invalid in a particular context. We'll deal with these in several phases in this section. Note that some conditions for validity are delayed until the section on typing, as it is more appropriate to introduce those restrictions there.
 

	
 
## Scoping and Naming Rules
 

	
 
Without any further constraints one may define variables or types with the same name. So we introduce rules for the various identifiers that are found within a program. 
 

	
 
Within the root of a module we are at the module scope. It's only parent is the global scope (whose contents we'll introduce later). The identifiers of type and procedure declarations within the module may not conflict with one another, or with any of the identifiers in the global scope.
 

	
 
Secondly, imports introduce identifiers into the module scope. Identifiers of procedures and types imported into a module may not conflict with one another, the global scope (which is implicitly already satisfied, unless the import becomes aliased), or the types defined within a module. In case an import produces an alias for a particular type or procedure, then that is the identifier that may not conflict.
 

	
 
Whenever we're parsing a procedure (function or component), we enter a new scope whose parent is the module scope. Block statements implicitly introduce a new child scope. Any defined variable's identifier will be added to the scope it is in, and may not conflict with another identifier in the same scope or parent scopes.
 

	
 
**note**: Look up if I made mistakes with labels.
 

	
 
Various other groups of identifiers in the program may not conflict among themselves. These are:
 

	
 
- The polymorphic variables of type or procedure definitions.
 
- The fields names of a struct.
 
- The variants of an enum.
 
- The identifiers of variants of a union (even though they might have had a different number of embedded values).
 

	
 
## Polymorphic Argument Rules
 

	
 
Types and procedures may be defined using polymorphic variables. Whenever we use these types to construct literals, or use functions to perform calls, we may specify these as polymorphic arguments. One has the choice of:
 

	
 
1. Specifying all of the polymorphic arguments. Note that one may still allow for type inference by using the `auto` type in the place of the polymorphic argument.
 
2. Specifying none of the polymorphic arguments, in this the compiler must assume that all of the polymorphic arguments are to be inferred (i.e. as if the programmer had written `auto` for each of the polymorphic arguments).
 

	
 
Any other case should be a compile-time error.
 

	
 
**note**: At the moment polymorphism in the language is rather underpowered until a reasonable and simple scheme for constraining them can be found. At the moment one can only specify one of the builtin types or struct/enum/union types of polymorphic arguments. Furthermore functions and procedures defined with polymorphic variables may not construct literals by using them.
 

	
 
## Expression Rules
 

	
 
### Assignment
 

	
 
Assignment expressions may only be placed at the statement level. That is to say: viewing a series of expressions as a tree, where each node has a parent expression or a parent statement (e.g. the test expression of an if-statement), then assignment expressions may only have the expression-statement as a parent. 
 

	
 
The left hand side of an assignment expression should be assignable. An expression is assignable if it is:
 

	
 
- An indexing expression with an assignable subject expression.
 
- A slicing expression with an assignable subject expression.
 
- A select expression with an assignable subject expression.
 
- A variable expression.
 

	
 
### Literals
 

	
 
An enum literal is valid if its `TypeRef` points to an enum declaration (either directly, or through a potentially aliased import) and the subsequent identifier is a variant declared in the type definition.
 

	
 
A union literal is valid if its `TypeRef` points (indirectly) to a union declaration, the subsequent identifier is a variant in the type definition, and if (where applicable) the number of embedded expressions matches the number of embedded types in the variant's definitions.
 

	
 
A struct literal is valid if its `TypeRef` points (indirectly) to a struct declaration, and it provides an expression to initialize all of the struct's fields, and only initializes each field once.
 

	
 
For all of these literals the rules for polymorphic arguments apply: if the type specifies polymorphic arguments, then all of them have to be specified or none of them have to be specified (for implicit type inference).
 

	
 
### Binding
 

	
 
A binding expression itself is only valid if it is nested in the expression tree under no expressions, or only binary expressions. The chain of parents must terminate at an if-statement or while-statements test expression. As a result, binding expressions may not be nested in one another.
 

	
 
A binding expression's left hand side is valid if it is a binding variable, or if it is a literal that may contain a binding variable. A binding variable is a variable that has not yet been declared using a `StmtLocalMem`. Such a variable will then be implicitly be declared at the binding location, and usable within the scope of the if-statement's then-block, or the while-statement's loop body. The same scoping rules apply to that scope: so binding variables may not be declared twice, nor may conflict to any variables declared in the scope or it's parents.
 

	
 
Furthermore, the parent of a binding variable may only be some kind of literal expression, or the left hand side of the binding expression itself.
 

	
 
### Function Calls
 

	
 
The `TypeRef` for a function call should resolve to either a builtin function, or a user-defined function. Function calls follow the rules for polymorphic arguments as outlined above. Furthermore the number of expressions given as arguments to the function should match the number of arguments in the definition of the function.
 

	
 
### Component "Calls"
 

	
 
The `TypeRef` for a component call should resolve to either a builtin component, or a user-defined component. Component calls, like function calls, must following the rules for polymorphic arguments and the arguments to the components themselves. Furthermore component "call" expressions may only be placed with a `StmtNew` as parent. That is to say: one doesn't really call components, instead one instantiates them.
 

	
 
## Builtin Procedures
 

	
language_spec.md
Show inline comments
 
deleted file
0 comments (0 inline, 0 general)