CSY/reowolf Files · docs/runtime/runtime.md · Centrum Wiskunde & Informatica (CWI)

Files @ 8607ee53a8f4
Branch filter:
Location: CSY/reowolf/docs/runtime/runtime.md

8607ee53a8f4 31.3 KiB text/markdown Show Annotation Show as Raw Download as Raw
Finish documentation on general runtime architecture
# Runtime Design

## General Architecture

Roughly speaking the project consists of the following parts:

1. 
   The compiler itself. This transforms the PDL source code into an executable format.
2. 
   The interpreter. This takes the executable format and executes it. It is a very unoptimized AST walker based on stack frames. Generally speaking the bottommost frame in the stack contains the code and memory associated with a component.

    Once the interpreter hits a point where it needs to interact with the runtime (generally in order to communicate with another component) it will halt and emit a signal to the runtime.
3. 
   The runtime. This is the code that keeps track of the components, decides when they can run, where their messages should end up, bring various control algorithms (running behind the scene) to completion.

We'll not go into points 1 and 2 in this document. One may simply assume that at the language level there is support for all of the things that are implemented in the runtime: sync blocks, channels, ports, component creation, sending and receiving messages, etc.

Once such features are encountered in the interpreter, a signal will be emitted to the runtime. Handling these signals is discussed in this document.

## Runtime Architecture

The runtime's code essentially consists of:

- The machinery that keeps track of the components. That is to say: there is some memory reserved for each of the components. And using some kind of component ID we can look up this memory in a registry. If the runtime finds that there are no more user-controlled components running (i.e. the ones that a programmer uses to create more components) and there are no more regular components running, then the runtime shuts down.
- A bit of shared memory for all of the OS threads that will be managed by the runtime. This mainly consists of a work queue. This work queue contains the identities of the components that are scheduled for execution.
- A set of scheduler threads. They attempt to retrieve work from the work queue. This work is in the form of component IDs, which they use to retrieve the associated component and run its PDL code. They do this by invoking the interpreter on the component's current execution and memory state. Once a runtime signal (as mentioned above) as emitted by the interpreter, the scheduler thread will deal with it appropriately.
- An auxilliary polling thread. Not of great importance, so a short description suffices: although most components react to one another, some components (e.g. a TCP socket) might have nothing to do until the OS instructs it to do something. This polling thread ensures that once there is something to do, the component is put back onto the work queue.

As long as the runtime doesn't shut down there will be `T` threads executing `N` components. A component can either be running (by being executed by a scheduler thread), scheduled for execution (its ID is in the work queue), or sleeping. All of these states are exclusive. Maintaining the exclusivity of these states is of great importance! We never want to end up in a place where two threads are both modifying the code/memory state of the same component!

A component will start its lifecycle as being put into the work queue (not exactly true, but this will be dealt with later) by the component that created it. At some point a scheduler will pick up the newly created component's ID from the work queue and start executing its code. Once running (and we'll exclude fatal errors for now) the component may at some point reach the end of its code and terminate, or it may encounter a place in its code where it blocks and needs to wait for external input (e.g. it performed a `get`, but a message has not arrived yet).

Once the execution of a component is blocked, it will attempt to go to sleep. Meaning: it will be in a state where it is not running, but also not scheduled for execution. A component may then be woken up by a different component, or the polling thread, by sending the sleeping a component a message. To prevent components from scheduling a sleeping component multiple times, the memory of a component contains an atomic `sleeping` boolean.

It is instructive at this point to roughly explain how components are stored in memory. The components memory is roughly divided into two regionh. There is the publicly accessible memory of a component and a private memory region. The public part is accessible by all scheduler threads. So all of the memory in the public memory region is somehow behind a lock, or some kind of locking mechanism (and will include the concept of atomics into "some kind of locking mechanism" for now). Hence the aforementioned `sleeping` boolean lives in this region. Conversely, the private memory region of a component is only accessed by the scheduler thread that is running the component. So here we store things like the memory and execution state of its code.

Returning to the idea of a component wishing to enter the "sleeping" state. The procedure in pseudocode is written as:

```
func go_to_sleep(this_component) {
   // We are currently, executing, so our sleeping flag MUST be `false`
   assert(atomic::load(this_component.sleeping) == false);
   atomic::store(this_component.sleeping, true); // we force the flag to false
   // Note that while we were trying to go to sleep, someone has sent us a new
   // message, but it did not see yet that we stored `false` in the sleeping
   // flag, so we need to check ourselves.
   if messages_in_inbox() {
      // We try to set the flag back to false, but another scheduler thread may
      // have already done this
      let swap_success = atomic::cas(this_component.sleeping, true, false);
      if swap_success {
         put_in_work_queue(this_component.id);
      }
   }
}
```

Similarly, each time we try to send a component a message, we must do the following:

```
func send_message_to(target_component, message_data) {
   put_message_in_inbox_locked(target_component.inbox, message_data);
   let swap_success = atomic::cas(target_component.sleeping, true, false);
   if swap_success {
      put_in_work_queue(target_component.id);
   }
}
```

Note that, because we cannot predict how the OS threads are scheduled, we can also not predict the way in which our own schedulers (which are running on OS threads) will schedule the execution of the components. Hence as a mental model, one may assume that each component is running in its own thread. The system above ensures that if a component has something to do (because it has received a message), it will eventually end up being executed by a scheduler. With the code for the "sleeping" state we'll ensure that a component can only be executed by one scheduler at a time.

## General Messaging

With this rough architecture, components can send eachother messages. One will find three kinds of messages in the runtime (okay, four, but the last one is just to make the OS polling thread work):

1. Data messages: these are the messages that are sent by `put` calls and received with `get` calls. As will be explained later in this document, more information is attached to data messages than the values given as argument to the `put` call. These messages will arrive in the target component's inbox. When the target component performs a call to `get` they're actually pulled out of the inbox.
2. Control messages: to facilitate certain operations permitted within PDL, the scheduler thread may decide to send messages to other components. These messages are called control messages. We'll encounter them later in this document when describing component creation, transmitting ports, and closing channels. Different from data messages, which may linger in the component's inbox for a while, control messages are handled by the receiving component immediately. This is important for various control algorithms.
3. Sync messages: to facilitate the consensus algorithm, there will be messages initiated by the scheduler thread as well. That is to say: when the component sends data messages there will be information attached to the data message that facilitates a successful sync round. But apart from that when the components are all done executing their code they must somehow reach consensus. This is done through these sync messages.

Note that already the concept of a channel, each having its own little slot (or limited buffer), becomes a superfluous design decision for the runtime. This is because the scheduler threads themselves also need to be able to send messages to other components. Hence we'll need something more generic than a per-channel buffer, namely a generic message inbox.

As we'll later also see, the concept of a directional channel *may* be a useful tool within PDL code (and I'm not arguing against such a concept), but as we'll see throughout this document the control messages can (conceptually) flow both ways over the channel.

## Runtime Design Drivers

Most design choices in the runtime were based on the fact that the Reowolf language should facilitate easier programming over the internet. So although the entire system currently only considers components running on a single machine, those components were conceptually regarded as living on different machines. In other words: components were conceptually considered not to have shared memory they can both access.

This has some implications for channels. Certainly a component that sends a value must have it stored somewhere temporarily, and the component receiving it needs to keep it around as well. But the channel is not an entity that you'll find in memory. Rather there is one component owning one port, and when a message is `put` through it, it will arrive at another component owning the peer port; there is no shared memory that represents a message flowing through a channel.

A multi-machine runtime also requires the runtime to embrace the asynchronous nature of components. `put`s are non-blocking and can be performed one after the other before the peer has performed a corresponding `get`. The language does not contain the concept of a "lock" such that two components can agree on who owns a shared bit of memory. Rather each component is executed in its own thread of execution, and for multiple components to coordinate their actions they must use the messaging facilities. In order to make this coordinates somewhat simple to reason about one of the design drivers of the runtime was to ensure that each message sent in a specific order from one component to another will arrive in that same order.

And so we have a multi-machine runtime where components running in their own thread can only coordinate through messages. As a result an ever-important in designing internal (control) algorithms is to consider something called "message crossing". Two components may decide to initiate a protocol at the same time, hence send eachother the exact same protocol-initiating message. 

Yet another result is that we decided to design the runtime without any globally unique component and/or port IDs. Certainly: on a single machine a component IDs allows one to retrieve a component's memory. But when sending a message to a component living on another machine, it may well be that we're sending it to a through a port that has the same port ID as ours, and targeting a component that has the same ID as ours.

## Control Algorithms

We'll now discuss several of the control algorithms. These control algorithms may be initiated by the scheduler threads when certain runtime signals are emitted by the interpreter, and are executed by sending messages. We'll talk about these messages as if they're sent from component to another component. For certain reasons (introduced later) these messages may be sent to ports of peer components, but that does not mean that the peer can `get` these messages: they are control messages that are intercepted and handled by a bit of code that lives above the interpreter.

### Changing Port Peers due to Component Creation

Components, when they're not in sync mode, may decide to create new components. Ports may be the arguments to these newly created components. The rule we place on such a kind of port transfer is that the component that is creating the new component fully relinquishes ownership of the transferred port, and after the component creation the new component owns that port. As an annotated example:

```
comp creator(in<u32> one_port) {
   channel another_port -> and_a_final_one;
   sync {
      auto value = get(one_port); // legal
      put(another_port); // legal
   }
   // perform component creation
   new some_other_component(one_port, and_a_final_one);
   
   sync put(one_port); // illegal! Port was transferred
}
```

We have several runtime properties to uphold when we're transferring ports:

- No globally unique port IDs, so the new component is allowed to choose new port IDs for the ports it is adopting ownership of.
- The peers of the transferred ports may be unaware that a new component is created. In fact those peers may have already transferred messages to the instantiating component! As a design decision (the one that we find makes sense) any incoming, but unread, messages for a port are transferred along to the new component.
- Similarly to above: a peer of a transferred port needs to be aware at some point that its peer port has changed owner.
- Together with the requirement that peers need to be aware of the transferred ports, we also need to maintain ordering in the sent messages that intend to arrive at that transferred port at some point in time.

Here we see the asynchronous nature of the runtime rear its head. Because the transferring of ports does not just happen to the receiving end of a port (hence message only eventually arrive at their correct destination). It may happen to the transmitting end of a port as well. What this means for the receiver is that it is never sure which component is its peer until it has recevied a data message that is annotated with the origin of the message. At that moment in time the peer of the port is known, but only until the end of the synchronous round. Because after the synchronous round it is perfectly possible for the port to be passed around again.

For all of the requirements above, the internal control algorithm to transfer a port to a new component is as following:

1. The component that is creating the new component (we'll call the creator the instantiator component, and the created the new component) temporarily has access to the private memory of the new component. Reason being is that a component is always created on the same machine as the instantiator component. And so the first step it takes is to create new port IDs (that make sense for the newly created component, instead of for itself) and map the old port IDs to the new ones.
2. The component transfers all of the metadata associated with the port, and transfers all of the messages that are targeted at those transferred ports to the new component.
3. For each transferred port the instantiator sends a `PortPeerChanged_Block` control message to the peers. This message instructs the peer that the port should be temporarily blocked. Any component that tries to send a message through that port enters a blocked state that can only be lifted if the corresponding `PortPeerChanged_Unblock` control message is sent. At the same time the instantiator sets up a special bit of code that will relay all incoming messages from that peer to the new component. We've mentioned earlier that all messages will have a target port. So when messages arrive at the instantiator component that need to be relayed, the instantiator component will modify the target port to the new component's chosen port ID.
4. Once a peer has received a `PortPeerChanged_Block`, it will, as stated above, stop sending messages over that channel. Not only data messages, but control messages as well. This also means that if the other component cannot start transferring ports itself. In any case, it will respond with an `Ack`nowledgement back to the instantiator component.
5. The instantiator component waits until it has received an `Ack` for all of the `PortPeerChanged_Block` message it has sent. This is done such that we're sure that we've received all of the messages that are actually intended for the new component (because while the new component is being created the peer may still be sending messages intended for the new component, but sent to the instantiator component). As a result, the new component will have all of the data messages in the inbox in the order in which they were sent, therefore maintaining the runtime property of message ordering.
6. When all of the `Ack`s are received, the instantiator component will remove the bit of code that relays all of the messages and will schedule the new component for execution. At this point the instantiator component will no longer access the private memory of the new component. Since the instantiator component is aware of the new component's ID and the new port IDs for all of the transferred ports, it will send `PortPeerChanged_Unblock` messages to all of the peer components. This message will also contain the new component's ID and its port ID.
7. The peers, upon receiving the `PortPeerChanged_Unblock` message, will update the metadata of their ports such that they point to the new components IDs. They will also unblock the port such that messages can be sent again.

With this control algorithm, all peers are now are of the new port's position. We've also maintained message ordering for the new component. Although it was mentioned in point (4), we'll mention it here to be extra clear: creating a new component will be blocked until all of the transferred ports are unblocked. If we don't do this a data/control message may end up at the wrong component.

Likewise we see the asynchronous nature of ports: the peers are eventually consistent. To maintain this consistence it is important to modify port IDs when we're relaying messages.

### Shutting Down Components

A component will require a bit of memory to run. So when we're done executing a component (either because it has crashes, or because its program has terminated) we would like to release this memory again. Earlier we mentioned that components send messages by accessing an inbox in the public memory region of a component. This memory will, ofcourse, be freed as well. So we need to make sure that when a component shuts down, that all of its peers will somehow be notified that they can never send messages to that terminated component again.

In order to do so we have another control protocol. We'll extend this protocol when we discuss encountering crashing components, but we'll introduce a simpler variant here. The protocol is relatively simple. For each of the ports that are not yet closed and are owned by the terminating component we will:

1. Make sure that the port is not blocked. If the port is blocked then the component blocks until the associated port is becomes unblocked. If the port is already closed then we do not execute the other steps in this control algorithm.
2. The port will send a `ClosePort` message to the peer of the port that is closing. Note that this `ClosePort` message will have a target port. If it happens to be that the terminating component will receive a `PortPeerChanged_Block` message for that port in the near future, we're certain that the `ClosePort` message will at least arrive at the correct peer (since the target port will be used to relay that message to the correct receiver).
3. The peer of the port, upon receiving a `ClosePort` message, will mark the port as being closed in its metadata. From that point onwards, any attempt to `put` or `get` on that port will result in the peer component crashing. In response to the `ClosePort` message, the peer component will send an `Ack`.  There is one exception, and that is when the peer component itself already initiated a `ClosePort` control algorithm for that port. In that case the incoming `ClosePort` message is treated like an `Ack`.
4. The terminating component will wait until all of the `Ack`s have arrived (or crossing `ClosePort` messages, as stated in point (3)). Once they do, they will instruct the runtime to remove the component from memory.

Note here that we have to be careful and send annotate the `ClosePort` message with the target port. We'll wait if the port is already blocked, but it may be that we have the `ClosePort` message crossing with a `PortPeerChanged_Block` message. Which implies that our `ClosePort` message will be relayed by the peer component.

### Transferring Ports through Data Messages

The PDL code allows for ports to be transferred through ports. As a simple example, consider the following code:

```
struct Pair {
   in<bool> command,
   out<u32> response,
}

comp some_component(
   in<u32> to_transmit,
   out<in<u32>> tx_one_port,
   out<Pair> tx_two_ports
) {
   // Transmitting a port directly
   sync put(tx_one_port, to_transmit);
   
   // Transmitting multiple ports at a time using a data structure
   channel command_tx -> command_rx;
   channel response_tx -> response_rx;
   
   sync {
      let message = Pair{ 
         command: command_rx,
         response: response_tx,
      };
      put(tx_two_ports, message);
   }
   
   // Sending a command and receiving a response
   sync {
      put(command_tx, true);
      auto response = get(response_rx);
   }
}
```

To facilitate this, we'll follow roughly the same procedure as when we're transferring ports to a newly created component. But we have one complication: we do not have direct access to the private memory of the component we're sending the ports to (we'll call this component the "adopting component", and the sending component the "relinquishing component"). And so we'll have to follow a control protocol that is slightly different.

Note that it is perfectly okay to send closed ports. The adopting component will receive this component together with the information that the port is closed. In this way, if the adopting component attempts a `put` or `get` on that received component, it will crash.

We'll enforce a second rule upon transferring ports. Namely that ports transferred in a synchronous round may not have been used in `get` or `put` operations. I'm certain that it is possible to come up with a set of rules that will make this possible. But the protocol for transferring components over channels is a lot easier if we disallow this. For this reason we'll introduce a field in the metadata for each port that registers when the port was last used. If the relinquishing component attempts to transfer a port that has been used, then it will crash.

As for transferring ports to a newly created component, we want to ensure that all of the messages intended for that port arrive in the correct order at the adopting component.

That said: the control protocol for transmitting ports proceeds as following:

1. The relinquishing component will first make sure that none of the ports are blocked. If the ports are blocked then it will sleep until the ports become unblocked. As stated above the relinquishing component will also make sure that the ports were not previously used within the synchronous round.
2. The relinquishing component will send `PortPeerChanged_Block` message to all of the peers of the ports that will be transferred. However, in this case it will not relay any messages to the new component, they will still pile up in the relinquishing component's inbox.
3. The peers, upon receiving the `PortPeerChanged_Block` message, will proceed as they would in the case where ports werer transferred to a new component: they'll block the port and send an `Ack`.
4. The relinquishing component will wait until all of the expected `Ack` message are received. Once they are received the component will wait until the port the message will travel through becomes unblocked (that is: the port that is used to transfer the ports to the adopting component).
5. The relinquishing component will send the data message containing the transferred ports to the adopting component. It will annotate this message with a list containing `(tranferred port ID, peer component ID, peer port ID)` triples. Note that since those peer ports are blocked, they will not be transferred in the meantime. This is essential for the next step.
6. The adopting component will receive the annotated data message containing the transferred ports. For each transferred port it will decide upon a new port ID.
7. The adopting component will, for each adopted port, send out a `PortPeerChanged_Unblock` message to the blocked peer ports. This message will be annotated with the `(adopting component ID, new port ID)` pairs. Such that the peers all know where the peers can be found.

## Dealing with Crashing Components

A component may at any point during its execution be triggered to crash. This may be because of something simple like an out-of-bounds array access. But as described above using closed ports may lead to such an event as well. In such a case we not only need to go through the `ClosePort` control protocol, to make sure that we can remove the crashing component's memory from the runtime, but we'll also have to make sure that all of the peers are aware that *their* peer has crashed. Here we'll make a design decision: if a peer component crashes during a synchronous round and there were interactions with that component, then that interacting component should crash as well. The exact reasons will be introduced later, but it comes down to the fact that we need to do something about the fact that the synchronous round will never be able to complete.

We'll talk ourselves through the case of a component crashing before coming up with the control algorithm to deal with components crashing.

We'll first consider that a component may crash inside our outside of a synchronous block. From the point of view of the peer component, we'll have four cases to consider:

1. The peer component is not in a synchronous block. 
2. The crashing component died before the peer component entered the synchronous block.
3. The crashing component died during the same synchronous block as the peer component.
4. The crashing component died after reaching consensus on the synchronous block that the peer component is currently still in.

Before discussing these cases, it is important to remember that the entire runtime has components running in their own thread of execution. We may have that the crashing component is unaware of its peers (due to the fact that peer ports might change ownership at any point in time). We'll discuss the consensus algorithm in more detail later within the documentation. For now it is important to note that the components will discover the synchronous region they are part of while the PDL code is executing. So if a component crashes within a synchronous region before the end of the sync block is reached, it may be possible that it will not discover the full synchronous region it would be part of.

Because the crashing component is potentially unaware of the component IDs it will end up notifying that is has failed. We can not design the crash-handling algorithm in a way such that the crashing component notifies the peers of when they have to crash. We'll do the opposite: the crashing component simply crashes and somehow attempts to notify the peers. Those peers themselves decide whether they have to crash in response to such a notification.

For this reason, it does not make a lot of sense to deal with component failure through the consensus algorithm. Dealing with the failure through the consensus algorithm makes sense if we can find the synchronous region that we would discovery if we were able to fully execute the sync block of each participating component. And so we'll opt to deal with failure on a peer-by-peer basis.

We'll go back to the four cases we've discusses above. We'll change our point of view: we're now considering a component (the "handling component") that has to deal with the failure of a peer (the "crashing component"). We'll introduce a small part of our solution a-priori: like a component shutting down, a failing component will simply end its life by broadcasting `ClosePort` message over all of its owned ports that are not closed (and the failing component will also wait until all of those ports are not blocked).

In the first case, we're dealing with a failing component while the handling component is not in a synchronous block. This means that if there was a previous synchronous block, that it has succeeded. We might still have data messages in our inbox that were sent by the failing component. But in this case it is rather easy to deal with this: we mark the ports as closed, and if we end up using them in the next synchronous block, then we will crash ourselves.

In the second case we have that the peer component died before we ourselves have entered the synchronous block. This case is somewhat equivalent to the case we described above. The crashing component cannot have sent the handling component any messages. So we mark the port as closed, potentially failing in the future if they end up being used.

Next up is the third case, where both the crashing component and the handling component were both in the same synchronous round. Like before we mark the port as closed and future use will cause a crash. The difference is that the handling component may be blocked on attempting to `get` from a port which the crashing component now indicates is closed, perhaps we might have already performed successful `get` operations. In that case the handling component should crash: the crashing component can never submit its local solution, so the synchronous round can never succeed! And so we need to have metadata stored for the component that tracks if the port was used in the synchronous round. If the component is used in the synchronous round and a `ClosePort` message comes in, then the component should crash as well.

The fourth case is where the failing component crashes *after* the handling component finished its sync round. This is an edge cases dealing with the following situation: both the handling as the crashing component have submitted their local solution to the consensus algorithm. The crashing component receives a global solution, finishes the sync round, and then crashes, therefore sending the `ClosePort` message to the handling component. The handling component, due to the asynchronous nature of the runtime, receives the `ClosePort` message before the global solution has a chance to reach the handling component. In this case, however, the handling component should be able to finish the synchronous round, and it shouldn't crash.

Here is where we arrive at the protocol for dealing with component crashes. To let the handling component deal with this last case, we'll let the crashing component send the `ClosePort` messages together with a boolean indicating if it crashed inside, or outside of a synchronous block.

If the crashing component sends `ClosePort` together with the indication that it crashed inside of the synchronous block, then we're dealing with case 3 *if* there are messages in the inbox, or if the handling component uses the closed port after receiving the `ClosePort` message. But if the crashing component sends `ClosePort` together with the indication that it crashed outside of a synchronous block, then we check if we have performed any operations on the synchronous port. If we have, then we crash. If we haven't, then the handling component must be dealing with a synchronous round that succeeds in the future.

In this way we modify the control algorithm for terminating components. Now we're able to deal correctly with crashing components.