CSY/reowolf Files · docs/runtime/sync.md · Centrum Wiskunde & Informatica (CWI)

Files @ 84e785174232
Branch filter:
Location: CSY/reowolf/docs/runtime/sync.md

84e785174232 32.4 KiB text/markdown Show Annotation Show as Raw Download as Raw
Removed some duplicated code in error handling
# Synchronous Communication and Component Orchestration

## Introduction

The Reowolf runtime consists of a system that allows multiple components to run within their own thread of execution. These components are able to exchange messages with one another. Components are capable of creating other components, and of creating channels. We may visualise the runtime as a cloud of all kinds of components, connected by the communication channels between them, hence a kind of communication graph.

With this version of the runtime there were several main drivers. For performance reasons we want:

- As little centralized information as possible (because centralization of information implies synchronization to access it).
- As much parallelism as possible (when information *must* be exchanged, then make sure as little components are affected as possible).

To keep the complexity of the runtime to a reasonable minimum, the following requirements have to be met as well:

- A component may terminate, therefore potentially not being able to participate in synchronous communication or receive messages. Experimentation showed that the system that ensures that termination is broadcast to all peers should be kept simple (earlier systems depended on a state-machine with assumptions about the order in which messages were exchanged, which greatly complicates the messaging subsystem of the runtime).
- Messages should arrive in order (with some exceptions). As we'll see later in this document we have different types of messages. Reasoning about a component's operational state becomes much simpler if we can assume that the transmission of messages between components is ordered.

As we will see there are several types of messages that can be exchanged. Among them we have:

- Data messages: these messages contain the data that is "transmitted from and to PDL code". For each `put` a data message is annotated by the runtime and sent along to the receiving component, which will then hopefully retrieve the data with a `get`. These messages are conceptually sent over channels.
- Sync messages: these messages are sent between components to communicate their consensus-state. These messages are not necessarily associated with channels.
- Control messages: these messages are sent between components to ensure that the entire runtime is reliably facilitating data exchange. That is: they ensure that the language is working as intended. As an example: sending a port to a different component requires a bit of bookkeeping to ensure that all involved components are aware of the port exchange.

The remainder of this document tries to describe the various technical aspects of synchronous communication and component orchestration.

## Brief Description of Schedulers

Each component conceptually has its own thread of execution. It is executing a program with at any given point in time a particular memory state. In reality there are a limited number of threads that execute components. Making sure that components are scheduled correctly is based on the fact that components are generally executing programs that are blocked at some point: a message needs to be received, or a port is blocked so we cannot send any information. At that point a component is considered "sleeping". Should another component, scheduled on a particular thread, send a message to this sleeping component, then it is "woken up" by putting it into the execution queue.

The job of the scheduler is then to: execute a component scheduled for execution, wait until a component is scheduled, or shut down in case there are no more components to execute.

The details of the execution queue (currently rather simplistically implemented) is not of interest. What is of interest is that a component can only be scheduled once.

## Creation of Channels

Within PDL code it is possible to create new channels. And so a component will always (that is to say: for now) create both endpoints of the channel, hence own both endpoints of the channel upon creation. Identifiers for these ports are generated locally (we don't want to resolve to having to synchronize on some kind of global port ID counter).

As these IDs are generated locally there is no real technical challenge, but note that ports at different components may have the same port ID.

## Creation of Components

Within PDL code it is possible to create components. Upon their creation they can be given endpoints of channels. Hence at component creation we are changing the configuration of the communication graph. All of the relevant components need to be informed about the port changing ownership.

Here we run into a plethora of problems. The other endpoint might have been given away to another created component. The other endpoint may have already been used in communication, such that we already have messages underway for the port we're trying to give to a newly created component. We may have that the local port ID assigned by the creating component is not the same as the local port ID that the newly created component may want to assign to it. We may have that this port has been passed along multiple times already, etc.

We cannot help that messages have already arrived, or are in transit, for the transferred port. But we can make some assumptions that simplify the transfer of ports. As a first one, we have that the creating component decides when the created component is scheduled for execution. We'll choose not to execute it initially, such that we can be sure that it will not send messages over its ports the moment is created. To further simplify the problem, we have assumed that messages arrive in order. So although messages might still be underway for the transferred ports, if we ask the sender to stop sending, and the sender blocks the port and acknowledges that it has received this command. Then the moment the creator receives the acknowledgement it is certain that it has received all messages intended for the transferred ports. We'll also disallow the creation of components during synchronous interactions.

And so here we have our first control protocol. If a port is transferred then we might have:

1. That the peer port is transferred to the new component as well. All is fine and we can take care of the exchange immediately.
2. That the peer port stays with the creating component. Here all is fine as well, everything is running in a single thread of execution so we diligently do our bookkeeping on the data associated with the port and the channel and we can transfer the port.
3. The peer port is already owned by a different component. Here we need to have a slightly more complicated protocol

In this last case we take the following actions, `C` for creating component, `N` for newly created component, and `P` for the peer component that holds the other port of the same channel.

1. `C` transfers the port to the newly created component `N`, and ask it to come up with a new ID for that port. The port had an ID that was decided by its old owner `C`, and now has one that is agreeable with component `N`.
2. `C` sends a control message `PeerChangeBlockPort` to the peer `P`.
3. `P` receives the `PeerChangeBlockPort` message. It causes the peer port to be temporarily blocked. `P` may still continue executing its code, but the moment it wishes to send something over this port it is forced to block its execution. In response `P` sends an `Acknowledge` message back to `C`.
4. `C` waits for the `Acknowledge` message of `C`. Since the `Acknowledge` message was sent after the last data message that `P` sent to the port under consideration, and because `P` has blocked that port, we are certain that we received all messages. We transfer these messages to `N`.
5. Note that there may be multiple ports being transferred from `C` to `N`, each with a potentially different peer. And so here `C` waits until steps 2 through 4 are completed for all of the transferred ports.
6. Once `C` has received all of the `Acknowledge` messages it was waiting for, it will send a `PeerChangeUnblockPort` message to each peer `P`. This message contains the new port ID, such that `P` can unblock its port, and continue sending messages over this channel, but now correctly arriving at `N`. Additionally, `C` will now schedule the new component `N`.

There is a bit of extra overhead here with the `PeerChangeBlockPort` -> `Acknowledge` -> `PeerChangeUnblockPort` sequence with respect to other possible schemes. But this one allows `P` to continue executing its code as long as it does not use its blocked port. It also ensures that messages arrive in order: they're all collected by `C`, given to `N`, and only then may `P` continue creating messages to `N`, hence arriving after the earlier messages have been handed off to `N`.

As we'll later introduce, ports can end up being closed. When a port is closed the above procedure is not initiated. A port cannot be reopened, and once a port is closed its peer is also notified that the channel cannot be used anymore. As such, it is not needed (and perhaps impossible, if the memory backing the peer component has already been freed) to notify that the port will change peer.

## Managing the Lifetime of Components

Components are created by other components or by the API. Note that the API may for intents and purposes be viewed as a component. It may own and create channels, and it may create components. Component start, like a function, executing at the top of their program listing and may end in two ways. Either they encounter an unrecoverable error, or they neatly reach the end of their execution.

In either case we want to free up the memory that was occupied by the component during its execution. And we want to make sure that any kind of pending message/action is correctly handled. Apart from that the component contains somekind of message inbox, whose memory is associated with that component. Hence we want to make sure that all peers are aware that they're no longer able to send messages to a terminated component.

A small interlude before we continue: trying to take care of unrecoverable errors that occur during a sync round by incorporating the appropriate logic into the consensus algorithm proved to be rather hard. It caused a large increase in the number of states a component could be in, and as a result made the code much harder to reason about. That is: not so much communicating that an error had ocurred (that needs to occur in every synchronous algorithm), but keeping track of which messages can be sent to which component during the consensus algorithm.

For this reason there are two systems that make sure that components stay alive as long as needed. Firstly components will have a reference counter. For simplicity the component also holds a reference to itself. The component will remove this self-reference when it terminates. Each channel causes two components to also hold references to eachother. *If* a consensus algorithm is implemented such that a central components ends up communicating to all participating parties (using channels or not using channels), then we can make sure that it can reach all participating components by incrementing their reference counts (note that this is not yet properly implemented in the runtime). 

Through this mechanism the consensus algorithm can be greatly simplified. If a component encounters a critical error and is already participated in a sync round, then it can notify the other participants, but remain reachable to all other components until the round ends (and the reference counts are decreased again).

A second system is needed to ensure that a component actually exits (because all the peers hold a reference, and we need all of those references to drop to 0 to truly remove the component from the runtime). And so when a component exits it will send `ClosePort` messages to all of its peers. These peers will `Acknowledge` those messages and close the respective ports, meaning that they will no longer allow sending messages over those ports, that will be a fatal error. However, messages that were sent to the exiting component before receiving the `ClosePort` message might still be underway. And so the terminating component will wait for all of the `Acknowledge` messages for all of its channels such that it knows that it has received all data messages. The component will respond to these intermediate messages with a `DataMessageFailed` message, meaning that the message has been received the moment the component was already terminated, hence the sender should consider this a failed message transmission.

Bringing these systems together: apart from data messages there might still be control messages in transit, or the exiting component might still have some control/sync work to do. And so we need to modify something we said earlier: instead of a component removing its self-reference the moment it terminates, we do this the moment we have received all the `Acknowledge` messages we were expecting. This way if a component is busy creating another component, we're certain that the appropriate protocols are observed. So:

Concluding everything described above, two separate mechanisms will act in conjunction:

- A. The exiting component `E` waiting until it has finished notifying all peers `P` of its termination:
  1. Component `E` sends a `ClosePort` message to each peer `P` for each of the shared channels (except when the port is already closed because `P` is also in the process of shutting down).
  2. The peer `P` receives the `ClosePort` message and closes the port as a result. This is a change of port state that will cause any subsequent `put` operations on that port to become a fatal error for component `P`. In response the peer `P` sends an `Acknowledge` message to component `E`, unless component `P` exited itself (that is: sending a `ClosePort` message to `E` before the `ClosePort` message from `P` arrived). After closing the port the component `P` will remove the reference to `E`.
  3. Component `E` waits until all of its pending control operations are finished (i.e. waiting for the `Acknowledge` messages following `ClosePort` messages, `PeerChangeBlockPort` messages, etc.). Once all of these are finished, note that we can no longer participate in any future control actions: component `E` will not create channels/components itself. Since all of its ports are closed, the peers `P` will also not send any data or control messages.
  4. Component `E` now checks its inbox for any remaining messages. It will respond to any data messages that arrived after `E` sending `ClosePort` and before `E` receiving `Acknowledge` with a `DataMessageFailed` message (except for ports that were closed before the `Acknowledge`). Then it removes the reference to itself, therefore decrementing the reference counter for the component by 1.

- B. The reference counting mechanism. Any sync round the exiting component `E` is participating in will conceptually hold a reference to `E`. The component `E` will always respond to sync messages as if it were alive (albeit trying to indicate to everyone that it is actually exiting). The component that removes the last reference to the component `E` (which may be `E` itself, but also a peer `P`) will truly remove the associated memory from the runtime. 

**Note**: We did not consider machine termination. That is to say: once we reach the runtime maturity where communication occurs over different machines, then we have to consider that machines encounter fatal errors. However these can only be resolved by embedding the possibility of failure inside the protocol over which these machines communicate. 

## Sending Data Messages

So far we've discussed the following properties associated with sending data messages:

1. Port IDs are decided locally. So a peer may have an ID that is outdated for the intended recipient.
2. Ports can move from owner to owner. So a peer might have a component ID that is outdated for the intended recipient.
3. Ports may be closed.
4. Message intended for specific ports may end up at an intermediate component that is passing that message along.

However, there are some properties that we can take advantage of:

1. When a component sends a message, it knows for certain what its own component ID and port ID is. So a transmitting port always sends the correct source component/port ID.
2. When a component receives a message, it knows for certain what its own component ID and port ID is. So once a receiving port receives a properly annotated message from a transmitted port, the receiving end can be certain about the component IDs and port IDs that make up the channel.

Note that although the message transmitter may not be certain about its final recipient, the components along the way *are* aware of the routing that is necessary to make the message arrive at the intended target. Combing back to the case where we have a creator `C`, new component `N` and peer `P`. Then `P` will send a message intended for `N`, but arriving at `C`. Here `C` can change the target port and component to `N` (as it is in the process of transferring that port, so knows both its original and new port ID). Once the message arrives and is accepted by the recipient then it is certain about the component and port IDs.

## Sending Sync Messages

Sync messages have the purpose of making sure that consensus is reached about the interactions that took place in all of the participating components' sync blocks. The previous runtime featured speculative execution and a branching execution model: a component could exhibit multiple behaviours, and at the end all components decide which combination of local behaviours constitute a satisfying single global behaviour (if one exists at all). Without speculative execution the model can be a lot simpler.

We'll only shortly discuss the mechanisms that are present in the synchronization algorithm. A component has a local counter, that is reset for each synchronous interaction, that is used when transmitting data messages. Such a message will be annotated with the counter at `N`, after which the component sends the next message with annotation `N+1`. At the same time the component will keep track of a local mapping from port ID to port annotation, we'll call this the port mapping. Likewise, when a component receives a data message it will assign the received annotation in its own port mapping. If two components assign the same annotation to the ports that constitute a channel, then there is an agreeable interaction there.

And so at the end of the synchronous round a component will somehow broadcast its port mapping. Note from the discussion above that a transmitting port's annotation is only associated with that transmitting port, since a transmitting port can only truly ever know its own component/port ID. While the receiving port's annotation knows about the peer's component/port ID as well. And so a component can broadcast `(component ID, port ID, mapping)` for each of its transmitting ports, while it can broadcast `(own component ID, own port ID, peer component ID, peer port ID, mapping)` for each receiving port. Then a recipient of these mappings can match them up and make sure that the mappings agree.

Note that this broadcasting of synchronous messages is essentially a component-to-component operation. However these messages must still be sent over ports anyway (and any port that was used to transmit messages to a particular receiving component will do). There are two reasons:

1. The sync message may need to be rerouted (e.g. a sender quickly fires both a data message and a subsequent sync message while the receiving port is being transferred to a new component), but needs to arrive at the same target as the data message. This is essentially restating that a transmitter never knows about the component ID of the recipient.
2. The sync message must not be taken into account by the recipient if it has not accepted any messages from the sender yet. Ofcourse this can be achieved in various ways but a simple way to achieve this is to send the sync message over ports.

## Annotating Data Messages

These port mappings are also sent along when sending data messages. We will not go into details but here the mapping makes sure that messages arrive in the right order, and certain kinds of deadlock or inconsistent protocol behaviour may be detected. This port mapping is checked for consistency by the recipient and, when consistent, the target port is updated with its new mapping.

As we'll send along this mapping we will only consider the ports that are shared between the two components. But in the most general case the transmitting ports of the component do not have knowledge about the peer component. And so the sent port mapping will have to contain the annotation for *all* transmitting ports. Receiving port mappings only have to be sent along if they received a message, and here we can indeed apply filtering. Likewise, if the recipient of a port mapping has not yet received anything on its receiving port, then it cannot be sure about the identity of the sender.

This leads to problems both for ensuring the correct ordering of the messages. For finding consensus it is not. Suppose that a port `a` sends a message to a port `b`. Port `b` does not accept it. Upon trying to find consensus we see that `a` will submit an entry in its port mapping, and `b` does not submit anything at all. Hence no solution can be found, as desired.

For the message ordering we require from the receiver that it confirms that, for all of the shared channels, it has the same mapping as the sender sends along. Suppose a component `A` has ports `a_put` and `b_put`, while a component `B` has ports `a_get` and `b_get` (where `a_put -> a_get` and `b_put -> b_get`). Suppose component `A` sends on `a_put` and `b_put` consecutively. And component `B` only receives from `b_get`. Then since `a_get` did not receive from `a_put` (hence did not learn that component/port ID pair of `a_put` is associated with `a_get`), the component `B` cannot figure out that `a_get` should precede a `b_get`. Likewise component `A` has no way to know that `a_put` and `b_put` are sending to the same component, hence it cannot indicate to component `B` that `a_get` should precede `b_get`.

There are some simple assumptions we can make that makes the problem a little bit easier to think about. Suppose again `a_put -> a_get` and `b_put -> b_get`. Suppose `a_put` is used first, where we send along the mapping of `a_put` and `b_put`. Then we send along `b_put`, again sending along the mapping. Then it is possible for the receiving component to accept the wrong message first (e.g. `b_get`, therefore learning about `b`), but it will be impossible to get from `a_get` later, since that one requires `b_put` (of which we learned that it matches `b_get`) to not have sent any messages.

Without adding any extra overhead (e.g. some kind of discovery round per synchronous interaction), we can take three approaches:

1. We simply don't care. It is impossible for a round where messages are received out of order to complete. Hence we temporarily allow a component to take the wrong actions, therefore wasting some CPU time, and to crash/error afterward.
2. We remove the entire concept of ordering of channels at a single component. Channels are always independent entities. This way we truly do not have to care. All we care about is that the messages that have been sent over a channel arrive at the other side.
3. We slightly modify the algorithm to detect these problems. This can be done in reasonable fashion, albeit a bit "hacky". For each channel there is a slot to receive messages. Messages wait there until the receiver performs a `get` in the PDL code. So far we've only considered learning about the component/port IDs that constitute a channel the moment they're received with a `get`. The algorithm could be changed to already learn about the peer component/port ID the moment the message arrives in the slot.

We'll go with the last option in the current implementation. We return to the problematic example above. Note that messages between components are sent in ordered fashion, and `a_put` happens before `b_put`. Then component `B` will first learn that `a_put` is the peer of `a_get`, then it performs the first `get` on the message from `b_put` to `b_get`. This message is annotated with a port mapping that `a_put` has been used before. We're now able to detect at component `B` that we cannot accept `b_get` before `a_get`.

Concluding:

- Every data message that is transmitted needs to contain the port mapping of all `put`ting ports (annotating them appropriately if they have not yet been used). We also need to include the port mapping of all `get`ting ports that have a pending/received message. The port mapping for `put`ting ports will only include their own ID, the port mapping for `get`ting ports will include the IDs of their peer as well.
- Every arriving data message will immediately be used to identify the sender as the peer of the corresponding `get`ter port. Since messages between components arrive in order this allows us to detect when the `put`s are in a different order at the sender as the `get`s at the receiver.

## Handling Fatal Component Errors

Components may, during their execution, encounter errors that prevent them from continuing executing their code. For the purposes of this chapter we may consider these to occur during two particular phases of their execution:

1. The error occurred outside of a sync block. Or equivalently (from the point of view of the runtime): the error ocurred inside a sync block, but the component has not interacted with other components through `put`/`get` calls.
2. The error occurred inside of a sync block. The component can have performed any number of `put`/`get` calls. But for the sake of discussion we will only discuss the case where we perform:
   1. One `put` in the synchronous round.
   2. One `get` in the synchronous round.

As a preliminary remark: note that encountering an error is nothing special: the component can simply print an error to `stdout` and stop executing. The handling of the error by peers is of importance! If an interaction is made impossible because a peer has stopped executing, then the component that wishes to perform that interaction should error out itself!

### Handling Errors Outside of a Sync Block

If a component `E` encounters a critical error outside of a sync block. Then we can be sure that if it had a lat synchronous round, that it succeeded. However, there might be future synchronous rounds for component `E`, likewise a peer component `C` might have already put a message in `E`'s inbox.

The requirement for the outside-sync error of `E` is that any future sync interactions by `C` will fail (but, if `C` has no future interactions, it shouldn't fail either!). 

Note that `E` cannot perform `put`/`get` requests, because we're assuming `E` is outside of a sync block. Hence the only possible failing interaction is that `C` has performed a `put`, or is attempting a `get`. In the case the `C` `put`s to `E`, then `E` might not have figured out the identity of `C` yet (see earlier remarks on the eventual consistency of peer detection). Hence `C` is responsible for ensuring its own correct shutdown due to a failing `put`. Likewise for a `get`: `C` cannot receive from `E` if it is failing. So if `C` is waiting on a message to arrive, or if it will call `get` in the future, then `C` must fail as well.

In this case it is sufficient for `E` to send around a `ClosePort` message. As detailed in another chapter of this document. However, a particular race condition might occur. We have assumed that `E` is not in a sync block. But `C` is not aware of this fact. `C` might not be able to distinguish between the following three cases:

1. Regular shutdown: Components `C` and `E` are not in a sync round.
   - `E` broadcasts `ClosePort`.
   - `C` receives `ClosePort`.
2. Shutdown within a sync round, `ClosePort` leads `Solution`: A leader component `L`, peer component `C` and failing component `E`. Assume that all are/were busy in a synchronous round with one another.
   - `L` broadcasts `Solution` for the current sync round.
   - `E` receives `Solution`, finishes round. 
   - `E` encounters an error, so sends `ClosePort` to `C`.
   - `C` receives `ClosePort` from `E`.
   - `C` receives `Solution` from `L`.
3. Shutdown within a sync round, `Solution` leads `ClosePort`: Same components `L`, `C` and `E`.
   - `L` broadcasts `Solution` for the current sync round.
   - `E` receives `Solution` finishes round.
   - `E` encounters an error, so sends `ClosePort` to `C`.
   - `C` receives `Solution` from `L`.
   - `C` receives `ClosePort` from `E`.

In all described cases `E` encounters an error after finishing a sync round. But from the point of view of `C` it is unsure whether the `ClosePort` message pertains to the current synchronous round or not. In case 1 and 3 nothing is out of the ordinary. But in case 2 we have that `C` is at a particular point in time aware of the `ClosePort` from `E`, but not yet of the `Solution` from `L`. `C` should not fail the sync round, as it is completed, but it is unaware of this fact.

As a rather simple solution, since components that are participating with one another in a sync round move in lock-step at the end of the sync block, we send a boolean along with the `ClosePort`, e.g. `ClosePort(nonsync)`. This boolean indicates whether `E` was inside or outside of a sync block during it encountering an error. Now `C` can distinguish between the three cases: in all cases it agrees that `E` was not in a sync block (and hence: the sync round in cases 2 and 3 can be completed).

### Handling Errors Inside of a Sync Block

If `E` is inside of a sync block. Then it has interacted with other components. Our requirement now is that the sync round fails (and ofcourse, that all of the peers are notified that `E` will no longer be present in the runtime). There are two things that are complicating this type of failure:

1. Suppose that in the successful case of the synchronous interaction, there are a large number of components interacting with one another. Now it might be that `E` fails very early in its sync block, such that it cannot interact with several components. This lack of interaction might cause the single sync block to break up into several smaller sync blocks. Each of these separated regions is supposed to fail.
2. Within a particular synchronous interaction we might have that the leader `L` has a reference to the component `E` without it being a direct peer. There is a reference counting system in place that makes sure that `L` can always send messages to `E`. But we still need to make sure that those references stay alive for as long as needed. 

Suppose a synchronous region is (partially) established, and the component `E` encounters a critical error. The two points given above imply that two processes need to be initiated. For the first error-handling process, we simply use the same scheme as described in the case where `E` is not in a synchronous region. However now we broadcast `ClosePort(sync)` instead of `ClosePort(nonsync)` messages. Consider the following two cases: 

1. Component `C` is not part of the same synchronous region as `E`. And component `C` has tried `put`ting to `E`. If `C` receives a `ClosePort(sync)`, then it knows that its interaction should fail. Note: it might be that `E` wasn't planning on `get`ting from `C` in the sync round in which `E` failed, but much later. In that case it still makes sense for `C` to fail; it would have failed in the future. A small inconsistency here (within the current infinitely-deadlocking implementation) is that if `E` would *never* `get` from `C`, then `C` would deadlock instead of crash (one could argue that this implies that deadlocking should lead to crashing through a timeout mechanism).
2. Component `C` is not part of the same synchronous region as `E`. And if `E` wouldn't have crashed, then it would've `put` a message to `C`. In this case it is still proper for `C` to crash. The reasoning works the same as above.

So that is to say that this `ClosePort(sync)` causes instant failure of `C` if it has used the closed port in a round without consensus, or if it uses that port in the future. Note that this `ClosePort(sync)` system causes cascading failures throughout the disjoint synchronous regions. This is as intended: once one component's PDL program can no longer be executed, we cannot depend on the discovery of all the peers that constitute the intended synchronous region. So instead we rely on a peer-to-peer mechanism to make sure that every component is notified of failure.

However, while these cascading peer-to-peer `ClosePort(sync)` messages are happily shared around, we still have a leader component somewhere, and components that have not yet been notified of the failure. Here we can make several design choices to