diff --git a/docs/runtime/design.md b/docs/runtime/design.md deleted file mode 100644 index 13bf6de648fc5dec6de45bbf30cdfc6eac18a1d3..0000000000000000000000000000000000000000 --- a/docs/runtime/design.md +++ /dev/null @@ -1,15 +0,0 @@ -# Runtime Design - -## Preliminary preliminaries - -There will be some confusion when we're using the word "synchronization". When we talk about OS-syncing we mean using synchronization primitives such as atomics, mutexes, semaphores etc. When we talk about sync-blocks, sync-regions, etc. we mean the Reowolf language's distributed consensus feature. - -## Preliminary Notes - -The runtime was designed in several iterations. For the purpose of documentation, we have had: - -- Reowolf 1.0: A single-threaded, globally locking runtime. -- Initial 1.2: A single-threaded runtime, no longer globally locking. The newly designed consensus algorithm worked quite well and reasonably efficiently (not measured by comparison to another runtime, rather, the idea of the consensus algorithm was simple and efficient to perform) -- Multithreaded 1.2, v1: Here is where we moved towards a more multithreaded design. From the start, the idea was to "maximize concurrency", that is to say: we should only use OS-syncing when absolutely appropriate. Furthermore the initial implementation should be somewhat efficient: we should not employ locks when it is not absolutely necessary. The following remarks can be made with respect to this initial multithreaded implementation: - - Because there will generally be far more components than there are hardware threads, an efficient implementation requires some kind of scheduler that is able to execute the code of components. To track which components are supposed to run, there will be a work queue. The initial implementation features just a single global work queue. Each thread that is executing components is called a scheduler in this document. - - At the most basic level, a component has properties that allow access by only one writer at a time, and properties that are conceptually (ofcourse we need some kind of OS synchronization) accessible by multiple writers at a time. At the very least, executing the code of a component should only be performed by one writer at a time (the scheduler), while sending messages to a component should be allowed by multiple schedulers. Hence the runtime splits component properties into two: those that should only be accessed by the scheduler that is executing the code, and those that should be accessible by all schedulers at any time. \ No newline at end of file diff --git a/docs/runtime/runtime.md b/docs/runtime/runtime.md new file mode 100644 index 0000000000000000000000000000000000000000..f8eeb89ad3feaef71ef8ba402f41ff97e2549f77 --- /dev/null +++ b/docs/runtime/runtime.md @@ -0,0 +1,243 @@ +# Runtime Design + +## General Architecture + +Roughly speaking the project consists of the following parts: + +1. + The compiler itself. This transforms the PDL source code into an executable format. +2. + The interpreter. This takes the executable format and executes it. It is a very unoptimized AST walker based on stack frames. Generally speaking the bottommost frame in the stack contains the code and memory associated with a component. + + Once the interpreter hits a point where it needs to interact with the runtime (generally in order to communicate with another component) it will halt and emit a signal to the runtime. +3. + The runtime. This is the code that keeps track of the components, decides when they can run, where their messages should end up, bring various control algorithms (running behind the scene) to completion. + +We'll not go into points 1 and 2 in this document. One may simply assume that at the language level there is support for all of the things that are implemented in the runtime: sync blocks, channels, ports, component creation, sending and receiving messages, etc. + +Once such features are encountered in the interpreter, a signal will be emitted to the runtime. Handling these signals is discussed in this document. + +## Runtime Architecture + +The runtime's code essentially consists of: + +- The machinery that keeps track of the components. That is to say: there is some memory reserved for each of the components. And using some kind of component ID we can look up this memory in a registry. If the runtime finds that there are no more user-controlled components running (i.e. the ones that a programmer uses to create more components) and there are no more regular components running, then the runtime shuts down. +- A bit of shared memory for all of the OS threads that will be managed by the runtime. This mainly consists of a work queue. This work queue contains the identities of the components that are scheduled for execution. +- A set of scheduler threads. They attempt to retrieve work from the work queue. This work is in the form of component IDs, which they use to retrieve the associated component and run its PDL code. They do this by invoking the interpreter on the component's current execution and memory state. Once a runtime signal (as mentioned above) as emitted by the interpreter, the scheduler thread will deal with it appropriately. +- An auxilliary polling thread. Not of great importance, so a short description suffices: although most components react to one another, some components (e.g. a TCP socket) might have nothing to do until the OS instructs it to do something. This polling thread ensures that once there is something to do, the component is put back onto the work queue. + +As long as the runtime doesn't shut down there will be `T` threads executing `N` components. A component can either be running (by being executed by a scheduler thread), scheduled for execution (its ID is in the work queue), or sleeping. All of these states are exclusive. Maintaining the exclusivity of these states is of great importance! We never want to end up in a place where two threads are both modifying the code/memory state of the same component! + +A component will start its lifecycle as being put into the work queue (not exactly true, but this will be dealt with later) by the component that created it. At some point a scheduler will pick up the newly created component's ID from the work queue and start executing its code. Once running (and we'll exclude fatal errors for now) the component may at some point reach the end of its code and terminate, or it may encounter a place in its code where it blocks and needs to wait for external input (e.g. it performed a `get`, but a message has not arrived yet). + +Once the execution of a component is blocked, it will attempt to go to sleep. Meaning: it will be in a state where it is not running, but also not scheduled for execution. A component may then be woken up by a different component, or the polling thread, by sending the sleeping a component a message. To prevent components from scheduling a sleeping component multiple times, the memory of a component contains an atomic `sleeping` boolean. + +It is instructive at this point to roughly explain how components are stored in memory. The components memory is roughly divided into two regionh. There is the publicly accessible memory of a component and a private memory region. The public part is accessible by all scheduler threads. So all of the memory in the public memory region is somehow behind a lock, or some kind of locking mechanism (and will include the concept of atomics into "some kind of locking mechanism" for now). Hence the aforementioned `sleeping` boolean lives in this region. Conversely, the private memory region of a component is only accessed by the scheduler thread that is running the component. So here we store things like the memory and execution state of its code. + +Returning to the idea of a component wishing to enter the "sleeping" state. The procedure in pseudocode is written as: + +``` +func go_to_sleep(this_component) { + // We are currently, executing, so our sleeping flag MUST be `false` + assert(atomic::load(this_component.sleeping) == false); + atomic::store(this_component.sleeping, true); // we force the flag to false + // Note that while we were trying to go to sleep, someone has sent us a new + // message, but it did not see yet that we stored `false` in the sleeping + // flag, so we need to check ourselves. + if messages_in_inbox() { + // We try to set the flag back to false, but another scheduler thread may + // have already done this + let swap_success = atomic::cas(this_component.sleeping, true, false); + if swap_success { + put_in_work_queue(this_component.id); + } + } +} +``` + +Similarly, each time we try to send a component a message, we must do the following: + +``` +func send_message_to(target_component, message_data) { + put_message_in_inbox_locked(target_component.inbox, message_data); + let swap_success = atomic::cas(target_component.sleeping, true, false); + if swap_success { + put_in_work_queue(target_component.id); + } +} +``` + +Note that, because we cannot predict how the OS threads are scheduled, we can also not predict the way in which our own schedulers (which are running on OS threads) will schedule the execution of the components. Hence as a mental model, one may assume that each component is running in its own thread. The system above ensures that if a component has something to do (because it has received a message), it will eventually end up being executed by a scheduler. With the code for the "sleeping" state we'll ensure that a component can only be executed by one scheduler at a time. + +## General Messaging + +With this rough architecture, components can send eachother messages. One will find three kinds of messages in the runtime (okay, four, but the last one is just to make the OS polling thread work): + +1. Data messages: these are the messages that are sent by `put` calls and received with `get` calls. As will be explained later in this document, more information is attached to data messages than the values given as argument to the `put` call. These messages will arrive in the target component's inbox. When the target component performs a call to `get` they're actually pulled out of the inbox. +2. Control messages: to facilitate certain operations permitted within PDL, the scheduler thread may decide to send messages to other components. These messages are called control messages. We'll encounter them later in this document when describing component creation, transmitting ports, and closing channels. Different from data messages, which may linger in the component's inbox for a while, control messages are handled by the receiving component immediately. This is important for various control algorithms. +3. Sync messages: to facilitate the consensus algorithm, there will be messages initiated by the scheduler thread as well. That is to say: when the component sends data messages there will be information attached to the data message that facilitates a successful sync round. But apart from that when the components are all done executing their code they must somehow reach consensus. This is done through these sync messages. + +Note that already the concept of a channel, each having its own little slot (or limited buffer), becomes a superfluous design decision for the runtime. This is because the scheduler threads themselves also need to be able to send messages to other components. Hence we'll need something more generic than a per-channel buffer, namely a generic message inbox. + +As we'll later also see, the concept of a directional channel *may* be a useful tool within PDL code (and I'm not arguing against such a concept), but as we'll see throughout this document the control messages can (conceptually) flow both ways over the channel. + +## Runtime Design Drivers + +Most design choices in the runtime were based on the fact that the Reowolf language should facilitate easier programming over the internet. So although the entire system currently only considers components running on a single machine, those components were conceptually regarded as living on different machines. In other words: components were conceptually considered not to have shared memory they can both access. + +This has some implications for channels. Certainly a component that sends a value must have it stored somewhere temporarily, and the component receiving it needs to keep it around as well. But the channel is not an entity that you'll find in memory. Rather there is one component owning one port, and when a message is `put` through it, it will arrive at another component owning the peer port; there is no shared memory that represents a message flowing through a channel. + +A multi-machine runtime also requires the runtime to embrace the asynchronous nature of components. `put`s are non-blocking and can be performed one after the other before the peer has performed a corresponding `get`. The language does not contain the concept of a "lock" such that two components can agree on who owns a shared bit of memory. Rather each component is executed in its own thread of execution, and for multiple components to coordinate their actions they must use the messaging facilities. In order to make this coordinates somewhat simple to reason about one of the design drivers of the runtime was to ensure that each message sent in a specific order from one component to another will arrive in that same order. + +And so we have a multi-machine runtime where components running in their own thread can only coordinate through messages. As a result an ever-important in designing internal (control) algorithms is to consider something called "message crossing". Two components may decide to initiate a protocol at the same time, hence send eachother the exact same protocol-initiating message. + +Yet another result is that we decided to design the runtime without any globally unique component and/or port IDs. Certainly: on a single machine a component IDs allows one to retrieve a component's memory. But when sending a message to a component living on another machine, it may well be that we're sending it to a through a port that has the same port ID as ours, and targeting a component that has the same ID as ours. + +## Control Algorithms + +We'll now discuss several of the control algorithms. These control algorithms may be initiated by the scheduler threads when certain runtime signals are emitted by the interpreter, and are executed by sending messages. We'll talk about these messages as if they're sent from component to another component. For certain reasons (introduced later) these messages may be sent to ports of peer components, but that does not mean that the peer can `get` these messages: they are control messages that are intercepted and handled by a bit of code that lives above the interpreter. + +### Changing Port Peers due to Component Creation + +Components, when they're not in sync mode, may decide to create new components. Ports may be the arguments to these newly created components. The rule we place on such a kind of port transfer is that the component that is creating the new component fully relinquishes ownership of the transferred port, and after the component creation the new component owns that port. As an annotated example: + +``` +comp creator(in one_port) { + channel another_port -> and_a_final_one; + sync { + auto value = get(one_port); // legal + put(another_port); // legal + } + // perform component creation + new some_other_component(one_port, and_a_final_one); + + sync put(one_port); // illegal! Port was transferred +} +``` + +We have several runtime properties to uphold when we're transferring ports: + +- No globally unique port IDs, so the new component is allowed to choose new port IDs for the ports it is adopting ownership of. +- The peers of the transferred ports may be unaware that a new component is created. In fact those peers may have already transferred messages to the instantiating component! As a design decision (the one that we find makes sense) any incoming, but unread, messages for a port are transferred along to the new component. +- Similarly to above: a peer of a transferred port needs to be aware at some point that its peer port has changed owner. +- Together with the requirement that peers need to be aware of the transferred ports, we also need to maintain ordering in the sent messages that intend to arrive at that transferred port at some point in time. + +Here we see the asynchronous nature of the runtime rear its head. Because the transferring of ports does not just happen to the receiving end of a port (hence message only eventually arrive at their correct destination). It may happen to the transmitting end of a port as well. What this means for the receiver is that it is never sure which component is its peer until it has recevied a data message that is annotated with the origin of the message. At that moment in time the peer of the port is known, but only until the end of the synchronous round. Because after the synchronous round it is perfectly possible for the port to be passed around again. + +For all of the requirements above, the internal control algorithm to transfer a port to a new component is as following: + +1. The component that is creating the new component (we'll call the creator the instantiator component, and the created the new component) temporarily has access to the private memory of the new component. Reason being is that a component is always created on the same machine as the instantiator component. And so the first step it takes is to create new port IDs (that make sense for the newly created component, instead of for itself) and map the old port IDs to the new ones. +2. The component transfers all of the metadata associated with the port, and transfers all of the messages that are targeted at those transferred ports to the new component. +3. For each transferred port the instantiator sends a `PortPeerChanged_Block` control message to the peers. This message instructs the peer that the port should be temporarily blocked. Any component that tries to send a message through that port enters a blocked state that can only be lifted if the corresponding `PortPeerChanged_Unblock` control message is sent. At the same time the instantiator sets up a special bit of code that will relay all incoming messages from that peer to the new component. We've mentioned earlier that all messages will have a target port. So when messages arrive at the instantiator component that need to be relayed, the instantiator component will modify the target port to the new component's chosen port ID. +4. Once a peer has received a `PortPeerChanged_Block`, it will, as stated above, stop sending messages over that channel. Not only data messages, but control messages as well. This also means that if the other component cannot start transferring ports itself. In any case, it will respond with an `Ack`nowledgement back to the instantiator component. +5. The instantiator component waits until it has received an `Ack` for all of the `PortPeerChanged_Block` message it has sent. This is done such that we're sure that we've received all of the messages that are actually intended for the new component (because while the new component is being created the peer may still be sending messages intended for the new component, but sent to the instantiator component). As a result, the new component will have all of the data messages in the inbox in the order in which they were sent, therefore maintaining the runtime property of message ordering. +6. When all of the `Ack`s are received, the instantiator component will remove the bit of code that relays all of the messages and will schedule the new component for execution. At this point the instantiator component will no longer access the private memory of the new component. Since the instantiator component is aware of the new component's ID and the new port IDs for all of the transferred ports, it will send `PortPeerChanged_Unblock` messages to all of the peer components. This message will also contain the new component's ID and its port ID. +7. The peers, upon receiving the `PortPeerChanged_Unblock` message, will update the metadata of their ports such that they point to the new components IDs. They will also unblock the port such that messages can be sent again. + +With this control algorithm, all peers are now are of the new port's position. We've also maintained message ordering for the new component. Although it was mentioned in point (4), we'll mention it here to be extra clear: creating a new component will be blocked until all of the transferred ports are unblocked. If we don't do this a data/control message may end up at the wrong component. + +Likewise we see the asynchronous nature of ports: the peers are eventually consistent. To maintain this consistence it is important to modify port IDs when we're relaying messages. + +### Shutting Down Components + +A component will require a bit of memory to run. So when we're done executing a component (either because it has crashes, or because its program has terminated) we would like to release this memory again. Earlier we mentioned that components send messages by accessing an inbox in the public memory region of a component. This memory will, ofcourse, be freed as well. So we need to make sure that when a component shuts down, that all of its peers will somehow be notified that they can never send messages to that terminated component again. + +In order to do so we have another control protocol. We'll extend this protocol when we discuss encountering crashing components, but we'll introduce a simpler variant here. The protocol is relatively simple. For each of the ports that are not yet closed and are owned by the terminating component we will: + +1. Make sure that the port is not blocked. If the port is blocked then the component blocks until the associated port is becomes unblocked. If the port is already closed then we do not execute the other steps in this control algorithm. +2. The port will send a `ClosePort` message to the peer of the port that is closing. Note that this `ClosePort` message will have a target port. If it happens to be that the terminating component will receive a `PortPeerChanged_Block` message for that port in the near future, we're certain that the `ClosePort` message will at least arrive at the correct peer (since the target port will be used to relay that message to the correct receiver). +3. The peer of the port, upon receiving a `ClosePort` message, will mark the port as being closed in its metadata. From that point onwards, any attempt to `put` or `get` on that port will result in the peer component crashing. In response to the `ClosePort` message, the peer component will send an `Ack`. There is one exception, and that is when the peer component itself already initiated a `ClosePort` control algorithm for that port. In that case the incoming `ClosePort` message is treated like an `Ack`. +4. The terminating component will wait until all of the `Ack`s have arrived (or crossing `ClosePort` messages, as stated in point (3)). Once they do, they will instruct the runtime to remove the component from memory. + +Note here that we have to be careful and send annotate the `ClosePort` message with the target port. We'll wait if the port is already blocked, but it may be that we have the `ClosePort` message crossing with a `PortPeerChanged_Block` message. Which implies that our `ClosePort` message will be relayed by the peer component. + +### Transferring Ports through Data Messages + +The PDL code allows for ports to be transferred through ports. As a simple example, consider the following code: + +``` +struct Pair { + in command, + out response, +} + +comp some_component( + in to_transmit, + out> tx_one_port, + out tx_two_ports +) { + // Transmitting a port directly + sync put(tx_one_port, to_transmit); + + // Transmitting multiple ports at a time using a data structure + channel command_tx -> command_rx; + channel response_tx -> response_rx; + + sync { + let message = Pair{ + command: command_rx, + response: response_tx, + }; + put(tx_two_ports, message); + } + + // Sending a command and receiving a response + sync { + put(command_tx, true); + auto response = get(response_rx); + } +} +``` + +To facilitate this, we'll follow roughly the same procedure as when we're transferring ports to a newly created component. But we have one complication: we do not have direct access to the private memory of the component we're sending the ports to (we'll call this component the "adopting component", and the sending component the "relinquishing component"). And so we'll have to follow a control protocol that is slightly different. + +Note that it is perfectly okay to send closed ports. The adopting component will receive this component together with the information that the port is closed. In this way, if the adopting component attempts a `put` or `get` on that received component, it will crash. + +We'll enforce a second rule upon transferring ports. Namely that ports transferred in a synchronous round may not have been used in `get` or `put` operations. I'm certain that it is possible to come up with a set of rules that will make this possible. But the protocol for transferring components over channels is a lot easier if we disallow this. For this reason we'll introduce a field in the metadata for each port that registers when the port was last used. If the relinquishing component attempts to transfer a port that has been used, then it will crash. + +As for transferring ports to a newly created component, we want to ensure that all of the messages intended for that port arrive in the correct order at the adopting component. + +That said: the control protocol for transmitting ports proceeds as following: + +1. The relinquishing component will first make sure that none of the ports are blocked. If the ports are blocked then it will sleep until the ports become unblocked. As stated above the relinquishing component will also make sure that the ports were not previously used within the synchronous round. +2. The relinquishing component will send `PortPeerChanged_Block` message to all of the peers of the ports that will be transferred. However, in this case it will not relay any messages to the new component, they will still pile up in the relinquishing component's inbox. +3. The peers, upon receiving the `PortPeerChanged_Block` message, will proceed as they would in the case where ports werer transferred to a new component: they'll block the port and send an `Ack`. +4. The relinquishing component will wait until all of the expected `Ack` message are received. Once they are received the component will wait until the port the message will travel through becomes unblocked (that is: the port that is used to transfer the ports to the adopting component). +5. The relinquishing component will send the data message containing the transferred ports to the adopting component. It will annotate this message with a list containing `(tranferred port ID, peer component ID, peer port ID)` triples. Note that since those peer ports are blocked, they will not be transferred in the meantime. This is essential for the next step. +6. The adopting component will receive the annotated data message containing the transferred ports. For each transferred port it will decide upon a new port ID. +7. The adopting component will, for each adopted port, send out a `PortPeerChanged_Unblock` message to the blocked peer ports. This message will be annotated with the `(adopting component ID, new port ID)` pairs. Such that the peers all know where the peers can be found. + +## Dealing with Crashing Components + +A component may at any point during its execution be triggered to crash. This may be because of something simple like an out-of-bounds array access. But as described above using closed ports may lead to such an event as well. In such a case we not only need to go through the `ClosePort` control protocol, to make sure that we can remove the crashing component's memory from the runtime, but we'll also have to make sure that all of the peers are aware that *their* peer has crashed. Here we'll make a design decision: if a peer component crashes during a synchronous round and there were interactions with that component, then that interacting component should crash as well. The exact reasons will be introduced later, but it comes down to the fact that we need to do something about the fact that the synchronous round will never be able to complete. + +We'll talk ourselves through the case of a component crashing before coming up with the control algorithm to deal with components crashing. + +We'll first consider that a component may crash inside our outside of a synchronous block. From the point of view of the peer component, we'll have four cases to consider: + +1. The peer component is not in a synchronous block. +2. The crashing component died before the peer component entered the synchronous block. +3. The crashing component died during the same synchronous block as the peer component. +4. The crashing component died after reaching consensus on the synchronous block that the peer component is currently still in. + +Before discussing these cases, it is important to remember that the entire runtime has components running in their own thread of execution. We may have that the crashing component is unaware of its peers (due to the fact that peer ports might change ownership at any point in time). We'll discuss the consensus algorithm in more detail later within the documentation. For now it is important to note that the components will discover the synchronous region they are part of while the PDL code is executing. So if a component crashes within a synchronous region before the end of the sync block is reached, it may be possible that it will not discover the full synchronous region it would be part of. + +Because the crashing component is potentially unaware of the component IDs it will end up notifying that is has failed. We can not design the crash-handling algorithm in a way such that the crashing component notifies the peers of when they have to crash. We'll do the opposite: the crashing component simply crashes and somehow attempts to notify the peers. Those peers themselves decide whether they have to crash in response to such a notification. + +For this reason, it does not make a lot of sense to deal with component failure through the consensus algorithm. Dealing with the failure through the consensus algorithm makes sense if we can find the synchronous region that we would discovery if we were able to fully execute the sync block of each participating component. And so we'll opt to deal with failure on a peer-by-peer basis. + +We'll go back to the four cases we've discusses above. We'll change our point of view: we're now considering a component (the "handling component") that has to deal with the failure of a peer (the "crashing component"). We'll introduce a small part of our solution a-priori: like a component shutting down, a failing component will simply end its life by broadcasting `ClosePort` message over all of its owned ports that are not closed (and the failing component will also wait until all of those ports are not blocked). + +In the first case, we're dealing with a failing component while the handling component is not in a synchronous block. This means that if there was a previous synchronous block, that it has succeeded. We might still have data messages in our inbox that were sent by the failing component. But in this case it is rather easy to deal with this: we mark the ports as closed, and if we end up using them in the next synchronous block, then we will crash ourselves. + +In the second case we have that the peer component died before we ourselves have entered the synchronous block. This case is somewhat equivalent to the case we described above. The crashing component cannot have sent the handling component any messages. So we mark the port as closed, potentially failing in the future if they end up being used. + +Next up is the third case, where both the crashing component and the handling component were both in the same synchronous round. Like before we mark the port as closed and future use will cause a crash. The difference is that the handling component may be blocked on attempting to `get` from a port which the crashing component now indicates is closed, perhaps we might have already performed successful `get` operations. In that case the handling component should crash: the crashing component can never submit its local solution, so the synchronous round can never succeed! And so we need to have metadata stored for the component that tracks if the port was used in the synchronous round. If the component is used in the synchronous round and a `ClosePort` message comes in, then the component should crash as well. + +The fourth case is where the failing component crashes *after* the handling component finished its sync round. This is an edge cases dealing with the following situation: both the handling as the crashing component have submitted their local solution to the consensus algorithm. The crashing component receives a global solution, finishes the sync round, and then crashes, therefore sending the `ClosePort` message to the handling component. The handling component, due to the asynchronous nature of the runtime, receives the `ClosePort` message before the global solution has a chance to reach the handling component. In this case, however, the handling component should be able to finish the synchronous round, and it shouldn't crash. + +Here is where we arrive at the protocol for dealing with component crashes. To let the handling component deal with this last case, we'll let the crashing component send the `ClosePort` messages together with a boolean indicating if it crashed inside, or outside of a synchronous block. + +If the crashing component sends `ClosePort` together with the indication that it crashed inside of the synchronous block, then we're dealing with case 3 *if* there are messages in the inbox, or if the handling component uses the closed port after receiving the `ClosePort` message. But if the crashing component sends `ClosePort` together with the indication that it crashed outside of a synchronous block, then we check if we have performed any operations on the synchronous port. If we have, then we crash. If we haven't, then the handling component must be dealing with a synchronous round that succeeds in the future. + +In this way we modify the control algorithm for terminating components. Now we're able to deal correctly with crashing components. \ No newline at end of file diff --git a/docs/runtime/sync.md b/docs/runtime/sync.md deleted file mode 100644 index 305e02ae0f4a03f02538f64303e52c99c6b2dfa6..0000000000000000000000000000000000000000 --- a/docs/runtime/sync.md +++ /dev/null @@ -1,203 +0,0 @@ -# Synchronous Communication and Component Orchestration - -## Introduction - -The Reowolf runtime consists of a system that allows multiple components to run within their own thread of execution. These components are able to exchange messages with one another. Components are capable of creating other components, and of creating channels. We may visualise the runtime as a cloud of all kinds of components, connected by the communication channels between them, hence a kind of communication graph. - -With this version of the runtime there were several main drivers. For performance reasons we want: - -- As little centralized information as possible (because centralization of information implies synchronization to access it). -- As much parallelism as possible (when information *must* be exchanged, then make sure as little components are affected as possible). - -To keep the complexity of the runtime to a reasonable minimum, the following requirements have to be met as well: - -- A component may terminate, therefore potentially not being able to participate in synchronous communication or receive messages. Experimentation showed that the system that ensures that termination is broadcast to all peers should be kept simple (earlier systems depended on a state-machine with assumptions about the order in which messages were exchanged, which greatly complicates the messaging subsystem of the runtime). -- Messages should arrive in order (with some exceptions). As we'll see later in this document we have different types of messages. Reasoning about a component's operational state becomes much simpler if we can assume that the transmission of messages between components is ordered. - -As we will see there are several types of messages that can be exchanged. Among them we have: - -- Data messages: these messages contain the data that is "transmitted from and to PDL code". For each `put` a data message is annotated by the runtime and sent along to the receiving component, which will then hopefully retrieve the data with a `get`. These messages are conceptually sent over channels. -- Sync messages: these messages are sent between components to communicate their consensus-state. These messages are not necessarily associated with channels. -- Control messages: these messages are sent between components to ensure that the entire runtime is reliably facilitating data exchange. That is: they ensure that the language is working as intended. As an example: sending a port to a different component requires a bit of bookkeeping to ensure that all involved components are aware of the port exchange. - -The remainder of this document tries to describe the various technical aspects of synchronous communication and component orchestration. - -## Brief Description of Schedulers - -Each component conceptually has its own thread of execution. It is executing a program with at any given point in time a particular memory state. In reality there are a limited number of threads that execute components. Making sure that components are scheduled correctly is based on the fact that components are generally executing programs that are blocked at some point: a message needs to be received, or a port is blocked so we cannot send any information. At that point a component is considered "sleeping". Should another component, scheduled on a particular thread, send a message to this sleeping component, then it is "woken up" by putting it into the execution queue. - -The job of the scheduler is then to: execute a component scheduled for execution, wait until a component is scheduled, or shut down in case there are no more components to execute. - -The details of the execution queue (currently rather simplistically implemented) is not of interest. What is of interest is that a component can only be scheduled once. - -## Creation of Channels - -Within PDL code it is possible to create new channels. And so a component will always (that is to say: for now) create both endpoints of the channel, hence own both endpoints of the channel upon creation. Identifiers for these ports are generated locally (we don't want to resolve to having to synchronize on some kind of global port ID counter). - -As these IDs are generated locally there is no real technical challenge, but note that ports at different components may have the same port ID. - -## Creation of Components - -Within PDL code it is possible to create components. Upon their creation they can be given endpoints of channels. Hence at component creation we are changing the configuration of the communication graph. All of the relevant components need to be informed about the port changing ownership. - -Here we run into a plethora of problems. The other endpoint might have been given away to another created component. The other endpoint may have already been used in communication, such that we already have messages underway for the port we're trying to give to a newly created component. We may have that the local port ID assigned by the creating component is not the same as the local port ID that the newly created component may want to assign to it. We may have that this port has been passed along multiple times already, etc. - -We cannot help that messages have already arrived, or are in transit, for the transferred port. But we can make some assumptions that simplify the transfer of ports. As a first one, we have that the creating component decides when the created component is scheduled for execution. We'll choose not to execute it initially, such that we can be sure that it will not send messages over its ports the moment is created. To further simplify the problem, we have assumed that messages arrive in order. So although messages might still be underway for the transferred ports, if we ask the sender to stop sending, and the sender blocks the port and acknowledges that it has received this command. Then the moment the creator receives the acknowledgement it is certain that it has received all messages intended for the transferred ports. We'll also disallow the creation of components during synchronous interactions. - -And so here we have our first control protocol. If a port is transferred then we might have: - -1. That the peer port is transferred to the new component as well. All is fine and we can take care of the exchange immediately. -2. That the peer port stays with the creating component. Here all is fine as well, everything is running in a single thread of execution so we diligently do our bookkeeping on the data associated with the port and the channel and we can transfer the port. -3. The peer port is already owned by a different component. Here we need to have a slightly more complicated protocol - -In this last case we take the following actions, `C` for creating component, `N` for newly created component, and `P` for the peer component that holds the other port of the same channel. - -1. `C` transfers the port to the newly created component `N`, and ask it to come up with a new ID for that port. The port had an ID that was decided by its old owner `C`, and now has one that is agreeable with component `N`. -2. `C` sends a control message `PeerChangeBlockPort` to the peer `P`. -3. `P` receives the `PeerChangeBlockPort` message. It causes the peer port to be temporarily blocked. `P` may still continue executing its code, but the moment it wishes to send something over this port it is forced to block its execution. In response `P` sends an `Acknowledge` message back to `C`. -4. `C` waits for the `Acknowledge` message of `C`. Since the `Acknowledge` message was sent after the last data message that `P` sent to the port under consideration, and because `P` has blocked that port, we are certain that we received all messages. We transfer these messages to `N`. -5. Note that there may be multiple ports being transferred from `C` to `N`, each with a potentially different peer. And so here `C` waits until steps 2 through 4 are completed for all of the transferred ports. -6. Once `C` has received all of the `Acknowledge` messages it was waiting for, it will send a `PeerChangeUnblockPort` message to each peer `P`. This message contains the new port ID, such that `P` can unblock its port, and continue sending messages over this channel, but now correctly arriving at `N`. Additionally, `C` will now schedule the new component `N`. - -There is a bit of extra overhead here with the `PeerChangeBlockPort` -> `Acknowledge` -> `PeerChangeUnblockPort` sequence with respect to other possible schemes. But this one allows `P` to continue executing its code as long as it does not use its blocked port. It also ensures that messages arrive in order: they're all collected by `C`, given to `N`, and only then may `P` continue creating messages to `N`, hence arriving after the earlier messages have been handed off to `N`. - -As we'll later introduce, ports can end up being closed. When a port is closed the above procedure is not initiated. A port cannot be reopened, and once a port is closed its peer is also notified that the channel cannot be used anymore. As such, it is not needed (and perhaps impossible, if the memory backing the peer component has already been freed) to notify that the port will change peer. - -## Managing the Lifetime of Components - -Components are created by other components or by the API. Note that the API may for intents and purposes be viewed as a component. It may own and create channels, and it may create components. Component start, like a function, executing at the top of their program listing and may end in two ways. Either they encounter an unrecoverable error, or they neatly reach the end of their execution. - -In either case we want to free up the memory that was occupied by the component during its execution. And we want to make sure that any kind of pending message/action is correctly handled. Apart from that the component contains somekind of message inbox, whose memory is associated with that component. Hence we want to make sure that all peers are aware that they're no longer able to send messages to a terminated component. - -A small interlude before we continue: trying to take care of unrecoverable errors that occur during a sync round by incorporating the appropriate logic into the consensus algorithm proved to be rather hard. It caused a large increase in the number of states a component could be in, and as a result made the code much harder to reason about. That is: not so much communicating that an error had ocurred (that needs to occur in every synchronous algorithm), but keeping track of which messages can be sent to which component during the consensus algorithm. - -For this reason there are two systems that make sure that components stay alive as long as needed. Firstly components will have a reference counter. For simplicity the component also holds a reference to itself. The component will remove this self-reference when it terminates. Each channel causes two components to also hold references to eachother. *If* a consensus algorithm is implemented such that a central components ends up communicating to all participating parties (using channels or not using channels), then we can make sure that it can reach all participating components by incrementing their reference counts (note that this is not yet properly implemented in the runtime). - -Through this mechanism the consensus algorithm can be greatly simplified. If a component encounters a critical error and is already participated in a sync round, then it can notify the other participants, but remain reachable to all other components until the round ends (and the reference counts are decreased again). - -A second system is needed to ensure that a component actually exits (because all the peers hold a reference, and we need all of those references to drop to 0 to truly remove the component from the runtime). And so when a component exits it will send `ClosePort` messages to all of its peers. These peers will `Acknowledge` those messages and close the respective ports, meaning that they will no longer allow sending messages over those ports, that will be a fatal error. However, messages that were sent to the exiting component before receiving the `ClosePort` message might still be underway. And so the terminating component will wait for all of the `Acknowledge` messages for all of its channels such that it knows that it has received all data messages. The component will respond to these intermediate messages with a `DataMessageFailed` message, meaning that the message has been received the moment the component was already terminated, hence the sender should consider this a failed message transmission. - -Bringing these systems together: apart from data messages there might still be control messages in transit, or the exiting component might still have some control/sync work to do. And so we need to modify something we said earlier: instead of a component removing its self-reference the moment it terminates, we do this the moment we have received all the `Acknowledge` messages we were expecting. This way if a component is busy creating another component, we're certain that the appropriate protocols are observed. So: - -Concluding everything described above, two separate mechanisms will act in conjunction: - -- A. The exiting component `E` waiting until it has finished notifying all peers `P` of its termination: - 1. Component `E` sends a `ClosePort` message to each peer `P` for each of the shared channels (except when the port is already closed because `P` is also in the process of shutting down). - 2. The peer `P` receives the `ClosePort` message and closes the port as a result. This is a change of port state that will cause any subsequent `put` operations on that port to become a fatal error for component `P`. In response the peer `P` sends an `Acknowledge` message to component `E`, unless component `P` exited itself (that is: sending a `ClosePort` message to `E` before the `ClosePort` message from `P` arrived). After closing the port the component `P` will remove the reference to `E`. - 3. Component `E` waits until all of its pending control operations are finished (i.e. waiting for the `Acknowledge` messages following `ClosePort` messages, `PeerChangeBlockPort` messages, etc.). Once all of these are finished, note that we can no longer participate in any future control actions: component `E` will not create channels/components itself. Since all of its ports are closed, the peers `P` will also not send any data or control messages. - 4. Component `E` now checks its inbox for any remaining messages. It will respond to any data messages that arrived after `E` sending `ClosePort` and before `E` receiving `Acknowledge` with a `DataMessageFailed` message (except for ports that were closed before the `Acknowledge`). Then it removes the reference to itself, therefore decrementing the reference counter for the component by 1. - -- B. The reference counting mechanism. Any sync round the exiting component `E` is participating in will conceptually hold a reference to `E`. The component `E` will always respond to sync messages as if it were alive (albeit trying to indicate to everyone that it is actually exiting). The component that removes the last reference to the component `E` (which may be `E` itself, but also a peer `P`) will truly remove the associated memory from the runtime. - -**Note**: We did not consider machine termination. That is to say: once we reach the runtime maturity where communication occurs over different machines, then we have to consider that machines encounter fatal errors. However these can only be resolved by embedding the possibility of failure inside the protocol over which these machines communicate. - -## Sending Data Messages - -So far we've discussed the following properties associated with sending data messages: - -1. Port IDs are decided locally. So a peer may have an ID that is outdated for the intended recipient. -2. Ports can move from owner to owner. So a peer might have a component ID that is outdated for the intended recipient. -3. Ports may be closed. -4. Message intended for specific ports may end up at an intermediate component that is passing that message along. - -However, there are some properties that we can take advantage of: - -1. When a component sends a message, it knows for certain what its own component ID and port ID is. So a transmitting port always sends the correct source component/port ID. -2. When a component receives a message, it knows for certain what its own component ID and port ID is. So once a receiving port receives a properly annotated message from a transmitted port, the receiving end can be certain about the component IDs and port IDs that make up the channel. - -Note that although the message transmitter may not be certain about its final recipient, the components along the way *are* aware of the routing that is necessary to make the message arrive at the intended target. Combing back to the case where we have a creator `C`, new component `N` and peer `P`. Then `P` will send a message intended for `N`, but arriving at `C`. Here `C` can change the target port and component to `N` (as it is in the process of transferring that port, so knows both its original and new port ID). Once the message arrives and is accepted by the recipient then it is certain about the component and port IDs. - -## Sending Sync Messages - -Sync messages have the purpose of making sure that consensus is reached about the interactions that took place in all of the participating components' sync blocks. The previous runtime featured speculative execution and a branching execution model: a component could exhibit multiple behaviours, and at the end all components decide which combination of local behaviours constitute a satisfying single global behaviour (if one exists at all). Without speculative execution the model can be a lot simpler. - -We'll only shortly discuss the mechanisms that are present in the synchronization algorithm. A component has a local counter, that is reset for each synchronous interaction, that is used when transmitting data messages. Such a message will be annotated with the counter at `N`, after which the component sends the next message with annotation `N+1`. At the same time the component will keep track of a local mapping from port ID to port annotation, we'll call this the port mapping. Likewise, when a component receives a data message it will assign the received annotation in its own port mapping. If two components assign the same annotation to the ports that constitute a channel, then there is an agreeable interaction there. - -And so at the end of the synchronous round a component will somehow broadcast its port mapping. Note from the discussion above that a transmitting port's annotation is only associated with that transmitting port, since a transmitting port can only truly ever know its own component/port ID. While the receiving port's annotation knows about the peer's component/port ID as well. And so a component can broadcast `(component ID, port ID, mapping)` for each of its transmitting ports, while it can broadcast `(own component ID, own port ID, peer component ID, peer port ID, mapping)` for each receiving port. Then a recipient of these mappings can match them up and make sure that the mappings agree. - -Note that this broadcasting of synchronous messages is essentially a component-to-component operation. However these messages must still be sent over ports anyway (and any port that was used to transmit messages to a particular receiving component will do). There are two reasons: - -1. The sync message may need to be rerouted (e.g. a sender quickly fires both a data message and a subsequent sync message while the receiving port is being transferred to a new component), but needs to arrive at the same target as the data message. This is essentially restating that a transmitter never knows about the component ID of the recipient. -2. The sync message must not be taken into account by the recipient if it has not accepted any messages from the sender yet. Ofcourse this can be achieved in various ways but a simple way to achieve this is to send the sync message over ports. - -## Annotating Data Messages - -These port mappings are also sent along when sending data messages. We will not go into details but here the mapping makes sure that messages arrive in the right order, and certain kinds of deadlock or inconsistent protocol behaviour may be detected. This port mapping is checked for consistency by the recipient and, when consistent, the target port is updated with its new mapping. - -As we'll send along this mapping we will only consider the ports that are shared between the two components. But in the most general case the transmitting ports of the component do not have knowledge about the peer component. And so the sent port mapping will have to contain the annotation for *all* transmitting ports. Receiving port mappings only have to be sent along if they received a message, and here we can indeed apply filtering. Likewise, if the recipient of a port mapping has not yet received anything on its receiving port, then it cannot be sure about the identity of the sender. - -This leads to problems both for ensuring the correct ordering of the messages. For finding consensus it is not. Suppose that a port `a` sends a message to a port `b`. Port `b` does not accept it. Upon trying to find consensus we see that `a` will submit an entry in its port mapping, and `b` does not submit anything at all. Hence no solution can be found, as desired. - -For the message ordering we require from the receiver that it confirms that, for all of the shared channels, it has the same mapping as the sender sends along. Suppose a component `A` has ports `a_put` and `b_put`, while a component `B` has ports `a_get` and `b_get` (where `a_put -> a_get` and `b_put -> b_get`). Suppose component `A` sends on `a_put` and `b_put` consecutively. And component `B` only receives from `b_get`. Then since `a_get` did not receive from `a_put` (hence did not learn that component/port ID pair of `a_put` is associated with `a_get`), the component `B` cannot figure out that `a_get` should precede a `b_get`. Likewise component `A` has no way to know that `a_put` and `b_put` are sending to the same component, hence it cannot indicate to component `B` that `a_get` should precede `b_get`. - -There are some simple assumptions we can make that makes the problem a little bit easier to think about. Suppose again `a_put -> a_get` and `b_put -> b_get`. Suppose `a_put` is used first, where we send along the mapping of `a_put` and `b_put`. Then we send along `b_put`, again sending along the mapping. Then it is possible for the receiving component to accept the wrong message first (e.g. `b_get`, therefore learning about `b`), but it will be impossible to get from `a_get` later, since that one requires `b_put` (of which we learned that it matches `b_get`) to not have sent any messages. - -Without adding any extra overhead (e.g. some kind of discovery round per synchronous interaction), we can take three approaches: - -1. We simply don't care. It is impossible for a round where messages are received out of order to complete. Hence we temporarily allow a component to take the wrong actions, therefore wasting some CPU time, and to crash/error afterward. -2. We remove the entire concept of ordering of channels at a single component. Channels are always independent entities. This way we truly do not have to care. All we care about is that the messages that have been sent over a channel arrive at the other side. -3. We slightly modify the algorithm to detect these problems. This can be done in reasonable fashion, albeit a bit "hacky". For each channel there is a slot to receive messages. Messages wait there until the receiver performs a `get` in the PDL code. So far we've only considered learning about the component/port IDs that constitute a channel the moment they're received with a `get`. The algorithm could be changed to already learn about the peer component/port ID the moment the message arrives in the slot. - -We'll go with the last option in the current implementation. We return to the problematic example above. Note that messages between components are sent in ordered fashion, and `a_put` happens before `b_put`. Then component `B` will first learn that `a_put` is the peer of `a_get`, then it performs the first `get` on the message from `b_put` to `b_get`. This message is annotated with a port mapping that `a_put` has been used before. We're now able to detect at component `B` that we cannot accept `b_get` before `a_get`. - -Concluding: - -- Every data message that is transmitted needs to contain the port mapping of all `put`ting ports (annotating them appropriately if they have not yet been used). We also need to include the port mapping of all `get`ting ports that have a pending/received message. The port mapping for `put`ting ports will only include their own ID, the port mapping for `get`ting ports will include the IDs of their peer as well. -- Every arriving data message will immediately be used to identify the sender as the peer of the corresponding `get`ter port. Since messages between components arrive in order this allows us to detect when the `put`s are in a different order at the sender as the `get`s at the receiver. - -## Handling Fatal Component Errors - -Components may, during their execution, encounter errors that prevent them from continuing executing their code. For the purposes of this chapter we may consider these to occur during two particular phases of their execution: - -1. The error occurred outside of a sync block. Or equivalently (from the point of view of the runtime): the error ocurred inside a sync block, but the component has not interacted with other components through `put`/`get` calls. -2. The error occurred inside of a sync block. The component can have performed any number of `put`/`get` calls. But for the sake of discussion we will only discuss the case where we perform: - 1. One `put` in the synchronous round. - 2. One `get` in the synchronous round. - -As a preliminary remark: note that encountering an error is nothing special: the component can simply print an error to `stdout` and stop executing. The handling of the error by peers is of importance! If an interaction is made impossible because a peer has stopped executing, then the component that wishes to perform that interaction should error out itself! - -### Handling Errors Outside of a Sync Block - -If a component `E` encounters a critical error outside of a sync block. Then we can be sure that if it had a lat synchronous round, that it succeeded. However, there might be future synchronous rounds for component `E`, likewise a peer component `C` might have already put a message in `E`'s inbox. - -The requirement for the outside-sync error of `E` is that any future sync interactions by `C` will fail (but, if `C` has no future interactions, it shouldn't fail either!). - -Note that `E` cannot perform `put`/`get` requests, because we're assuming `E` is outside of a sync block. Hence the only possible failing interaction is that `C` has performed a `put`, or is attempting a `get`. In the case the `C` `put`s to `E`, then `E` might not have figured out the identity of `C` yet (see earlier remarks on the eventual consistency of peer detection). Hence `C` is responsible for ensuring its own correct shutdown due to a failing `put`. Likewise for a `get`: `C` cannot receive from `E` if it is failing. So if `C` is waiting on a message to arrive, or if it will call `get` in the future, then `C` must fail as well. - -In this case it is sufficient for `E` to send around a `ClosePort` message. As detailed in another chapter of this document. However, a particular race condition might occur. We have assumed that `E` is not in a sync block. But `C` is not aware of this fact. `C` might not be able to distinguish between the following three cases: - -1. Regular shutdown: Components `C` and `E` are not in a sync round. - - `E` broadcasts `ClosePort`. - - `C` receives `ClosePort`. -2. Shutdown within a sync round, `ClosePort` leads `Solution`: A leader component `L`, peer component `C` and failing component `E`. Assume that all are/were busy in a synchronous round with one another. - - `L` broadcasts `Solution` for the current sync round. - - `E` receives `Solution`, finishes round. - - `E` encounters an error, so sends `ClosePort` to `C`. - - `C` receives `ClosePort` from `E`. - - `C` receives `Solution` from `L`. -3. Shutdown within a sync round, `Solution` leads `ClosePort`: Same components `L`, `C` and `E`. - - `L` broadcasts `Solution` for the current sync round. - - `E` receives `Solution` finishes round. - - `E` encounters an error, so sends `ClosePort` to `C`. - - `C` receives `Solution` from `L`. - - `C` receives `ClosePort` from `E`. - -In all described cases `E` encounters an error after finishing a sync round. But from the point of view of `C` it is unsure whether the `ClosePort` message pertains to the current synchronous round or not. In case 1 and 3 nothing is out of the ordinary. But in case 2 we have that `C` is at a particular point in time aware of the `ClosePort` from `E`, but not yet of the `Solution` from `L`. `C` should not fail the sync round, as it is completed, but it is unaware of this fact. - -As a rather simple solution, since components that are participating with one another in a sync round move in lock-step at the end of the sync block, we send a boolean along with the `ClosePort`, e.g. `ClosePort(nonsync)`. This boolean indicates whether `E` was inside or outside of a sync block during it encountering an error. Now `C` can distinguish between the three cases: in all cases it agrees that `E` was not in a sync block (and hence: the sync round in cases 2 and 3 can be completed). - -### Handling Errors Inside of a Sync Block - -If `E` is inside of a sync block. Then it has interacted with other components. Our requirement now is that the sync round fails (and ofcourse, that all of the peers are notified that `E` will no longer be present in the runtime). There are two things that are complicating this type of failure: - -1. Suppose that in the successful case of the synchronous interaction, there are a large number of components interacting with one another. Now it might be that `E` fails very early in its sync block, such that it cannot interact with several components. This lack of interaction might cause the single sync block to break up into several smaller sync blocks. Each of these separated regions is supposed to fail. -2. Within a particular synchronous interaction we might have that the leader `L` has a reference to the component `E` without it being a direct peer. There is a reference counting system in place that makes sure that `L` can always send messages to `E`. But we still need to make sure that those references stay alive for as long as needed. - -Suppose a synchronous region is (partially) established, and the component `E` encounters a critical error. The two points given above imply that two processes need to be initiated. For the first error-handling process, we simply use the same scheme as described in the case where `E` is not in a synchronous region. However now we broadcast `ClosePort(sync)` instead of `ClosePort(nonsync)` messages. Consider the following two cases: - -1. Component `C` is not part of the same synchronous region as `E`. And component `C` has tried `put`ting to `E`. If `C` receives a `ClosePort(sync)`, then it knows that its interaction should fail. Note: it might be that `E` wasn't planning on `get`ting from `C` in the sync round in which `E` failed, but much later. In that case it still makes sense for `C` to fail; it would have failed in the future. A small inconsistency here (within the current infinitely-deadlocking implementation) is that if `E` would *never* `get` from `C`, then `C` would deadlock instead of crash (one could argue that this implies that deadlocking should lead to crashing through a timeout mechanism). -2. Component `C` is not part of the same synchronous region as `E`. And if `E` wouldn't have crashed, then it would've `put` a message to `C`. In this case it is still proper for `C` to crash. The reasoning works the same as above. - -So that is to say that this `ClosePort(sync)` causes instant failure of `C` if it has used the closed port in a round without consensus, or if it uses that port in the future. Note that this `ClosePort(sync)` system causes cascading failures throughout the disjoint synchronous regions. This is as intended: once one component's PDL program can no longer be executed, we cannot depend on the discovery of all the peers that constitute the intended synchronous region. So instead we rely on a peer-to-peer mechanism to make sure that every component is notified of failure. - -However, while these cascading peer-to-peer `ClosePort(sync)` messages are happily shared around, we still have a leader component somewhere, and components that have not yet been notified of the failure. \ No newline at end of file