diff --git a/docs/runtime/01_runtime.md b/docs/runtime/01_runtime.md new file mode 100644 index 0000000000000000000000000000000000000000..b29808a2ad37dea3470d99caa50b1396e8b89cc1 --- /dev/null +++ b/docs/runtime/01_runtime.md @@ -0,0 +1,255 @@ +# Runtime Design + +## General Architecture + +Roughly speaking the project consists of the following parts: + +1. + The compiler itself. This transforms the PDL source code into an executable format. +2. + The interpreter. This takes the executable format and executes it. It is a very unoptimized AST walker based on stack frames. Generally speaking the bottommost frame in the stack contains the code and memory associated with a component. + + Once the interpreter hits a point where it needs to interact with the runtime (generally in order to communicate with another component) it will halt and emit a signal to the runtime. +3. + The runtime. This is the code that keeps track of the components, decides when they can run, where their messages should end up, and bring various control algorithms (running behind the scene) to completion. + +We'll not go into points 1 and 2 in this document. One may simply assume that at the language level there is support for all of the things that are implemented in the runtime: sync blocks, channels, ports, component creation, sending and receiving messages, etc. + +Once such builtin features are encountered in the interpreter (e.g. the creation of a channel), a signal will be emitted to the runtime. A rough outline of the architecture, and handling these signals, is discussed in this document. + +## Runtime Architecture + +The runtime's code essentially consists of: + +- The machinery that keeps track of the components. That is to say: there is some memory reserved for each of the components. And using some kind of component ID we can look up this memory in a registry. If the runtime finds that there are no more user-controlled components running (i.e. the ones that a programmer uses to create more components) and there are no more regular components running, then the runtime shuts down. +- A bit of shared memory for all of the OS threads that will be managed by the runtime. This mainly consists of a work queue. This work queue contains the identities of the components that are scheduled for execution. +- A set of scheduler threads. They attempt to retrieve work from the work queue. This work is in the form of component IDs, which they use to retrieve the associated component and run its PDL code. They do this by invoking the interpreter on the component's current execution and memory state. Once a runtime signal (as mentioned above) is emitted by the interpreter, the scheduler thread will deal with it appropriately. +- An auxilliary polling thread. Not of great importance, so a short description suffices: although most components react to one another, some components (e.g. a TCP socket) might have nothing to do until the OS instructs it to do something. This polling thread ensures that once there is something to do, the component is put back onto the work queue. + +As long as the runtime doesn't shut down there will be `T` threads executing `N` components. A component can either be running (by being executed by a scheduler thread), scheduled for execution (its ID is in the work queue), or sleeping. All of these states are exclusive. Maintaining the exclusivity of these states is of great importance! We never want to end up in a place where two threads are both modifying the code/memory state of the same component! + +A component will start its lifecycle as being put into the work queue (not exactly true, but this will be dealt with later) by the component that created it. At some point a scheduler will pick up the newly created component's ID from the work queue and start executing its code. Once running (and we'll exclude fatal errors for now) the component may at some point reach the end of its code and terminate, or it may encounter a place in its code where it blocks and needs to wait for external input (e.g. it performed a `get`, but a message has not arrived yet). + +Once the execution of a component is blocked, it will attempt to go to sleep. Meaning: it will be in a state where it is not running, but also not scheduled for execution. A component may then be woken up by a different component, or the polling thread, by sending the sleeping a component a message. To prevent components from scheduling a sleeping component multiple times, the memory of a component contains an atomic `sleeping` boolean. + +It is instructive at this point to roughly explain how components are stored in memory. The components memory is roughly divided into two regions. There is the publicly accessible memory of a component and a private memory region. The public part is accessible by all scheduler threads. So all of the memory in the public memory region is somehow behind a lock, or some kind of locking mechanism (we will include the concept of atomics into "some kind of locking mechanism" for now). Hence the aforementioned `sleeping` boolean lives in this region. Conversely, the private memory region of a component is only accessed by the scheduler thread that is running the component. So here we store things like the memory state and execution state of the component. + +Returning to the idea of a component wishing to enter the "sleeping" state. The procedure in pseudocode is written as: + +``` +func go_to_sleep(this_component) { + // We are currently executing, so our sleeping flag MUST be `false` + assert(atomic::load(this_component.sleeping) == false); + atomic::store(this_component.sleeping, true); // we force the flag to true + // Note that while we were trying to go to sleep, someone has sent us a new + // message, but it did not see yet that we stored `false` in the sleeping + // flag, so we need to check ourselves. + if messages_in_inbox() { + // We try to set the flag back to false, but another scheduler thread may + // have already done this + let swap_success = atomic::cas(this_component.sleeping, true, false); + if swap_success { + put_in_work_queue(this_component.id); + } + } +} +``` + +Similarly, each time we try to send a component a message, we must do the following: + +``` +func send_message_to(target_component, message_data) { + put_message_in_inbox_locked(target_component.inbox, message_data); + let swap_success = atomic::cas(target_component.sleeping, true, false); + if swap_success { + put_in_work_queue(target_component.id); + } +} +``` + +Note that, because we cannot predict how the OS threads are scheduled, we can also not predict the way in which our own schedulers (which are running on OS threads) will schedule the execution of the components. Hence as a mental model, one may assume that each component is running in its own thread. The system above ensures that if a component has something to do (because it has received a message), it will eventually end up being executed by a scheduler. With the code for the "sleeping" state we'll ensure that a component can only be executed by one scheduler at a time. + +## General Messaging + +With this rough architecture, components can send each other messages. One will find three kinds of messages in the runtime (okay, four, but the last one is just to make the OS polling thread work): + +1. Data messages: these are the messages that are sent by `put` calls and received with `get` calls. As will be explained later in this document, more information is attached to data messages than the values given as argument to the `put` call. These messages will arrive in the target component's inbox. When the target component performs a call to `get` they're pulled out of the inbox and transferred to the component's memory state, such that the PDL code can read the transferred values. +2. Control messages: to facilitate certain operations permitted within PDL, the scheduler thread may decide to send messages to other components. These messages are called control messages. We'll encounter them later in this document when describing component creation, transmitting ports, and closing channels. Different from data messages, which may linger in the component's inbox for a while, control messages are handled by the receiving component immediately. This is important for various control algorithms. +3. Sync messages: to facilitate the consensus algorithm, there will be messages initiated by the scheduler thread as well. That is to say: when the component sends data messages there will be information attached to the data message that facilitates a successful sync round. But apart from that when the components are all done executing their code they must somehow reach consensus. This is done through these sync messages. + +Note that already the concept of a channel, each having its own little slot (or limited buffer), becomes a superfluous design decision for the runtime. This is because the scheduler threads themselves also need to be able to send messages to other components. Hence we'll need something more generic than a per-channel buffer, namely a generic message inbox. + +As we'll later also see, the concept of a directional channel *may* be a useful tool within PDL code (and I'm not arguing against such a concept), but as we'll see throughout this document the control messages can (conceptually) flow both ways over the channel. + +## Runtime Design Drivers + +Most design choices in the runtime were based on the fact that the Reowolf language should facilitate easier programming over the internet. So although the entire system currently only considers components running on a single machine, those components were conceptually regarded as living on different machines. In other words: components were conceptually considered not to have shared memory they can both access. + +This has some implications for channels. Certainly a component that sends a value must have it stored somewhere temporarily, and the component receiving it needs to keep it around as well. But the channel is not an entity that you'll find in memory. Rather there is one component owning one port, and when a message is `put` through it, it will arrive at another component owning the peer port; there is no memory sahred between components that will store a message flowing through a channel. + +A multi-machine runtime also requires the runtime to embrace the asynchronous nature of components. `put`s are non-blocking and can be performed one after the other before the peer has performed a corresponding `get`. The language does not contain the concept of a "lock" such that two components can agree on who owns a shared bit of memory. Rather each component is executed in its own thread of execution, and for multiple components to coordinate their actions they must use the messaging facilities. In order to make this coordination-through-messages somewhat simple to reason about one of the design drivers of the runtime was to ensure that each message sent in a specific order from one component to another will arrive in that same order at the target component. + +And so we have a multi-machine runtime where components running in their own thread can only coordinate through messages. As a result an ever-important consideration in designing internal (control) algorithms is something called "message crossing". Two components may decide to initiate a protocol at the same time, hence send each other the exact same protocol-initiating message (e.g. we have components `A` and `B`, and a protocol that requires an initiator to send a `Request` message, and then wait for a response in terms of a `Response` message, then we may have `A` and `B` both sending each other `Request` at the same time). + +Yet another result is that we decided to design the runtime without any globally unique component and/or port IDs. Certainly: on a single machine a component IDs allows one to retrieve a component's memory. But when sending a message to a component living on another machine, it may well be that we're sending it to a through a port that has the same port ID as ours, and targeting a component that has the same ID as ours. + +## Control Algorithms + +We'll now discuss several of the control algorithms. These control algorithms may be initiated by the scheduler threads when certain runtime signals are emitted by the interpreter. The control algorithms are brought to completion by sending messages. We'll talk about these messages as if they're sent from component to another component (this is for the sake of clarity: in reality they're sent by one scheduler thread to the memory location reserved for the target component's inbox). Because messages may be relayed one or more times before they arrive at the intended receiver (we'll introduce this concept soon), most messages include their intended target port in some kind of message header. This is true for all data messages, and most control messages. Only when a component is certain about the identity of the receiving component can it send messages without a target port in a header. + +### Changing Port Peers due to Component Creation + +Components, when they're not in sync mode, may decide to create new components. Ports may be used as the arguments to these newly created components. The rule we place on such a kind of port transfer is that the component that is creating the new component fully relinquishes ownership of the transferred port, and after the new component is created, the new component owns that port. As an annotated example: + +``` +comp creator(in one_port) { + channel another_port -> and_a_final_one; + sync { + auto value = get(one_port); // legal, just receiving an integer + put(another_port, 1337); // legal, sending a value over an owned + } + // perform component creation + new some_other_component(one_port, and_a_final_one); // transferring two ports + + sync get(one_port); // illegal! Port was transferred + sync put(another_port, 1338); // still legal, we still own this port + sync get(and_a_final_one); // also illegal, port was transferred. +} +``` + +We have several runtime properties to uphold when we're transferring ports: + +- No globally unique port IDs, so the new component is allowed to choose new port IDs for the ports it is adopting ownership of. +- The peers of the transferred ports may be unaware that a new component is created. In fact those peers may have already transferred messages to the instantiating component! As a design decision (the one that we find makes sense) any incoming, but unread, messages for a port are transferred along to the new component. +- Similarly to the above: a peer of a transferred port needs to be aware at some point that its peer port has changed ownership. +- Together with the requirement that peers need to be aware of the transferred ports, we also need to maintain ordering in the sent messages that intend to arrive at that transferred port at some point in time. + +Here we see the asynchronous nature of the runtime rear its head. Because the transferring of ports does not just happen to the receiving end of a port (in which case we transfer already received messages, hence messages only arrive at their correct destination eventually). It may happen to the transmitting end of a port as well. What this means for the receiver is that it is never sure which component is its peer until it has recevied a data message that is annotated with the origin of the message. At that moment in time the peer of the port is known, but only until the end of the synchronous round. Because after the synchronous round it is perfectly possible for the port to be passed around again. + +For all of the requirements above, the internal control algorithm to transfer a port to a new component is as following: + +1. The component that is creating the new component (we'll call the creator the instantiator component, and the created one the new component) temporarily has access to the private memory of the new component. Reason being is that a component is always created on the same machine as the instantiator component. And so the first step it takes is to create new port IDs (that make sense for the newly created component, instead of for the instantiator component) and map the old port IDs to the new ones. +2. The component transfers all of the metadata associated with the port, and transfers all of the messages that are targeted at those transferred ports to the new component. +3. For each transferred port the instantiator sends a `PortPeerChanged_Block` control message to the peers. This message instructs the peer that the port should be temporarily blocked. Any component that tries to send a message through that port enters a blocked state that can only be lifted if the corresponding `PortPeerChanged_Unblock` control message is sent. At the same time the instantiator sets up a special bit of code that will relay all incoming messages from that peer to the new component. We've mentioned earlier that all messages will have a target port. So when messages arrive at the instantiator component that need to be relayed, the instantiator component will modify the target port to the new component's chosen port ID. +4. Once a peer has received a `PortPeerChanged_Block`, it will, as stated above, stop sending messages over that channel. Not only data messages, but control messages as well. This also means that if the other component cannot start transferring ports itself. In any case, it will respond with an `Ack`nowledgement back to the instantiator component. +5. The instantiator component waits until it has received an `Ack` for all of the `PortPeerChanged_Block` message it has sent. This is done such that we're sure that we've received all of the messages that are actually intended for the new component (because while the new component is being created the peer may still be sending messages intended for the new component, but sent to the instantiator component). As a result, the new component will have all of the data messages in the inbox in the order in which they were sent, therefore maintaining the runtime property of message ordering. +6. When all of the `Ack`s are received, the instantiator component will remove the bit of code that relays all of the messages and will schedule the new component for execution. At this point the instantiator component will no longer access the private memory of the new component. Since the instantiator component is aware of the new component's ID and the new port IDs for all of the transferred ports, it will send `PortPeerChanged_Unblock` messages to all of the peer components. This message will also contain the new component's ID and its port ID. +7. The peers, upon receiving the `PortPeerChanged_Unblock` message, will update the metadata of their ports such that they point to the new component's ports. They will also unblock the port such that messages can be sent again. + +With this control algorithm, all peers are now aware of the new port's position. We've also maintained message ordering for the message sent to the new component. Although it was mentioned in point (4), we'll mention it here to be extra clear: creating a new component will be blocked until all of the transferred ports are unblocked. If we don't do this a data/control message may end up at the wrong component. + +Likewise we see the asynchronous nature of ports: the peers are eventually consistent. This is why we stressed earlier that almost all messages have their targeted port in their message header. This is needed such that a component like the instantiator discussed above knows when to relay messages. In this process the relaying component will also update the target port ID in the header to the new port ID. + +### Shutting Down Components + +A component will require a bit of memory to run. So when we're done executing a component (either because it has crashes, or because its program has terminated) we would like to release this memory again. Earlier we mentioned that components send messages by accessing an inbox in the public memory region of a component. This memory will, ofcourse, be freed as well. So we need to make sure that when a component shuts down, all of its peers will somehow be notified that they can never send messages to that terminated component again. + +In order to do so we have another control protocol. We'll extend this protocol when we discuss encountering crashing components, but we'll introduce a simpler variant here. The protocol is relatively simple. For each of the ports that are not yet closed and are owned by the terminating component we will: + +1. Make sure that the port is not blocked. If the port is blocked then the component blocks until the associated port is becomes unblocked. If the port is already closed then we do not execute the other steps in this control algorithm. +2. The port will send a `ClosePort` message to the peer of the port that is closing. Note that this `ClosePort` message will have a target port. If it happens to be that the terminating component will receive a `PortPeerChanged_Block` message for that port in the near future, we're certain that the `ClosePort` message will at least arrive at the correct peer (since the target port will be used to relay that message to the correct receiver). +3. The peer of the port, upon receiving a `ClosePort` message, will mark the port as being closed in its metadata. From that point onwards, any attempt to `put` or `get` on that port will result in the peer component crashing. In response to the `ClosePort` message, the peer component will send an `Ack`. There is one exception, and that is when the peer component itself already initiated a `ClosePort` control algorithm for that port. In that case the incoming `ClosePort` message is treated like an `Ack`. +4. The terminating component will wait until all of the `Ack`s have arrived (or crossing `ClosePort` messages, as stated in point (3)). Once they do, they will instruct the runtime to remove the component from memory. + +To reiterate: we have to be careful and annotate the `ClosePort` message with the target port. The terminating component will delay sending a `ClosePort` message if the port is blocked, but it may be that we have the `ClosePort` message crossing with a `PortPeerChanged_Block` message. Which implies that our `ClosePort` message will be relayed by the peer component. + +### Transferring Ports through Data Messages + +The PDL code allows for ports to be transferred through ports. As a simple example, consider the following code: + +``` +struct Pair { + in command, + out response, +} + +comp some_component( + in to_transmit, + out> tx_one_port, + out tx_two_ports +) { + // Transmitting a port directly + sync put(tx_one_port, to_transmit); + + // Transmitting multiple ports at a time using a data structure + channel command_tx -> command_rx; + channel response_tx -> response_rx; + + sync { + let message = Pair{ + command: command_rx, + response: response_tx, + }; + put(tx_two_ports, message); + } + + // Sending a command and receiving a response + sync { + put(command_tx, true); + auto response = get(response_rx); + } +} +``` + +To facilitate this, we'll follow roughly the same procedure as when we're transferring ports to a newly created component. But we have one complication: we do not have direct access to the private memory of the component we're sending the ports to (we'll call this component the "adopting component", and the sending component the "relinquishing component"). And so we'll have to follow a control protocol that is slightly different. + +Note that it is perfectly okay to send closed ports. The adopting component will receive this component together with the information that the port is closed. In this way, if the adopting component attempts a `put` or `get` on that received component, it will crash. + +We'll enforce a second rule upon transferring ports. Namely that ports transferred in a synchronous round may not have been used in `get` or `put` operations. I'm certain that it is possible to come up with a set of rules that will make this possible. But the protocol for transferring components over channels is a lot easier if we disallow this. For this reason we'll introduce a field in the metadata for each port that registers when the port was last used. If the relinquishing component attempts to transfer a port that has been used within the same sync round, then it will crash. + +Like before we want to ensure that all messages intended for the transferred port arrive in the correct order at the adopting component. + +And so the control protocol for transmitting ports proceeds as following: + +1. The relinquishing component will first make sure that none of the ports are blocked. If the ports are blocked then it will sleep until the ports become unblocked. As stated above the relinquishing component will also make sure that the ports were not previously used within the synchronous round. +2. The relinquishing component will send `PortPeerChanged_Block` message to all of the peers of the ports that will be transferred. However, in this case it will not relay any messages to the new component, they will still pile up in the relinquishing component's inbox. +3. The peers, upon receiving the `PortPeerChanged_Block` message, will proceed as they would in the case where ports were transferred to a new component: they'll block the port and send an `Ack`. +4. The relinquishing component will wait until all of the expected `Ack` message are received. Once they are received the component will wait until the port the message will travel through becomes unblocked (that is: the port that is used to transfer the ports to the adopting component). +5. The relinquishing component will send the data message containing the transferred ports to the adopting component. It will annotate this message with a list containing `(tranferred port ID, peer component ID, peer port ID)` triples. Note that since those peer ports are blocked, they will not be transferred in the meantime. This is essential for the next step. +6. The adopting component will receive the annotated data message containing the transferred ports. For each transferred port it will decide upon a new port ID. +7. The adopting component will, for each adopted port, send out a `PortPeerChanged_Unblock` message to the blocked peer ports. This message will be annotated with the `(adopting component ID, new port ID)` pairs. Such that the peers all know where the peers can be found. + +## Dealing with Crashing Components + +A component may at any point during its execution be triggered to crash. This may be because of something simple like an out-of-bounds array access. But as described above using closed ports may lead to such an event as well. In such a case we not only need to go through the `ClosePort` control protocol, to make sure that we can remove the crashing component's memory from the runtime, but we'll also have to make sure that all of the peers are aware that *their* peer has crashed. Here we'll make a design decision: if a peer component crashes during a synchronous round and there were interactions with that component, then that interacting component should crash as well. The exact reasons will be introduced later, but it comes down to the fact that we need to do something about the fact that the synchronous round will never be able to complete. + +We'll talk ourselves through the case of a component crashing before coming up with the control algorithm to deal with components crashing. + +We'll first consider that a component may crash inside our outside of a synchronous block. From the point of view of the peer component, we'll have four cases to consider: + +1. The peer component is not in a synchronous block. +2. The crashing component died before the peer component entered the synchronous block. +3. The crashing component died during the same synchronous block as the peer component. +4. The crashing component died after reaching consensus on the synchronous block that the peer component is currently still in. + +Before discussing these cases, it is important to remember that the entire runtime has components running in their own thread of execution. We may have that the crashing component is unaware of its peers (due to the fact that peer ports might change ownership at any point in time). We'll discuss the consensus algorithm in more detail later within the documentation. For now it is important to note that the components will discover the synchronous region they are part of while the PDL code is executing. So if a component crashes within a synchronous region before the end of the sync block is reached, it may be possible that it will not discover the full synchronous region it would be part of. + +Because the crashing component is potentially unaware of the component IDs it will end up notifying that is has failed, we can not design the crash-handling algorithm in such a way such that the crashing component notifies the peers of when they have to crash. We'll do the opposite: the crashing component simply crashes and somehow attempts to notify the peers. Those peers themselves decide whether they have to crash in response to such a notification. + +For this reason, it does not make a lot of sense to deal with component failure through the consensus algorithm. Dealing with the failure through the consensus algorithm only makes sense if we can find the synchronous region that we would have discovered if we were able to fully execute the sync block of each participating component. As explained above: we can't, and so we'll opt to deal with failure on a peer-by-peer basis. + +We'll go back to the four cases we've discusses above. We'll change our point of view: we're now considering a component (the "handling component") that has to deal with the failure of a peer (the "crashing component"). We'll introduce a small part of our solution a-priori: like a component shutting down, a failing component will simply end its life by broadcasting `ClosePort` message over all of its owned ports that are not closed (and the failing component will also wait until all of those ports are not blocked). + +In the first case, we're dealing with a failing component while the handling component is not in a synchronous block. This means that if there was a previous synchronous block, that it has succeeded. We might still have data messages in our inbox that were sent by the failing component. But in this case it is rather easy to deal with this: we mark the ports as closed, and if we end up using them in the next synchronous block, then we will crash ourselves. + +In the second case we have that the peer component died before we ourselves have entered the synchronous block. This case is somewhat equivalent to the case we described above. The crashing component cannot have sent the handling component any messages. So we mark the port as closed, potentially failing in the future if they end up being used. + +Next up is the third case, where both the crashing component and the handling component were both in the same synchronous round. Like before we mark the port as closed and future use will cause a crash. The difference is that the handling component may be blocked on attempting to `get` from a port which the crashing component now indicates is closed, perhaps we might have already performed successful `get`/`put` operations. In that case the handling component should crash: the crashing component can never submit its local solution, so the synchronous round can never succeed! And so we need to have metadata stored for the component that tracks if the port was used in the synchronous round. If the component is used in the synchronous round and a `ClosePort` message comes in, then the component should crash as well. + +The fourth case is where the failing component crashes *after* the handling component finished its sync round. This is an edge cases dealing with the following situation: both the handling as the crashing component have submitted their local solution to the consensus algorithm. The crashing component receives a global solution, finishes the sync round, and then crashes, therefore sending the `ClosePort` message to the handling component. The handling component, due to the asynchronous nature of the runtime, receives the `ClosePort` message before the global solution has a chance to reach the handling component. In this case, however, the handling component should be able to finish the synchronous round, and it shouldn't crash. + +Here is where we arrive at the protocol for dealing with component crashes. To let the handling component deal with this last case, we'll let the crashing component send the `ClosePort` messages together with a boolean indicating if it crashed inside, or outside of a synchronous block. + +If the crashing component sends `ClosePort` together with the indication that it crashed inside of the synchronous block, then we're dealing with case 3 *if* there are messages in the inbox, or if the handling component uses the closed port after receiving the `ClosePort` message. But if the crashing component sends `ClosePort` together with the indication that it crashed outside of a synchronous block, then we check if we have performed any operations on the port in the synchronous round. If we have performed operations *and* have received messages from that component, then apparently the synchronous round will succeed. So we will not immediately crash. Otherwise we will. + +In this way we modify the control algorithm for terminating components. Now we're able to deal correctly with crashing components. + +## Sync Algorithm + +A description of the synchronous algorithm is present in different documents. We will mention here that central to the consensus algorithm is that two components agree on the interactions that took place over a specific channel. In order for this to happen we'll send along a lot of metadata when trying to reach consensus, but here we're just concerned with attempting to match up the two ends of a channel. + +A port is identified by a `(component ID, port ID)` pair, and channel is a pair of those identifying pairs. So to match up the two ends of a channel we would have to find a consistent pair of ports that agree on who their peers are. However, we're dealing with the problem of eventual consistency: `put`ting ports never know who their peer is, because the sent message might be relayed. However, `get`ting ports *will* know who their peer is for the duration of a single synchronous round once they've received a single message. + +This is the trick we will apply in the consensus algorithm. If a channel did not see any messages passing through it, then the components that own those ports will not have to reach consensus because they will not be part of the same synchronous region. However if a message did go through the channel then the components join the same synchronous region, and they'll have to form some sort of consensus on what interaction took place on that channel. + +And so the `put`ting component will only submit its own `(component ID, port ID, metadata_for_sync_round)` triplet. The `get`ting port will submit information containing `(self component ID, self port ID, peer component ID, peer port ID, metadata_for_sync_round)`. The consensus algorithm can now figure out which two ports belong to the same channel. \ No newline at end of file diff --git a/docs/runtime/02_consensus_1.md b/docs/runtime/02_consensus_1.md new file mode 100644 index 0000000000000000000000000000000000000000..ca914cd0d0ecd51e49fe61ffaffed49622636ac1 --- /dev/null +++ b/docs/runtime/02_consensus_1.md @@ -0,0 +1,367 @@ +# Previous Consensus Algorithm + +## Introduction + +The previous consensus algorithm (the one within Reowolf 1.0 and 1.1) had support for speculative execution. This means that the user may (directly or indirectly) fork the execution of a component. That particular execution (which may be a forked execution already) then becomes >1 executions. At some point a component will have to choose which particular execution will be committed to memory. This is one reason for the existence of a `sync` block: a block of code wherein one may perform forking, and at the end a component will have to choose the execution that is committed to memory. + +With speculative execution we may have multiple components that are all forking their execution and sending/receiving messages. So we do not end up with one component that has to choose its final execution, but all components choosing their final execution. Note that one component's execution may apply restrictions on the validity of another component's execution. As an example, suppose the following components and their executions: + +- Component A: Has two executions: + - Execution A1: Component A has sent a message to component B. + - Execution A2: Component A has received a message from component B. +- Component B: Has three executions: + - Execution B1: Component B has received a message from component A, then sends a message back to component A. + - Execution B2: Component B has received a message from component A. + - Execution B3: Component B has sent two messages to component A. + +Without delving into too much detail, one may see that the only valid solution to this problem is the combination of `A1` and `B2`. So the components cannot just pick any execution, but must pick an execution that is agreeable with the chosen executions of the components it has interacted with. + +## Component Execution Tree, and Execution Traces + +Components execute PDL code, which may contain calls to `fork`, `put`, and `get`. A `fork` explicitly forks the execution of the code. A `put` sends a message to a particular component, and a `get` receives a message from a component and forks (as explained later). + +As the component enters a `sync` block, it has only one possible execution. But as stated above there are reasons for the execution to split up. These individual executions may themselves split up later, thereby forming a so-called "execution tree": + +``` + +-----+ +------+ + | put |------>| sync | ++-------+ +------+----->+-----+ | end | +| sync | | fork | +------+ +| start |----->+------+----->+-----+ ++-------+ | get |------>+------+ + +-----+ | sync | + | | end | + | +------+ + | + +------>+------+ + | | sync | + | | end | + | +------+ + | + +--> ... +``` + +This tree was formed by executing the following following PDL code: + +``` +primitive some_component(out tx, in rx) { + sync { + fork { + put(tx, 5); + } or fork { + get(rx, 1); + } +} +``` + +We can see the reason for calling the execution tree a "tree". There are several things to note about the execution tree: Firstly that some executions have been completed and form a complete trace, that is: starting from the "sync start" a complete trace may be represented by the path running to the "sync end". Conversely, there is one trace that is incomplete: there is a trace waiting at the `get` for a message. We'll call a place where the execution splits into multiple branches/executions a "branching point". + +Note that the branching points can in the *general case* only be discovered at runtime. Any code may have control flow points like `if` statements, or `while` loops. Consider the following code: + +``` +primitive some_component(out tx, bool which_way) { + sync { + if (which_way) { + fork { + put(tx, 1); + } or fork { + put(tx, 2); + } + } else { + put(tx, 3); + } + } +} +``` + +Depending on the value of `which_way` we produce two different execution trees (of which we can determine all traces). The compiler cannot decide at compile time which execution tree will be generated. + +Note that the `get` branching points have an arbitrary number of forked executions arising from them. We'll call them "waiting points". In the *general case* we cannot figure out how many forked executions arise from a `get` branching point. The reason being might be illustrated by the following simple example: + +``` +primitive sender(out tx, u32 num_forks) { + sync { + auto fork_counter = 1; + while (fork_counter < num_forks) { + fork { + put(tx, fork_counter); + } or fork { } // empty case + } + put(tx, num_forks); + } +} + +primitive receiver(in rx) { + u32[] values = {}; + sync { + bool keep_going = true; + while (keep_going) { + auto new_value = get(rx); + values @= { new_value }; // append + fork { + keep_going = false; + } or fork { } + } + } +} +``` + +If the sender is connected to the receiver, then the sender will send anywhere between `1` and `num_forks` messages (distributed over `num_forks` forks), depending on a user-supplied parameter (which we cannot figure out at compile-time). The isolated receiver can generate an infinite number of forked executions. We can analyze that the receiver will at most have `num_forks + 1` forked executions arising from its `get` branching point (the `num_forks` branches that do receive, and one final fork that is infinitely waiting on another message), but the compiler cannot. + +For this reason a `get` branching point needs to be kept around for the entire duration of the sync block. The runtime will always need to have a copy of the component's memory and execution state the moment it encountered a `get` instruction, because it might just be that another component (in perhaps a new fork, which we cannot predict) will send it another message, such that it needs to produce a new forked execution. + +A `get` operation is also a "blocking operation": in the *general case* the component needs to know the value produced by the `get` operation in order to continue its execution (perhaps more specifically: the first time a read operation is performed on the variable that will store the transmitted message). Consider the simple case where the received message contains a boolean that is used in the test expression of an `if` statement: we'll need to have actually received that boolean before we can decide which control flow path to take. Speculating on the contents of messages is too computationally expensive to be taken seriously. A put operation is not a blocking operation: the message is sent and the component continues executing its code. + +We've touched upon control flow points multiple times. We'll touch upon some aspects of control flow here, to more easily introduce the algorithm for finding consensus later. A component is fully described by its memory (i.e. all of the memory locations it has access to through its variables) and execution state (i.e. its current position in its code). So once a component encounters a control flow point, it can only take one control flow path. Which path may be computed from its memory state. The calling of certain impure functions (e.g. retrieving a cryptographically secure random number) does not change this fact. Note that receiving values from other components might change a component's memory state, hence influence the control flow path it takes in the subsequent forked execution. Conversely, a component sending a value might influence another component's memory state. + +So before treading into more detail, here we've found that in the general case: + +- Speculative execution may only occur inside sync blocks. +- Speculative execution implies that we end up with an execution tree. +- A path through the execution tree that reaches the end of the sync block is called a complete trace, and represents a valid execution of the sync block for the component (but perhaps not compatible with a particular complete trace of a peer it interacted with). +- The set of traces produced by a component in its sync block can practically only be discovered at runtime. +- A `get` operation is necessarily a blocking operation that always incurs a branching point. A `put` operation is a nonblocking operation that will not fork into multiple executions. +- The control flow path of a trace of a component may be influenced by the messages it has received. + +## Towards a Consensus Algorithm + +The key to the consensus problem is somehow discovering the ways in which the components have influenced the memory state of their peers. If we have a complete trace for each component, for which all peer components agree on the way they have influenced that complete trace, then we've found a solution to the consensus problem. Hence we can subdivide the consensus problem into four parts: + +1. Keeping track of the messages that influence the memory state of components. +2. Keeping track of the peers that influence the memory state of components. +3. Finding a set of interactions between components on which all involved components agree, i.e. each `put` should have a corresponding `get` at the peer. +4. Somehow having a communication protocol that finds these agreeable interactions. + +We'll not consider the last point, as this is essentially a gossip protocol, and the appropriate gossip protocol varies with the requirements of the user (e.g. robust to failure, memory efficient, runtime efficiency, message complexity). We define some terms to make the following discussion easier: + +- "component graph": A graph where each node is a component, and each channel between components forms am edge in that graph. +- "sync region": The group of components that have interacted with one another and should agree on the global consensus solution. +- "local solution": A complete trace of a component. For the component this is a valid local solution, but might not be part of a global solution. +- "global solution": A set of traces, one for each of the components in the sync region, that all agree on the interactions that took place between the components in the sync region. + +We'll now work incrementally to the final consensus algorithm, making it a bit of a story in order to explain the reasoning and intuition behind the consensus algorithm. + +Suppose a component can somehow predict exactly which messages it is going to receive during the execution of its code, we'll assume that each received message has the appropriate `get` call associated with it. In this case we're able to produce the set of complete traces that a component produces by symbolically executing its code: we start out with the initial memory state, might perhaps do some explicit `fork`ing, know exactly which messages we receive and how they influence the control flow, and arrive at the end of the sync block. Hence each component can figure out independently which complete trace is the solution to its consensus problem. + +However, as we've outlined above, we cannot know exactly which messages we're going to receive. We'll have to discover these messages while executing a component. The next best thing is to keep track of the values of the messages that we've received in a complete trace. Once we have complete traces for all of the interacting components, we can check that the received value corresponds to a sent value. e.g. + +``` +primitive sender(out tx) { + sync { + fork { + put(tx, 1); + } or fork { + put(tx, 2); + } + } +} + +primitive receiver(in rx) { + u32 value = 0; + sync { + value = get(rx); + } +} +``` + +Where `tx` is part of the same channel as `rx`. In this case we'll have two traces for each of the components, resulting in two valid global consensus solutions. In one solution the message `1` was transferred, in another the message `2` was transferred. There are two problems with this solution: firstly it doesn't take the identity of the channel into account. And secondly it doesn't take the effects of previous messages into account. + +To illustrate the first problem, consider: + +``` +primitive sender(out tx_a, out tx_b) { + u32 some_variable_in_memory = 0; + sync { + fork { + put(tx_a, 1); + some_variable_in_memory = 1; + } or fork { + put(tx_b, 1); + some_variable_in_memory = 2; + } + } +} + +primitive receiver(in rx_a, in rx_b) { + u32 value = 0; + sync { + value = get(rx_a); + } +} +``` + +Here the fact that the sender has the solutions `1` and `1` does not help the receiver figure out which of those corresponds to its own solution of `1`. + +To illustrate the second problem, consider: + +``` +primitive sender(out tx) { + sync { + fork { + put(tx, 1); + put(tx, 2); + } or fork { + put(tx, 2); + } + } +} + +primitive receiver(in rx) { + u32 value = 0; + sync { + value = get(rx); + } +} +``` + +Now we'll have `sender` contributing the solutions `1, 2` and `2`. While the receiver will generate the solutions `1`, `2` and `2`. The reason there are three solutions for the receiver is because it cannot figure out that the message `2` from the sender depended on the first message `1` from the sender having arrived. + +So we need to change the algorithm. Instead of just tracking which messages were sent, each component needs to have a mapping from port identities to sent messages (internally the runtime will generate port/channel IDs, but for the sake of this discussion we'll use the postfixes to the port names in the PDL code to indicate to which channel they belong, e.g. the `tx_a` out-port is part of the same channel `a` as the `rx_a` in-port). Secondly, if we sent a message, we need to transmit in which way it depends on previously received messages by sending along the sender's port mapping. The receiver, upon `get`ting a message, checks the port mapping to see if there is any of its own executions that can accept that message. + +We're already calling this information the "port mapping" (because we'll truly turn it into a mapping later), but for now the sent mapping is a list of pairs containing `(port ID, sent value`). + +This changes the way we can interpret the execution tree: each node is not only associated with the performed operation (`fork`, `put` or `get`), but also associated with a particular port mapping that indicates the influence of other components that allowed it to reach that exection node. We modify the port mapping per node in the following way: + +- For a `fork`: we fork the execution as many times as needed, and for those forks we copy the port mapping of the ancestor node in the execution tree. +- For a `put`: we transmit the current port mapping, and the transmitted value in the message. We then update the mapping from the sending port: that particular port ID now maps to the recently transmitted value. +- For a `get`: we receive the transmitted port mapping and value. Note that this `get` might be a particular statement executed by multiple different forked executions (each with their own port mapping). And so for a `get` to succeed, we need the shared channels between the sender and the receiver to agree on their port mapping. If such a `get` succeeds, then it forks into a new execution where the receiving port now maps to the received value. + +So, for a slightly more complicated example, combining the two previously examples: + +``` +primitive initiator(out tx_a, out tx_b, in rx_c) { + u32 value = 0; + sync { + put(tx_a, 1); // operation i1 + fork { + put(tx_a, 1); // operation i2 + value = get(rx_c); // operation i3 + } or fork { + put(tx_b, 2); // operation i4 + value = get(rx_c); // operation i5 + } + } +} + +primitive responder(in rx_a, in rx_b, out tx_c) { + sync { + auto value_1 = get(rx_a); // operation r1 + auto value_2 = 0; + fork { + value_2 = get(rx_a); // operation r2 + } or fork { + value_2 = get(rx_b); // operation r3 + } + put(tx_c, value1 + value2); // operation r4 +} +``` + +Here, once the components have brought as much forked executions to completion as possible, we'll have the following execution trees (and mappings). The square bracketed terms denote port mapping. The parenthesized terms correspond to the operations in the code, and the curly bracketed terms are the names for the traces (so we can refer to them in this document). + +``` +For the initiator: + +sync --> put (i1) --> fork +--> put (i2) -----> get (i3) -+-----> sync end {A} +start [(a,1)] | [(a,1),(a,1)] [(a,1),(a,1),(c,2)] + | | + | | + | +-> ... + | + +--> put (i4) -----> get (i5) -+-----> sync end {B} + [(a,1),(b,2)] [(a,1),(b,2),(c,3)] + | + | + +-> ... + +For the responder: + +sync --> get (r1) -+--> fork +--> get (r2) -----> put (r4) ----> sync end {C} +start [(a,1)] | | [(a,1),(a,1)] [(a,1),(a,1),(c,2)] + | | + +-> ... +--> get (r3) -----> put (r4) ----> sync end {D} + [(a,1),(b,2)] [(a,1),(b,2),(c,3)] +``` + +We can see that the `put` operation at `i2` does not end up being received at `r1`. The reason being that at `r1` the responder expects to not have received anything on `rx_a` yet. The message that the initiator sends contains the annotation `[(a,1),(a,1)]`, meaning: I have previously sent `[(a,1)]`, and am now sending `[(a,1)]`. The only operation that can receive this operation is at `r2`, because that expects the mapping `[(a,1)]`! + +Similarly: when the responder `put`s at `r4`, this happens in two branches. The branch that ends up being trace `C` expects the initiator to be in the state `[(a,1),(a,1)]` (hence can only be received at operation `i3`, resulting in trace `A` of the initiator). The branch that ends up being trace `D` expects the initiator to be in the state `[(a,1),(b,2)]` (hence can only be received at operation `i5`, resulting in trace `B` of the initiator). + +And ofcourse, when both components are finished, they can compare the mappings in both of them and conclude that the traces `A` and `C` are compatible, since their port mappings are compatible. Similarly traces `B` and `D` are compatible. So there are two global solutions to the consensus problem. + +For the sake of simplicity, we've only considered two components. But there may be many more components involved in a single synchronous region. In this case we'll have to clarify when receiving a message is valid. When a message is sent to another component, the receiving component first filters the port mapping (both for the mapping stored in the execution tree and the mapping sent over the channel) such that only the channels shared between the two components are left. If those two mappings are equal, then the message can be received in that branch. + +The same process needs to be applied when we seek a global solution. In rather silly pseudocode, but the simplest way to explain this process, is to have the following algorithm seeking the global solution: + +``` +all_components = [ component_1, component_2, ..., component_N ] +global_solutions = [] + +// Nested loop through all components +for all complete traces in component_1: + for all complete traces in component_2: + ... + ... + for all complete traces in component_N: + let success = true; + let all_traces = [trace_of_1, trace_of_2, ..., trace_of_N]; + + // Looping through every pair of traces + for index_a in 0..N: + for index_b in index_a + 1..N: + let component_a = all_components[index_a] + let trace_a = all_traces[index_a] + let component_b = all_components[index_b] + let trace_b = all_traces[index_b] + + trace_a = filter_on_shared_channels(trace_a, component_a, component_b) + trace_b = filter_on_shared_channels(trace_b, component_a, component_b) + + if trace_a != trace_b: + success = false; + break + + if !success: + break + + if success: + global_solutions = append(global_solutions, all_traces) +``` + +We'll now apply the last bit of trickery to this algorithm. Firstly keeping track of the sent message values may be prohibitively expensive. Suppose some kind of streaming data processor that receives gigabytes of data in a single synchronous round. It would be an unwise design decision to store all of that data in the port mapping. So what we'll do instead is assinging each node in the execution tree a unique number (unique only for that execution tree, different execution trees might contain the same number). With that unique number we'll redefine the port mapping to consist of a list of `(port ID, branch number)` pairs. + +The reason this is a valid trick is because of the arguments made earlier regarding control flow being influenced by received messages. If we know how a component was influenced by external influences, then the control flow path it takes is deterministic, hence the content of the sent messages will be deterministic. Locally a component `A` may only describe the way it was influenced by its peer `B`, but `B` also records how it was influenced by its peers `C` and `D`. So transitively `A` will also know the indirect mutual influences between it and `C` and `D`. + +Lastly, we can turn the list of `(port ID, branch number)` pairs into a true mapping `{port ID -> branch number}`, we do not actually need to keep the entire history around. The reason behind this is the fact that the `get` operation is blocking and requires the sent port mapping to be compatible with its execution node's port mapping. So when a `get` operation succeeds, it agrees that the executions of both sender and receiver are compatible up until that point. Continued execution only has to check that the subsequent interactions are compatible up until that point. Consider the following cases: + +- A `get` operation never receives a message: in that case we keep blocking indefinitely, we'll never get a complete trace, hence are prevented from finding a local solution. +- A `get` operation receives a message containing an incompatible port mapping: in that case we have roughly the same case as above. The message is not accepted and we keep blocking indefinitely. The interpretation is different: the sender and the receiver did not agree on the control flow paths they took. +- A `get` operation receives a message containing a compatible port mapping: in this case both components agree on the order of operations that took place for the message to be transferred. +- A `put` operation is never received: again, we're fine. The `put`ting component might submit a local solution for a complete trace. But the mapping for that never-received message will also never appear in the supposed receiver's port mapping. Hence the two components will never agree on the control flow paths they took. + +## The Consensus Algorithm + +With the story above, we may describe the complete consensus finding algorithm as following. + +Each component will locally construct an execution tree. Branches appear whenever it encounters a `fork` (producing a set of new branches), a `put` operation (this will create a single new branch) or a `get` operation (which will create new branches if the sent message's port mapping is compatible with the port mapping stored at that `get`'s branch). Whenever a new branch is created it is assigned a locally unique identifier. + +The port mapping at each branch consists of a mapping `{port ID -> branch number}`. This port mapping starts out empty at the start of a sync block. When a component `put`s a value through a channel, it will: + +- Generate the `put` branch and assign it a unique ID. +- Send the message that is annotated with the ancestor branch's port mapping. Together with the (`sending port's ID`, `newly assigned branch ID)` pair. +- Update the port mapping of the `put` branch: it will copy the ancestor branch's port mapping, and update it with the `(send port's ID, newly assigned branch ID)` pair. In case a previous entry is present for that specific `port ID`, then it is overwritten. + +When a component encounters a `fork`, it will simply copy the ancestor branch's port mapping for each new branch. Each new branch will receive a new unique ID. When a component performs a `get`, it will block until it receives a message annotated with a port mapping. If that happens, then it will: + +- Compare the port mapping in the message with the branch's port mapping. If one of the shared channels do not agree on the port mapping, then the message is not valid for that `get` operation. Note that the message *must* be kept around, because in the future there may be a `get` operation that *is* valid for that port mapping. +- If the port mapping for the shared channels do agree, then we: + - Generate a new fork originating from the `get` branch and assign it a unique ID. + - Update the port mapping of the `get` branch: copy the ancestor branch's port mapping, and update it with the `(sending port ID, peer's branch number)` pair. + +## Reasons for not implementing this Consensus Algorithm + +There are a lot of practical issues with this consensus algorithm: + +1. The fact that a `get` operation never knows when it will receive a new message requires us to keep a complete copy of the component's memory and execution state at that point. Hence for each `get` operation we're incurring a rather large memory overhead. +2. The fact that we never know if a received message can be discarded because it cannot be received by any of the `get` operations in the component's code. There may be another message coming in that causes a fork with a `get` operation that *can* receive this message. Hence we need to keep around *all* of the messages received in a synchronous round. +3. The incredible computational complexity of finding a global solution to the consensus algorithm. We need to check for each component all of its completed traces. For all of those `N` components, each supplying `T` traces (to simplify the math), we need to check each pair of traces. So the maximum computational complexity is `(N*T)*(N*(T-1))/2`. In reality this is a bit less, because we can very likely quickly eliminate certain traces. +4. Considering the previous points: the simplicity (especially when running over the internet) for a nefarious actor to incur computational overhead for the receiver. All it has to do is to keep sending messages to the receiver with an acceptable port mapping, but to never offer the consensus algorithm a valid local solution. Each accepted message will spawn a new fork at the receiver. \ No newline at end of file diff --git a/docs/runtime/03_consensus_2.md b/docs/runtime/03_consensus_2.md new file mode 100644 index 0000000000000000000000000000000000000000..372b75601bf60763288636c557cdef7057558a59 --- /dev/null +++ b/docs/runtime/03_consensus_2.md @@ -0,0 +1,181 @@ +# Current Consensus Algorithm and Communication Rules + +## Introduction + +For all of the reasons described in the previous consensus algorithm, speculative execution was removed from the runtime. This greatly simplifies the consensus algorithm. In fact, there isn't much need for a consensus algorithm anymore, apart from some edge cases. Hence the `fork` statement is gone from the language. + +Most of this document will describe a different problem that wasn't discussed in the document describing the previous consensus algorithm: the introduction of a `select` statement and the way in which components are allowed into a synchronous region. + +## Reducing the Previous Consensus Algorithm + +Without speculative execution, there is no more execution tree: there is only one control flow path a component can take through its PDL code. That execution may not form a complete trace (i.e. reach the end of the sync block) because it encounters a blocking operation. But to reiterate: if there is a complete trace, then it can only have taken one control flow path through its code. + +However, there is still are reasons for incorporating parts of the old consensus algorithm. That is: there is still a port mapping, `put` operations update this port mapping, and `get` operations (which will not fork anymore, but simply block until a message is present) do the same thing. + +There are two reasons for this. Firstly there is the design choice of enforcing strict ordering on the way channels are used between components. If there are two channels between components, then we may have the following code: + +``` +comp sender_variant_a(out tx_a, out tx_b) { + sync { + put(tx_a, 1); + put(tx_b, 2); + } +} + +comp sender_variant_b(out tx_a, out tx_b) { + sync { + put(tx_b, 2); + put(tx_b, 1); + } +} + +comp receiver(in rx_a, in rx_b) { + sync { + auto va = get(rx_a); + auto vb = get(rx_b); + } +} +``` + +If we wouldn't send along the port mapping. Then both `sender_variant_a` as `sender_variant_b` would be valid peers for the `receiver` component. This is because if put operations are still asynchronous (that is: they send the message and continue executing, not waiting for the corresponding `get` to complete), then both messages will arrive at the `receiver` component. The receiver component can retrieve these messages in any order. However, if we *do* send along the port mapping, then only `sender_variant_a` can be a valid peer of `receiver`. Note: this is a design choice. + +We could reduce this design choice by only sending along the port mapping of the port over which is a message is sent. This would imply that messages over the same channel have to be received in the correct order, but messages over different channels can be received in any desired order. We've kept this strict ordering in place by sending along the full port mapping. + +The second reason for still including the port mapping has to do with the fact that, by design, `put` operations are not acknowledged. That is to say: the `put` operation is asynchronous, and does not prevent the sending component from continuing its execution. For a `put` operation to be valid, the message *must* be received by the peer. Without any kind of port mapping we have no way of detecting the following mismatch: + +``` +comp sender(out tx) { + sync { + put(tx, 0); + put(tx, 1); + } +} + +comp receiver(in rx) { + sync get(rx); +} +``` + +However, there are much simpler ways of designing a consensus algorithm without speculative execution: one may annotate the port with the number of message sent, one may (as stated above) only send along the port mapping for a single port, instead of all shared channels, etc. + +That being said, the current implementation still uses branch IDs (differently interpreted in this case: a simple operation counter). There is still a port mapping from `port ID` to `branch ID`, and finding the global solution still proceeds in the same way (except now there is only one complete trace per component). + +## Synchronous Regions + +There were some aspects of the consensus algorithm that were specifically left out in the previous document. Among them is when to consider a component part of the synchronous region. In version 1.0 of the Reowolf runtime each peer of each component was considered part of the synchronous region. The idea behind this decision was that the fact that a channel remains unused in a synchronous round should be seen as part of the consensus algorithm: if a component did not `get` on a port, then the peer must agree that it did not `put` on a port. + +So when a complete trace was found (that is: a component reached the end of its sync block), a component went over all its ports and sent a special message over the unused/silent channels asking the peers to join the synchronous round with their local solutions, where those local solutions are required not to have sent/received anything over those silent channels. + +The problem with this approach may be illustrated with a simple example: suppose a set of requesting servers that are connected to a database server of some sorts. In the normal situation it would be perfectly normal for multiple servers to be storing database resources at the same time. These can all be handled in sequence by the database server, but the requesting servers do not necessarily have to wait for one another. Some pseudocode for this example: + +``` +union Request { + StoreData(u8[]), + /* more things here in a more complicated example */ +} + +comp requester(out tx_cmd) { + // Setup code here + // Perhaps some code that requires retrieving database resources + while (!should_exit) { + sync { + u8[] data = { 1, 2, 3 } /* something more complicated */ + sync { + put(tx_cmd, Request::StoreData(data)) + } + } + } +} + +comp database(in[] rx_cmds) { + // Note the array of input ports: there are multiple requesters + // Lot of complicated stuff + while (!should_exit && length(rx_cmds) > 0) { + // Here is the conundrum (and a second one, mentioned later in this document): + auto command = get(rx_cmds[0]); // take a message from the first requester + if (let Request::StoreData(to_store) = command) { + store_the_data(to_store); + } else { + // other commands, in a more complicated example + } + } +} +``` + +In this case, the code seems reasonable, but will *always* fail if there are >1 requesters at the database component. Because once the database reaches the end of its sync block, it will have a mapping for `rx_cmds[0]`, but the remaining ports were all silent. So the consensus algorithm asks the peer `requester` components if their channels were silent. But they aren't! Each requester can only send data in its sync block. + +So one might be inclined (in fact, its the only way to solve this in the unmodified language) to write the requester as: + +``` +comp requester(out tx_cmd) { + // Setup code here + // Perhaps some code that requires retrieving database resources + while (!should_exit) { + sync { + fork { + // We either do nothing + } or fork { + // Or we communicate with the database server + u8[] data = { 1, 2, 3 } /* something more complicated */ + sync { + put(tx_cmd, Request::StoreData(data)) + } + } + } + + some_code_that_may_take_a_really_long_time(); + } +} +``` + +In some sense the problem is solved. Only one `requester` component will have its request going through, the other ones have silent channels, and the `database` component is capable of receiving that message. But there are two further problems here: Firstly there is no way to give any guarantees about fairness: the component that can communicate to the `database` component the fastest will probably have their traces completed first, hence will be the first ones submitted to the consensus algorithm, hence will likely be the one picked first. If this happens over the internet then the slow connections will almost never be able to submit their information. Secondly we have that none of the components can run at their own pace. Each component *always* has to wait until all of the other ones have joined the synchronous region. There is no way in which a single `requester` can do some other work on its own unfettered by the execution speeds of its peers. + +And so the rule for joining a synchronous region was changed. Now it is simply: if a component performs a `get` and receives a message from another component, then they both become part of the same synchronous region (and perhaps this region is already larger because the components are part of larger separate synchronous regions). If a message is in the inbox of the component, but there is no `get` operation performed, then the message is kept around for perhaps a later synchronous round. Note that this works together with the consensus algorithm in such a way that joining a consensus region happens upon the first accepted message on a channel. Later messages within that synchronous round now *must* arrive due to the consensus algorithm. + +## Select Statement + +The example above hints at a second problem (figured out a long time ago by our unix overlords): there is no way within PDL code to say that we can perform a variety of behaviours triggered by the arrival of messages. For our `database` component above, we see that it can have multiple requesters connected to it, but there is no way to indicate that we can have valid behaviour for any one of them. So we introduce the `select` statement. It is a statement with a set of arms. Each arm has a guard, where we indicate which ports need to have incoming messages, and a block of code, that is executed when all of the indicated ports actually have received a message. + +The select statement may only appear within a sync block. The arm's guard is formulated in terms of an expression or a variable declaration (that may contain `get` calls, but not `put` calls). The corresponding block of code is executed when all of the `get` calls in the guard have a message ready for them. In case multiple arm guards are satisfied then a random arms is chosen by the runtime for execution. + +So the select statement takes the form: + +``` +select { + auto value = get(rx_a) + get(rx_b) + -> do_something_with_the_value(value), + auto value = get(rx_a) + -> do_something_with_the_value(value), +} +``` + +This code is transformed by the compiler into something akin to: + +``` +while (still_waiting_for_a_message) { + if (value_present(rx_a) && value_present(rx_b)) { + auto value = get(rx_a) + get(rx_b); + do_something_with_the_value(value); + + break; + } else if (value_present(rx_a)) { + auto value = get(rx_a); + do_something_with_the_value(value); + + break; + } + block_until_there_is_a_new_message(); +} +``` + +The combination of the `select` statement and the way we introduce components into the synchronous region permits components to run independently of another when their protocol admits it. + +The rule where components only enter a synchronous region when the `get` a message that is present in the inbox still applies here. If, in the example above, the arm requiring messages on channel `b` executes, then only the peer component of this channel joins the synchronous region. The message that came over channel `a` will still be present in the inbox for later interactions or sync blocks. + +## The Current Consensus Algorithm + +Because the current consensus algorithm is a scaled down version of the previous one, we'll be a bit more concise: Each component has a local counter, this produces number we (still) call branch branch numbers. The counter is incremented each time a `get` or a `put` call is performed. We'll use this counter to annotate ports in the port mapping each time a `put` call is performed. The port mapping is sent along with messages sent through `put` calls. This message will arrive in the inbox of the receiver. When the peer performs a `get` call on the receiving end of the channel, it will check if the received port mapping matches the local port mapping. If it doesn't, we can immediately let the component crash. If it does, then the message is received and the component continues execution. Once a `get` call has completed, the sender is incorporated into the synchronous region, if it wasn't already. + +Once a component has reached the end of the sync block it will submit its local solution (i.e. the port mapping) for validation by the consensus algorithm. If all components in the synchronous region have submitted a local solution, whose port mappings are pairwise consistent with one another, then we've found a global solution. With this global solution the components in the synchronous region are ordered to continue the execution of their PDL code. + +As a small side-note: with speculative execution the consensus algorithm amounted to finding, for each component, a complete trace whose interactions are consistent with all of its peers. Without speculative execution we have the multi-party equivalent to an `Ack` message in the TCP protocol. (Yes, this statement skips over a *lot* of details of the TCP protocol, but) in TCP both parties perform a best-effort attempt to ensure that a sent message has arrived at the receiver by the receiver `Ack`nowledging that the message has been received. In this cases the consensus algorithm performs this function: it ensures that the sent messages have all arrived at their peers in a single multi-party interaction. \ No newline at end of file diff --git a/docs/runtime/04_known_issues.md b/docs/runtime/04_known_issues.md new file mode 100644 index 0000000000000000000000000000000000000000..2fa7b4437d7e3ce4074a471796d081418566c744 --- /dev/null +++ b/docs/runtime/04_known_issues.md @@ -0,0 +1,45 @@ +# Known Issues + +The current implementation of Reowolf has the following known issues: + +- Cannot create uninitialized variables that are later known to be initialized. This is not a problem for the regular types (perhaps a bit tedious), but is a problem for channels/ports. That is to say: if a component needs a temporary variable for a port, then it must create a complete channel. e.g. + + ``` + comp send(out tx1, out tx2, in which) { + channel unused -> temporary; + while (true) sync { + if (get(which)) { + temporary = tx1; + } else { + temporary = tx2; + } + put(temporary, 1); + } + } + ``` + + Another solution would be to use an empty array and to put a port inside of that. Hacks galore! + +- Reserved memory for ports will grow without bounds: Ports can be given away from one component to another by creating a component, or by sending a message containing them. The component sending those ports cannot remove them from its own memory if there are still other references to the transferred port in its memory. This is because we want to throw a reasonable error if that transferred port is used by the original owner. Hence we need to keep some information about that transferred port in the sending component's memory. The solution is to have reference counting for the ports, but this is not implemented. + +- An extra to the above statements: when transferring ports to a new component, the memory that remembers the state of that port is removed from the component that is creating the new one. Hence using old references to that port within the creating component's PDL code results in a crash. + +- Some control algorithms are not robust under multithreading. Mainly error handling when in sync mode (because there needs to be a revision where we keep track of which components are still reachable by another component). And complicated scenarios where ports are transferred. + +- There is an assertion in the interpreter that makes sure that there are no values left on the expression stack when a statement has completed. This is not true when you have an expression statement! If you want to remove this assertion make sure to clear the stack (using the method on the `Store`). + +- The TCP listener component should probably do a `shutdown` before a `close` on the socket handle. It should also set the `SO_REUSEADDR` option. + +- The way in which putting ports are ordered to block if the corresponding getter port's main inbox is full is rather silly. This led to the introduction of the "backup inbox" as it is found in the runtime's code. There is a design decision to make here, but the current implementation is a bit silly. There are two options: (a) have an atomic boolean indicating if the message slot for an inbox is full, or (b) do away with the "main inbox" alltogether, and have an unbounded message queue. + +- For practical use in components whose code supports an arbitrary number of peers (i.e. their code contains an array of ports that is used for communication and changes in size during the component's lifetime), the `select` statement somehow needs to support waiting on any one of those ports. + +- The compiler currently prevents one from using `sync` blocks (or corresponding `get` and/or `put` operations) in functions. They can only be used within components. When writing large programs this it makes it rather hard to re-use code: all code that interacts with other components can only be written within a sync block. I would advise creating `sync func`, `nonsync` func and regular `func`tions. Where: + + - `sync func`tions can only be called from within `sync` functions. They may not open new sync blocks, but may perform calls to `get`/`put`. These are useful to encapsulate sequences of `put`/`get` calls together with some common message-modifying code. + - `nonsync func`tions (or `async func`tions) may only be called outside of sync blocks, and may open new sync blocks themselves. They are useful to encapsulate a single interaction with other components. One may also create new components here. + - regular `func`tions. Are as useful as in any other language, but here we disallow calling `nonsync func`tions or `sync func`tions. + +- The `Ack` messages that are sent in response to `PeerPortChanged_Block` messages should contain the sending components `(component ID, port ID)` pair in case the `PeerPortChanged_Block` message is relayed. When such an `Ack` message is received, the peer of the port must be updated before transferring the port to the new owner. + +- The compiler currently accepts a select arm's guard that is formulated as `auto a = get(get(rx))`. This should be disallowed. \ No newline at end of file diff --git a/docs/runtime/consensus.md b/docs/runtime/consensus.md deleted file mode 100644 index 6e24c401cfc97730c7ef1091170d6c4fe2845523..0000000000000000000000000000000000000000 --- a/docs/runtime/consensus.md +++ /dev/null @@ -1,213 +0,0 @@ -# Previous Consensus Algorithm - -## Introduction - -The previous consensus algorithm (the one within Reowolf 1.0 and 1.1) had support for speculative execution. This means that the user may (directly or indirectly) fork the execution of a component. That particular execution then becomes two executions. At some point a component will have to choose which particular execution will be committed to memory. This is one reason for the existence of a `sync` block: a block of code wherein one may perform forking, and at the end a component will have to choose the execution that is committed to memory. - -With speculative execution we may have multiple components that are all forking their execution and sending/receiving messages. So we do not end up with one component that has to choose its final execution, but all components choosing their final execution. Note that one component's execution may apply restrictions on the validity of another component's execution. As an example, suppose the following components and their executions: - -- Component A: Has two executions: - - Execution A1: Component A has sent a message to component B. - - Execution A2: Component A has received a message from component B. -- Component B: Has three executions: - - Execution B1: Component B has received a message from component A, then sends a message back to component A. - - Execution B2: Component B has received a message from component A. - - Execution B3: Component B has sent two messages to component A. - -Without delving into too much detail, one may see that the only valid solution to this problem is the combination of `A1` and `B2`. - -## Component Execution Tree, and Execution Traces - -Components execute PDL code, which may contain calls to `fork`, `put`, and `get`. A `fork` explicitly forks the execution of the code. A `put` sends a message to a particular component, and a `get` receives a message from a component and forks (as explained later). - -As the component enters a `sync` block, it has only one possible execution. But as stated above there are reasons for the execution to split up. These individual executions may themselves split up later, thereby forming a so-called "execution tree": - -``` - +-----+ +------+ - | put |------>| sync | -+-------+ +------+----->+-----+ | end | -| sync | | fork | +------+ -| start |----->+------+----->+-----+ -+-------+ | get |------>+------+ - +-----+ | sync | - | | end | - | +------+ - | - +------>+------+ - | | sync | - | | end | - | +------+ - | - +--> ... -``` - -This corresponds to the following PDL code: - -``` -primitive some_component(out tx, in rx) { - sync { - fork { - put(tx, 5); - } or fork { - get(rx, 1); - } -} -``` - -We can see the reason for calling the execution tree a "tree". There are several things to note about the execution tree: Firstly that some executions have been completed and form a complete trace, that is: starting from the "sync start" a complete trace may be represented by the line running to the "sync end". Conversely, there is one trace that is incomplete: there is a trace waiting at the `get` for a message. We'll call a place where the execution splits into multiple branches/executions a "branching point". - -Note that the branching points can in the *general case* only be discovered at runtime. Any code may have control flow points like `if` statements, or `while` loops. Consider the following code: - -``` -primitive some_component(out tx, bool which_way) { - sync { - if (which_way) { - fork { - put(tx, 1); - } or fork { - put(tx, 2); - } - } else { - put(tx, 3); - } - } -} -``` - -Depending on the value of `which_way` we produce two different execution trees (of which we can determine all traces). The compiler cannot decide at compile time which execution tree will be generated. - -Note that the `get` branching points have an arbitrary number of forked executions arising from them. We'll call them "waiting points". In the *general case* we cannot figure out how many forked executions arise from a `get` branching point. The reason being might be illustrated by the following simple example: - -``` -primitive sender(out tx, u32 num_forks) { - sync { - auto fork_counter = 1; - while (fork_counter < num_forks) { - fork { - put(tx, fork_counter); - } or fork { } // empty case - } - put(tx, num_forks); - } -} - -primitive receiver(in rx) { - u32[] values = {}; - sync { - bool keep_going = true; - while (keep_going) { - auto new_value = get(rx); - values @= { new_value }; // append - fork { - keep_going = false; - } or fork { } - } - } -} -``` - -If the sender is connected to the receiver, then the sender will send anywhere between `1` and `num_forks` messages (distributed over `num_forks` forks), depending on a user-supplied parameter (which we cannot figure out at compile-time). The isolated receiver can generate an infinite number of forked executions. We can analyze that the receiver will at most have `num_forks + 1` forked executions arising from its `get` branching point (the `num_forks` branches that do receive, and one final fork that is infinitely waiting on another message), but the compiler cannot. - -For this reason a `get` branching point needs to be kept around for the entire duration of the sync block. The runtime will always need to have a copy of the component's memory and execution state the moment it encountered a `get` instruction, because it might just be that another component (in perhaps a new fork, which we cannot predict) will send it another message, such that it needs to produce a new forked execution. - -A `get` operation is also a "blocking operation": in the *general case* the component needs to know the value produced by the `get` operation in order to continue its execution (perhaps more specifically: the first time a `read` operation is performed on the variable that will store the transmitted message). Consider the simple case where the received message contains a boolean that is used in the test expression of an `if` statement: we'll need to have actually received that boolean before we can decide which control flow path to take. Speculating on the contents of messages is too computationally expensive to be taken seriously. A put operation is not a blocking operation: the message is sent and the component continues executing its code. - -We've touched upon control flow points multiple times. We'll touch upon some aspects of control flow here, to more easily introduce the algorithm for finding consensus later. A component is fully described by its memory (i.e. all of the memory locations it has access to through its variables) and execution state (i.e. its current position in its code). So once a component encounters a control flow point, it can only take one control flow path. The calling of certain impure functions (e.g. retrieving a cryptographically secure random number) does not change this fact. Note that receiving values from other components might change a component's memory state, hence influence the control flow path it takes in the subsequent forked execution. Conversely, a component sending a value might influence another component's memory state. - -So before treading into more detail, here we've found that in the general case: - -- A interior of a sync block demarks the place where speculative execution may occur. -- Speculative execution implies that we end up with an execution tree. -- A path through the execution tree that reaches the end of the sync block is called a trace, and represents a valid execution of the sync block for the component (but perhaps not for a peer it interacted with). -- The set of traces produced by a component in its sync block can practically only be discovered at runtime. -- A `get` operation is necessarily a blocking operation that always incurs a branching point. A `put` operation is a nonblocking operation that does not incur a branching point. -- The trace of a component is influenced by the messages it has received. - -## Towards a Consensus Algorithm - -The solution to the consensus problem is somehow discovering the ways in which the components have influenced the memory state of their peers. If we have a complete trace for each component, for which all peer components agree on the way they have influenced that complete trace, then we've found a solution to the consensus problem. Hence we can subdivide the consensus problem into four parts: - -1. Keeping track of the messages that influence the memory state of components. -2. Keeping track of the peers that influence the memory state of components. -3. Finding a set of interactions between components on which all involved components agree, i.e. each `put` should have a corresponding `get` at the peer. -4. Somehow having a communication protocol that finds these agreeable interactions. - -We'll incrementally work towards a solution that satisfies the first three points. We'll not consider the last point, as this is essentially a gossip protocol. We define some terms to make the following discussion easier: - -- "sync region": The group of components that have interacted with one another and should agree on the global consensus solutionj. -- "local solution": A complete trace of a component. For the component this is a valid local solution, but might not be part of a global solution. -- "global solution": A set of traces, one for each of the components in the sync region, that all agree on the interactions that took place between the components in the sync region. - -Suppose a component can somehow predict exactly which messages we're going to receive during the execution of its code, we'll assume that each received message has the appropriate `get` call associated with it. In this case we're able to produce the set of complete traces that a component produces by symbolically executing its code: we start out with the initial memory state, might perhaps do some explicit `fork`ing, know exactly which messages we receive and how they influence the control flow, and arrive at the end of the sync block. - -However, as we've outlined above, we cannot know exactly which messages we're going to receive. We'll have to discover these messages while executing a component. The next best thing is to keep track of the values of the messages that we've received in a complete trace. Once we have complete traces for all of the interacting components, we can check that the received value corresponds to a sent value. e.g. - -``` -primitive sender(out tx) { - sync { - fork { - put(tx, 1); - } or fork { - put(tx, 2); - } - } -} - -primitive receiver(in rx) { - u32 value = 0; - sync { - value = get(rx); - } -} -``` - -Where `tx` is part of the same channel as `rx`. In this case we'll have two traces for each of the components, resulting in two valid global consensus solutions. In one solution the message `1` was transferred, in another the message `2` was transferred. There are two problems with this solution: firstly it doesn't take the identity of the channel into account. And secondly it doesn't take the effects of previous messages into account. - -To illustrate the first problem, consider: - -``` -primitive sender(out tx_a, out tx_b) { - sync { - fork { - put(tx_a, 1); - } or fork { - put(tx_b, 1); - } - } -} - -primitive receiver(in rx_a, in rx_b) { - u32 value = 0; - sync { - value = get(rx_a); - } -} -``` - -Here the fact that the sender has the solutions `1` and `1` does not help the receiver figure out which of those corresponds to its own solution of `1`. - -To illustrate the second problem, consider: - -``` -primitive sender(out tx) { - sync { - fork { - put(tx, 1); - put(tx, 2); - } or fork { - put(tx, 2); - } - } -} - -primitive receiver(in rx) { - u32 value = 0; - sync { - value = get(rx); - } -} -``` - -Now we'll have `sender` contributing the solutions `1, 2` and `2`. While the receiver will generate the solutions `1`, `2` and `2`. The reason there are three solutions for the receiver is because it cannot figure out that the message `2` from the sender depended on the first message `1` from the sender having arrived. - -We can solve this by somehow embedding the identity of the channel associated with the message, and by describing all of the previous interactions \ No newline at end of file diff --git a/docs/runtime/design.md b/docs/runtime/design.md deleted file mode 100644 index 13bf6de648fc5dec6de45bbf30cdfc6eac18a1d3..0000000000000000000000000000000000000000 --- a/docs/runtime/design.md +++ /dev/null @@ -1,15 +0,0 @@ -# Runtime Design - -## Preliminary preliminaries - -There will be some confusion when we're using the word "synchronization". When we talk about OS-syncing we mean using synchronization primitives such as atomics, mutexes, semaphores etc. When we talk about sync-blocks, sync-regions, etc. we mean the Reowolf language's distributed consensus feature. - -## Preliminary Notes - -The runtime was designed in several iterations. For the purpose of documentation, we have had: - -- Reowolf 1.0: A single-threaded, globally locking runtime. -- Initial 1.2: A single-threaded runtime, no longer globally locking. The newly designed consensus algorithm worked quite well and reasonably efficiently (not measured by comparison to another runtime, rather, the idea of the consensus algorithm was simple and efficient to perform) -- Multithreaded 1.2, v1: Here is where we moved towards a more multithreaded design. From the start, the idea was to "maximize concurrency", that is to say: we should only use OS-syncing when absolutely appropriate. Furthermore the initial implementation should be somewhat efficient: we should not employ locks when it is not absolutely necessary. The following remarks can be made with respect to this initial multithreaded implementation: - - Because there will generally be far more components than there are hardware threads, an efficient implementation requires some kind of scheduler that is able to execute the code of components. To track which components are supposed to run, there will be a work queue. The initial implementation features just a single global work queue. Each thread that is executing components is called a scheduler in this document. - - At the most basic level, a component has properties that allow access by only one writer at a time, and properties that are conceptually (ofcourse we need some kind of OS synchronization) accessible by multiple writers at a time. At the very least, executing the code of a component should only be performed by one writer at a time (the scheduler), while sending messages to a component should be allowed by multiple schedulers. Hence the runtime splits component properties into two: those that should only be accessed by the scheduler that is executing the code, and those that should be accessible by all schedulers at any time. \ No newline at end of file diff --git a/docs/runtime/known_issues.md b/docs/runtime/known_issues.md deleted file mode 100644 index 1e6b3c7b18bf2d314979c2fec31fd7e343e67397..0000000000000000000000000000000000000000 --- a/docs/runtime/known_issues.md +++ /dev/null @@ -1,25 +0,0 @@ -# Known Issues - -The current implementation of Reowolf has the following known issues: - -- Cannot create uninitialized variables that are later known to be initialized. This is not a problem for the regular types (perhaps a bit tedious), but is a problem for channels/ports. That is to say: if a component needs a temporary variable for a port, then it must create a complete channel. e.g. - - ``` - comp send(out tx1, out tx2, in which) { - channel unused -> temporary; - while (true) sync { - if (get(which)) { - temporary = tx1; - } else { - temporary = tx2; - } - put(temporary, 1); - } - } - ``` - -- Reserved memory for ports will grow without bounds: Ports can be given away from one component to another by creating a component, or by sending a message containing them. The component sending those ports cannot remove them from its own memory if there are still other references to the transferred port in its memory. This is because we want to throw a reasonable error if that transferred port is used by the original owner. Hence we need to keep some information about that transferred port in the sending component's memory. The solution is to have reference counting for the ports, but this is not implemented. - -- An extra to the above statements: when transferring ports to a new component, the memory that remembers the state of that port is removed from the component that is creating the new one. Hence using old references to that port within the creating component's PDL code results in a crash. - -- Some control algorithms are not robust under multithreading. Mainly error handling when in sync mode (because there needs to be a revision where we keep track of which components are still reachable by another component). And complicated scenarios where ports are transferred. \ No newline at end of file diff --git a/docs/runtime/sync.md b/docs/runtime/sync.md deleted file mode 100644 index 305e02ae0f4a03f02538f64303e52c99c6b2dfa6..0000000000000000000000000000000000000000 --- a/docs/runtime/sync.md +++ /dev/null @@ -1,203 +0,0 @@ -# Synchronous Communication and Component Orchestration - -## Introduction - -The Reowolf runtime consists of a system that allows multiple components to run within their own thread of execution. These components are able to exchange messages with one another. Components are capable of creating other components, and of creating channels. We may visualise the runtime as a cloud of all kinds of components, connected by the communication channels between them, hence a kind of communication graph. - -With this version of the runtime there were several main drivers. For performance reasons we want: - -- As little centralized information as possible (because centralization of information implies synchronization to access it). -- As much parallelism as possible (when information *must* be exchanged, then make sure as little components are affected as possible). - -To keep the complexity of the runtime to a reasonable minimum, the following requirements have to be met as well: - -- A component may terminate, therefore potentially not being able to participate in synchronous communication or receive messages. Experimentation showed that the system that ensures that termination is broadcast to all peers should be kept simple (earlier systems depended on a state-machine with assumptions about the order in which messages were exchanged, which greatly complicates the messaging subsystem of the runtime). -- Messages should arrive in order (with some exceptions). As we'll see later in this document we have different types of messages. Reasoning about a component's operational state becomes much simpler if we can assume that the transmission of messages between components is ordered. - -As we will see there are several types of messages that can be exchanged. Among them we have: - -- Data messages: these messages contain the data that is "transmitted from and to PDL code". For each `put` a data message is annotated by the runtime and sent along to the receiving component, which will then hopefully retrieve the data with a `get`. These messages are conceptually sent over channels. -- Sync messages: these messages are sent between components to communicate their consensus-state. These messages are not necessarily associated with channels. -- Control messages: these messages are sent between components to ensure that the entire runtime is reliably facilitating data exchange. That is: they ensure that the language is working as intended. As an example: sending a port to a different component requires a bit of bookkeeping to ensure that all involved components are aware of the port exchange. - -The remainder of this document tries to describe the various technical aspects of synchronous communication and component orchestration. - -## Brief Description of Schedulers - -Each component conceptually has its own thread of execution. It is executing a program with at any given point in time a particular memory state. In reality there are a limited number of threads that execute components. Making sure that components are scheduled correctly is based on the fact that components are generally executing programs that are blocked at some point: a message needs to be received, or a port is blocked so we cannot send any information. At that point a component is considered "sleeping". Should another component, scheduled on a particular thread, send a message to this sleeping component, then it is "woken up" by putting it into the execution queue. - -The job of the scheduler is then to: execute a component scheduled for execution, wait until a component is scheduled, or shut down in case there are no more components to execute. - -The details of the execution queue (currently rather simplistically implemented) is not of interest. What is of interest is that a component can only be scheduled once. - -## Creation of Channels - -Within PDL code it is possible to create new channels. And so a component will always (that is to say: for now) create both endpoints of the channel, hence own both endpoints of the channel upon creation. Identifiers for these ports are generated locally (we don't want to resolve to having to synchronize on some kind of global port ID counter). - -As these IDs are generated locally there is no real technical challenge, but note that ports at different components may have the same port ID. - -## Creation of Components - -Within PDL code it is possible to create components. Upon their creation they can be given endpoints of channels. Hence at component creation we are changing the configuration of the communication graph. All of the relevant components need to be informed about the port changing ownership. - -Here we run into a plethora of problems. The other endpoint might have been given away to another created component. The other endpoint may have already been used in communication, such that we already have messages underway for the port we're trying to give to a newly created component. We may have that the local port ID assigned by the creating component is not the same as the local port ID that the newly created component may want to assign to it. We may have that this port has been passed along multiple times already, etc. - -We cannot help that messages have already arrived, or are in transit, for the transferred port. But we can make some assumptions that simplify the transfer of ports. As a first one, we have that the creating component decides when the created component is scheduled for execution. We'll choose not to execute it initially, such that we can be sure that it will not send messages over its ports the moment is created. To further simplify the problem, we have assumed that messages arrive in order. So although messages might still be underway for the transferred ports, if we ask the sender to stop sending, and the sender blocks the port and acknowledges that it has received this command. Then the moment the creator receives the acknowledgement it is certain that it has received all messages intended for the transferred ports. We'll also disallow the creation of components during synchronous interactions. - -And so here we have our first control protocol. If a port is transferred then we might have: - -1. That the peer port is transferred to the new component as well. All is fine and we can take care of the exchange immediately. -2. That the peer port stays with the creating component. Here all is fine as well, everything is running in a single thread of execution so we diligently do our bookkeeping on the data associated with the port and the channel and we can transfer the port. -3. The peer port is already owned by a different component. Here we need to have a slightly more complicated protocol - -In this last case we take the following actions, `C` for creating component, `N` for newly created component, and `P` for the peer component that holds the other port of the same channel. - -1. `C` transfers the port to the newly created component `N`, and ask it to come up with a new ID for that port. The port had an ID that was decided by its old owner `C`, and now has one that is agreeable with component `N`. -2. `C` sends a control message `PeerChangeBlockPort` to the peer `P`. -3. `P` receives the `PeerChangeBlockPort` message. It causes the peer port to be temporarily blocked. `P` may still continue executing its code, but the moment it wishes to send something over this port it is forced to block its execution. In response `P` sends an `Acknowledge` message back to `C`. -4. `C` waits for the `Acknowledge` message of `C`. Since the `Acknowledge` message was sent after the last data message that `P` sent to the port under consideration, and because `P` has blocked that port, we are certain that we received all messages. We transfer these messages to `N`. -5. Note that there may be multiple ports being transferred from `C` to `N`, each with a potentially different peer. And so here `C` waits until steps 2 through 4 are completed for all of the transferred ports. -6. Once `C` has received all of the `Acknowledge` messages it was waiting for, it will send a `PeerChangeUnblockPort` message to each peer `P`. This message contains the new port ID, such that `P` can unblock its port, and continue sending messages over this channel, but now correctly arriving at `N`. Additionally, `C` will now schedule the new component `N`. - -There is a bit of extra overhead here with the `PeerChangeBlockPort` -> `Acknowledge` -> `PeerChangeUnblockPort` sequence with respect to other possible schemes. But this one allows `P` to continue executing its code as long as it does not use its blocked port. It also ensures that messages arrive in order: they're all collected by `C`, given to `N`, and only then may `P` continue creating messages to `N`, hence arriving after the earlier messages have been handed off to `N`. - -As we'll later introduce, ports can end up being closed. When a port is closed the above procedure is not initiated. A port cannot be reopened, and once a port is closed its peer is also notified that the channel cannot be used anymore. As such, it is not needed (and perhaps impossible, if the memory backing the peer component has already been freed) to notify that the port will change peer. - -## Managing the Lifetime of Components - -Components are created by other components or by the API. Note that the API may for intents and purposes be viewed as a component. It may own and create channels, and it may create components. Component start, like a function, executing at the top of their program listing and may end in two ways. Either they encounter an unrecoverable error, or they neatly reach the end of their execution. - -In either case we want to free up the memory that was occupied by the component during its execution. And we want to make sure that any kind of pending message/action is correctly handled. Apart from that the component contains somekind of message inbox, whose memory is associated with that component. Hence we want to make sure that all peers are aware that they're no longer able to send messages to a terminated component. - -A small interlude before we continue: trying to take care of unrecoverable errors that occur during a sync round by incorporating the appropriate logic into the consensus algorithm proved to be rather hard. It caused a large increase in the number of states a component could be in, and as a result made the code much harder to reason about. That is: not so much communicating that an error had ocurred (that needs to occur in every synchronous algorithm), but keeping track of which messages can be sent to which component during the consensus algorithm. - -For this reason there are two systems that make sure that components stay alive as long as needed. Firstly components will have a reference counter. For simplicity the component also holds a reference to itself. The component will remove this self-reference when it terminates. Each channel causes two components to also hold references to eachother. *If* a consensus algorithm is implemented such that a central components ends up communicating to all participating parties (using channels or not using channels), then we can make sure that it can reach all participating components by incrementing their reference counts (note that this is not yet properly implemented in the runtime). - -Through this mechanism the consensus algorithm can be greatly simplified. If a component encounters a critical error and is already participated in a sync round, then it can notify the other participants, but remain reachable to all other components until the round ends (and the reference counts are decreased again). - -A second system is needed to ensure that a component actually exits (because all the peers hold a reference, and we need all of those references to drop to 0 to truly remove the component from the runtime). And so when a component exits it will send `ClosePort` messages to all of its peers. These peers will `Acknowledge` those messages and close the respective ports, meaning that they will no longer allow sending messages over those ports, that will be a fatal error. However, messages that were sent to the exiting component before receiving the `ClosePort` message might still be underway. And so the terminating component will wait for all of the `Acknowledge` messages for all of its channels such that it knows that it has received all data messages. The component will respond to these intermediate messages with a `DataMessageFailed` message, meaning that the message has been received the moment the component was already terminated, hence the sender should consider this a failed message transmission. - -Bringing these systems together: apart from data messages there might still be control messages in transit, or the exiting component might still have some control/sync work to do. And so we need to modify something we said earlier: instead of a component removing its self-reference the moment it terminates, we do this the moment we have received all the `Acknowledge` messages we were expecting. This way if a component is busy creating another component, we're certain that the appropriate protocols are observed. So: - -Concluding everything described above, two separate mechanisms will act in conjunction: - -- A. The exiting component `E` waiting until it has finished notifying all peers `P` of its termination: - 1. Component `E` sends a `ClosePort` message to each peer `P` for each of the shared channels (except when the port is already closed because `P` is also in the process of shutting down). - 2. The peer `P` receives the `ClosePort` message and closes the port as a result. This is a change of port state that will cause any subsequent `put` operations on that port to become a fatal error for component `P`. In response the peer `P` sends an `Acknowledge` message to component `E`, unless component `P` exited itself (that is: sending a `ClosePort` message to `E` before the `ClosePort` message from `P` arrived). After closing the port the component `P` will remove the reference to `E`. - 3. Component `E` waits until all of its pending control operations are finished (i.e. waiting for the `Acknowledge` messages following `ClosePort` messages, `PeerChangeBlockPort` messages, etc.). Once all of these are finished, note that we can no longer participate in any future control actions: component `E` will not create channels/components itself. Since all of its ports are closed, the peers `P` will also not send any data or control messages. - 4. Component `E` now checks its inbox for any remaining messages. It will respond to any data messages that arrived after `E` sending `ClosePort` and before `E` receiving `Acknowledge` with a `DataMessageFailed` message (except for ports that were closed before the `Acknowledge`). Then it removes the reference to itself, therefore decrementing the reference counter for the component by 1. - -- B. The reference counting mechanism. Any sync round the exiting component `E` is participating in will conceptually hold a reference to `E`. The component `E` will always respond to sync messages as if it were alive (albeit trying to indicate to everyone that it is actually exiting). The component that removes the last reference to the component `E` (which may be `E` itself, but also a peer `P`) will truly remove the associated memory from the runtime. - -**Note**: We did not consider machine termination. That is to say: once we reach the runtime maturity where communication occurs over different machines, then we have to consider that machines encounter fatal errors. However these can only be resolved by embedding the possibility of failure inside the protocol over which these machines communicate. - -## Sending Data Messages - -So far we've discussed the following properties associated with sending data messages: - -1. Port IDs are decided locally. So a peer may have an ID that is outdated for the intended recipient. -2. Ports can move from owner to owner. So a peer might have a component ID that is outdated for the intended recipient. -3. Ports may be closed. -4. Message intended for specific ports may end up at an intermediate component that is passing that message along. - -However, there are some properties that we can take advantage of: - -1. When a component sends a message, it knows for certain what its own component ID and port ID is. So a transmitting port always sends the correct source component/port ID. -2. When a component receives a message, it knows for certain what its own component ID and port ID is. So once a receiving port receives a properly annotated message from a transmitted port, the receiving end can be certain about the component IDs and port IDs that make up the channel. - -Note that although the message transmitter may not be certain about its final recipient, the components along the way *are* aware of the routing that is necessary to make the message arrive at the intended target. Combing back to the case where we have a creator `C`, new component `N` and peer `P`. Then `P` will send a message intended for `N`, but arriving at `C`. Here `C` can change the target port and component to `N` (as it is in the process of transferring that port, so knows both its original and new port ID). Once the message arrives and is accepted by the recipient then it is certain about the component and port IDs. - -## Sending Sync Messages - -Sync messages have the purpose of making sure that consensus is reached about the interactions that took place in all of the participating components' sync blocks. The previous runtime featured speculative execution and a branching execution model: a component could exhibit multiple behaviours, and at the end all components decide which combination of local behaviours constitute a satisfying single global behaviour (if one exists at all). Without speculative execution the model can be a lot simpler. - -We'll only shortly discuss the mechanisms that are present in the synchronization algorithm. A component has a local counter, that is reset for each synchronous interaction, that is used when transmitting data messages. Such a message will be annotated with the counter at `N`, after which the component sends the next message with annotation `N+1`. At the same time the component will keep track of a local mapping from port ID to port annotation, we'll call this the port mapping. Likewise, when a component receives a data message it will assign the received annotation in its own port mapping. If two components assign the same annotation to the ports that constitute a channel, then there is an agreeable interaction there. - -And so at the end of the synchronous round a component will somehow broadcast its port mapping. Note from the discussion above that a transmitting port's annotation is only associated with that transmitting port, since a transmitting port can only truly ever know its own component/port ID. While the receiving port's annotation knows about the peer's component/port ID as well. And so a component can broadcast `(component ID, port ID, mapping)` for each of its transmitting ports, while it can broadcast `(own component ID, own port ID, peer component ID, peer port ID, mapping)` for each receiving port. Then a recipient of these mappings can match them up and make sure that the mappings agree. - -Note that this broadcasting of synchronous messages is essentially a component-to-component operation. However these messages must still be sent over ports anyway (and any port that was used to transmit messages to a particular receiving component will do). There are two reasons: - -1. The sync message may need to be rerouted (e.g. a sender quickly fires both a data message and a subsequent sync message while the receiving port is being transferred to a new component), but needs to arrive at the same target as the data message. This is essentially restating that a transmitter never knows about the component ID of the recipient. -2. The sync message must not be taken into account by the recipient if it has not accepted any messages from the sender yet. Ofcourse this can be achieved in various ways but a simple way to achieve this is to send the sync message over ports. - -## Annotating Data Messages - -These port mappings are also sent along when sending data messages. We will not go into details but here the mapping makes sure that messages arrive in the right order, and certain kinds of deadlock or inconsistent protocol behaviour may be detected. This port mapping is checked for consistency by the recipient and, when consistent, the target port is updated with its new mapping. - -As we'll send along this mapping we will only consider the ports that are shared between the two components. But in the most general case the transmitting ports of the component do not have knowledge about the peer component. And so the sent port mapping will have to contain the annotation for *all* transmitting ports. Receiving port mappings only have to be sent along if they received a message, and here we can indeed apply filtering. Likewise, if the recipient of a port mapping has not yet received anything on its receiving port, then it cannot be sure about the identity of the sender. - -This leads to problems both for ensuring the correct ordering of the messages. For finding consensus it is not. Suppose that a port `a` sends a message to a port `b`. Port `b` does not accept it. Upon trying to find consensus we see that `a` will submit an entry in its port mapping, and `b` does not submit anything at all. Hence no solution can be found, as desired. - -For the message ordering we require from the receiver that it confirms that, for all of the shared channels, it has the same mapping as the sender sends along. Suppose a component `A` has ports `a_put` and `b_put`, while a component `B` has ports `a_get` and `b_get` (where `a_put -> a_get` and `b_put -> b_get`). Suppose component `A` sends on `a_put` and `b_put` consecutively. And component `B` only receives from `b_get`. Then since `a_get` did not receive from `a_put` (hence did not learn that component/port ID pair of `a_put` is associated with `a_get`), the component `B` cannot figure out that `a_get` should precede a `b_get`. Likewise component `A` has no way to know that `a_put` and `b_put` are sending to the same component, hence it cannot indicate to component `B` that `a_get` should precede `b_get`. - -There are some simple assumptions we can make that makes the problem a little bit easier to think about. Suppose again `a_put -> a_get` and `b_put -> b_get`. Suppose `a_put` is used first, where we send along the mapping of `a_put` and `b_put`. Then we send along `b_put`, again sending along the mapping. Then it is possible for the receiving component to accept the wrong message first (e.g. `b_get`, therefore learning about `b`), but it will be impossible to get from `a_get` later, since that one requires `b_put` (of which we learned that it matches `b_get`) to not have sent any messages. - -Without adding any extra overhead (e.g. some kind of discovery round per synchronous interaction), we can take three approaches: - -1. We simply don't care. It is impossible for a round where messages are received out of order to complete. Hence we temporarily allow a component to take the wrong actions, therefore wasting some CPU time, and to crash/error afterward. -2. We remove the entire concept of ordering of channels at a single component. Channels are always independent entities. This way we truly do not have to care. All we care about is that the messages that have been sent over a channel arrive at the other side. -3. We slightly modify the algorithm to detect these problems. This can be done in reasonable fashion, albeit a bit "hacky". For each channel there is a slot to receive messages. Messages wait there until the receiver performs a `get` in the PDL code. So far we've only considered learning about the component/port IDs that constitute a channel the moment they're received with a `get`. The algorithm could be changed to already learn about the peer component/port ID the moment the message arrives in the slot. - -We'll go with the last option in the current implementation. We return to the problematic example above. Note that messages between components are sent in ordered fashion, and `a_put` happens before `b_put`. Then component `B` will first learn that `a_put` is the peer of `a_get`, then it performs the first `get` on the message from `b_put` to `b_get`. This message is annotated with a port mapping that `a_put` has been used before. We're now able to detect at component `B` that we cannot accept `b_get` before `a_get`. - -Concluding: - -- Every data message that is transmitted needs to contain the port mapping of all `put`ting ports (annotating them appropriately if they have not yet been used). We also need to include the port mapping of all `get`ting ports that have a pending/received message. The port mapping for `put`ting ports will only include their own ID, the port mapping for `get`ting ports will include the IDs of their peer as well. -- Every arriving data message will immediately be used to identify the sender as the peer of the corresponding `get`ter port. Since messages between components arrive in order this allows us to detect when the `put`s are in a different order at the sender as the `get`s at the receiver. - -## Handling Fatal Component Errors - -Components may, during their execution, encounter errors that prevent them from continuing executing their code. For the purposes of this chapter we may consider these to occur during two particular phases of their execution: - -1. The error occurred outside of a sync block. Or equivalently (from the point of view of the runtime): the error ocurred inside a sync block, but the component has not interacted with other components through `put`/`get` calls. -2. The error occurred inside of a sync block. The component can have performed any number of `put`/`get` calls. But for the sake of discussion we will only discuss the case where we perform: - 1. One `put` in the synchronous round. - 2. One `get` in the synchronous round. - -As a preliminary remark: note that encountering an error is nothing special: the component can simply print an error to `stdout` and stop executing. The handling of the error by peers is of importance! If an interaction is made impossible because a peer has stopped executing, then the component that wishes to perform that interaction should error out itself! - -### Handling Errors Outside of a Sync Block - -If a component `E` encounters a critical error outside of a sync block. Then we can be sure that if it had a lat synchronous round, that it succeeded. However, there might be future synchronous rounds for component `E`, likewise a peer component `C` might have already put a message in `E`'s inbox. - -The requirement for the outside-sync error of `E` is that any future sync interactions by `C` will fail (but, if `C` has no future interactions, it shouldn't fail either!). - -Note that `E` cannot perform `put`/`get` requests, because we're assuming `E` is outside of a sync block. Hence the only possible failing interaction is that `C` has performed a `put`, or is attempting a `get`. In the case the `C` `put`s to `E`, then `E` might not have figured out the identity of `C` yet (see earlier remarks on the eventual consistency of peer detection). Hence `C` is responsible for ensuring its own correct shutdown due to a failing `put`. Likewise for a `get`: `C` cannot receive from `E` if it is failing. So if `C` is waiting on a message to arrive, or if it will call `get` in the future, then `C` must fail as well. - -In this case it is sufficient for `E` to send around a `ClosePort` message. As detailed in another chapter of this document. However, a particular race condition might occur. We have assumed that `E` is not in a sync block. But `C` is not aware of this fact. `C` might not be able to distinguish between the following three cases: - -1. Regular shutdown: Components `C` and `E` are not in a sync round. - - `E` broadcasts `ClosePort`. - - `C` receives `ClosePort`. -2. Shutdown within a sync round, `ClosePort` leads `Solution`: A leader component `L`, peer component `C` and failing component `E`. Assume that all are/were busy in a synchronous round with one another. - - `L` broadcasts `Solution` for the current sync round. - - `E` receives `Solution`, finishes round. - - `E` encounters an error, so sends `ClosePort` to `C`. - - `C` receives `ClosePort` from `E`. - - `C` receives `Solution` from `L`. -3. Shutdown within a sync round, `Solution` leads `ClosePort`: Same components `L`, `C` and `E`. - - `L` broadcasts `Solution` for the current sync round. - - `E` receives `Solution` finishes round. - - `E` encounters an error, so sends `ClosePort` to `C`. - - `C` receives `Solution` from `L`. - - `C` receives `ClosePort` from `E`. - -In all described cases `E` encounters an error after finishing a sync round. But from the point of view of `C` it is unsure whether the `ClosePort` message pertains to the current synchronous round or not. In case 1 and 3 nothing is out of the ordinary. But in case 2 we have that `C` is at a particular point in time aware of the `ClosePort` from `E`, but not yet of the `Solution` from `L`. `C` should not fail the sync round, as it is completed, but it is unaware of this fact. - -As a rather simple solution, since components that are participating with one another in a sync round move in lock-step at the end of the sync block, we send a boolean along with the `ClosePort`, e.g. `ClosePort(nonsync)`. This boolean indicates whether `E` was inside or outside of a sync block during it encountering an error. Now `C` can distinguish between the three cases: in all cases it agrees that `E` was not in a sync block (and hence: the sync round in cases 2 and 3 can be completed). - -### Handling Errors Inside of a Sync Block - -If `E` is inside of a sync block. Then it has interacted with other components. Our requirement now is that the sync round fails (and ofcourse, that all of the peers are notified that `E` will no longer be present in the runtime). There are two things that are complicating this type of failure: - -1. Suppose that in the successful case of the synchronous interaction, there are a large number of components interacting with one another. Now it might be that `E` fails very early in its sync block, such that it cannot interact with several components. This lack of interaction might cause the single sync block to break up into several smaller sync blocks. Each of these separated regions is supposed to fail. -2. Within a particular synchronous interaction we might have that the leader `L` has a reference to the component `E` without it being a direct peer. There is a reference counting system in place that makes sure that `L` can always send messages to `E`. But we still need to make sure that those references stay alive for as long as needed. - -Suppose a synchronous region is (partially) established, and the component `E` encounters a critical error. The two points given above imply that two processes need to be initiated. For the first error-handling process, we simply use the same scheme as described in the case where `E` is not in a synchronous region. However now we broadcast `ClosePort(sync)` instead of `ClosePort(nonsync)` messages. Consider the following two cases: - -1. Component `C` is not part of the same synchronous region as `E`. And component `C` has tried `put`ting to `E`. If `C` receives a `ClosePort(sync)`, then it knows that its interaction should fail. Note: it might be that `E` wasn't planning on `get`ting from `C` in the sync round in which `E` failed, but much later. In that case it still makes sense for `C` to fail; it would have failed in the future. A small inconsistency here (within the current infinitely-deadlocking implementation) is that if `E` would *never* `get` from `C`, then `C` would deadlock instead of crash (one could argue that this implies that deadlocking should lead to crashing through a timeout mechanism). -2. Component `C` is not part of the same synchronous region as `E`. And if `E` wouldn't have crashed, then it would've `put` a message to `C`. In this case it is still proper for `C` to crash. The reasoning works the same as above. - -So that is to say that this `ClosePort(sync)` causes instant failure of `C` if it has used the closed port in a round without consensus, or if it uses that port in the future. Note that this `ClosePort(sync)` system causes cascading failures throughout the disjoint synchronous regions. This is as intended: once one component's PDL program can no longer be executed, we cannot depend on the discovery of all the peers that constitute the intended synchronous region. So instead we rely on a peer-to-peer mechanism to make sure that every component is notified of failure. - -However, while these cascading peer-to-peer `ClosePort(sync)` messages are happily shared around, we still have a leader component somewhere, and components that have not yet been notified of the failure. \ No newline at end of file diff --git a/src/runtime2/stdlib/internet.rs b/src/runtime2/stdlib/internet.rs index d32e849a1b76464ae208cb9e11c18789b56e5ea9..4744cc7835343d0c4018326214dda8eb824e99aa 100644 --- a/src/runtime2/stdlib/internet.rs +++ b/src/runtime2/stdlib/internet.rs @@ -45,7 +45,6 @@ impl SocketTcpClient { return Err(SocketError::Modifying); } - println!(" CREATE [{:04}] client", socket_handle); return Ok(SocketTcpClient{ socket_handle, is_blocking: SOCKET_BLOCKING, @@ -93,7 +92,6 @@ impl SocketTcpClient { impl Drop for SocketTcpClient { fn drop(&mut self) { - println!("DESTRUCT [{:04}] client", self.socket_handle); debug_assert!(self.socket_handle >= 0); unsafe{ close(self.socket_handle) }; } @@ -132,7 +130,6 @@ impl SocketTcpListener { } - println!(" CREATE [{:04}] listener", socket_handle); return Ok(SocketTcpListener{ socket_handle, is_blocking: SOCKET_BLOCKING, @@ -147,14 +144,12 @@ impl SocketTcpListener { return Err(IoError::last_os_error()); } - println!(" CREATE [{:04}] client (from listener)", socket_handle); return Ok(socket_handle); } } impl Drop for SocketTcpListener { fn drop(&mut self) { - println!("DESTRUCT [{:04}] listener", self.socket_handle); debug_assert!(self.socket_handle >= 0); unsafe{ close(self.socket_handle) }; }