CSY/reowolf Changeset - 95d7fd5dafe5 · Centrum Wiskunde & Informatica (CWI)

Changeset - 95d7fd5dafe5

Parent rev.

Child rev.

[Not reviewed]

0 1 0

mh - 4 years ago 2022-03-02 19:04:51
contact@maxhenger.nl

Extend documentation regarding components

1 file changed with 13 insertions and 1 deletions:

docs/runtime/sync.md

0 comments (0 inline, 0 general)

docs/runtime/sync.md

➞

Show inline comments

 Here we run into a plethora of problems. The other endpoint might have been given away to another created component. The other endpoint may have already been used in communication, such that we already have messages underway for the port we're trying to give to a newly created component. We may have that the local port ID assigned by the creating component is not the same as the local port ID that the newly created component may want to assign to it. We may have that this port has been passed along multiple times already, etc.
 We cannot help that messages have already arrived, or are in transit, for the transferred port. But we can make some assumptions that simplify the transfer of ports. As a first one, we have that the creating component decides when the created component is scheduled for execution. We'll choose not to execute it initially, such that we can be sure that it will not send messages over its ports the moment is created. To further simplify the problem, we have assumed that messages arrive in order. So although messages might still be underway for the transferred ports, if we ask the sender to stop sending, and the sender blocks the port and acknowledges that it has received this command. Then the moment the creator receives the acknowledgement it is certain that it has received all messages intended for the transferred ports.
 And so here we have our first control protocol. If a port is transferred then we might have:
 . That the peer port is transferred to the new component as well. All is fine and we can take care of the exchange immediately.
 . That the peer port stays with the creating component. Here all is fine as well, everything is running in a single thread of execution so we diligently do our bookkeeping on the data associated with the port and the channel and we can transfer the port.
 . The peer port is already owned by a different component. Here we need to have a slightly more complicated protocol
 In this last case we take the following actions, `C` for creating component, `N` for newly created component, and `P` for the peer component that holds the other port of the same channel.
 . `C` transfers the port to the newly created component `N`, and ask it to come up with a new ID for that port. The port had an ID that was decided by its old owner `C`, and now has one that is agreeable with component `N`.
 . `C` sends a control message `PeerChangeBlockPort` to the peer `P`.
 . `P` receives the `PeerChangeBlockPort` message. It causes the peer port to be temporarily blocked. `P` may still continue executing its code, but the moment it wishes to send something over this port it is forced to block its execution. In response `P` sends an `Acknowledge` message back to `C`.
 . `C` waits for the `Acknowledge` message of `C`. Since the `Acknowledge` message was sent after the last data message that `P` sent to the port under consideration, and because `P` has blocked that port, we are certain that we received all messages. We transfer these messages to `N`.
 . Note that there may be multiple ports being transferred from `C` to `N`, each with a potentially different peer. And so here `C` waits until steps 2 through 4 are completed for all of the transferred ports.
 . Once `C` has received all of the `Acknowledge` messages it was waiting for, it will send a `PeerChangeUnblockPort` message to each peer `P`. This message contains the new port ID, such that `P` can unblock its port, and continue sending messages over this channel, but now correctly arriving at `N`. Additionally, `C` will now schedule the new component `N`.
 There is a bit of extra overhead here with the `PeerChangeBlockPort` -> `Acknowledge` -> `PeerChangeUnblockPort` sequence with respect to other possible schemes. But this one allows `P` to continue executing its code as long as it does not use its blocked port. It also ensures that messages arrive in order: they're all collected by `C`, given to `N`, and only then may `P` continue creating messages to `N`, hence arriving after the earlier messages have been handed off to `N`.
 ## Managing the Lifetime of Components
 Components are created by other components or by the API. Note that the API may for intents and purposes be viewed as a component. It may own and create channels, and it may create components. They start
 Components are created by other components or by the API. Note that the API may for intents and purposes be viewed as a component. It may own and create channels, and it may create components. Component start, like a function, executing at the top of their program listing and may end in two ways. Either they encounter an unrecoverable error, or they neatly reach the end of their execution.
 In either case we want to free up the memory that was occupied by the component during its execution. And we want to make sure that any kind of pending message/action is correctly handled. Apart from that the component contains somekind of message inbox, whose memory is associated with that component. Hence we want to make sure that all peers are aware that they're no longer able to send messages to a terminated component.
 A small interlude before we continue: trying to take care of unrecoverable errors that occur during a sync round by incorporating the appropriate logic into the consensus algorithm proved to be rather hard. It caused a large increase in the number of states a component could be in, and as a result made the code much harder to reason about. That is: not so much communicating that an error had ocurred (that needs to occur in every synchronous algorithm), but keeping track of which messages can be sent to which component during the consensus algorithm.
 For this reason there are two systems that make sure that components stay alive as long as needed. Firstly components will have a reference counter. For simplicity the component also holds a reference to itself. The component will remove this self-reference when it terminates. Each channel causes two components to also hold references to eachother. *If* a consensus algorithm is implemented such that a central components ends up communicating to all participating parties (using channels or not using channels), then we can make sure that it can reach all participating components by incrementing their reference counts (note that this is not yet properly implemented in the runtime).
 Through this mechanism the consensus algorithm can be greatly simplified. If a component encounters a critical error and is already participated in a sync round, then it can notify the other participants, but remain reachable to all other components until the round ends (and the reference counts are decreased again).
 A second system is needed to ensure that a component actually exits (because all the peers hold a reference, and we need all of those references to drop to 0 to truly remove the component from the runtime). And so when a component exits it will send `ClosePort` messages to all of its peers. These peers will `Acknowledge` those messages and close the respective ports, meaning that they will no longer allow sending messages over those ports, that will be a fatal error. However, messages that were sent to the exiting component before receiving the `ClosePort` message might still be underway.
 **Note**: We did not consider machine termination. That is to say: once we reach the runtime maturity where communication occurs over different machines, then we have to consider that machines encounter fatal errors. However these can only be resolved by embedding the possibility of failure inside the protocol over which these machines communicate.
 ## Synchronization of
@@ \ No newline at end of file @@

0 comments (0 inline, 0 general)