From 85980c6df11abbff0e15329bd08e9bfeaae0b0d4 2022-04-19 15:40:09 From: MH Date: 2022-04-19 15:40:09 Subject: [PATCH] Revised and finished docs on outside-sync errors --- diff --git a/docs/runtime/sync.md b/docs/runtime/sync.md index 412f991da53670488fb2ce175282a66e3a55dfe3..5b7872b325875032f76c73ea6f771f3315ce3820 100644 --- a/docs/runtime/sync.md +++ b/docs/runtime/sync.md @@ -149,16 +149,42 @@ Concluding: Components may, during their execution, encounter errors that prevent them from continuing executing their code. For the purposes of this chapter we may consider these to occur during two particular phases of their execution: -1. The error occurs outside of a sync-block. -2. The error occurs anywhere inside of a sync-block. Or more specifically: the error occurs inside of a sync-block where the component has already performed an interaction with the outside world (i.e. performed a `put` or a `get`, **note:** I need to think about whether a select block influences the error-handling as well). +1. The error occurred outside of a sync block. Or equivalently (from the point of view of the runtime): the error ocurred inside a sync block, but the component has not interacted with other components through `put`/`get` calls. +2. The error occurred inside of a sync block. The component can have performed any number of `put`/`get` calls. But for the sake of discussion we will only discuss the case where we perform: + 1. One `put` in the synchronous round. + 2. One `get` in the synchronous round. -### Handling Fatal Errors outside of Synchronous Rounds +As a preliminary remark: note that encountering an error is nothing special: the component can simply print an error to `stdout` and stop executing. The handling of the error by peers is of importance! If an interaction is made impossible because a peer has stopped executing, then the component that wishes to perform that interaction should error out itself! -In the first case we're dealing with a component that has finished previous interactions with the outside world. So it does not have to deal with submitting the fact that a sync round has finished to the outside world. And so the component will perhaps log something to `stdout` to indicate that it has failed, but apart from that it will simply initiate the exit procedure as described earlier: reporting to all peers that the ports will be closed. +### Handling Errors Outside of a Sync Block -There is one more remark that should be made here. Although the component `E` that has encountered the error might not be part of a sync round, another component `C` might have sent a message to component `E`. If the message is being sent from `C` while it has already received the information from `E` that it port should be closed, then `C` needs to handle the error as well. +If a component `E` encounters a critical error outside of a sync block. Then we can be sure that if it had a lat synchronous round, that it succeeded. However, there might be future synchronous rounds for component `E`, likewise a peer component `C` might have already put a message in `E`'s inbox. -Hence, if the component `E` encounters a critical error, while there are still data messages from component `C` in the inbox (and the corresponding port is not yet closed), then component `E` sends a `DeliveryFailed` message to `C`. We may annotate each sent data message with the origin of the message in the PDL source, such that we can send this annotation back to the sender. Once the `DeliveryFailed` message arrives at `C` there are two possible scenarios (consider that it has sent a message, hence must have done this in a sync round that has not yet finished): +The requirement for the outside-sync error of `E` is that any future sync interactions by `C` will fail (but, if `C` has no future interactions, it shouldn't fail either!). + +Note that `E` cannot perform `put`/`get` requests, because we're assuming `E` is outside of a sync block. Hence the only possible failing interaction is that `C` has performed a `put`, or is attempting a `get`. In the case the `C` `put`s to `E`, then `E` might not have figured out the identity of `C` yet (see earlier remarks on the eventual consistency of peer detection). Hence `C` is responsible for ensuring its own correct shutdown due to a failing `put`. Likewise for a `get`: `C` cannot receive from `E` if it is failing. So if `C` is waiting on a message to arrive, or if it will call `get` in the future, then `C` must fail as well. + +In this case it is sufficient for `E` to send around a `ClosePort` message. As detailed in another chapter of this document. However, a particular race condition might occur. We have assumed that `E` is not in a sync block. But `C` is not aware of this fact. `C` might not be able to distinguish between the following three cases: + +1. Regular shutdown: Components `C` and `E` are not in a sync round. + - `E` broadcasts `ClosePort`. + - `C` receives `ClosePort`. +2. Shutdown within a sync round, `ClosePort` leads `Solution`: A leader component `L`, peer component `C` and failing component `E`. Assume that all are/were busy in a synchronous round with one another. + - `L` broadcasts `Solution` for the current sync round. + - `E` receives `Solution`, finishes round. + - `E` encounters an error, so sends `ClosePort` to `C`. + - `C` receives `ClosePort` from `E`. + - `C` receives `Solution` from `L`. +3. Shutdown within a sync round, `Solution` leads `ClosePort`: Same components `L`, `C` and `E`. + - `L` broadcasts `Solution` for the current sync round. + - `E` receives `Solution` finishes round. + - `E` encounters an error, so sends `ClosePort` to `C`. + - `C` receives `Solution` from `L`. + - `C` receives `ClosePort` from `E`. + +In all described cases `E` encounters an error after finishing a sync round. But from the point of view of `C` it is unsure whether the `ClosePort` message pertains to the current synchronous round or not. In case 1 and 3 nothing is out of the ordinary. But in case 2 we have that `C` is at a particular point in time aware of the `ClosePort` from `E`, but not yet of the `Solution` from `L`. `C` should not fail the sync round, as it is completed, but it is unaware of this fact. + +As a rather simple solution, since components that are participating with one another in a sync round move in lock-step at the end of the sync block, we send a boolean along with the `ClosePort`. This boolean indicates whether `E` was inside or outside of a sync block during it encountering an error. Now `C` can distinguish between the three cases: in all cases it agrees that `E` was not in a sync block (and hence: the sync round in cases 2 and 3 can be completed). + +### Handling Errors Inside of a Sync Block -1. It is still waiting on the conclusion to a synchronous round that, if it were not for component `E`, would have succeeded. In this case the component `C` prints the `put`-error, and initiates failure in the synchronous round (we'll come back to this later in the other subchapter). -2. It is not waiting for the conclusion of a synchronous round, because after sending some other component (maybe even `C` itself) experienced a fatal error. It received the notification of the failed synchronous round first, hence is busy shutting down. In this case the component likely already printed an error, hence can ignore the `DeliveryFailed` message and continue shutting down. \ No newline at end of file