Once a netsplit has healed, we need to re-sync the registered processes. When two nodes are reconnected, we add a discover
message to the inbox of both nodes:
Reconnect(n) ==
...
/\ inbox' = [o \in Nodes |-> CASE
(o = n) -> Append(inbox[o], [action |-> "discover", from |-> other_node])
[] (o = other_node) -> Append(inbox[o], [action |-> "discover", from |-> n])
[] OTHER -> inbox[o]
]
Handling that, sends a copy of the local data back to the sender:
Discover(n) ==
/\ Len(inbox[n]) > 0
/\ LET message == Head(inbox[n])
IN message.action = "discover"
/\ inbox' = [o \in Nodes |-> CASE
(o = n) -> Tail(inbox[o])
[] (o = message.from) -> Append(inbox[o], [action |-> "ack_sync", local_data |-> locally_registered[n]])
[] OTHER -> inbox[o]
]
which is then merged:
AckSync(n) ==
/\ Len(inbox[n]) > 0
/\ Head(inbox[n]).action = "ack_sync"
/\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
/\ LET message == Head(inbox[n])
IN locally_registered' = [locally_registered EXCEPT![n] = locally_registered[n] \union message.local_data]
We also need to change the invariant; to only check once all messages have been processed, on all nodes:
AllRegistered ==
\A n \in Nodes:
(\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered
If we run this, we get a failure:
...
State 9: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
"Disconnect",
"Unregister",
"Reconnect",
"Discover",
"SyncRegister",
"Discover",
"AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", local_data |-> {}]>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
State 10: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
"Disconnect",
"Unregister",
"Reconnect",
"Discover",
"SyncRegister",
"Discover",
"AckSync",
"AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
I’ll spare you the full stack trace, but we now have a phantom key: it was removed from a node, during the partition, but then merged back in. Again, the problem lies with my model being too simplistic. We will look into how to resolve this, next time.