It’s a Syn (Part 5)

Once a netsplit has healed, we need to re-sync the registered processes. When two nodes are reconnected, we add a discover message to the inbox of both nodes:

Reconnect(n) ==
        ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "discover", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "discover", from |-> n])
            [] OTHER -> inbox[o]
        ]

Handling that, sends a copy of the local data back to the sender:

Discover(n) ==
    /\ Len(inbox[n]) > 0
    /\ LET message == Head(inbox[n])
        IN message.action = "discover"
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Tail(inbox[o])
            [] (o = message.from) -> Append(inbox[o], [action |-> "ack_sync", local_data |-> locally_registered[n]])
            [] OTHER -> inbox[o]
        ]

which is then merged:

AckSync(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "ack_sync"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ LET message == Head(inbox[n])
        IN locally_registered' = [locally_registered EXCEPT![n] = locally_registered[n] \union message.local_data]

We also need to change the invariant; to only check once all messages have been processed, on all nodes:

AllRegistered ==
    \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered

If we run this, we get a failure:

...

State 9: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", local_data |-> {}]>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

State 10: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

I’ll spare you the full stack trace, but we now have a phantom key: it was removed from a node, during the partition, but then merged back in. Again, the problem lies with my model being too simplistic. We will look into how to resolve this, next time.

Leave a comment