It’s a Syn (Part 9)

We are now handling registration conflicts correctly, but the invariant is being violated due to limitations in change tracking in our model.

We can move from a simple set, to refcounting:

Init ==
     ...
    /\ registered = [n \in Names |-> 0]

RegisterOrUpdateOnNode(n) ==
    ...
    registered' = [registered EXCEPT![message.name] = @ + 1]

If a name is registered on two nodes, it will look something like this: [a |-> 2]. Running this, we get a failure:

Error: Invariant AllRegistered is violated.
...

State 10: <Next line 235, col 5 to line 247, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 3)) @@ n2 :> (n1 :> << >> @@ n2 :> (a :> 3)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
   <<"Register", n2, a>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Unregister", n1, a>>,
   <<"SyncRegister", n1, a>>,
   <<"UnregisterOnNode", n1, a>>,
   <<"SyncRegister", n2, a>>,
   <<"SyncUnregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

It looks like we think a has been unregistered everywhere, when it hasn’t. If we update the condition for unregister, we now get a more interesting failure:

Error: Invariant ThereCanBeOnlyOne is violated.
...

State 12: <Next line 238, col 5 to line 250, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = ( n1 :> (n1 :> (a :> 3) @@ n2 :> <<>>) @@
  n2 :> (n1 :> (a :> 3) @@ n2 :> (a :> 9)) )
/\ time = 11
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Down", n2, n1>>,
   <<"Discover", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> <<>>], [time |-> 9, action |-> "sync_register", name |-> a, from |-> n2]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

Connor MacLeod would not be pleased. Currently, when a node re-joins the cluster, we are just taking its version of the data as the truth. We also need to implement conflict resolution here:

MergeRegistries(local, remote, local_node, remote_node) ==
    LET local_merged == [rr \in {r \in DOMAIN local[local_node] : (r \notin DOMAIN remote \/ local[local_node][r] > remote[r])} |-> local[local_node][rr]]
        remote_merged == [rr \in {r \in DOMAIN remote : (r \notin DOMAIN local[local_node] \/ remote[r] > local[local_node][r])} |-> remote[rr]]
    IN [r \in DOMAIN local |-> CASE
        (r = remote_node) -> remote_merged
        [] (r = local_node) -> local_merged
        [] OTHER -> local[r]
    ]

AckSync(n) ==
    ...
    l == MergeRegistries(locally_registered[n], message.local_data, n, message.from)

The real implementation actually uses the same code as for registration, which is probably sensible; so I might need to refactor this later.

Running this looked like it might run forever (again), as I think I added another stuttering case by allowing multiple attempts to register the same name. If we remove that, we get a pretty long run; but eventually it still fails:

Error: Invariant AllRegistered is violated.
...

State 16: <Next line 247, col 5 to line 259, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> << >> @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"Unregister", n2, a>>,
   <<"AckSync", n2, n1>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And this time, it’s the other way round. We think a should still be registered, somewhere, but it’s not. I think the problem here is that when we resolve conflicts on sync, we aren’t updating the refcount.

Leave a comment