We are now handling registration conflicts correctly, but the invariant is being violated due to limitations in change tracking in our model.
We can move from a simple set, to refcounting:
Init ==
...
/\ registered = [n \in Names |-> 0]
RegisterOrUpdateOnNode(n) ==
...
registered' = [registered EXCEPT![message.name] = @ + 1]
If a name is registered on two nodes, it will look something like this: [a |-> 2]
. Running this, we get a failure:
Error: Invariant AllRegistered is violated.
...
State 10: <Next line 235, col 5 to line 247, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 3)) @@ n2 :> (n1 :> << >> @@ n2 :> (a :> 3)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
<<"Register", n2, a>>,
<<"RegisterOrUpdateOnNode", n1, a>>,
<<"RegisterOrUpdateOnNode", n2, a>>,
<<"Unregister", n1, a>>,
<<"SyncRegister", n1, a>>,
<<"UnregisterOnNode", n1, a>>,
<<"SyncRegister", n2, a>>,
<<"SyncUnregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})
It looks like we think a
has been unregistered everywhere, when it hasn’t. If we update the condition for unregister, we now get a more interesting failure:
Error: Invariant ThereCanBeOnlyOne is violated.
...
State 12: <Next line 238, col 5 to line 250, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = ( n1 :> (n1 :> (a :> 3) @@ n2 :> <<>>) @@
n2 :> (n1 :> (a :> 3) @@ n2 :> (a :> 9)) )
/\ time = 11
/\ states = << <<"Disconnect", n1, n2>>,
<<"Register", n1, a>>,
<<"Down", n1, n2>>,
<<"RegisterOrUpdateOnNode", n1, a>>,
<<"Reconnect", n1, n2>>,
<<"Register", n2, a>>,
<<"Discover", n1, n2>>,
<<"Down", n2, n1>>,
<<"Discover", n2, n1>>,
<<"RegisterOrUpdateOnNode", n2, a>>,
<<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> <<>>], [time |-> 9, action |-> "sync_register", name |-> a, from |-> n2]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})
Connor MacLeod would not be pleased. Currently, when a node re-joins the cluster, we are just taking its version of the data as the truth. We also need to implement conflict resolution here:
MergeRegistries(local, remote, local_node, remote_node) ==
LET local_merged == [rr \in {r \in DOMAIN local[local_node] : (r \notin DOMAIN remote \/ local[local_node][r] > remote[r])} |-> local[local_node][rr]]
remote_merged == [rr \in {r \in DOMAIN remote : (r \notin DOMAIN local[local_node] \/ remote[r] > local[local_node][r])} |-> remote[rr]]
IN [r \in DOMAIN local |-> CASE
(r = remote_node) -> remote_merged
[] (r = local_node) -> local_merged
[] OTHER -> local[r]
]
AckSync(n) ==
...
l == MergeRegistries(locally_registered[n], message.local_data, n, message.from)
The real implementation actually uses the same code as for registration, which is probably sensible; so I might need to refactor this later.
Running this looked like it might run forever (again), as I think I added another stuttering case by allowing multiple attempts to register the same name. If we remove that, we get a pretty long run; but eventually it still fails:
Error: Invariant AllRegistered is violated.
...
State 16: <Next line 247, col 5 to line 259, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> << >> @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Disconnect", n1, n2>>,
<<"Register", n1, a>>,
<<"Down", n1, n2>>,
<<"RegisterOrUpdateOnNode", n1, a>>,
<<"Register", n2, a>>,
<<"Down", n2, n1>>,
<<"RegisterOrUpdateOnNode", n2, a>>,
<<"Reconnect", n1, n2>>,
<<"Discover", n1, n2>>,
<<"Discover", n2, n1>>,
<<"AckSync", n1, n2>>,
<<"Unregister", n2, a>>,
<<"AckSync", n2, n1>>,
<<"UnregisterOnNode", n2, a>>,
<<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})
And this time, it’s the other way round. We think a
should still be registered, somewhere, but it’s not. I think the problem here is that when we resolve conflicts on sync, we aren’t updating the refcount.