Before we can resolve conflicts, we have to generate some. We can let each node try and register the same pool of names:
Init ==
...
/\ names = [n \in Nodes |-> 1..MaxValues]
Register(n) ==
/\ names[n] # {}
/\ LET next_val == CHOOSE o \in (names[n]): TRUE
IN ...
/\ names' = [names EXCEPT![n] = names[n] \ {next_val}]
This produces an immediate failure, and reveals that we have a race even without disconnection:
...
State 3: <Next line 139, col 5 to line 149, col 22 of module syn>
/\ states = <<<<"Register", n1, 1>>, <<"Register", n2, 1>>>>
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :> <<[action |-> "sync_register", name |-> 1, from |-> n2]>> @@
n2 :> <<[action |-> "sync_register", name |-> 1, from |-> n1]>> )
/\ locally_registered = (n1 :> (n1 :> {1} @@ n2 :> {}) @@ n2 :> (n1 :> {} @@ n2 :> {1}))
/\ disconnections = 0
/\ registered = {1}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
Both nodes can attempt to register the same name, even when connected, if it’s before the sync_register
message has arrived/been processed.
Let’s park that, for now. Our conflict resolution process will pick the most recent registration as the winner, so we will need the concept of time:
Init ==
...
/\ time = 0
Register(n) ==
...
/\ time' = time + 1
A counter incremented by every state transition. Now, instead of a storing registrations in a set, we will have a struct (function) of name -> time:
Init ==
...
/\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> <<>>]]
/\ names = {"a"}
I was seeing some confusing behaviour from TLC when using integer keys for the registered names, so I switched to a set of strings.
Register(n) ==
/\ names # {}
/\ LET next_val == CHOOSE o \in names: TRUE
l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {next_val} |-> time]]
IN registered' = registered \union {next_val}
/\ locally_registered' = [locally_registered EXCEPT![n] = l]
/\ names' = names \ {next_val}
...
SyncRegister(n) ==
...
/\ LET message == Head(inbox[n])
l == [locally_registered[n] EXCEPT![message.from] = locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]]
now locally_registered
will look something like this:
( n1 :> (n1 :> << >> @@ n2 :> [a |-> 0]) @@ n2 :> (n1 :> << >> @@ n2 :> [a |-> 0]) )
This model still has no failures, as we can only register a key once. If that limitation is removed, we need to handle a conflict:
conflict == message.name \in DOMAIN locally_registered[n][n]
l == [o \in DOMAIN locally_registered[n] |-> CASE
(o = message.from) -> (
IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
locally_registered[n][message.from]
ELSE
locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]
)
[] (o = n) ->
IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
locally_registered[n][n]
ELSE
[r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
[] OTHER -> locally_registered[n][o]
]
We keep whichever process was registered with a more recent time. I also added the “Highlander” invariant, to guarantee that we can’t keep both:
INVARIANTS
...
ThereCanBeOnlyOne
RECURSIVE Duplicates(_, _, _)
Duplicates(keys, struct, acc) ==
IF Cardinality(keys) < 2 THEN acc
ELSE
LET k1 == CHOOSE k \in keys: TRUE
k2 == CHOOSE k \in (keys \ {k1}): TRUE
duplicates == DOMAIN struct[k1] \intersect DOMAIN struct[k2]
IN Duplicates(keys \ {k1}, struct, duplicates \union acc)
ThereCanBeOnlyOne ==
\A n \in Nodes:
Duplicates(DOMAIN locally_registered[n], locally_registered[n], {}) = {}
We immediately hit a failure:
...
State 4: <Next line 176, col 5 to line 186, col 22 of module syn>
/\ states = <<<<"Register", n1, "a">>, <<"Register", n2, "a">>, <<"Unregister", n1, "a">>>>
/\ time = 3
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :>
<<[time |-> 1, action |-> "sync_register", name |-> "a", from |-> n2]>> @@
n2 :>
<< [time |-> 0, action |-> "sync_register", name |-> "a", from |-> n1],
[action |-> "sync_unregister", name |-> "a", from |-> n1] >> )
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> [a |-> 1]))
/\ disconnections = 0
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
Both nodes have tried to register the same name, but before we have tried to resolve the conflict, one node has removed its registration; and even though the other node still has a (valid) registration, we are now expecting it to not exist anywhere.
The easiest way to solve this, is to check all other nodes, before removing the name from the canonical set:
RECURSIVE AllRegisteredNames(_, _, _)
AllRegisteredNames(nodes, locals, registrations) ==
IF nodes = {} THEN registrations
ELSE
LET n == CHOOSE n \in nodes: TRUE
IN AllRegisteredNames(nodes \ {n}, locals, registrations \union DOMAIN locals[n][n])
RegisteredElsewhere(node) ==
AllRegisteredNames(AllOtherNodes(node), locally_registered, {})
Unregister(n) ==
...
/\ LET item_to_remove == ItemToRemove(n)
IN registered' = (IF item_to_remove \in RegisteredElsewhere(n) THEN registered ELSE registered \ {item_to_remove})
Now the failure has morphed slightly:
...
State 7: <Next line 187, col 5 to line 197, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> [a |-> 0] @@ n2 :> << >>))
/\ time = 6
/\ states = << <<"Register", n1, "a">>,
<<"Register", n2, "a">>,
<<"SyncRegister", n1, "a">>,
<<"Unregister", n2, "a">>,
<<"SyncUnregister", n1, "a">>,
<<"SyncRegister", n2, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})
And the 👻 registration is actually the process that lost the conflict resolution battle.
This caused a fair bit of head scratching, and I nearly persuaded myself that I had found a real race condition. My first thought was that a sync_register
message for a dead process would be discarded, but erlang:is_process_alive
can only be used on a local pid.
Now I think that the root cause is that I am treating register/unregister as sync calls (when in reality they use gen_server:call
); which has allowed them to push into the queue, ahead of the sync_register
message.
We’ll see if I’m right, next time!