It’s a Syn (Part 7)

Before we can resolve conflicts, we have to generate some. We can let each node try and register the same pool of names:

Init ==
     ...
    /\ names = [n \in Nodes |-> 1..MaxValues]

 Register(n) ==
    /\ names[n] # {}
    /\ LET next_val == CHOOSE o \in (names[n]): TRUE
         IN ...
        /\ names' = [names EXCEPT![n] = names[n] \ {next_val}]

This produces an immediate failure, and reveals that we have a race even without disconnection:

...

State 3: <Next line 139, col 5 to line 149, col 22 of module syn>
/\ states = <<<<"Register", n1, 1>>, <<"Register", n2, 1>>>>
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :> <<[action |-> "sync_register", name |-> 1, from |-> n2]>> @@
  n2 :> <<[action |-> "sync_register", name |-> 1, from |-> n1]>> )
/\ locally_registered = (n1 :> (n1 :> {1} @@ n2 :> {}) @@ n2 :> (n1 :> {} @@ n2 :> {1}))
/\ disconnections = 0
/\ registered = {1}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes can attempt to register the same name, even when connected, if it’s before the sync_register message has arrived/been processed.

Let’s park that, for now. Our conflict resolution process will pick the most recent registration as the winner, so we will need the concept of time:

Init ==
    ...
    /\ time = 0

Register(n) ==
    ...
    /\ time' = time + 1

A counter incremented by every state transition. Now, instead of a storing registrations in a set, we will have a struct (function) of name -> time:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> <<>>]]
    /\ names = {"a"}

I was seeing some confusing behaviour from TLC when using integer keys for the registered names, so I switched to a set of strings.

Register(n) ==
    /\ names # {}
    /\ LET next_val == CHOOSE o \in names: TRUE
        l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {next_val} |-> time]]
        IN registered' = registered \union {next_val}
        /\ locally_registered' = [locally_registered EXCEPT![n] = l]
        /\ names' = names \ {next_val}
    ...

SyncRegister(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]]

now locally_registered will look something like this:

( n1 :> (n1 :> << >> @@ n2 :> [a |-> 0]) @@ n2 :> (n1 :> << >> @@ n2 :> [a |-> 0]) )

This model still has no failures, as we can only register a key once. If that limitation is removed, we need to handle a conflict:

        conflict == message.name \in DOMAIN locally_registered[n][n]
        l == [o \in DOMAIN locally_registered[n] |-> CASE
            (o = message.from) -> (
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][message.from]
                ELSE
                    locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]
            )
            [] (o = n) ->
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][n]
                ELSE
                    [r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
            [] OTHER -> locally_registered[n][o]
        ]

We keep whichever process was registered with a more recent time. I also added the “Highlander” invariant, to guarantee that we can’t keep both:

 INVARIANTS
...
ThereCanBeOnlyOne

RECURSIVE Duplicates(_, _, _)

Duplicates(keys, struct, acc) ==
    IF Cardinality(keys) < 2 THEN acc
    ELSE
        LET k1 == CHOOSE k \in keys: TRUE
        k2 == CHOOSE k \in (keys \ {k1}): TRUE
        duplicates == DOMAIN struct[k1] \intersect DOMAIN struct[k2]
        IN Duplicates(keys \ {k1}, struct, duplicates \union acc)

ThereCanBeOnlyOne ==
    \A n \in Nodes:
        Duplicates(DOMAIN locally_registered[n], locally_registered[n], {}) = {}

We immediately hit a failure:

...

State 4: <Next line 176, col 5 to line 186, col 22 of module syn>
/\ states = <<<<"Register", n1, "a">>, <<"Register", n2, "a">>, <<"Unregister", n1, "a">>>>
/\ time = 3
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :>
      <<[time |-> 1, action |-> "sync_register", name |-> "a", from |-> n2]>> @@
  n2 :>
      << [time |-> 0, action |-> "sync_register", name |-> "a", from |-> n1],
         [action |-> "sync_unregister", name |-> "a", from |-> n1] >> )
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> [a |-> 1]))
/\ disconnections = 0
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes have tried to register the same name, but before we have tried to resolve the conflict, one node has removed its registration; and even though the other node still has a (valid) registration, we are now expecting it to not exist anywhere.

The easiest way to solve this, is to check all other nodes, before removing the name from the canonical set:

RECURSIVE AllRegisteredNames(_, _, _)

AllRegisteredNames(nodes, locals, registrations) ==
    IF nodes = {} THEN registrations
    ELSE
        LET n == CHOOSE n \in nodes: TRUE
        IN AllRegisteredNames(nodes \ {n}, locals, registrations \union DOMAIN locals[n][n])

RegisteredElsewhere(node) ==
    AllRegisteredNames(AllOtherNodes(node), locally_registered, {})

Unregister(n) ==
    ...
     /\ LET item_to_remove == ItemToRemove(n)
        IN registered' = (IF item_to_remove \in RegisteredElsewhere(n) THEN registered ELSE registered \ {item_to_remove})

Now the failure has morphed slightly:

...

State 7: <Next line 187, col 5 to line 197, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> [a |-> 0] @@ n2 :> << >>))
/\ time = 6
/\ states = << <<"Register", n1, "a">>,
   <<"Register", n2, "a">>,
   <<"SyncRegister", n1, "a">>,
   <<"Unregister", n2, "a">>,
   <<"SyncUnregister", n1, "a">>,
   <<"SyncRegister", n2, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And the 👻 registration is actually the process that lost the conflict resolution battle.

This caused a fair bit of head scratching, and I nearly persuaded myself that I had found a real race condition. My first thought was that a sync_register message for a dead process would be discarded, but erlang:is_process_alive can only be used on a local pid.

Now I think that the root cause is that I am treating register/unregister as sync calls (when in reality they use gen_server:call); which has allowed them to push into the queue, ahead of the sync_register message.

We’ll see if I’m right, next time!

Leave a comment