It’s a Syn (Part 11)

Assuming that it is possible to end up with a 🧟 pid, I think one solution would be to send a message when the conflict resolution kills a process:

SyncRegister(n) ==
    ...
       /\ inbox' = [o \in Nodes |->
            (IF (o = n) THEN
                Tail(inbox[o])
            ELSE
                IF keep_remote THEN
                    Append(inbox[o], [action |-> "killed", name |-> message.name, time |-> locally_registered[n][n][message.name], from |-> n])
                ELSE
                    inbox[o])
        ]

and then remove any exact matches, i.e. registered with the same name, at the same time:

Killed(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "killed"
    /\ LET message == Head(inbox[n])
        exact_match == message.name \in DOMAIN locally_registered[n][message.from] /\ locally_registered[n][message.from][message.name] = message.time
        l == (IF exact_match THEN
                [r \in (DOMAIN locally_registered[n][message.from] \ {message.name}) |-> locally_registered[n][message.from][r]]
            ELSE
                locally_registered[n][message.from])
        IN inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
        /\ locally_registered' = [locally_registered EXCEPT![n] = ([locally_registered[n] EXCEPT![message.from] = l])]

with this fix, I could run the model significantly longer without failure (but sadly ran out of memory on my laptop: I’m not sure I’m interested enough to pay AWS for the answer).

It’s a Syn (Part 10)

If we also update the refcount, when a conflict occurs during sync:

AckSync(n) ==
        ...
        conflicts == DOMAIN locally_registered[n][n] \intersect DOMAIN message.local_data
        c1 == [c \in { r \in conflicts : message.local_data[r] > locally_registered[n][n][r] } |-> registered[c] - 1]
        c2 == [c \in { r \in conflicts : locally_registered[n][n][r] > message.local_data[r] } |-> registered[c]]
        IN ...
        /\ registered' = c1 @@ c2 @@ [r \in (DOMAIN registered \ conflicts) |-> registered[r]]

we get another failure (I’ll paste the full stack trace, this time):

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 0
/\ states = <<>>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {a} @@ n2 :> {a})

State 2: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 1
/\ states = <<<<"Register", n1, a>>>>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<[action |-> "register_or_update_on_node", name |-> a]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {a})

State 3: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 2
/\ states = <<<<"Register", n1, a>>, <<"Disconnect", n1, n2>>>>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 0)
/\ inbox = ( n1 :>
      << [action |-> "register_or_update_on_node", name |-> a],
         [action |-> "DOWN", from |-> n2] >> @@
  n2 :> <<[action |-> "DOWN", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {a})

The two nodes (n1 & n2) are disconnected:

State 4: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 3
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = ( n1 :> <<[action |-> "DOWN", from |-> n2]>> @@
  n2 :> <<[action |-> "DOWN", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {a})

State 5: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 4
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "DOWN", from |-> n1]>>)
/\ names = (n1 :> {} @@ n2 :> {a})

State 6: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 5
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = ( n1 :> <<>> @@
  n2 :>
      << [action |-> "DOWN", from |-> n1],
         [action |-> "register_or_update_on_node", name |-> a] >> )
/\ names = (n1 :> {} @@ n2 :> {})

State 7: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 6
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "register_or_update_on_node", name |-> a]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 8: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 7
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

During the netsplit, both manage to register the same name (“a”)

State 9: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 8
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = ( n1 :> <<[action |-> "discover", from |-> n2]>> @@
  n2 :> <<[action |-> "discover", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {})

State 10: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = ( n1 :> <<[action |-> "discover", from |-> n2]>> @@
  n2 :>
      << [action |-> "discover", from |-> n1],
         [action |-> "unregister_on_node", name |-> a] >> )
/\ names = (n1 :> {} @@ n2 :> {})

State 11: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 10
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "discover", from |-> n1], [action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 12: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 11
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> (a :> 6)]>> @@ n2 :> <<[action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 13: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 6)) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 12
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 14: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 6)) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 13
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<[action |-> "sync_unregister", name |-> a, from |-> n2]>> @@ n2 :> <<[action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 15: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 14
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 16: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> (a :> 2) @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>>,
   <<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

Once re-connected, the sync takes place; but one node also un-registers the name, and we somehow end up with a zombie registration (to a pid that was killed by the conflict resolution).

So far, I’ve not been able to find anything in the Syn code (or the Erlang VM) that prevents this; but I’m quite open to it being a problem with my model 🤷

It’s a Syn (Part 9)

We are now handling registration conflicts correctly, but the invariant is being violated due to limitations in change tracking in our model.

We can move from a simple set, to refcounting:

Init ==
     ...
    /\ registered = [n \in Names |-> 0]

RegisterOrUpdateOnNode(n) ==
    ...
    registered' = [registered EXCEPT![message.name] = @ + 1]

If a name is registered on two nodes, it will look something like this: [a |-> 2]. Running this, we get a failure:

Error: Invariant AllRegistered is violated.
...

State 10: <Next line 235, col 5 to line 247, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 3)) @@ n2 :> (n1 :> << >> @@ n2 :> (a :> 3)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
   <<"Register", n2, a>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Unregister", n1, a>>,
   <<"SyncRegister", n1, a>>,
   <<"UnregisterOnNode", n1, a>>,
   <<"SyncRegister", n2, a>>,
   <<"SyncUnregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

It looks like we think a has been unregistered everywhere, when it hasn’t. If we update the condition for unregister, we now get a more interesting failure:

Error: Invariant ThereCanBeOnlyOne is violated.
...

State 12: <Next line 238, col 5 to line 250, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = ( n1 :> (n1 :> (a :> 3) @@ n2 :> <<>>) @@
  n2 :> (n1 :> (a :> 3) @@ n2 :> (a :> 9)) )
/\ time = 11
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Down", n2, n1>>,
   <<"Discover", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> <<>>], [time |-> 9, action |-> "sync_register", name |-> a, from |-> n2]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

Connor MacLeod would not be pleased. Currently, when a node re-joins the cluster, we are just taking its version of the data as the truth. We also need to implement conflict resolution here:

MergeRegistries(local, remote, local_node, remote_node) ==
    LET local_merged == [rr \in {r \in DOMAIN local[local_node] : (r \notin DOMAIN remote \/ local[local_node][r] > remote[r])} |-> local[local_node][rr]]
        remote_merged == [rr \in {r \in DOMAIN remote : (r \notin DOMAIN local[local_node] \/ remote[r] > local[local_node][r])} |-> remote[rr]]
    IN [r \in DOMAIN local |-> CASE
        (r = remote_node) -> remote_merged
        [] (r = local_node) -> local_merged
        [] OTHER -> local[r]
    ]

AckSync(n) ==
    ...
    l == MergeRegistries(locally_registered[n], message.local_data, n, message.from)

The real implementation actually uses the same code as for registration, which is probably sensible; so I might need to refactor this later.

Running this looked like it might run forever (again), as I think I added another stuttering case by allowing multiple attempts to register the same name. If we remove that, we get a pretty long run; but eventually it still fails:

Error: Invariant AllRegistered is violated.
...

State 16: <Next line 247, col 5 to line 259, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> << >> @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"Unregister", n2, a>>,
   <<"AckSync", n2, n1>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And this time, it’s the other way round. We think a should still be registered, somewhere, but it’s not. I think the problem here is that when we resolve conflicts on sync, we aren’t updating the refcount.

It’s a Syn (Part 7)

Before we can resolve conflicts, we have to generate some. We can let each node try and register the same pool of names:

Init ==
     ...
    /\ names = [n \in Nodes |-> 1..MaxValues]

 Register(n) ==
    /\ names[n] # {}
    /\ LET next_val == CHOOSE o \in (names[n]): TRUE
         IN ...
        /\ names' = [names EXCEPT![n] = names[n] \ {next_val}]

This produces an immediate failure, and reveals that we have a race even without disconnection:

...

State 3: <Next line 139, col 5 to line 149, col 22 of module syn>
/\ states = <<<<"Register", n1, 1>>, <<"Register", n2, 1>>>>
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :> <<[action |-> "sync_register", name |-> 1, from |-> n2]>> @@
  n2 :> <<[action |-> "sync_register", name |-> 1, from |-> n1]>> )
/\ locally_registered = (n1 :> (n1 :> {1} @@ n2 :> {}) @@ n2 :> (n1 :> {} @@ n2 :> {1}))
/\ disconnections = 0
/\ registered = {1}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes can attempt to register the same name, even when connected, if it’s before the sync_register message has arrived/been processed.

Let’s park that, for now. Our conflict resolution process will pick the most recent registration as the winner, so we will need the concept of time:

Init ==
    ...
    /\ time = 0

Register(n) ==
    ...
    /\ time' = time + 1

A counter incremented by every state transition. Now, instead of a storing registrations in a set, we will have a struct (function) of name -> time:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> <<>>]]
    /\ names = {"a"}

I was seeing some confusing behaviour from TLC when using integer keys for the registered names, so I switched to a set of strings.

Register(n) ==
    /\ names # {}
    /\ LET next_val == CHOOSE o \in names: TRUE
        l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {next_val} |-> time]]
        IN registered' = registered \union {next_val}
        /\ locally_registered' = [locally_registered EXCEPT![n] = l]
        /\ names' = names \ {next_val}
    ...

SyncRegister(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]]

now locally_registered will look something like this:

( n1 :> (n1 :> << >> @@ n2 :> [a |-> 0]) @@ n2 :> (n1 :> << >> @@ n2 :> [a |-> 0]) )

This model still has no failures, as we can only register a key once. If that limitation is removed, we need to handle a conflict:

        conflict == message.name \in DOMAIN locally_registered[n][n]
        l == [o \in DOMAIN locally_registered[n] |-> CASE
            (o = message.from) -> (
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][message.from]
                ELSE
                    locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]
            )
            [] (o = n) ->
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][n]
                ELSE
                    [r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
            [] OTHER -> locally_registered[n][o]
        ]

We keep whichever process was registered with a more recent time. I also added the “Highlander” invariant, to guarantee that we can’t keep both:

 INVARIANTS
...
ThereCanBeOnlyOne

RECURSIVE Duplicates(_, _, _)

Duplicates(keys, struct, acc) ==
    IF Cardinality(keys) < 2 THEN acc
    ELSE
        LET k1 == CHOOSE k \in keys: TRUE
        k2 == CHOOSE k \in (keys \ {k1}): TRUE
        duplicates == DOMAIN struct[k1] \intersect DOMAIN struct[k2]
        IN Duplicates(keys \ {k1}, struct, duplicates \union acc)

ThereCanBeOnlyOne ==
    \A n \in Nodes:
        Duplicates(DOMAIN locally_registered[n], locally_registered[n], {}) = {}

We immediately hit a failure:

...

State 4: <Next line 176, col 5 to line 186, col 22 of module syn>
/\ states = <<<<"Register", n1, "a">>, <<"Register", n2, "a">>, <<"Unregister", n1, "a">>>>
/\ time = 3
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :>
      <<[time |-> 1, action |-> "sync_register", name |-> "a", from |-> n2]>> @@
  n2 :>
      << [time |-> 0, action |-> "sync_register", name |-> "a", from |-> n1],
         [action |-> "sync_unregister", name |-> "a", from |-> n1] >> )
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> [a |-> 1]))
/\ disconnections = 0
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes have tried to register the same name, but before we have tried to resolve the conflict, one node has removed its registration; and even though the other node still has a (valid) registration, we are now expecting it to not exist anywhere.

The easiest way to solve this, is to check all other nodes, before removing the name from the canonical set:

RECURSIVE AllRegisteredNames(_, _, _)

AllRegisteredNames(nodes, locals, registrations) ==
    IF nodes = {} THEN registrations
    ELSE
        LET n == CHOOSE n \in nodes: TRUE
        IN AllRegisteredNames(nodes \ {n}, locals, registrations \union DOMAIN locals[n][n])

RegisteredElsewhere(node) ==
    AllRegisteredNames(AllOtherNodes(node), locally_registered, {})

Unregister(n) ==
    ...
     /\ LET item_to_remove == ItemToRemove(n)
        IN registered' = (IF item_to_remove \in RegisteredElsewhere(n) THEN registered ELSE registered \ {item_to_remove})

Now the failure has morphed slightly:

...

State 7: <Next line 187, col 5 to line 197, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> [a |-> 0] @@ n2 :> << >>))
/\ time = 6
/\ states = << <<"Register", n1, "a">>,
   <<"Register", n2, "a">>,
   <<"SyncRegister", n1, "a">>,
   <<"Unregister", n2, "a">>,
   <<"SyncUnregister", n1, "a">>,
   <<"SyncRegister", n2, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And the 👻 registration is actually the process that lost the conflict resolution battle.

This caused a fair bit of head scratching, and I nearly persuaded myself that I had found a real race condition. My first thought was that a sync_register message for a dead process would be discarded, but erlang:is_process_alive can only be used on a local pid.

Now I think that the root cause is that I am treating register/unregister as sync calls (when in reality they use gen_server:call); which has allowed them to push into the queue, ahead of the sync_register message.

We’ll see if I’m right, next time!

It’s a Syn (Part 8)

If we make the register request async:

Register(n) ==
    ...
            IN inbox' = [inbox EXCEPT![n] = Append(inbox[n], [action |-> "register_or_update_on_node", name |-> next_val])]

RegisterOrUpdateOnNode(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "register_or_update_on_node"
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {message.name} |-> time]]
        already_registered == message.name \in AllRegisteredForNode(locally_registered[n])
        IN
        (IF already_registered THEN
            \* {error, taken}
            registered' = registered
            /\ locally_registered' = locally_registered
            /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
            /\ names' = names
        ELSE
            registered' = registered \union {message.name}
            /\ locally_registered' = [locally_registered EXCEPT![n] = l]
            /\ inbox' = [o \in Nodes |-> CASE
                (o = n) -> Tail(inbox[n])
                [] (o \in visible_nodes[n]) -> Append(inbox[o], [action |-> "sync_register", name |-> message.name, from |-> n, time |-> time])
                [] OTHER -> inbox[o]
            ]
            /\ names' = [names EXCEPT![n] = names[n] \ {message.name}]
        )
        /\ states' = Append(states, <<"RegisterOrUpdateOnNode", n, message.name>>)
    /\ time' = time + 1
    /\ UNCHANGED <<visible_nodes, disconnections>>

we now need to handle the case that the name was registered in the new race we have added.

Other than that the process is the same, and the same vars are updated. If we do the same for unregister:

Unregister(n) ==
        ...
        IN inbox' = [inbox EXCEPT![n] = Append(inbox[n], [action |-> "unregister_on_node", name |-> item_to_remove])]
 
UnregisterOnNode(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "unregister_on_node"
    /\ LET message == Head(inbox[n])
        l == [r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
        already_removed == message.name \notin AllRegisteredForNode(locally_registered[n])
        IN
        (IF already_removed THEN
            \* {error, undefined}
            registered' = registered
            /\ locally_registered' = locally_registered
            /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
            /\ names' = names
        ELSE
            registered' = (IF message.name \in RegisteredElsewhere(n) THEN registered ELSE registered \ {message.name})
            /\ locally_registered' = [locally_registered EXCEPT![n] = ([locally_registered[n] EXCEPT![n] = l])]
            /\ inbox' = [o \in Nodes |-> CASE
                (o = n) -> Tail(inbox[n])
                [] (o \in visible_nodes[n]) -> Append(inbox[o], [action |-> "sync_unregister", name |-> message.name, from |-> n])
                [] OTHER -> inbox[o]
            ]
            /\ names' = [names EXCEPT![n] = names[n] \ {message.name}]
        )
        /\ states' = Append(states, <<"UnregisterOnNode", n, message.name>>)
    /\ time' = time + 1
    /\ UNCHANGED <<visible_nodes, disconnections>>

we still have a failure of the same invariant:

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
...

State 10: <Next line 234, col 5 to line 246, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 9
/\ states = << <<"Register", n1, "a">>,
   <<"Register", n2, "a">>,
   <<"RegisterOrUpdateOnNode", n1, "a">>,
   <<"RegisterOrUpdateOnNode", n2, "a">>,
   <<"SyncRegister", n2, "a">>,
   <<"Unregister", n2, "a">>,
   <<"UnregisterOnNode", n2, "a">>,
   <<"SyncRegister", n1, "a">>,
   <<"SyncUnregister", n1, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {"a"}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

but now the problem is that our canonical list of registrations still thinks the name should be there, even though it was removed.

I think this fix was too simplistic, and I will need to go back to my first idea: of refcounting registrations.

It’s a Syn (Part 6)

So the other half of the protocol requires that registrations for a disconnected node are removed. First, we send a DOWN message on disconnect (to both nodes):

Disconnect(n) ==
          ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "DOWN", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "DOWN", from |-> n])
            [] OTHER -> inbox[o]
        ]

And just do nothing with the message, for now:

Down(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "DOWN"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ states' = Append(states, "Down")
    /\ UNCHANGED <<registered, next_val, visible_nodes, locally_registered>>

Next ==
    /\ \E n \in Nodes:
        ...
        \/ Down(n)

To be able to remove the registrations for the disconnected node, we need to track who has registered what. There’s probably more than one way to achieve this, but I decided to convert my locally_registered struct into a deeper HashMap<Node, HashMap<Node, Set>>; so that each node has its own map of what the others have registered:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> {}]]

This makes updates more complicated:

 Register(n) ==
     /\ next_val < MaxValues
     /\ registered' = registered \union {next_val}
     /\ LET l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] \union {next_val}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

which we will just have to live with. And the invariant needs to create a superset for comparison:

RECURSIVE ReduceStruct(_, _, _)

ReduceStruct(keys, struct, acc) ==
    IF keys = {} THEN acc
    ELSE
        LET k == CHOOSE k \in keys: TRUE
        IN ReduceStruct(keys \ {k}, struct, acc \union struct[k])

AllRegisteredForNode(locals) ==
    ReduceStruct(DOMAIN locals, locals, {})

 AllRegistered ==
     \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => AllRegisteredForNode(locally_registered[n]) = registered

This is pretty fugly, but if there’s a built in way to do it, I couldn’t find it. Finally, we can use this extra info, to remove the remote registrations:

Down(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = {}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

If we run this, the error seems to be gone, but it also looks like it is going to run forever (again). So we also need to limit the number of disconnections:

Init ==
     ...
    /\ disconnections = 0

 Disconnect(n) ==
    /\ disconnections < MaxDisconnections
    ...
    /\ disconnections' = disconnections + 1

Back to a working system! But even with a strictly limited number of nodes/registrations/disconnects:

CONSTANTS
Nodes = {n1, n2}
MaxValues = 2 
MaxDisconnections = 1 

we have had a combinatorial explosion of interleavings, and running this model now takes 45 mins on my laptop:

Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 2.7E-4
  based on the actual fingerprints:  val = 3.1E-4
148808949 states generated, 49693165 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 46min 15s at (2024-01-10 13:55:28)

Nearly 50M distinct states! As I said previously, this is the downside to an exhaustive search of the domain. If this is an existential question for your business (e.g. a smart contract, that can’t be fixed once out in the wild; or a spacecraft), then you might need to just rent a honking great EC2 instance and run your model for a week.

There are some possible optimisations. We can declare the nodes to be a “symmetry set“, removing any states that are mirror images of each other (about half in a 2 node cluster). This makes a huge difference:

74404477 states generated, 24846583 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 05min 03s at (2024-01-10 14:19:02)

but you need to be confident that it really doesn’t affect the outcome. Beyond that, the only option is to start removing some concurrency (i.e. combine two state transitions), and hope that those missing interleavings didn’t reveal any bugs.

And we still haven’t got to the interesting bit! As it stands, our model can never have any registration conflicts because a) we are using an incrementing integer as the key, and b) when we sync, we are unioning two sets, which would be quite happy with two nodes having registered the same name.

Next time, we will look into the default conflict resolution process.

Testing fragments of TLA+

As your specs grow, you quickly reach a point where the feedback loop of running the whole model (even with restricted inputs) is too slow.

If you are using the Toolbox (GUI), it’s pretty easy to set up a scratchfile; but it is also possible to write some “unit” tests for any helper functions:

---- MODULE syn_tests ----
EXTENDS syn

VARIABLE x

I == FALSE /\ x = 0 
N == FALSE /\ x' = x 

...

\* AllRegisteredForNode

ASSUME LET l == ("n1" :> <<>> @@ "n2" :> [a |-> 0, b |-> 2] @@ "n3" :> [c |-> 1])
       res == AllRegisteredForNode(l)
       IN res = {"a", "b", "c"}

\* Duplicates

ASSUME LET l == ("n1" :> <<>> @@ "n2" :> [a |-> 0] @@ "n3" :> [a |-> 1])
       res == Duplicates(DOMAIN l, l, {})
       IN res = {"a"}
====

We need to override the Init and Next functions with this config file (syn_tests.cfg), to prevent the existing model from running:

INIT I
NEXT N

If the assumptions are all valid:

$ docker run --rm -it -v $PWD:/app -w /app amazoncorretto:8-alpine3.17-jdk java -XX:+UseParallelGC -cp tla2tools.jar tlc2.TLC -workers auto -cleanup scratch.tla
TLC2 Version 2.18 of 20 March 2023 (rev: 3ea3222)
...
Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 0.0
0 states generated, 0 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 1.
Finished in 00s at (2024-01-17 14:16:36)

The failure output isn’t very useful:

Starting... 
Error: Assumption line 17, col 8 to line 19, col 18 of module syn_tests is false.
Finished in 00s 

But you can output values to debug, using PrintT:

ASSUME ... IN PrintT(res) /\ res = {}

Starting...
{"a"}
Error: Assumption line 17, col 8 to line 19, col 33 of module syn_tests is false.

There are some more examples here.

It’s a Syn (Part 5)

Once a netsplit has healed, we need to re-sync the registered processes. When two nodes are reconnected, we add a discover message to the inbox of both nodes:

Reconnect(n) ==
        ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "discover", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "discover", from |-> n])
            [] OTHER -> inbox[o]
        ]

Handling that, sends a copy of the local data back to the sender:

Discover(n) ==
    /\ Len(inbox[n]) > 0
    /\ LET message == Head(inbox[n])
        IN message.action = "discover"
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Tail(inbox[o])
            [] (o = message.from) -> Append(inbox[o], [action |-> "ack_sync", local_data |-> locally_registered[n]])
            [] OTHER -> inbox[o]
        ]

which is then merged:

AckSync(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "ack_sync"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ LET message == Head(inbox[n])
        IN locally_registered' = [locally_registered EXCEPT![n] = locally_registered[n] \union message.local_data]

We also need to change the invariant; to only check once all messages have been processed, on all nodes:

AllRegistered ==
    \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered

If we run this, we get a failure:

...

State 9: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", local_data |-> {}]>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

State 10: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

I’ll spare you the full stack trace, but we now have a phantom key: it was removed from a node, during the partition, but then merged back in. Again, the problem lies with my model being too simplistic. We will look into how to resolve this, next time.

It’s a Syn (Part 4)

We are starting to get to the business end. So how can we model a netsplit? The easiest thing seems to be to add a set for each node, listing the nodes that are visible from it:

AllOtherNodes(n) ==
    Nodes \ {n}

Init ==
     ...
    /\ visible_nodes = [n \in Nodes |-> AllOtherNodes(n)]

At the start, all nodes can see all other nodes. Now, when we register a new value:

Register(n) ==
     ...
    /\ inbox' = [o \in Nodes |->
                       IF o \in visible_nodes[n] THEN
                           Append(inbox[o], [action |-> "sync_register", name |-> next_val])
                       ELSE
                           inbox[o]
                   ]

we only send messages to the nodes that are still visible (same for unregister). And we add two new state transitions:

Next ==
    /\ \E n \in Nodes:
        ...
        \/ Disconnect(n)
        \/ Reconnect(n)

To disconnect:

Disconnect(n) ==
    /\ Cardinality(visible_nodes[n]) > 0
    /\ LET other_node == CHOOSE o \in visible_nodes[n]: TRUE
        IN visible_nodes' = [o \in Nodes |-> CASE
            (o = other_node) -> visible_nodes[o] \ {n}
            [] (o = n) -> visible_nodes[o] \ {other_node}
            [] OTHER -> visible_nodes[o]
        ]
    /\ UNCHANGED <<registered, locally_registered, inbox, next_val>>

If other nodes are still visible; we select a node, at random, from the set and remove it (on both sides). And reconnection is the opposite:

Reconnect(n) ==
    /\ Cardinality(AllOtherNodes(n) \ visible_nodes[n]) > 0
    /\ LET other_node == CHOOSE o \in (AllOtherNodes(n) \ visible_nodes[n]): TRUE
        IN visible_nodes' = [o \in Nodes |-> CASE
            (o = other_node) -> visible_nodes[o] \union {n}
            [] (o = n) -> visible_nodes[o] \union {other_node}
            [] OTHER -> visible_nodes[o]
        ]
    /\ UNCHANGED <<registered, locally_registered, inbox, next_val>>

We also need to update our invariant:

AllRegistered ==
    \A n \in Nodes:
        Len(inbox[n]) = 0 /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered

to only compare the registered values, once a netsplit has healed. If we run this model:

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ next_val = 0
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {} @@ n2 :> {})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

State 2: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 0
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {} @@ n2 :> {})
/\ registered = {}
/\ visible_nodes = (n1 :> {} @@ n2 :> {})

State 3: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {})
/\ registered = {0}
/\ visible_nodes = (n1 :> {} @@ n2 :> {})

State 4: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {})
/\ registered = {0}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

it fails pretty quickly. But that is what we should expect, as new keys could have been added while the network was partitioned, and we aren’t doing anything to resolve that. And that is what we will be looking into, next time.

It’s a Syn (Part 3)

I said last time that waiting for all messages to be processed before performing any new actions was unrealistic, so let’s remove those two pre-conditions:

diff --git a/syn.tla b/syn.tla
index 5bab535..d75d5d3 100644
--- a/syn.tla
+++ b/syn.tla
@@ -15,7 +15,6 @@ Init ==
     /\ removed = 0
 
 Register(n) ==
-    /\ \A o \in Nodes: Len(inbox[o]) = 0
     /\ next_val < MaxValues
     /\ registered' = [registered EXCEPT![n] = registered[n] \union {next_val}]
     /\ next_val' = next_val + 1
@@ -34,7 +33,6 @@ ItemToRemove(n) ==
     CHOOSE r \in registered[n]: TRUE
 
 Unregister(n) ==
-    /\ \A o \in Nodes: Len(inbox[o]) = 0
     /\ Cardinality(registered[n]) > 0
     /\ LET item_to_remove == ItemToRemove(n)
         IN registered' = [registered EXCEPT![n] = registered[n] \ {item_to_remove}]

Now if we run the model checker, we have a problem:

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ next_val = 0
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ added = 0
/\ removed = 0
/\ registered = (n1 :> {} @@ n2 :> {})

State 2: <Next line 55, col 5 to line 60, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "sync_register", name |-> 0]>>)
/\ added = 1
/\ removed = 0
/\ registered = (n1 :> {0} @@ n2 :> {})

State 3: <Next line 55, col 5 to line 60, col 19 of module syn>
/\ next_val = 1
/\ inbox = ( n1 :> <<>> @@
  n2 :>
      << [action |-> "sync_register", name |-> 0],
         [action |-> "sync_unregister", name |-> 0] >> )
/\ added = 1
/\ removed = 1
/\ registered = (n1 :> {} @@ n2 :> {})

State 4: <Next line 55, col 5 to line 60, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "sync_unregister", name |-> 0]>>)
/\ added = 1
/\ removed = 1
/\ registered = (n1 :> {} @@ n2 :> {0})

State 5: <Next line 55, col 5 to line 60, col 19 of module syn>
/\ next_val = 1
/\ inbox = ( n1 :> <<[action |-> "sync_unregister", name |-> 0]>> @@
  n2 :> <<[action |-> "sync_unregister", name |-> 0]>> )
/\ added = 1
/\ removed = 2
/\ registered = (n1 :> {} @@ n2 :> {})

State 6: <Next line 55, col 5 to line 60, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "sync_unregister", name |-> 0]>>)
/\ added = 1
/\ removed = 2
/\ registered = (n1 :> {} @@ n2 :> {})

In this scenario, a key (0) was added on node 1 and then immediately removed. But node 2 still had both messages waiting, so it also added the key; and then decided to remove it itself before getting to the unregister message from node 1. This is getting more interesting, but the problem lies with our model, not the protocol.

It’s generally considered bad form to update the registry for another process (and this can actually be enforced); but the real problem here is that even though removing an item that doesn’t exist from a set is a no-op, we are double counting. So we think that we have added 1 item, and removed 2; and therefore the size of the set should be -1, which is never going to happen.

Assuming we want to allow the non-strict behaviour, we could probably solve this with some clever conditional updates of the counters; but I think I would prefer instead to track a “platonic ideal” of what should be registered, and compare the actual state to that instead.

Init ==
    ...
    /\ registered = {}
    /\ locally_registered = [n \in Nodes |-> {}]

Register(n) ==
    ...
    /\ registered' = registered \union {next_val}
    /\ locally_registered' = [locally_registered EXCEPT![n] = locally_registered[n] \union {next_val}]

AllRegistered ==
    \A n \in Nodes:
        Len(inbox[n]) = 0 => locally_registered[n] = registered

This does remove the previous error, but even with just 5 keys, we are generating far more states than before:

Model checking completed. No error has been found.
    ...
1782413 states generated, 357188 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 26.

This means running the model will be slower, but should also mean it’s more likely to find some interesting concurrent executions.

Next stop, netsplits!