It’s a Syn (Part 6)

So the other half of the protocol requires that registrations for a disconnected node are removed. First, we send a DOWN message on disconnect (to both nodes):

Disconnect(n) ==
          ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "DOWN", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "DOWN", from |-> n])
            [] OTHER -> inbox[o]
        ]

And just do nothing with the message, for now:

Down(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "DOWN"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ states' = Append(states, "Down")
    /\ UNCHANGED <<registered, next_val, visible_nodes, locally_registered>>

Next ==
    /\ \E n \in Nodes:
        ...
        \/ Down(n)

To be able to remove the registrations for the disconnected node, we need to track who has registered what. There’s probably more than one way to achieve this, but I decided to convert my locally_registered struct into a deeper HashMap<Node, HashMap<Node, Set>>; so that each node has its own map of what the others have registered:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> {}]]

This makes updates more complicated:

 Register(n) ==
     /\ next_val < MaxValues
     /\ registered' = registered \union {next_val}
     /\ LET l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] \union {next_val}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

which we will just have to live with. And the invariant needs to create a superset for comparison:

RECURSIVE ReduceStruct(_, _, _)

ReduceStruct(keys, struct, acc) ==
    IF keys = {} THEN acc
    ELSE
        LET k == CHOOSE k \in keys: TRUE
        IN ReduceStruct(keys \ {k}, struct, acc \union struct[k])

AllRegisteredForNode(locals) ==
    ReduceStruct(DOMAIN locals, locals, {})

 AllRegistered ==
     \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => AllRegisteredForNode(locally_registered[n]) = registered

This is pretty fugly, but if there’s a built in way to do it, I couldn’t find it. Finally, we can use this extra info, to remove the remote registrations:

Down(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = {}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

If we run this, the error seems to be gone, but it also looks like it is going to run forever (again). So we also need to limit the number of disconnections:

Init ==
     ...
    /\ disconnections = 0

 Disconnect(n) ==
    /\ disconnections < MaxDisconnections
    ...
    /\ disconnections' = disconnections + 1

Back to a working system! But even with a strictly limited number of nodes/registrations/disconnects:

CONSTANTS
Nodes = {n1, n2}
MaxValues = 2 
MaxDisconnections = 1 

we have had a combinatorial explosion of interleavings, and running this model now takes 45 mins on my laptop:

Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 2.7E-4
  based on the actual fingerprints:  val = 3.1E-4
148808949 states generated, 49693165 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 46min 15s at (2024-01-10 13:55:28)

Nearly 50M distinct states! As I said previously, this is the downside to an exhaustive search of the domain. If this is an existential question for your business (e.g. a smart contract, that can’t be fixed once out in the wild; or a spacecraft), then you might need to just rent a honking great EC2 instance and run your model for a week.

There are some possible optimisations. We can declare the nodes to be a “symmetry set“, removing any states that are mirror images of each other (about half in a 2 node cluster). This makes a huge difference:

74404477 states generated, 24846583 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 05min 03s at (2024-01-10 14:19:02)

but you need to be confident that it really doesn’t affect the outcome. Beyond that, the only option is to start removing some concurrency (i.e. combine two state transitions), and hope that those missing interleavings didn’t reveal any bugs.

And we still haven’t got to the interesting bit! As it stands, our model can never have any registration conflicts because a) we are using an incrementing integer as the key, and b) when we sync, we are unioning two sets, which would be quite happy with two nodes having registered the same name.

Next time, we will look into the default conflict resolution process.

Leave a comment