It’s a Syn (Part 2)

To make things more interesting, we need to take the simple model, and run it with multiple concurrent nodes:

CONSTANTS
Nodes = {n1, n2}

Just two, to start with. The model state needs to become more complex:

Init ==
    /\ inbox = [n \in Nodes |-> <<>>]
    /\ registered = [n \in Nodes |-> {}]
    /\ next_val = 0
    /\ added = 0
    /\ removed = 0

Each node now has its own set of registered values; and, this being an Erlang system, it makes sense to model the inbox for messages (a sequence, for each node). In this case, a node is really a process (or gen_server).

Next ==
    /\ \E n \in Nodes:
        \/ Register(n)
        \/ SyncRegister(n)
        \/ Unregister(n)
        \/ SyncUnregister(n)
        \/ Complete

There are more options, for the next state: we now pick a node from the set of all nodes, and apply an action to that. Also, register & unregister have split in two (async).

Register(n) ==
    /\ \A o \in Nodes: Len(inbox[o]) = 0
    /\ next_val < MaxValues
    /\ registered' = [registered EXCEPT![n] = registered[n] \union {next_val}]
    /\ next_val' = next_val + 1
    /\ added' = added + 1
    /\ inbox' = [o \in Nodes |-> IF o = n THEN inbox[o] ELSE Append(inbox[o], [action |-> "sync_register", name |-> next_val])]
    /\ UNCHANGED <<removed>>

First we update the local registry, then broadcast a message to all other nodes, to do the same.

SyncRegister(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "sync_register"
    /\ registered' = [registered EXCEPT![n] = registered[n] \union {Head(inbox[n]).name}]
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ UNCHANGED <<removed, added, next_val>>

To make things simpler, I decided that a pre-condition for register would be that all nodes have empty inboxes (i.e. any previous messages have already been processed). This obviously isn’t realistic, and the constraint will be removed later.

ItemToRemove(n) ==
    CHOOSE r \in registered[n]: TRUE

Unregister(n) ==
    /\ \A o \in Nodes: Len(inbox[o]) = 0
    /\ Cardinality(registered[n]) > 0
    /\ LET item_to_remove == ItemToRemove(n)
        IN registered' = [registered EXCEPT![n] = registered[n] \ {item_to_remove}]
        /\ inbox' = [o \in Nodes |-> IF o = n THEN inbox[o] ELSE Append(inbox[o], [action |-> "sync_unregister", name |-> item_to_remove])]
    /\ removed' = removed + 1
    /\ UNCHANGED <<added, next_val>>

SyncUnregister(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "sync_unregister"
    /\ registered' = [registered EXCEPT![n] = registered[n] \ {Head(inbox[n]).name}]
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ UNCHANGED <<removed, added, next_val>>

Unsurprisingly, unregister is still a mirror image of the same steps. We maintain the same invariant:

AllRegistered ==
    \A n \in Nodes:
        Len(inbox[n]) = 0 => Cardinality(registered[n]) = (added - removed)

but evaluated on each node, individually; and we don’t bother checking until all messages have been processed (as the counters obviously won’t match).

And now we also check that all messages have been consumed, when the system halts:

PROPERTIES
AllMessagesProcessed

AllMessagesProcessed ==
    \A n \in Nodes:
        <>(Len(inbox[n]) = 0)

Again, we can run this with the model checker:

Starting...
Implied-temporal checking--satisfiability problem has 2 branches.
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2024-01-05 13:15:50.
Progress(41) at 2024-01-05 13:15:50: 543 states generated, 286 distinct states found, 0 states left on queue.
Checking 2 branches of temporal properties for the complete state space with 572 total distinct states at (2024-01-05 13:15:50)
Finished checking temporal properties in 00s at 2024-01-05 13:15:50
Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 4.0E-15
543 states generated, 286 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 41.

And the number of possible states has ballooned from the simple model, but still no errors!

We can also try registering a larger number of keys, or using more nodes, with the same result; but the search space, and runtime, can increase dramatically (this is the downside to exhaustive checking, vs example based testing or even generative random testing).

It’s a Syn (Part 1)

I was looking at the Syn process registry recently, and thought it might be interesting to try modelling its netsplit recovery protocol (eventual consistency) in TLA+.

It’s always worth starting with something so simple, that it can’t be wrong. In this case I decided that scopes were out of scope, as was metadata; and I would first try to build a model for a single node.

Init ==
    /\ registered = {}
    /\ next_val = 0
    /\ added = 0
    /\ removed = 0

We have an empty set, for the registered values. A serial for the next item to be added (using integers as “keys”); and two counters, for the number of items added & removed.

Next ==
    \/ Register
    \/ Unregister
    \/ Complete

The next state is a choice of registering a value, unregistering a value; or reaching the end (an arbitrary limit).

Register ==
    /\ next_val < 5
    /\ registered' = registered \union {next_val}
    /\ next_val' = next_val + 1
    /\ added' = added + 1
    /\ UNCHANGED <<removed>>

If we haven’t hit that limit yet (pre-condition), then we add the next value to the set of registered values, and increment the counter.

ItemToRemove ==
    CHOOSE r \in registered: TRUE

Unregister ==
    /\ Cardinality(registered) > 0
    /\ registered' = registered \ {ItemToRemove}
    /\ removed' = removed + 1
    /\ UNCHANGED <<added, next_val>>

Unregistration is pretty similar. The pre-condition this time is that the set is not empty, and obviously we remove the (randomly selected) key instead.

Complete ==
    /\ next_val = 5
    /\ UNCHANGED <<added, next_val, registered, removed>>

Finally, once 5 values have been added (no matter how many were removed), we give up.

INVARIANTS
AllRegistered

AllRegistered == Cardinality(registered) = (added - removed)

And after every step, we ensure that the number of items in the set is equal to those added minus those removed.

If we run the model:

$ docker run --rm -it -v $PWD:/app -w /app openjdk:14-jdk-alpine java -XX:+UseParallelGC -cp tla2tools.jar tlc2.TLC -config syn.cfg -workers auto -cleanup syn.tla
TLC2 Version 2.18 of Day Month 20?? (rev: cab6f13)
...
Computing initial states...
Finished computing initial states: 1 distinct state generated at 2024-01-05 12:23:27.
Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 1.8E-17
37 states generated, 21 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 11.

No errors are found. Not a huge surprise, but this gives us a good foundation to build something more realistic.