Erlang – Look on my works, ye Mighty, and despair!

It’s a Syn (Part 13)

February 22, 2024February 22, 2024Graham HayLeave a comment

It took a few attempts, but I managed to come up with a merge implementation that worked for both AckSync and SyncRegister, with more than 2 nodes:

Merge(left, right) ==
    LET to_keep == {name \in DOMAIN left : (name \notin DOMAIN right \/ left[name] > right[name])}
    IN [name \in to_keep |-> left[name]]

RECURSIVE Flatten(_, _, _)

Flatten(keys, struct, acc) ==
    IF keys = {} THEN acc
    ELSE
        LET k == CHOOSE k \in keys: TRUE
        IN Flatten(keys \ {k}, struct, acc @@ struct[k])

MergeRegistries(local, remote, remote_node) ==
    LET all_registered == Flatten(DOMAIN local, local, << >>)
    IN [r \in DOMAIN local |-> CASE
        (r = remote_node) -> local[r] @@ Merge(remote, all_registered)
        [] OTHER -> Merge(local[r], remote)
    ]

SyncRegister(n) ==
    ...
    l == MergeRegistries(locally_registered[n], [r \in {message.name} |-> message.time], message.from)

AckSync(n) ==
    ...
    l == MergeRegistries(locally_registered[n], message.local_data, message.from)

and now I can run simulation node (with 5 nodes & more keys), for hours; without failure. I have no idea how much of the search space has been covered though, I suspect not a lot.

And on that note, I think I’m done with this topic 🎉

It’s a Syn (Part 12)

February 19, 2024February 19, 2024Graham HayLeave a comment

Up till now, we have been using the default behaviour of TLC: model checking mode. But there is another tool in our quiver (mixing my metaphors there), simulation mode!

This is more like the way property testing tooling works: rather than a breadth-first search of the problem space, a random walk is generated. As we have discovered, the time taken to run an exhaustive model check rapidly expands, as the model becomes more complicated.

If you have added constraints to prevent the model running forever (e.g. in our case, the disconnection limit, and not re-using registered names), you can remove those now. When you run the spec:

$ docker run --rm -it -v $PWD:/app -w /app amazoncorretto:8-alpine3.17-jdk java -cp tla2tools.jar tlc2.TLC -simulate syn.tla
TLC2 Version 2.18 of 20 March 2023 (rev: 3ea3222)
Running Random Simulation with seed -7079380368092791143 with 1 worker on 8 cores with 3488MB heap and 64MB offheap memory (Linux 6.7.2-arch1-1 amd64, Amazon.com Inc. 1.8.0_392 x86_64).
...

it gives you the seed, so you can run the same tests again after fixing any issues. TLC will run forever in this mode, if there aren’t any failures, but we are not so lucky:

Error: Invariant ThereCanBeOnlyOne is violated.
Error: The behavior up to this point is:

...

State 34: <Next line 290, col 5 to line 303, col 22 of module syn>
/\ disconnections = 6
/\ locally_registered = ( n1 :> (n1 :> (a :> 2 @@ b :> 26) @@ n2 :> <<>> @@ n3 :> <<>>) @@
  n2 :> (n1 :> <<>> @@ n2 :> (a :> 22) @@ n3 :> <<>>) @@
  n3 :> (n1 :> (a :> 2) @@ n2 :> (a :> 22) @@ n3 :> << >>) )
/\ time = 33
/\ states = << <<"Register", n1, a>>,
   <<"Register", n2, a>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Disconnect", n2, n1>>,
   <<"Register", n1, b>>,
   <<"Disconnect", n2, n3>>,
   <<"Register", n2, b>>,
   <<"SyncRegister", n3, a>>,
   <<"Register", n1, c>>,
   <<"Reconnect", n3, n2>>,
   <<"Unregister", n1, a>>,
   <<"Register", n2, c>>,
   <<"Disconnect", n2, n3>>,
   <<"Down", n3, n2>>,
   <<"Reconnect", n3, n2>>,
   <<"Register", n1, d>>,
   <<"Unregister", n1, a>>,
   <<"Unregister", n1, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Disconnect", n2, n1>>,
   <<"Reconnect", n2, n1>>,
   <<"Disconnect", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Register", n1, e>>,
   <<"Down", n1, n2>>,
   <<"Discover", n3, n2>>,
   <<"RegisterOrUpdateOnNode", n1, b>>,
   <<"Down", n3, n2>>,
   <<"Register", n1, f>>,
   <<"Disconnect", n1, n3>>,
   <<"Register", n3, b>>,
   <<"Discover", n3, n2>>,
   <<"SyncRegister", n3, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {n3} @@ n3 :> {n2})
/\ registered = (a :> 2 @@ b :> 1 @@ c :> 0 @@ d :> 0 @@ e :> 0 @@ f :> 0 @@ g :> 0 @@ h :> 0)
/\ inbox = (n1 :> <<[action |-> "register_or_update_on_node", name |-> c], [action |-> "unregister_on_node", name |-> a], [action |-> "register_or_update_on_node", name |-> d], [action |-> "unregister_on_node", name |-> a], [action |-> "unregister_on_node", name |-> a], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "register_or_update_on_node", name |-> e], [action |-> "register_or_update_on_node", name |-> f], [action |-> "DOWN", from |-> n3]>> @@ n2 :> <<[time |-> 2, action |-> "sync_register", name |-> a, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "DOWN", from |-> n3], [action |-> "register_or_update_on_node", name |-> b], [action |-> "discover", from |-> n3], [action |-> "register_or_update_on_node", name |-> c], [action |-> "DOWN", from |-> n3], [action |-> "discover", from |-> n3], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "ack_sync", from |-> n3, local_data |-> << >>], [action |-> "ack_sync", from |-> n3, local_data |-> << >>]>> @@ n3 :> <<[time |-> 26, action |-> "sync_register", name |-> b, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "register_or_update_on_node", name |-> b]>>)
/\ names = (n1 :> {a, g, h} @@ n2 :> {d, e, f, g, h} @@ n3 :> {a, c, d, e, f, g, h})

The stack trace is much deeper now (34 transitions), and with 3 nodes it is much harder to understand what has gone wrong. This looks like an issue with my conflict resolution handling though, I am only checking for conflicts between the current node, and the sender. This works fine with a two node cluster, but not with 3 (or more).

I could probably change that check to \in RegisteredOnThisNode(n), but I would need to know which node had registered it, to resolve the conflict; and whatever change I make will also need to be reproduced in AckSync too. It might be time to bite the bullet, and factor out the conflict resolution (yet more recursion).

It’s a Syn (Part 11)

February 17, 2024February 16, 2024Graham HayLeave a comment

Assuming that it is possible to end up with a 🧟 pid, I think one solution would be to send a message when the conflict resolution kills a process:

SyncRegister(n) ==
    ...
       /\ inbox' = [o \in Nodes |->
            (IF (o = n) THEN
                Tail(inbox[o])
            ELSE
                IF keep_remote THEN
                    Append(inbox[o], [action |-> "killed", name |-> message.name, time |-> locally_registered[n][n][message.name], from |-> n])
                ELSE
                    inbox[o])
        ]

and then remove any exact matches, i.e. registered with the same name, at the same time:

Killed(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "killed"
    /\ LET message == Head(inbox[n])
        exact_match == message.name \in DOMAIN locally_registered[n][message.from] /\ locally_registered[n][message.from][message.name] = message.time
        l == (IF exact_match THEN
                [r \in (DOMAIN locally_registered[n][message.from] \ {message.name}) |-> locally_registered[n][message.from][r]]
            ELSE
                locally_registered[n][message.from])
        IN inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
        /\ locally_registered' = [locally_registered EXCEPT![n] = ([locally_registered[n] EXCEPT![message.from] = l])]

with this fix, I could run the model significantly longer without failure (but sadly ran out of memory on my laptop: I’m not sure I’m interested enough to pay AWS for the answer).

It’s a Syn (Part 10)

February 15, 2024February 15, 2024Graham HayLeave a comment

If we also update the refcount, when a conflict occurs during sync:

AckSync(n) ==
        ...
        conflicts == DOMAIN locally_registered[n][n] \intersect DOMAIN message.local_data
        c1 == [c \in { r \in conflicts : message.local_data[r] > locally_registered[n][n][r] } |-> registered[c] - 1]
        c2 == [c \in { r \in conflicts : locally_registered[n][n][r] > message.local_data[r] } |-> registered[c]]
        IN ...
        /\ registered' = c1 @@ c2 @@ [r \in (DOMAIN registered \ conflicts) |-> registered[r]]

we get another failure (I’ll paste the full stack trace, this time):

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 0
/\ states = <<>>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {a} @@ n2 :> {a})

State 2: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 1
/\ states = <<<<"Register", n1, a>>>>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<[action |-> "register_or_update_on_node", name |-> a]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {a})

State 3: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> <<>> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 2
/\ states = <<<<"Register", n1, a>>, <<"Disconnect", n1, n2>>>>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 0)
/\ inbox = ( n1 :>
      << [action |-> "register_or_update_on_node", name |-> a],
         [action |-> "DOWN", from |-> n2] >> @@
  n2 :> <<[action |-> "DOWN", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {a})

The two nodes (n1 & n2) are disconnected:

State 4: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 3
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = ( n1 :> <<[action |-> "DOWN", from |-> n2]>> @@
  n2 :> <<[action |-> "DOWN", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {a})

State 5: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 4
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "DOWN", from |-> n1]>>)
/\ names = (n1 :> {} @@ n2 :> {a})

State 6: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 5
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = ( n1 :> <<>> @@
  n2 :>
      << [action |-> "DOWN", from |-> n1],
         [action |-> "register_or_update_on_node", name |-> a] >> )
/\ names = (n1 :> {} @@ n2 :> {})

State 7: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> <<>>))
/\ time = 6
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "register_or_update_on_node", name |-> a]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 8: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 7
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

During the netsplit, both manage to register the same name (“a”)

State 9: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 8
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = ( n1 :> <<[action |-> "discover", from |-> n2]>> @@
  n2 :> <<[action |-> "discover", from |-> n1]>> )
/\ names = (n1 :> {} @@ n2 :> {})

State 10: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = ( n1 :> <<[action |-> "discover", from |-> n2]>> @@
  n2 :>
      << [action |-> "discover", from |-> n1],
         [action |-> "unregister_on_node", name |-> a] >> )
/\ names = (n1 :> {} @@ n2 :> {})

State 11: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 10
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "discover", from |-> n1], [action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 12: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> (a :> 2) @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 11
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> (a :> 6)]>> @@ n2 :> <<[action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 13: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 6)) @@ n2 :> (n1 :> <<>> @@ n2 :> (a :> 6)))
/\ time = 12
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "unregister_on_node", name |-> a], [action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 14: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 6)) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 13
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<[action |-> "sync_unregister", name |-> a, from |-> n2]>> @@ n2 :> <<[action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 15: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 14
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", from |-> n1, local_data |-> (a :> 2)]>>)
/\ names = (n1 :> {} @@ n2 :> {})

State 16: <Next line 249, col 5 to line 261, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> (a :> 2) @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Register", n1, a>>,
   <<"Disconnect", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Down", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n2, n1>>,
   <<"Unregister", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>>,
   <<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

Once re-connected, the sync takes place; but one node also un-registers the name, and we somehow end up with a zombie registration (to a pid that was killed by the conflict resolution).

So far, I’ve not been able to find anything in the Syn code (or the Erlang VM) that prevents this; but I’m quite open to it being a problem with my model 🤷

It’s a Syn (Part 9)

February 8, 2024February 8, 2024Graham HayLeave a comment

We are now handling registration conflicts correctly, but the invariant is being violated due to limitations in change tracking in our model.

We can move from a simple set, to refcounting:

Init ==
     ...
    /\ registered = [n \in Names |-> 0]

RegisterOrUpdateOnNode(n) ==
    ...
    registered' = [registered EXCEPT![message.name] = @ + 1]

If a name is registered on two nodes, it will look something like this: [a |-> 2]. Running this, we get a failure:

Error: Invariant AllRegistered is violated.
...

State 10: <Next line 235, col 5 to line 247, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> (a :> 3)) @@ n2 :> (n1 :> << >> @@ n2 :> (a :> 3)))
/\ time = 9
/\ states = << <<"Register", n1, a>>,
   <<"Register", n2, a>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Unregister", n1, a>>,
   <<"SyncRegister", n1, a>>,
   <<"UnregisterOnNode", n1, a>>,
   <<"SyncRegister", n2, a>>,
   <<"SyncUnregister", n2, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 0)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

It looks like we think a has been unregistered everywhere, when it hasn’t. If we update the condition for unregister, we now get a more interesting failure:

Error: Invariant ThereCanBeOnlyOne is violated.
...

State 12: <Next line 238, col 5 to line 250, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = ( n1 :> (n1 :> (a :> 3) @@ n2 :> <<>>) @@
  n2 :> (n1 :> (a :> 3) @@ n2 :> (a :> 9)) )
/\ time = 11
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Register", n2, a>>,
   <<"Discover", n1, n2>>,
   <<"Down", n2, n1>>,
   <<"Discover", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"AckSync", n2, n1>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 2)
/\ inbox = (n1 :> <<[action |-> "ack_sync", from |-> n2, local_data |-> <<>>], [time |-> 9, action |-> "sync_register", name |-> a, from |-> n2]>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

Connor MacLeod would not be pleased. Currently, when a node re-joins the cluster, we are just taking its version of the data as the truth. We also need to implement conflict resolution here:

MergeRegistries(local, remote, local_node, remote_node) ==
    LET local_merged == [rr \in {r \in DOMAIN local[local_node] : (r \notin DOMAIN remote \/ local[local_node][r] > remote[r])} |-> local[local_node][rr]]
        remote_merged == [rr \in {r \in DOMAIN remote : (r \notin DOMAIN local[local_node] \/ remote[r] > local[local_node][r])} |-> remote[rr]]
    IN [r \in DOMAIN local |-> CASE
        (r = remote_node) -> remote_merged
        [] (r = local_node) -> local_merged
        [] OTHER -> local[r]
    ]

AckSync(n) ==
    ...
    l == MergeRegistries(locally_registered[n], message.local_data, n, message.from)

The real implementation actually uses the same code as for registration, which is probably sensible; so I might need to refactor this later.

Running this looked like it might run forever (again), as I think I added another stuttering case by allowing multiple attempts to register the same name. If we remove that, we get a pretty long run; but eventually it still fails:

Error: Invariant AllRegistered is violated.
...

State 16: <Next line 247, col 5 to line 259, col 22 of module syn>
/\ disconnections = 1
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> << >> @@ n2 :> << >>))
/\ time = 15
/\ states = << <<"Disconnect", n1, n2>>,
   <<"Register", n1, a>>,
   <<"Down", n1, n2>>,
   <<"RegisterOrUpdateOnNode", n1, a>>,
   <<"Register", n2, a>>,
   <<"Down", n2, n1>>,
   <<"RegisterOrUpdateOnNode", n2, a>>,
   <<"Reconnect", n1, n2>>,
   <<"Discover", n1, n2>>,
   <<"Discover", n2, n1>>,
   <<"AckSync", n1, n2>>,
   <<"Unregister", n2, a>>,
   <<"AckSync", n2, n1>>,
   <<"UnregisterOnNode", n2, a>>,
   <<"SyncUnregister", n1, a>> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = (a :> 1)
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And this time, it’s the other way round. We think a should still be registered, somewhere, but it’s not. I think the problem here is that when we resolve conflicts on sync, we aren’t updating the refcount.

It’s a Syn (Part 7)

January 31, 2024Graham HayLeave a comment

Before we can resolve conflicts, we have to generate some. We can let each node try and register the same pool of names:

Init ==
     ...
    /\ names = [n \in Nodes |-> 1..MaxValues]

 Register(n) ==
    /\ names[n] # {}
    /\ LET next_val == CHOOSE o \in (names[n]): TRUE
         IN ...
        /\ names' = [names EXCEPT![n] = names[n] \ {next_val}]

This produces an immediate failure, and reveals that we have a race even without disconnection:

...

State 3: <Next line 139, col 5 to line 149, col 22 of module syn>
/\ states = <<<<"Register", n1, 1>>, <<"Register", n2, 1>>>>
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :> <<[action |-> "sync_register", name |-> 1, from |-> n2]>> @@
  n2 :> <<[action |-> "sync_register", name |-> 1, from |-> n1]>> )
/\ locally_registered = (n1 :> (n1 :> {1} @@ n2 :> {}) @@ n2 :> (n1 :> {} @@ n2 :> {1}))
/\ disconnections = 0
/\ registered = {1}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes can attempt to register the same name, even when connected, if it’s before the sync_register message has arrived/been processed.

Let’s park that, for now. Our conflict resolution process will pick the most recent registration as the winner, so we will need the concept of time:

Init ==
    ...
    /\ time = 0

Register(n) ==
    ...
    /\ time' = time + 1

A counter incremented by every state transition. Now, instead of a storing registrations in a set, we will have a struct (function) of name -> time:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> <<>>]]
    /\ names = {"a"}

I was seeing some confusing behaviour from TLC when using integer keys for the registered names, so I switched to a set of strings.

Register(n) ==
    /\ names # {}
    /\ LET next_val == CHOOSE o \in names: TRUE
        l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {next_val} |-> time]]
        IN registered' = registered \union {next_val}
        /\ locally_registered' = [locally_registered EXCEPT![n] = l]
        /\ names' = names \ {next_val}
    ...

SyncRegister(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]]

now locally_registered will look something like this:

( n1 :> (n1 :> << >> @@ n2 :> [a |-> 0]) @@ n2 :> (n1 :> << >> @@ n2 :> [a |-> 0]) )

This model still has no failures, as we can only register a key once. If that limitation is removed, we need to handle a conflict:

        conflict == message.name \in DOMAIN locally_registered[n][n]
        l == [o \in DOMAIN locally_registered[n] |-> CASE
            (o = message.from) -> (
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][message.from]
                ELSE
                    locally_registered[n][message.from] @@ [r \in {message.name} |-> message.time]
            )
            [] (o = n) ->
                IF conflict /\ locally_registered[n][n][message.name] > message.time THEN
                    locally_registered[n][n]
                ELSE
                    [r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
            [] OTHER -> locally_registered[n][o]
        ]

We keep whichever process was registered with a more recent time. I also added the “Highlander” invariant, to guarantee that we can’t keep both:

 INVARIANTS
...
ThereCanBeOnlyOne

RECURSIVE Duplicates(_, _, _)

Duplicates(keys, struct, acc) ==
    IF Cardinality(keys) < 2 THEN acc
    ELSE
        LET k1 == CHOOSE k \in keys: TRUE
        k2 == CHOOSE k \in (keys \ {k1}): TRUE
        duplicates == DOMAIN struct[k1] \intersect DOMAIN struct[k2]
        IN Duplicates(keys \ {k1}, struct, duplicates \union acc)

ThereCanBeOnlyOne ==
    \A n \in Nodes:
        Duplicates(DOMAIN locally_registered[n], locally_registered[n], {}) = {}

We immediately hit a failure:

...

State 4: <Next line 176, col 5 to line 186, col 22 of module syn>
/\ states = <<<<"Register", n1, "a">>, <<"Register", n2, "a">>, <<"Unregister", n1, "a">>>>
/\ time = 3
/\ names = (n1 :> {} @@ n2 :> {})
/\ inbox = ( n1 :>
      <<[time |-> 1, action |-> "sync_register", name |-> "a", from |-> n2]>> @@
  n2 :>
      << [time |-> 0, action |-> "sync_register", name |-> "a", from |-> n1],
         [action |-> "sync_unregister", name |-> "a", from |-> n1] >> )
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> <<>>) @@ n2 :> (n1 :> <<>> @@ n2 :> [a |-> 1]))
/\ disconnections = 0
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

Both nodes have tried to register the same name, but before we have tried to resolve the conflict, one node has removed its registration; and even though the other node still has a (valid) registration, we are now expecting it to not exist anywhere.

The easiest way to solve this, is to check all other nodes, before removing the name from the canonical set:

RECURSIVE AllRegisteredNames(_, _, _)

AllRegisteredNames(nodes, locals, registrations) ==
    IF nodes = {} THEN registrations
    ELSE
        LET n == CHOOSE n \in nodes: TRUE
        IN AllRegisteredNames(nodes \ {n}, locals, registrations \union DOMAIN locals[n][n])

RegisteredElsewhere(node) ==
    AllRegisteredNames(AllOtherNodes(node), locally_registered, {})

Unregister(n) ==
    ...
     /\ LET item_to_remove == ItemToRemove(n)
        IN registered' = (IF item_to_remove \in RegisteredElsewhere(n) THEN registered ELSE registered \ {item_to_remove})

Now the failure has morphed slightly:

...

State 7: <Next line 187, col 5 to line 197, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> [a |-> 0] @@ n2 :> << >>))
/\ time = 6
/\ states = << <<"Register", n1, "a">>,
   <<"Register", n2, "a">>,
   <<"SyncRegister", n1, "a">>,
   <<"Unregister", n2, "a">>,
   <<"SyncUnregister", n1, "a">>,
   <<"SyncRegister", n2, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

And the 👻 registration is actually the process that lost the conflict resolution battle.

This caused a fair bit of head scratching, and I nearly persuaded myself that I had found a real race condition. My first thought was that a sync_register message for a dead process would be discarded, but erlang:is_process_alive can only be used on a local pid.

Now I think that the root cause is that I am treating register/unregister as sync calls (when in reality they use gen_server:call); which has allowed them to push into the queue, ahead of the sync_register message.

We’ll see if I’m right, next time!

It’s a Syn (Part 8)

January 31, 2024February 3, 2024Graham HayLeave a comment

If we make the register request async:

Register(n) ==
    ...
            IN inbox' = [inbox EXCEPT![n] = Append(inbox[n], [action |-> "register_or_update_on_node", name |-> next_val])]

RegisterOrUpdateOnNode(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "register_or_update_on_node"
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] @@ [r \in {message.name} |-> time]]
        already_registered == message.name \in AllRegisteredForNode(locally_registered[n])
        IN
        (IF already_registered THEN
            \* {error, taken}
            registered' = registered
            /\ locally_registered' = locally_registered
            /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
            /\ names' = names
        ELSE
            registered' = registered \union {message.name}
            /\ locally_registered' = [locally_registered EXCEPT![n] = l]
            /\ inbox' = [o \in Nodes |-> CASE
                (o = n) -> Tail(inbox[n])
                [] (o \in visible_nodes[n]) -> Append(inbox[o], [action |-> "sync_register", name |-> message.name, from |-> n, time |-> time])
                [] OTHER -> inbox[o]
            ]
            /\ names' = [names EXCEPT![n] = names[n] \ {message.name}]
        )
        /\ states' = Append(states, <<"RegisterOrUpdateOnNode", n, message.name>>)
    /\ time' = time + 1
    /\ UNCHANGED <<visible_nodes, disconnections>>

we now need to handle the case that the name was registered in the new race we have added.

Other than that the process is the same, and the same vars are updated. If we do the same for unregister:

Unregister(n) ==
        ...
        IN inbox' = [inbox EXCEPT![n] = Append(inbox[n], [action |-> "unregister_on_node", name |-> item_to_remove])]
 
UnregisterOnNode(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "unregister_on_node"
    /\ LET message == Head(inbox[n])
        l == [r \in (DOMAIN locally_registered[n][n] \ {message.name}) |-> locally_registered[n][n][r]]
        already_removed == message.name \notin AllRegisteredForNode(locally_registered[n])
        IN
        (IF already_removed THEN
            \* {error, undefined}
            registered' = registered
            /\ locally_registered' = locally_registered
            /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
            /\ names' = names
        ELSE
            registered' = (IF message.name \in RegisteredElsewhere(n) THEN registered ELSE registered \ {message.name})
            /\ locally_registered' = [locally_registered EXCEPT![n] = ([locally_registered[n] EXCEPT![n] = l])]
            /\ inbox' = [o \in Nodes |-> CASE
                (o = n) -> Tail(inbox[n])
                [] (o \in visible_nodes[n]) -> Append(inbox[o], [action |-> "sync_unregister", name |-> message.name, from |-> n])
                [] OTHER -> inbox[o]
            ]
            /\ names' = [names EXCEPT![n] = names[n] \ {message.name}]
        )
        /\ states' = Append(states, <<"UnregisterOnNode", n, message.name>>)
    /\ time' = time + 1
    /\ UNCHANGED <<visible_nodes, disconnections>>

we still have a failure of the same invariant:

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
...

State 10: <Next line 234, col 5 to line 246, col 22 of module syn>
/\ disconnections = 0
/\ locally_registered = (n1 :> (n1 :> << >> @@ n2 :> << >>) @@ n2 :> (n1 :> <<>> @@ n2 :> << >>))
/\ time = 9
/\ states = << <<"Register", n1, "a">>,
   <<"Register", n2, "a">>,
   <<"RegisterOrUpdateOnNode", n1, "a">>,
   <<"RegisterOrUpdateOnNode", n2, "a">>,
   <<"SyncRegister", n2, "a">>,
   <<"Unregister", n2, "a">>,
   <<"UnregisterOnNode", n2, "a">>,
   <<"SyncRegister", n1, "a">>,
   <<"SyncUnregister", n1, "a">> >>
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})
/\ registered = {"a"}
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ names = (n1 :> {} @@ n2 :> {})

but now the problem is that our canonical list of registrations still thinks the name should be there, even though it was removed.

I think this fix was too simplistic, and I will need to go back to my first idea: of refcounting registrations.

It’s a Syn (Part 6)

January 22, 2024February 8, 2024Graham HayLeave a comment

So the other half of the protocol requires that registrations for a disconnected node are removed. First, we send a DOWN message on disconnect (to both nodes):

Disconnect(n) ==
          ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "DOWN", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "DOWN", from |-> n])
            [] OTHER -> inbox[o]
        ]

And just do nothing with the message, for now:

Down(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "DOWN"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ states' = Append(states, "Down")
    /\ UNCHANGED <<registered, next_val, visible_nodes, locally_registered>>

Next ==
    /\ \E n \in Nodes:
        ...
        \/ Down(n)

To be able to remove the registrations for the disconnected node, we need to track who has registered what. There’s probably more than one way to achieve this, but I decided to convert my locally_registered struct into a deeper HashMap<Node, HashMap<Node, Set>>; so that each node has its own map of what the others have registered:

 Init ==
    ...
    /\ locally_registered = [n1 \in Nodes |-> [n2 \in Nodes |-> {}]]

This makes updates more complicated:

 Register(n) ==
     /\ next_val < MaxValues
     /\ registered' = registered \union {next_val}
     /\ LET l == [locally_registered[n] EXCEPT![n] = locally_registered[n][n] \union {next_val}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

which we will just have to live with. And the invariant needs to create a superset for comparison:

RECURSIVE ReduceStruct(_, _, _)

ReduceStruct(keys, struct, acc) ==
    IF keys = {} THEN acc
    ELSE
        LET k == CHOOSE k \in keys: TRUE
        IN ReduceStruct(keys \ {k}, struct, acc \union struct[k])

AllRegisteredForNode(locals) ==
    ReduceStruct(DOMAIN locals, locals, {})

 AllRegistered ==
     \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => AllRegisteredForNode(locally_registered[n]) = registered

This is pretty fugly, but if there’s a built in way to do it, I couldn’t find it. Finally, we can use this extra info, to remove the remote registrations:

Down(n) ==
    ...
    /\ LET message == Head(inbox[n])
        l == [locally_registered[n] EXCEPT![message.from] = {}]
        IN locally_registered' = [locally_registered EXCEPT![n] = l]

If we run this, the error seems to be gone, but it also looks like it is going to run forever (again). So we also need to limit the number of disconnections:

Init ==
     ...
    /\ disconnections = 0

 Disconnect(n) ==
    /\ disconnections < MaxDisconnections
    ...
    /\ disconnections' = disconnections + 1

Back to a working system! But even with a strictly limited number of nodes/registrations/disconnects:

CONSTANTS
Nodes = {n1, n2}
MaxValues = 2 
MaxDisconnections = 1

we have had a combinatorial explosion of interleavings, and running this model now takes 45 mins on my laptop:

Model checking completed. No error has been found.
  Estimates of the probability that TLC did not check all reachable states
  because two distinct states had the same fingerprint:
  calculated (optimistic):  val = 2.7E-4
  based on the actual fingerprints:  val = 3.1E-4
148808949 states generated, 49693165 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 46min 15s at (2024-01-10 13:55:28)

Nearly 50M distinct states! As I said previously, this is the downside to an exhaustive search of the domain. If this is an existential question for your business (e.g. a smart contract, that can’t be fixed once out in the wild; or a spacecraft), then you might need to just rent a honking great EC2 instance and run your model for a week.

There are some possible optimisations. We can declare the nodes to be a “symmetry set“, removing any states that are mirror images of each other (about half in a 2 node cluster). This makes a huge difference:

74404477 states generated, 24846583 distinct states found, 0 states left on queue.
The depth of the complete state graph search is 17.
The average outdegree of the complete state graph is 1 (minimum is 0, the maximum 7 and the 95th percentile is 2).
Finished in 05min 03s at (2024-01-10 14:19:02)

but you need to be confident that it really doesn’t affect the outcome. Beyond that, the only option is to start removing some concurrency (i.e. combine two state transitions), and hope that those missing interleavings didn’t reveal any bugs.

And we still haven’t got to the interesting bit! As it stands, our model can never have any registration conflicts because a) we are using an incrementing integer as the key, and b) when we sync, we are unioning two sets, which would be quite happy with two nodes having registered the same name.

Next time, we will look into the default conflict resolution process.

It’s a Syn (Part 5)

January 17, 2024Graham HayLeave a comment

Once a netsplit has healed, we need to re-sync the registered processes. When two nodes are reconnected, we add a discover message to the inbox of both nodes:

Reconnect(n) ==
        ...
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Append(inbox[o], [action |-> "discover", from |-> other_node])
            [] (o = other_node) -> Append(inbox[o], [action |-> "discover", from |-> n])
            [] OTHER -> inbox[o]
        ]

Handling that, sends a copy of the local data back to the sender:

Discover(n) ==
    /\ Len(inbox[n]) > 0
    /\ LET message == Head(inbox[n])
        IN message.action = "discover"
        /\ inbox' = [o \in Nodes |-> CASE
            (o = n) -> Tail(inbox[o])
            [] (o = message.from) -> Append(inbox[o], [action |-> "ack_sync", local_data |-> locally_registered[n]])
            [] OTHER -> inbox[o]
        ]

which is then merged:

AckSync(n) ==
    /\ Len(inbox[n]) > 0
    /\ Head(inbox[n]).action = "ack_sync"
    /\ inbox' = [inbox EXCEPT![n] = Tail(inbox[n])]
    /\ LET message == Head(inbox[n])
        IN locally_registered' = [locally_registered EXCEPT![n] = locally_registered[n] \union message.local_data]

We also need to change the invariant; to only check once all messages have been processed, on all nodes:

AllRegistered ==
    \A n \in Nodes:
        (\A o \in Nodes: Len(inbox[o]) = 0) /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered

If we run this, we get a failure:

...

State 9: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<[action |-> "ack_sync", local_data |-> {}]>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

State 10: <Next line 111, col 5 to line 120, col 19 of module syn>
/\ states = << "Register",
   "Disconnect",
   "Unregister",
   "Reconnect",
   "Discover",
   "SyncRegister",
   "Discover",
   "AckSync",
   "AckSync" >>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {0})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

I’ll spare you the full stack trace, but we now have a phantom key: it was removed from a node, during the partition, but then merged back in. Again, the problem lies with my model being too simplistic. We will look into how to resolve this, next time.

It’s a Syn (Part 4)

January 16, 2024Graham HayLeave a comment

We are starting to get to the business end. So how can we model a netsplit? The easiest thing seems to be to add a set for each node, listing the nodes that are visible from it:

AllOtherNodes(n) ==
    Nodes \ {n}

Init ==
     ...
    /\ visible_nodes = [n \in Nodes |-> AllOtherNodes(n)]

At the start, all nodes can see all other nodes. Now, when we register a new value:

Register(n) ==
     ...
    /\ inbox' = [o \in Nodes |->
                       IF o \in visible_nodes[n] THEN
                           Append(inbox[o], [action |-> "sync_register", name |-> next_val])
                       ELSE
                           inbox[o]
                   ]

we only send messages to the nodes that are still visible (same for unregister). And we add two new state transitions:

Next ==
    /\ \E n \in Nodes:
        ...
        \/ Disconnect(n)
        \/ Reconnect(n)

To disconnect:

Disconnect(n) ==
    /\ Cardinality(visible_nodes[n]) > 0
    /\ LET other_node == CHOOSE o \in visible_nodes[n]: TRUE
        IN visible_nodes' = [o \in Nodes |-> CASE
            (o = other_node) -> visible_nodes[o] \ {n}
            [] (o = n) -> visible_nodes[o] \ {other_node}
            [] OTHER -> visible_nodes[o]
        ]
    /\ UNCHANGED <<registered, locally_registered, inbox, next_val>>

If other nodes are still visible; we select a node, at random, from the set and remove it (on both sides). And reconnection is the opposite:

Reconnect(n) ==
    /\ Cardinality(AllOtherNodes(n) \ visible_nodes[n]) > 0
    /\ LET other_node == CHOOSE o \in (AllOtherNodes(n) \ visible_nodes[n]): TRUE
        IN visible_nodes' = [o \in Nodes |-> CASE
            (o = other_node) -> visible_nodes[o] \union {n}
            [] (o = n) -> visible_nodes[o] \union {other_node}
            [] OTHER -> visible_nodes[o]
        ]
    /\ UNCHANGED <<registered, locally_registered, inbox, next_val>>

We also need to update our invariant:

AllRegistered ==
    \A n \in Nodes:
        Len(inbox[n]) = 0 /\ visible_nodes[n] = AllOtherNodes(n) => locally_registered[n] = registered

to only compare the registered values, once a netsplit has healed. If we run this model:

Error: Invariant AllRegistered is violated.
Error: The behavior up to this point is:
State 1: <Initial predicate>
/\ next_val = 0
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {} @@ n2 :> {})
/\ registered = {}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

State 2: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 0
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {} @@ n2 :> {})
/\ registered = {}
/\ visible_nodes = (n1 :> {} @@ n2 :> {})

State 3: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {})
/\ registered = {0}
/\ visible_nodes = (n1 :> {} @@ n2 :> {})

State 4: <Next line 78, col 5 to line 85, col 19 of module syn>
/\ next_val = 1
/\ inbox = (n1 :> <<>> @@ n2 :> <<>>)
/\ locally_registered = (n1 :> {0} @@ n2 :> {})
/\ registered = {0}
/\ visible_nodes = (n1 :> {n2} @@ n2 :> {n1})

it fails pretty quickly. But that is what we should expect, as new keys could have been added while the network was partitioned, and we aren’t doing anything to resolve that. And that is what we will be looking into, next time.