Up till now, we have been using the default behaviour of TLC: model checking mode. But there is another tool in our quiver (mixing my metaphors there), simulation mode!
This is more like the way property testing tooling works: rather than a breadth-first search of the problem space, a random walk is generated. As we have discovered, the time taken to run an exhaustive model check rapidly expands, as the model becomes more complicated.
If you have added constraints to prevent the model running forever (e.g. in our case, the disconnection limit, and not re-using registered names), you can remove those now. When you run the spec:
$ docker run --rm -it -v $PWD:/app -w /app amazoncorretto:8-alpine3.17-jdk java -cp tla2tools.jar tlc2.TLC -simulate syn.tla
TLC2 Version 2.18 of 20 March 2023 (rev: 3ea3222)
Running Random Simulation with seed -7079380368092791143 with 1 worker on 8 cores with 3488MB heap and 64MB offheap memory (Linux 6.7.2-arch1-1 amd64, Amazon.com Inc. 1.8.0_392 x86_64).
...
it gives you the seed, so you can run the same tests again after fixing any issues. TLC will run forever in this mode, if there aren’t any failures, but we are not so lucky:
Error: Invariant ThereCanBeOnlyOne is violated.
Error: The behavior up to this point is:
...
State 34: <Next line 290, col 5 to line 303, col 22 of module syn>
/\ disconnections = 6
/\ locally_registered = ( n1 :> (n1 :> (a :> 2 @@ b :> 26) @@ n2 :> <<>> @@ n3 :> <<>>) @@
n2 :> (n1 :> <<>> @@ n2 :> (a :> 22) @@ n3 :> <<>>) @@
n3 :> (n1 :> (a :> 2) @@ n2 :> (a :> 22) @@ n3 :> << >>) )
/\ time = 33
/\ states = << <<"Register", n1, a>>,
<<"Register", n2, a>>,
<<"RegisterOrUpdateOnNode", n1, a>>,
<<"Disconnect", n2, n1>>,
<<"Register", n1, b>>,
<<"Disconnect", n2, n3>>,
<<"Register", n2, b>>,
<<"SyncRegister", n3, a>>,
<<"Register", n1, c>>,
<<"Reconnect", n3, n2>>,
<<"Unregister", n1, a>>,
<<"Register", n2, c>>,
<<"Disconnect", n2, n3>>,
<<"Down", n3, n2>>,
<<"Reconnect", n3, n2>>,
<<"Register", n1, d>>,
<<"Unregister", n1, a>>,
<<"Unregister", n1, a>>,
<<"Reconnect", n1, n2>>,
<<"Disconnect", n2, n1>>,
<<"Reconnect", n2, n1>>,
<<"Disconnect", n2, n1>>,
<<"RegisterOrUpdateOnNode", n2, a>>,
<<"Register", n1, e>>,
<<"Down", n1, n2>>,
<<"Discover", n3, n2>>,
<<"RegisterOrUpdateOnNode", n1, b>>,
<<"Down", n3, n2>>,
<<"Register", n1, f>>,
<<"Disconnect", n1, n3>>,
<<"Register", n3, b>>,
<<"Discover", n3, n2>>,
<<"SyncRegister", n3, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {n3} @@ n3 :> {n2})
/\ registered = (a :> 2 @@ b :> 1 @@ c :> 0 @@ d :> 0 @@ e :> 0 @@ f :> 0 @@ g :> 0 @@ h :> 0)
/\ inbox = (n1 :> <<[action |-> "register_or_update_on_node", name |-> c], [action |-> "unregister_on_node", name |-> a], [action |-> "register_or_update_on_node", name |-> d], [action |-> "unregister_on_node", name |-> a], [action |-> "unregister_on_node", name |-> a], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "register_or_update_on_node", name |-> e], [action |-> "register_or_update_on_node", name |-> f], [action |-> "DOWN", from |-> n3]>> @@ n2 :> <<[time |-> 2, action |-> "sync_register", name |-> a, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "DOWN", from |-> n3], [action |-> "register_or_update_on_node", name |-> b], [action |-> "discover", from |-> n3], [action |-> "register_or_update_on_node", name |-> c], [action |-> "DOWN", from |-> n3], [action |-> "discover", from |-> n3], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "ack_sync", from |-> n3, local_data |-> << >>], [action |-> "ack_sync", from |-> n3, local_data |-> << >>]>> @@ n3 :> <<[time |-> 26, action |-> "sync_register", name |-> b, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "register_or_update_on_node", name |-> b]>>)
/\ names = (n1 :> {a, g, h} @@ n2 :> {d, e, f, g, h} @@ n3 :> {a, c, d, e, f, g, h})
The stack trace is much deeper now (34 transitions), and with 3 nodes it is much harder to understand what has gone wrong. This looks like an issue with my conflict resolution handling though, I am only checking for conflicts between the current node, and the sender. This works fine with a two node cluster, but not with 3 (or more).
I could probably change that check to \in RegisteredOnThisNode(n), but I would need to know which node had registered it, to resolve the conflict; and whatever change I make will also need to be reproduced in AckSync too. It might be time to bite the bullet, and factor out the conflict resolution (yet more recursion).