Up till now, we have been using the default behaviour of TLC: model checking mode. But there is another tool in our quiver (mixing my metaphors there), simulation mode!
This is more like the way property testing tooling works: rather than a breadth-first search of the problem space, a random walk is generated. As we have discovered, the time taken to run an exhaustive model check rapidly expands, as the model becomes more complicated.
If you have added constraints to prevent the model running forever (e.g. in our case, the disconnection limit, and not re-using registered names), you can remove those now. When you run the spec:
it gives you the seed, so you can run the same tests again after fixing any issues. TLC will run forever in this mode, if there aren’t any failures, but we are not so lucky:
Error: Invariant ThereCanBeOnlyOne is violated.
Error: The behavior up to this point is:
...
State 34: <Next line 290, col 5 to line 303, col 22 of module syn>
/\ disconnections = 6
/\ locally_registered = ( n1 :> (n1 :> (a :> 2 @@ b :> 26) @@ n2 :> <<>> @@ n3 :> <<>>) @@
n2 :> (n1 :> <<>> @@ n2 :> (a :> 22) @@ n3 :> <<>>) @@
n3 :> (n1 :> (a :> 2) @@ n2 :> (a :> 22) @@ n3 :> << >>) )
/\ time = 33
/\ states = << <<"Register", n1, a>>,
<<"Register", n2, a>>,
<<"RegisterOrUpdateOnNode", n1, a>>,
<<"Disconnect", n2, n1>>,
<<"Register", n1, b>>,
<<"Disconnect", n2, n3>>,
<<"Register", n2, b>>,
<<"SyncRegister", n3, a>>,
<<"Register", n1, c>>,
<<"Reconnect", n3, n2>>,
<<"Unregister", n1, a>>,
<<"Register", n2, c>>,
<<"Disconnect", n2, n3>>,
<<"Down", n3, n2>>,
<<"Reconnect", n3, n2>>,
<<"Register", n1, d>>,
<<"Unregister", n1, a>>,
<<"Unregister", n1, a>>,
<<"Reconnect", n1, n2>>,
<<"Disconnect", n2, n1>>,
<<"Reconnect", n2, n1>>,
<<"Disconnect", n2, n1>>,
<<"RegisterOrUpdateOnNode", n2, a>>,
<<"Register", n1, e>>,
<<"Down", n1, n2>>,
<<"Discover", n3, n2>>,
<<"RegisterOrUpdateOnNode", n1, b>>,
<<"Down", n3, n2>>,
<<"Register", n1, f>>,
<<"Disconnect", n1, n3>>,
<<"Register", n3, b>>,
<<"Discover", n3, n2>>,
<<"SyncRegister", n3, a>> >>
/\ visible_nodes = (n1 :> {} @@ n2 :> {n3} @@ n3 :> {n2})
/\ registered = (a :> 2 @@ b :> 1 @@ c :> 0 @@ d :> 0 @@ e :> 0 @@ f :> 0 @@ g :> 0 @@ h :> 0)
/\ inbox = (n1 :> <<[action |-> "register_or_update_on_node", name |-> c], [action |-> "unregister_on_node", name |-> a], [action |-> "register_or_update_on_node", name |-> d], [action |-> "unregister_on_node", name |-> a], [action |-> "unregister_on_node", name |-> a], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "discover", from |-> n2], [action |-> "DOWN", from |-> n2], [action |-> "register_or_update_on_node", name |-> e], [action |-> "register_or_update_on_node", name |-> f], [action |-> "DOWN", from |-> n3]>> @@ n2 :> <<[time |-> 2, action |-> "sync_register", name |-> a, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "DOWN", from |-> n3], [action |-> "register_or_update_on_node", name |-> b], [action |-> "discover", from |-> n3], [action |-> "register_or_update_on_node", name |-> c], [action |-> "DOWN", from |-> n3], [action |-> "discover", from |-> n3], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "discover", from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "ack_sync", from |-> n3, local_data |-> << >>], [action |-> "ack_sync", from |-> n3, local_data |-> << >>]>> @@ n3 :> <<[time |-> 26, action |-> "sync_register", name |-> b, from |-> n1], [action |-> "DOWN", from |-> n1], [action |-> "register_or_update_on_node", name |-> b]>>)
/\ names = (n1 :> {a, g, h} @@ n2 :> {d, e, f, g, h} @@ n3 :> {a, c, d, e, f, g, h})
The stack trace is much deeper now (34 transitions), and with 3 nodes it is much harder to understand what has gone wrong. This looks like an issue with my conflict resolution handling though, I am only checking for conflicts between the current node, and the sender. This works fine with a two node cluster, but not with 3 (or more).