Adding/removing LBaaS pool members, using Ansible

We use a load balancer to achieve “zero downtime deployments”. During a deploy, we take each node out of the rotation, update the code, and put it back in. If you’re using the Rackspace Cloud, there are ansible modules provided (as described here).

If you’re just using OpenStack though, it’s not quite as simple. While there are some ansible modules, there doesn’t seem to be one for managing LBaaS pool members. It is possible to create a pool member using a heat template, but that doesn’t really fit into an ansible playbook.

The only alternative seems to be using the neutron cli, which is deprecated (but doesn’t seem to have been replaced, for controlling a LBaaS anyway).

Fortunately, it can return output in a machine readable format (json), which makes calling it from ansible relatively simple. To add a pool member, in an idempotent fashion:

- name: Get current pool members
  local_action: command neutron lbaas-member-list test-gib-lb-pool --format json
  register: member_list

- set_fact: pool_member={{ member_list.stdout | from_json | selectattr("address", "equalto", ansible_default_ipv4.address) | map(attribute="id") | list }}

- name: Remove node from LB
  local_action: command neutron lbaas-member-delete {{ pool_member | first }} test-gib-lb-pool
  when: (pool_member | count) > 0

You first need to check if the current host is in the list of pool members. If so, remove it; otherwise, do nothing.

Once the code is updated, and the services restarted, you can put the node back in the pool:

- name: Add node to LB
  local_action: command neutron lbaas-member-create --name {{ inventory_hostname }} --subnet gamevy_subnet --address {{ ansible_default_ipv4.address }} --protocol-port 443 test-gib-lb-pool

This seems to be quite effective, but we have seen some errors when the members are deleted (which isn’t exactly “zero” downtime). If I find a better approach I’ll update this.

A squad of squids

We use squid as a forward proxy, to ensure that all outbound requests come from the same (whitelisted) IP addresses.

Originally, we chose the proxy to use at deploy time, using an ansible template:

"proxy": "http://{{ hostvars['proxy-' + (["01", "02"] | random())].ansible_default_ipv4.address }}:3128"

This “load balanced” the traffic, but wouldn’t be very useful if one of the proxies disappeared (the main reason for having two of them!)

I briefly considered using nginx to pass requests to squid, as it was already installed on the app servers; but quickly realised that wouldn’t work, for the same reason we needed to use squid in the first place: it can’t proxy TLS traffic.

A bit of research revealed that you can group multiple squid instances into a hierarchy. Taken care of by some more templating, in the squid config this time:

{% if inventory_hostname in groups['app_server'] %}
{% for host in proxy_ips %}
cache_peer {{ host }} parent 3128 0 round-robin
{% endfor %}
never_direct allow all
{% endif %}

Where the list of IPs is a projection from inventory data:

     
proxy_ips: "{{ groups['proxy_server'] | map('extract', hostvars, ['ansible_default_ipv4', 'address']) | list }}"

Withdraw!

The next transition is a withdrawal. Naively, we add another command:

open(_Data) ->
    [
        {open, {call, bank_statem, deposit, [pos_integer()]}},
        {open, {call, bank_statem, withdraw, [pos_integer()]}}
    ].

Which causes an immediate failure:

===> Testing prop_bank_statem:prop_test()
...
=ERROR REPORT==== 29-Apr-2018::15:08:41 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,2}}
** When server state  = {open,#{balance => 1}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,2},
                   #{balance => 1}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...
Crash dump is being written to: erl_crash.dump...done

Having the entire process crash isn’t very useful, so we follow the recommendation to use the TRAPEXIT macro:

prop_test() ->
    ?FORALL(Cmds, proper_fsm:commands(?MODULE),
        ?TRAPEXIT(
            begin
                bank_statem:start_link(),
                {History,State,Result} = proper_fsm:run_commands(?MODULE, Cmds),
                bank_statem:stop(),
                ?WHENFAIL(io:format("History: ~p\nState: ~p\nResult: ~p\n", [History,State,Result]),
                    aggregate(zip(proper_fsm:state_names(History), command_names(Cmds)), Result =:= ok))
            end)).

This allows shrinking to take place, and gives us (slightly) more useful output:

===> Testing prop_bank_statem:prop_test()
.!
Failed: After 2 test(s).
A linked process died with reason {function_clause,[{bank_statem,open,[{call,{,#Ref}},{withdraw,3},#{balance=>0}],[{file,[47,97,112,112,47,95,98,117,105,108,100,47,116,101,115,116,47,108,105,98,47,98,97,110,107,95,115,116,97,116,101,109,47,115,114,99,47,98,97,110,107,95,115,116,97,116,101,109,46,101,114,108]},{line,46}]},{gen_statem,call_state_function,5,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1633}]},{gen_statem,loop_event_state_function,6,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,[112,114,111,99,95,108,105,98,46,101,114,108]},{line,247}]}]}.

=ERROR REPORT==== 29-Apr-2018::15:11:36 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,3}}
** When server state  = {open,#{balance => 0}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,3},
                   #{balance => 0}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...
[{set,{var,1},{call,bank_statem,withdraw,[3]}}]

Shrinking (0 time(s))
[{set,{var,1},{call,bank_statem,withdraw,[3]}}]

=ERROR REPORT==== 29-Apr-2018::15:11:36 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,3}}
** When server state  = {open,#{balance => 0}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,3},
                   #{balance => 0}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
....
===> 
0/1 properties passed, 1 failed
===> Failed test cases:
  prop_bank_statem:prop_test() -> false

We can see that the error is caused by trying to withdraw funds that don’t exist. This can be avoided by adding a pre-condition, to ensure that invalid commands aren’t generated:

precondition(_From, _To, #{balance:=Balance}, {call, bank_statem, withdraw, [Amount]}) ->
    Balance - Amount >= 0;

And also update our model when a withdrawal occurs:

next_state_data(_From, _To, #{balance:=Balance}=Data, _Res, {call, bank_statem, withdraw, [Amount]}) ->
    NewBalance = Balance - Amount,
    Data#{balance:=NewBalance};

At this point, I would expect the property to pass, but it’s still failing:

===> Testing prop_bank_statem:prop_test()
...................!
Failed: After 20 test(s).
A linked process died with reason {function_clause,[{bank_statem,open,[{call,{,#Ref}},{withdraw,9},#{balance=>9}],[{file,[47,97,112,112,47,95,98,117,105,108,100,47,116,101,115,116,47,108,105,98,47,98,97,110,107,95,115,116,97,116,101,109,47,115,114,99,47,98,97,110,107,95,115,116,97,116,101,109,46,101,114,108]},{line,46}]},{gen_statem,call_state_function,5,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1633}]},{gen_statem,loop_event_state_function,6,[{file,[103,101,110,95,115,116,97,116,101,109,46,101,114,108]},{line,1023}]},{proc_lib,init_p_do_apply,3,[{file,[112,114,111,99,95,108,105,98,46,101,114,108]},{line,247}]}]}.

=ERROR REPORT==== 29-Apr-2018::15:14:34 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,9}}
** When server state  = {open,#{balance => 9}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,9},
                   #{balance => 9}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...

Shrinking .
=ERROR REPORT==== 29-Apr-2018::15:14:34 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,9}}
** When server state  = {open,#{balance => 9}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,9},
                   #{balance => 9}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...
.
=ERROR REPORT==== 29-Apr-2018::15:14:34 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,9}}
** When server state  = {open,#{balance => 9}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,9},
                   #{balance => 9}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...
(2 time(s))

=ERROR REPORT==== 29-Apr-2018::15:14:34 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {withdraw,9}}
** When server state  = {open,#{balance => 9}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,open,
                  [{call,{,#Ref}},
                   {withdraw,9},
                   #{balance => 9}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,46}]},
...
[{set,{var,1},{call,bank_statem,deposit,[2]}},{set,{var,2},{call,bank_statem,deposit,[7]}},{set,{var,3},{call,bank_statem,withdraw,[9]}}]
===> 
0/1 properties passed, 1 failed
===> Failed test cases:
  prop_bank_statem:prop_test() -> false

Rather embarrassingly, that’s an actual bug in my code, the guard is checking that the remaining balance is strictly greater than 0. Hurray for property testing!

State machine properties

One of the more interesting corners of the Erlang ecosystem is property based testing. Building on Haskell’s QuickCheck, it allows generation of test cases, similar to “fuzzing”.

The integration with gen_fsm/statem makes stateful testing remarkably easy, by describing the possible transitions for our state machine:

-module(prop_bank_statem).

-include_lib("proper/include/proper.hrl").

-compile(export_all).

prop_test() ->
    ?FORALL(Cmds, proper_fsm:commands(?MODULE),
        begin
            bank_statem:start_link(),
            {History,State,Result} = proper_fsm:run_commands(?MODULE, Cmds),
            bank_statem:stop(),
            ?WHENFAIL(
                io:format("History: ~p\nState: ~p\nResult: ~p\n", [History,State,Result]),
                aggregate(zip(proper_fsm:state_names(History), command_names(Cmds)), Result =:= ok)
            )
        end).

initial_state() -> open.

initial_state_data() -> #{}.

open(_Data) -> [{open, {call, bank_statem, deposit, [pos_integer()]}}].

weight(_FromState, _ToState, _Call) -> 1.

precondition(_From, _To, #{}, {call, _Mod, _Fun, _Args}) -> true.

postcondition(_From, _To, _Data, {call, _Mod, _Fun, _Args}, _Res) -> true.

next_state_data(_From, _To, Data, _Res, {call, _Mod, _Fun, _Args}) ->
    NewData = Data,
    NewData.

To begin with, we’re just checking that a deposit can be made to an open account. This property succeeds (after the default 100 cases):

===> Testing prop_bank_statem:prop_test()
....................................................................................................
OK: Passed 100 test(s).

100% {open,{bank_statem,deposit,1}}
===> 
1/1 properties passed

But, if we switch to using a less strict generator:

open(_Data) -> [{open, {call, bank_statem, deposit, [integer()]}}].

We very quickly hit a failure:

===> Testing prop_bank_statem:prop_test()

=ERROR REPORT==== 29-Apr-2018::14:05:12 ===
** State machine bank_statem terminating
** Last event = {{call,{,#Ref}},
                 {deposit,0}}
** When server state  = {open,#{balance => 0}}
** Reason for termination = error:function_clause
** Callback mode = state_functions
** Stacktrace =
**  [{bank_statem,handle_deposit,
                  [0,
                   #{balance => 0},
                   {,#Ref}],
                  [{file,"/app/_build/test/lib/bank_statem/src/bank_statem.erl"},
                   {line,77}]},

caused by an attempt to make a deposit of 0.

To make things more interesting, we can amend our deposit transition to return the updated balance, and track that in the model:

initial_state_data() -> #{balance=>0}.

...

postcondition(_From, _To, #{balance:=PrevBalance}, {call, bank_statem, deposit, [Amount]}, {deposit_made, UpdatedBalance}) ->
    UpdatedBalance =:= (PrevBalance + Amount);

...

next_state_data(_From, _To, #{balance:=Balance}=Data, _Res, {call, bank_statem, deposit, [Amount]}) ->
    NewBalance = Balance + Amount,
    Data#{balance:=NewBalance};

For this toy example, the model is almost exactly the same as the implementation. And, in fact, the hardest thing with this style of property testing is keeping the model simple, when testing a more complex example.

Next time, we will add the rest of the transitions to our model.

Hold pls

On hold

The next step for our bank account is to place a hold on an account.

account-events-withself-held

Using state functions, this is as simple as:

open({call, From}, place_hold, Data) ->
    {next_state, held, Data, [{reply, From, hold_placed}]};

...

open({call, From}, {deposit, Amount}, Data) ->
    handle_deposit(Amount, Data, From);

...

held({call, From}, {deposit, Amount}, Data) ->
    handle_deposit(Amount, Data, From);

held({call, From}, remove_hold, Data) ->
    {next_state, open, Data, [{reply, From, hold_removed}]}.

handle_deposit(Amount, #{balance:=Balance} = Data, From) when is_number(Amount) andalso Amount > 0 ->
    NewBalance = Balance + Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, deposit_made}]}.

Now any attempt to withdraw funds from a held account, will cause the gen_statem process to crash (hopefully having previously persisted any important data!).

And using handle event:

handle_event({call, From}, place_hold, open, Data) ->
    {next_state, held, Data, [{reply, From, hold_placed}]};

...

handle_event({call, From}, remove_hold, held, Data) ->
    {next_state, open, Data, [{reply, From, hold_removed}]};

...

handle_event({call, From}, {deposit, Amount}, open, Data) ->
    handle_deposit(Amount, Data, From);

handle_event({call, From}, {deposit, Amount}, held, Data) ->
    handle_deposit(Amount, Data, From);

...

handle_deposit(Amount, #{balance:=Balance} = Data, From) when is_number(Amount) andalso Amount > 0 ->
    NewBalance = Balance + Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, deposit_made}]}.

Insufficient funds?

We can also add an availableToWithdraw method, with different behaviour for held accounts, without too much hassle:

open({call, From}, available_to_withdraw, #{balance:=Balance} = Data) ->
    {keep_state, Data, [{reply, From, Balance}]};

...

held({call, From}, available_to_withdraw, Data) ->
    {keep_state, Data, [{reply, From, 0}]};

Or:

handle_event({call, From}, available_to_withdraw, open, #{balance:=Balance} = Data) ->
    {keep_state, Data, [{reply, From, Balance}]};

...

handle_event({call, From}, available_to_withdraw, held, Data) ->
    {keep_state, Data, [{reply, From, 0}]};

Handle event function

The main difference between gen_statem and the older gen_fsm, is the addition of a new “handle event function” callback mode, with less restrictions on the state data type.

Converting an existing state machine, is pretty easy. Change the callback mode, and move the existing state functions into one callback:

callback_mode() -> handle_event_function.

handle_event({call, From}, get_balance, open, #{balance:=Balance} = Data) ->
    {keep_state, Data, [{reply, From, Balance}]};

handle_event({call, From}, close, open, Data) ->
    {next_state, closed, Data, [{reply, From, closed}]};

handle_event({call, From}, {deposit, Amount}, open, #{balance:=Balance} = Data) when is_number(Amount) andalso Amount > 0 ->
    NewBalance = Balance + Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, deposit_made}]};

handle_event({call, From}, {withdraw, Amount}, open, #{balance:=Balance} = Data) when is_number(Amount) andalso (Balance - Amount > 0) ->
    NewBalance = Balance - Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, withdrawal_made}]};

handle_event({call, From}, reopen, closed, Data) ->
    {next_state, open, Data, [{reply, From, open}]}.

Whether you prefer this style, or the previous version, seems like a matter of personal taste, if you don’t need the extra flexibility.

❤️ the State Machine

Having recently read a post about using state machines in JS, I decided to try implementing the same logic using the (relatively) new gen_statem.

As ever, the code is available here.

The initial example is a naive representation of a bank account:

account-events-withself

The first step is to fill out the gen_statem boilerplate:

-module(bank_statem).

-behaviour(gen_statem).

-export([init/1, callback_mode/0]).
-export([open/3]).

init([]) ->
    {ok, open, #{balance=>0}}.

callback_mode() -> state_functions.

open({call, From}, get_balance, #{balance:=Balance} = Data) ->
    {keep_state, Data, [{reply, From, Balance}]}.

We return an initial state of open, and a map containing the initial balance of 0.

I also added a get_balance call, so we can inspect the current data (which would normally be referred to as the “state” in a gen_server, confusingly).

The first state transition is to close an open account:

open({call, From}, close, Data) ->
    {next_state, closed, Data, [{reply, From, closed}]};

The function name tells you which state this can be called from (open), and the return value tells you that the state machine would transition to a new state (closed), as well as returning a value (the atom closed) to the caller.

Re-opening an account is pretty similar:

closed({call, From}, reopen, Data) ->
    {next_state, open, Data, [{reply, From, open}]}.

As close is only defined for the open state, and reopen is only defined for the closed state, any inappropriate calls will cause the process to crash.

Deposit & withdraw are a little different:

open({call, From}, {deposit, Amount}, #{balance:=Balance} = Data) when is_number(Amount) andalso Amount > 0 ->
    NewBalance = Balance + Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, deposit_made}]};

open({call, From}, {withdraw, Amount}, #{balance:=Balance} = Data) when is_number(Amount) andalso (Balance - Amount > 0) ->
    NewBalance = Balance - Amount,
    {keep_state, Data#{balance:=NewBalance}, [{reply, From, withdrawal_made}]}.

In that they retain the current state, but the data (in this case, the balance) is updated. Guards have also been used to validate the arguments; again, any other calls will cause an error.

This will seem pretty familiar, if you’ve ever used gen_fsm. Next time, we’ll look at the new alternative callback mode: handle_event_function.