Adding/removing LBaaS pool members, using Ansible

We use a load balancer to achieve “zero downtime deployments”. During a deploy, we take each node out of the rotation, update the code, and put it back in. If you’re using the Rackspace Cloud, there are ansible modules provided (as described here).

If you’re just using OpenStack though, it’s not quite as simple. While there are some ansible modules, there doesn’t seem to be one for managing LBaaS pool members. It is possible to create a pool member using a heat template, but that doesn’t really fit into an ansible playbook.

The only alternative seems to be using the neutron cli, which is deprecated (but doesn’t seem to have been replaced, for controlling a LBaaS anyway).

Fortunately, it can return output in a machine readable format (json), which makes calling it from ansible relatively simple. To add a pool member, in an idempotent fashion:

- name: Get current pool members
  local_action: command neutron lbaas-member-list test-gib-lb-pool --format json
  register: member_list

- set_fact: pool_member={{ member_list.stdout | from_json | selectattr("address", "equalto", ansible_default_ipv4.address) | map(attribute="id") | list }}

- name: Remove node from LB
  local_action: command neutron lbaas-member-delete {{ pool_member | first }} test-gib-lb-pool
  when: (pool_member | count) > 0

You first need to check if the current host is in the list of pool members. If so, remove it; otherwise, do nothing.

Once the code is updated, and the services restarted, you can put the node back in the pool:

- name: Add node to LB
  local_action: command neutron lbaas-member-create --name {{ inventory_hostname }} --subnet gamevy_subnet --address {{ ansible_default_ipv4.address }} --protocol-port 443 test-gib-lb-pool

This seems to be quite effective, but we have seen some errors when the members are deleted (which isn’t exactly “zero” downtime). If I find a better approach I’ll update this.

A squad of squids

We use squid as a forward proxy, to ensure that all outbound requests come from the same (whitelisted) IP addresses.

Originally, we chose the proxy to use at deploy time, using an ansible template:

"proxy": "http://{{ hostvars['proxy-' + (["01", "02"] | random())].ansible_default_ipv4.address }}:3128"

This “load balanced” the traffic, but wouldn’t be very useful if one of the proxies disappeared (the main reason for having two of them!)

I briefly considered using nginx to pass requests to squid, as it was already installed on the app servers; but quickly realised that wouldn’t work, for the same reason we needed to use squid in the first place: it can’t proxy TLS traffic.

A bit of research revealed that you can group multiple squid instances into a hierarchy. Taken care of by some more templating, in the squid config this time:

{% if inventory_hostname in groups['app_server'] %}
{% for host in proxy_ips %}
cache_peer {{ host }} parent 3128 0 round-robin
{% endfor %}
never_direct allow all
{% endif %}

Where the list of IPs is a projection from inventory data:

     
proxy_ips: "{{ groups['proxy_server'] | map('extract', hostvars, ['ansible_default_ipv4', 'address']) | list }}"

Error from postgresql_user

We have an ansible task that creates a postgres user (role) for vagrant:

- name: Create vagrant user
  sudo: true
  sudo_user: postgres
  postgresql_user: name=vagrant role_attr_flags=CREATEDB,CREATEUSER

which was working fine with ansible 1.9; but when we upgraded to 2.0, we started getting an error if the user already existed.

TASK [pg-vagrant : Create vagrant user] ****************************************
fatal: [default]: FAILED! => {"changed": false, "failed": true, "module_stderr": "", "module_stdout": "\r\nTraceback (most recent call last):\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 2722, in \r\n    main()\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 621, in main\r\n    changed = user_alter(cursor, module, user, password, role_attr_flags, encrypted, expires, no_password_changes)\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 274, in user_alter\r\n    if current_role_attrs[PRIV_TO_AUTHID_COLUMN[role_attr_name]] != role_attr_value:\r\n  File \"/usr/lib/python2.7/dist-packages/psycopg2/extras.py\", line 144, in __getitem__\r\n    x = self._index[x]\r\nKeyError: 'rolcreateuser'\r\n", "msg": "MODULE FAILURE", "parsed": false}

When I checked that the user had been created successfully the first time I provisioned:

=# \dg
                                 List of roles
    Role name     |                   Attributes                   | Member of
------------------+------------------------------------------------+-----------
 postgres         | Superuser, Create role, Create DB, Replication | {}
 vagrant          | Superuser, Create DB                           | {}

I noticed that the role actually had Superuser privileges, something the documentation confirmed:

CREATEUSER
NOCREATEUSER
These clauses are an obsolete, but still accepted, spelling of SUPERUSER and NOSUPERUSER. Note that they are not equivalent to CREATEROLE as one might naively expect!

So it looks like somewhere in the toolchain (ansible, psycopg2, postgres) this is no longer supported. Substituting SUPERUSER for CREATEUSER fixed the issue.

Ansible templates and urlencode

If you use ansible templates to generate your config, and strong passwords, chances are that you know to use urlencode:

{
    "postgres": "postgres://{{ db_user }}:{{ db_password|urlencode() }}@{{ db_server }}/{{ db_name }}",
}

to ensure that any special chars are not mangled. Unfortunately, as I discovered to my cost, the jinja filter does NOT encode a forward slash as %2F.

apt update_cache with ansible

Installing 3rd party software, e.g. elasticsearch, sometimes involves adding an apt repo:

- name: Add apt repository
  apt_repository: repo='deb https://example.com/apt/example jessie main'

Once this has been added, it’s necessary to call apt-get update before the new software can be installed. It’s tempting to do so by adding update_cache=yes to the apt call:

- name: Install pkg
  apt: name=example update_cache=yes

But a better solution is to separate the two:

- name: Add apt repository
  apt_repository: repo='deb https://example.com/apt/example jessie main'
  register: apt_source_added

- name: Update cache
  apt: update_cache=yes
  when: apt_source_added|changed

- name: Install pkg
  apt: name=example

This ensures that the (time consuming) update only happens the first time, when the new repo is added. It also makes it much clearer what is taking the time, if the build hangs.

EDIT: I completely forgot that it’s possible to add the update_cache attribute directly to the apt_repository call. Much simpler!

- name: Add apt repository
  apt_repository: repo='deb https://example.com/apt/example jessie main' update_cache=yes

- name: Install pkg
  apt: name=example

Ansible & systemctl daemon-reload

(UPDATE 2: there’s also a systemd module now, which should provide a neater wrapper round these commands. Explicit daemon-reload is still required)

(UPDATE: there is now a PR open to add this functionality)

We recently migrated to Debian 8 which, by default, uses systemd. I can appreciate why some people have misgivings about it, but from my point of view it’s been a massive improvement.

Our unit files look like this now:

[Service]
ExecStart=/var/www/{{ app_name }}/app.js
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier={{ app_name }}
User={{ app_name }}
Group={{ app_name }}
Environment=NODE_ENV={{ env }}
WorkingDirectory=/var/www/{{ app_name }}

[Install]
WantedBy=multi-user.target

Compared to a 3 page init script, using start-stop-daemon. And we no longer need a watchdog like monit.

We do our deployments using ansible, which already knows how to play nice with systemd. One thing missing though, is that if you change a unit file you need to call systemctl daemon-reload before the changes will be picked up.

There’s a discussion underway as to whether ansible should take care of it. But for now, the easiest thing to do is add another handler:

- name: Install unit file
  sudo: true
  copy: src=foo.service dest=/lib/systemd/system/ owner=root mode=644
  notify:
    - reload systemd
    - restart foo

with a handler like this:

- name: reload systemd
  sudo: yes
  command: systemctl daemon-reload

UPDATE: if you need to restart the service later in the same play, you can flush the handlers to ensure daemon-reload has been called:

- meta: flush_handlers

Debugging an ansible module

Debugging an ansible module can be a pretty thankless task; luckily the team has provided some tools to make it a little easier. While it’s possible to attach a debugger (e.g. epdb), good old fashioned println debugging is normally enough.

If you just add a print statement to the module, and run it normally, then you’ll be disappointed to see that your output is nowhere to be seen. This is due to the way that ansible modules communicate, using stdin & stdout.

The secret is to run your module using the test-module script provided by the ansible team. You can then pass in the arguments to your module as a json blob:

hacking/test-module -m library/cloud/rax -a "{ \"credentials\": \"~/.rackspace_cloud_credentials\", \"region\": \"LON\", \"name\": \"app-prod-LON-%02d\", \"count\": \"2\", \"exact_count\": \"yes\", \"group\": \"app_servers\", \"flavor\": \"performance1-1\", \"image\": \"11b0cefc-d4ec-4f09-9ff6-f842ca97987c\", \"state\": \"present\", \"wait\": \"yes\", \"wait_timeout\": \"900\" }"

The output of this script can be a bit verbose, so if you’re only interested in your output it can be worthwhile commenting out the code that prints out the parsed output, and just keeping the raw version.