Mapping conflicts with ELK

I recently started upgrading to a newer version of ES (4.5), and found that it refused to start:

IllegalStateException: unable to upgrade the mappings for the index

In fact, this mapping conflict was one of the things I was hoping the upgrade would solve. After a bit of reading it became clear that I would have to make some changes.

The mapping in question was a field from the logs called “level”. In the postgres logs it was a string (e.g. “INFO”), and in our application logs (using bunyan) it was an integer (40 => “WARN”).

To allow me to search using a range (e.g. level:[40 TO 60]), I was using a mutate filter to convert the string “40” to an integer, and this was the cause of the conflict.

My first thought was to copy the field before converting:

mutate {
    add_field => { "level_int" => "%{level}" }
    convert => { "level_int" => "integer" }
}

But it turns out that that’s not enough to avoid a conflict (possibly because ES guesses the type, and saw an int first?). So I went with the nuclear option, and renamed the field:

mutate {
    rename => { "level" => "level_int" }
    convert => { "level_int" => "integer" }
}

Now my new documents were conflict free. Unfortunately, the only solution provided for existing data is to export and re-import it, which I wasn’t really in the mood for.

Luckily, I’m not in any rush to upgrade, and we close indices after 30 days. So I plan to wait for a month, and hope my data is clean by then!

“Trouble parsing json”

We use Bunyan┬áin our node apps, for “structured logging”. The output json string is passed to syslog, by systemd, and then fed into ELK.

{
    "name":"foo-service",
    "hostname":"app-01",
    "pid":30988,
    "ip":"1.19.24.8",
    "requestId":"1c11f448-73f2-4efa-bc63-3de787618d49",
    "level":50,
    "err": {
        "message":"oh noes!"
    }
}

Unfortunately, if that string is longer than 2048 chars (usually a stacktrace, or html returned from a web service instead of json), then the json blob ends up split over 2 lines in syslog.

This causes ELK to barf when attempting to parse the broken lines (assuming you are parsing as json), and means you won’t see those errors in Kibana.

It is possible to detect the error parsing the error, by searching for the string “Trouble parsing json”, but that’s not really a solution.

I would prefer to see a truncated error, than have the current situation, but that means either wrapping or patching Bunyan itself.

Node.js and systemd-notify

Running a nodejs app as a systemd service is pretty simple. But the downside to this approach is that if the app dies straight away (e.g. a config file is malformed), then systemd remains unaware.

An alternative is to switch to using notify:

[Service]
Type=notify
ExecStart=/var/www/app_name/app.js
Restart=always
...

There are various ways to notify systemd that your service is ready; but the simplest is to use systemd-notify, a console wrapper around sd_notify.

this.start = function(port) {
    var deferred = Q.defer();

    server = app.listen(port, '127.0.0.1', function(error) {
        if (error) {
            deferred.reject(new Error(error));
        } else {
            logger.info('Listening on port %d', port);
            exec('systemd-notify --ready');
            deferred.resolve();
        }
        });

        return deferred.promise;
    };
};

If the process that calls systemd-notify is not the one in ExecStart (e.g. you are using cluster), you will also need to set NotifyAccess to “all”.

Now when you use systemctl to start your service, it will wait for the notification before deeming the start process complete.

Jenkins and oauth2_proxy

We hide Jenkins behind bitly’s oauth2_proxy, to control access using our Google accounts. After recently upgrading to Debian Jessie (amongst other things), we found that enabling security on Jenkins (using the Reverse Proxy Auth plugin) resulted in an error:

java.lang.NullPointerException
	at org.jenkinsci.plugins.reverse_proxy_auth.ReverseProxySecurityRealm$1.doFilter(ReverseProxySecurityRealm.java:435)
	at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
	at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1482)
...

Following the stack trace, we find ourselves here. It’s pretty obvious that the NRE is caused by u being null, but the real question is why we are in that if block at all.

It turns out that at some point the oauth proxy started sending a basic auth header, as well as the X-Forwarded ones we need. This makes the Jenkins plugin sad, when it tries to look up the user.

Unfortunately, there is currently no way to have one without the other, which is an issue for other upstream applications. Hopefully at some point that flag will be added, but until then I’ve simply deleted the offending line.

Error from postgresql_user

We have an ansible task that creates a postgres user (role) for vagrant:

- name: Create vagrant user
  sudo: true
  sudo_user: postgres
  postgresql_user: name=vagrant role_attr_flags=CREATEDB,CREATEUSER

which was working fine with ansible 1.9; but when we upgraded to 2.0, we started getting an error if the user already existed.

TASK [pg-vagrant : Create vagrant user] ****************************************
fatal: [default]: FAILED! => {"changed": false, "failed": true, "module_stderr": "", "module_stdout": "\r\nTraceback (most recent call last):\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 2722, in \r\n    main()\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 621, in main\r\n    changed = user_alter(cursor, module, user, password, role_attr_flags, encrypted, expires, no_password_changes)\r\n  File \"/tmp/ansible-tmp-1457016468.17-9802839733620/postgresql_user\", line 274, in user_alter\r\n    if current_role_attrs[PRIV_TO_AUTHID_COLUMN[role_attr_name]] != role_attr_value:\r\n  File \"/usr/lib/python2.7/dist-packages/psycopg2/extras.py\", line 144, in __getitem__\r\n    x = self._index[x]\r\nKeyError: 'rolcreateuser'\r\n", "msg": "MODULE FAILURE", "parsed": false}

When I checked that the user had been created successfully the first time I provisioned:

=# \dg
                                 List of roles
    Role name     |                   Attributes                   | Member of
------------------+------------------------------------------------+-----------
 postgres         | Superuser, Create role, Create DB, Replication | {}
 vagrant          | Superuser, Create DB                           | {}

I noticed that the role actually had Superuser privileges, something the documentation confirmed:

CREATEUSER
NOCREATEUSER
These clauses are an obsolete, but still accepted, spelling of SUPERUSER and NOSUPERUSER. Note that they are not equivalent to CREATEROLE as one might naively expect!

So it looks like somewhere in the toolchain (ansible, psycopg2, postgres) this is no longer supported. Substituting SUPERUSER for CREATEUSER fixed the issue.

Tennis Kata in Erlang

I’ve recently been reading ploeh’s series on doing the Tennis Kata in F#. I decided to have a go in Erlang, we might not have a fancy type system but we do have pattern matching and partial functions.

You can find the source code to follow along here.

I decided to implement a “game” as a gen_server (when you have a hammer…), I refer you to a previous post for the pre-amble.

Love all

The first test is to check the initial score:

initial_score(Pid) ->
    fun() ->
        ?assertEqual(#{p1=>love,p2=>love}, gen_server:call(Pid, get_score))
    end.

I decided to return a map, rather than a tuple or record, and to use the atom “love” rather than 0. The implementation is pretty trivial:

init([]) ->
    {ok, #{p1=>love,p2=>love}}.

handle_call(get_score, _From, State) ->
    {reply, State, State};

If there was more information in the state map, you might want to return a projection rather than the whole object.

Fifteen love

fifteen_love(Pid) ->
    fun() ->
        ?assertEqual(#{p1=>15,p2=>love}, gen_server:call(Pid, {won_point, p1}))
    end.

This time we need to calculate the new score when a player has won a point:

handle_call({won_point, p1}, _From, State = #{p1:=CurrentScore}) ->
    NewState = State#{p1:=new_score(CurrentScore)},
    {reply, NewState, NewState};

...

new_score(love) -> 15.

Love fifteen

love_fifteen(Pid) ->
    fun() ->
        ?assertEqual(#{p1=>love,p2=>15}, gen_server:call(Pid, {won_point, p2}))
    end.

We can re-use the new_score method from before:

handle_call({won_point, p2}, _From, State = #{p2:=CurrentScore}) ->
    NewState = State#{p2:=new_score(CurrentScore)},
    {reply, NewState, NewState};

Fifteen all

fifteen_all(Pid) ->
    fun() ->
        won_point(Pid, p1),
        ?assertEqual(#{p1=>15,p2=>15}, won_point(Pid, p2))
    end.

won_point(Pid, Player) ->
    gen_server:call(Pid, {won_point, Player}).

No need for any new code, but it seemed like time for a little test refactoring.

Thirty fifteen

thirty_fifteen(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        ?assertEqual(#{p1=>30,p2=>15}, won_point(Pid, p1))
    end.

Now we need to add another function head, to handle the new score:

new_score(love) -> 15;
new_score(15) -> 30.

Thirty all

thirty_all(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p2),
        ?assertEqual(#{p1=>30,p2=>30}, won_point(Pid, p1))
    end.

No new code required.

Forty thirty

forty_thirty(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p1),
        won_point(Pid, p2),
        ?assertEqual(#{p1=>40,p2=>30}, won_point(Pid, p1))
    end.

Again, we need to handle a new score:

...
new_score(30) -> 40.

P1 win

p1_wins(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p1),
        won_point(Pid, p1),
        ?assertEqual({game_over, p1}, won_point(Pid, p1))
    end.

This requires a special case, for when the player has 40 points:

handle_call({won_point, p1}, _From, State = #{p1:=40}) ->
    {stop, normal, {game_over, p1}, State};

But now we have a problem, as I skipped over the troublesome “deuce”.

Deuce

deuce(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p1),
        won_point(Pid, p1),
        won_point(Pid, p2),
        ?assertEqual(#{p1=>deuce,p2=>deuce}, won_point(Pid, p2))
    end.

Again, I decided to use an atom, rather than 40-40. But now we need to take both player’s scores into account when determining the new score:

handle_call({won_point, p1}, _From, State = #{p1:=P1Score,p2:=P2Score}) ->
    {NewP1Score, NewP2Score} = new_score(P1Score, P2Score),
    NewState = State#{p1:=NewP1Score, p2:=NewP2Score},
    {reply, NewState, NewState};

...

new_score(love, Score) -> {15, Score};
new_score(15, Score) -> {30, Score};
new_score(30, 40) -> {deuce, deuce};
new_score(30, Score) -> {40, Score}.

Advantage

advantage_p1(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p1),
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p2),
        ?assertEqual(#{p1=>advantage,p2=>40}, won_point(Pid, p1))
    end.

Another case to handle:

...
new_score(deuce, deuce) -> {advantage, 40}.

Back to Deuce

advantage_p1_back_to_deuce(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p1),
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p2),
        won_point(Pid, p1),
        ?assertEqual(#{p1=>deuce,p2=>deuce}, won_point(Pid, p2))
    end.

Now we need to add a guard to the win case:

handle_call({won_point, p2}, _From, State = #{p1:=P1Score, p2:=40}) when P1Score =/= advantage ->
    {stop, normal, {game_over, p2}, State};

...

new_score(40, advantage) -> {deuce, deuce}.

Advantage, then win

advantage_p1_then_win(Pid) ->
    fun() ->
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p1),
        won_point(Pid, p1),
        won_point(Pid, p2),
        won_point(Pid, p2),
        won_point(Pid, p1),
        ?assertEqual({game_over, p1}, won_point(Pid, p1))
    end.

Finally, we need to allow a win from Advantage:

handle_call({won_point, p1}, _From, State = #{p1:=advantage}) ->
    {stop, normal, {game_over, p1}, State};

I think that’s everything covered. As Mark says, “it’s easy enough that it’s fun to play with, but difficult enough that it’s fun to play with”. I personally feel that the use of pattern matching made the intent of the code clearer, but of course YMMV.

Managing nodejs dependencies

The easiest way to “deploy” a node app is to clone the git repo on a server, and run npm install. There are a couple of disadvantages though: first, I don’t really like having to install git, and manage credentials for a private repo.

Second, installing the dependencies like that means you may get different versions of the modules you rely on than you were expecting. One of the tenets of a reliable build pipeline is ensuring that builds are repeatable, and that what you deploy matches what you tested.

There are a few alternatives: you could vendor in the node_modules folder, but this dirties the commit history, and increases the size of your repo. You could use npm shrinkwrap, which is the same concept as Bundler’s Gemfile.lock, a list of specific versions to install. This is definitely an improvement, but still leaves the risk of npm i failing during a deployment.

I’d prefer to only install the dependencies once, on the build server. This means I can run the tests, then tar up the folder and upload that to each environment in turn:

npm install
npm test
npm prune --production
npm dedupe
tar -czf app.tar.gz --exclude='*/coverage' app/

After running the tests, we prune the dependencies to remove mocha etc. We then dedupe to try and reduce the number of copies of shared modules, and finally create an archive of the app. This is output as an artifact of the build, and pulled in the by the deploy to staging build, output from that again, and finally pulled in by the deploy to production build.