Don’t Lose Your ets Tables

March 23rd, 2011  |  Published in erlang, reliability  |  Bookmark on Pinboard.in

In my recent QCon talk I talked about accidentally crashing an Erlang process on a customer’s subscription streaming video website running live in production. The code involved had not been used in production before, and the customer had decided somewhat unexpectedly to turn on a new feature that required it. The developer who wrote it had not tested it and had long since left the company.

The purpose of the code was to monitor bandwidth and session usage for each video subscriber to make sure they weren’t streaming more than they’d paid for. Concerned about the viability of the code, a colleague and I logged into the customer site (with their permission, of course), chose a subscriber at random, and, in an Erlang shell, I interactively invoked a function in the code in question to check that subscriber’s current bandwidth and session count. After a second check, we saw the numbers dropping, potentially indicating the subscriber was logging out, and we wanted to make sure all went well when the subscriber completely stopped streaming. After waiting a bit, I interactively called the function again, and — BAM! — the process holding session state for all paying customers crashed.

The original developer had used an Erlang ets table, an in-memory data store, to hold the subscriber data, and wrote something like this for lookups:

[SubscriberData] = ets:lookup(Table, Subscriber),

My interactive call from the shell looked up a nonexistent subscriber, so the result was the empty list [] rather than [SubscriberData], which caused a pattern mismatch and a badmatch exception. Uncaught, the exception crashed the process. Since the process owned the ets table, when it went down it took the ets table and all subscriber session data with it. It wasn’t so bad, since all it meant was that for a few hours a few subscribers potentially got a bit more video than they’d paid for, but still, it’s not at all the kind of design Erlang’s “Let It Crash” philosophy actually encourages. Crashing a process when something unexpected occurs is perfectly fine, since coding defensively introduces problems of its own, but you can still avoid losing your ets tables like this relatively easily.

Name an Heir

When you create an ets table you can also name a process to inherit the table should the creating process die:

TableId = ets:new(my_table, [{heir, SomeOtherProcess, HeirData}]),

If the creating process dies, the process SomeOtherProcess will receive a message of the form

{'ETS-TRANSFER', TableId, OldOwner, HeirData}

where TableId is the table identifier returned from ets:new, OldOwner is the pid of the process that owned the table, and HeirData is the data provided with the heir option passed to ets:new. Once it receives this message, SomeOtherProcess owns the table.

Give It Away

Alternatively, you can create an ets table and then give it to some other process to keep it:

TableId = ets:new(my_table, []),
ets:give_away(TableId, SomeOtherProcess, GiftData),

If the creating process dies, the process SomeOtherProcess will receive a message of the form

{'ETS-TRANSFER', TableId, OldOwner, GiftData}

where TableId is the table identifier returned from ets:new, OldOwner is the pid of the process that owned the table, and GiftData is the data provided in the ets:give_away call. Once it receives this message, SomeOtherProcess owns the table.

Table Manager

Instead of naming an heir or giving a table away, you can just have your Erlang supervisor process create a child process whose sole task is to own the table. This process creates the table as a named public table, thus allowing other processes to know its name and read/write it directly, with ets built-in concurrency protection dealing with any concurrency issues. Since the owner process does nothing more than create the table and then wait to be told to shut down, the likelihood of it crashing and taking the table with it is practically nil. The drawback here, though, is that the process actually using the table may have to coordinate with the owner process to ensure the table is available, and worse, it ends up using what is essentially a global variable — the table name — which can make code harder to read and maintain.

A Combination Approach

A nice way of managing ets tables, though, is to use a combination of the three previous techniques:

  1. The Erlang supervisor creates a table manager process. Since all this process does is manage the table, the likelihood of it crashing is very low.
  2. The table manager links itself to the table user process and traps exits, allowing it to receive an EXIT message if the table user process dies unexpectedly.
  3. The table manager creates a table, names itself (self()) as the heir, and then gives it away to the table user process.
  4. If the table user process dies, the table manager is informed of the process death and also inherits the table back.

Once it inherits the table, the table manager can then for example wait until the supervisor recreates the table user process, and then repeat the steps above to give the table to the new table user process. Other variations on this approach, like maybe a small pool of child process clones that cooperate to transfer the table between them in case of error, are of course also possible. Even though there are still process coordination issues here (but nothing difficult), I like this approach because it avoids global named tables and takes advantage of Erlang's supervision hierarchy.

The title of my QCon talk was "Let It Crash...Except When You Shouldn't." This scenario is an example of "when you shouldn't" — losing ets data due to a process crash is easily avoided.

Comments are closed.