erlang

One More Erlang Wide Finder

October 14th, 2007  |  Published in erlang, performance  |  Bookmark on Pinboard.in

[Update: WordPress totally destroyed the original version of this posting, so I had to almost completely rewrite it. :-( ]

Since posting my second version of an Erlang solution to Tim Bray’s Wide Finder, which Tim’s apparently been getting some good performance out of, I haven’t had time to try anything new. I mean, I work for a startup, so you could say I’m a bit busy. But the fact that the output of my earlier solution was just a number of semi-matches, rather than the list of top ten matches that the original Ruby version produced, was gnawing at me. In short, I didn’t finish the job. So last night, as I watched the marathon Red Sox playoff game, I worked on getting the output to match that of the Ruby version.

The executive summary is that this version has output exactly like that of Ruby, and as an added bonus it runs almost twice as fast as my original tbray5.erl code even though it does more work. On my 8-core 2.33 GHz Intel Xeon Linux box, the best time I’ve seen is 6.663 sec. It has more lines of code, though.

You can grab tbray14.erl and wfbm.erl if you’d like to try them out. Or just run the following commands:

wget http://steve.vinoski.net/code/tbray14.erl http://steve.vinoski.net/code/wfbm.erl
erl -make -smp
erl -smp -noshell -run tbray14 main o1000k.ap

Below find the details of how it works.

Boyer-Moore string seaching

Tim’s searching for data in his web logs that match this pattern:

GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)\\s

There’s a trailing space at the end of the pattern, hence that last \s. Obviously, the first part of the pattern is fixed, the second part variable. The part in parentheses is what Tim wants to see in the final top ten output list.

One of the problems with my previous version was how it broke the data up so it could look for matches. It used Erlang’s string:tokens function to first break on newlines, and then called it again to divide each line into space-separated chunks. Using that function also meant first converting Erlang binaries to strings. All in all, too slow.

I decided to instead pursue solutions that let me leave the data in the form of an Erlang binary and search through it that way. I wrote a character-by-character thing that worked, but it was also too slow. I tried various regular expression Erlang packages, as well as just using Erlang’s built-in pattern matching, but they were too slow too.

I finally settled on a combination of Boyer-Moore and Erlang’s built-in matching. It lets me advance through the data relatively quickly looking for the fixed part of that pattern, and then use Erlang’s pattern matching to get the rest. The code to do this is in wfbm.erl; let’s break it down function by function.

Constants

First, some constants:

-define(STR, "GET /ongoing/When").
-define(REVSTR, "mehW/gniogno/ TEG").
-define(STRLEN, length(?STR)).
-define(MATCHHEADLEN, length("/200x/")).
-define(SKIP, length("/200x/2007/10/15/")).

The first one, STR, defines the fixed part of the pattern we’re looking for, while REVSTR is the same string, only backwards. Boyer-Moore works by searching backwards, so we need the backwards version to let us do that. STRLEN is just the length of the fixed search string. MATCHHEADLEN is the length of the text we need to drop off the variable part of the patterns we find, so that our final output strings match the original Ruby output. And finally, SKIP is just the length of the front part of the variable part of the pattern, which has variable content but is always the same length.

Shift table

Boyer-Moore searching shifts the search string along the text being searched based on which characters don’t match and where those characters appear in the search string. The following code precomputes a table that tells us how to shift the search string along:

set_shifts(_, Count, Tbl) when Count =:= ?STRLEN - 1 ->
    Tbl;
set_shifts([H|T], Count, Tbl) ->
    New = ?STRLEN - Count - 1,
    NTbl = dict:store(H, New, Tbl),
    set_shifts(T, Count+1, NTbl).

set_defaults([], Tbl) ->
    Tbl;
set_defaults([V|T], Tbl) ->
    set_defaults(T, dict:store(V, ?STRLEN, Tbl)).

init() ->
    set_shifts(?STR, 0, set_defaults(lists:seq(1, 255), dict:new())).

The init/0 function is called to initialize the shift table. Callers are expected to invoke this once up front, and then pass the table in whenever they want to search. The set_defaults/2 function just sets the shift amount for all characters to the length of the search string, and then the set_shifts/3 function sets the correct shift values in the same table for the characters in the search string.

Finding matches

The exported find/2 function (not shown) calls find_matches/3 to get the work done. This function comes in three forms:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;
find_matches(Bin, _, Acc) when size(Bin) < ?STRLEN ->
    Acc;
find_matches(Bin, Tbl, Acc) ->
    {Front, _} = split_binary(Bin, ?STRLEN),
    Shift = get_shift(lists:reverse(binary_to_list(Front)), ?REVSTR, Tbl),
    {_, Next} = split_binary(Bin, Shift),
    find_matches(Next, Tbl, Acc).

The middle variant is invoked when we have searched to the end of the binary, and it’s too short to contain any more matches. This version just returns a list of the accumulated matches.

The first variant is invoked when the front of the binary matches the fixed portion of the pattern we’re searching for. Note that this isn’t strictly Boyer-Moore, since that algorithm searches in reverse, unless Erlang argument pattern matching also matches in reverse, which is unlikely. When the fixed part matches, we have to check the next part to ensure that it matches the variable part of the pattern, and we call get_tail/1 to do that; that’s described later, below.

The last variant of find_matches/3 gets called when the front of the binary doesn’t match. It first splits the binary to take enough characters off the front of the binary to match against the fixed search string, converts that first part to a string, reverses it, and passes it to get_shift/3:

get_shift([C1|T1], [C2|T2], Tbl) when C1 =:= C2 ->
    get_shift(T1, T2, Tbl);
get_shift([C1|_], _, Tbl) ->
    dict:fetch(C1, Tbl).

This pair of functions simply walks the reversed string character-by-character until it finds a mismatch, and returns the shift amount from the Boyer-Moore table for that character. The find_matches/3 function then uses that shift amount to split the binary again at the right spot, and then invoke itself recursively on the second half of the split binary to continue looking for matches.

Now, get_tail/1 is what find_matches/3 calls when the front of the binary matches the fixed part of the search pattern and we need to determine whether the tail of the binary matches the variable part of the search pattern. It has multiple variants. First, the easy ones:

get_tail(<<>>) ->
    no_match;
get_tail(Bin) ->
    get_tail(Bin, none, 0).

The first returns the atom no_match when an empty binary is passed in. The second variant calls get_tail/3, which does all the work. We pass in the atom none to initialize our search state, and we initialize the match length to zero.

The get_tail/3 function has a number of variants. The first four, shown below, just reject binaries that don’t match the variable portion of the search pattern:

get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;

We match the front of the variable portion of the pattern, where the date numbers appear, but we disallow anything that has an empty binary following it, or is followed immediately by a space character (shown here as 32:8, where 32 is the ASCII value for the space character). We do these matches twice, once for strings that start with "/20" and again for strings that start with "/19".

When the front of the binary matches the date portion of the variable part of our search pattern, we hit the following get_tail/3 variants:

get_tail(<<"/20",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
   when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);
get_tail(<<"/19",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
  when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);

These two indicate potentially good matches, so they change the search state from none to almost. They then recursively invoke the search with the Rest of the binary. Depending on what it holds, it will hit one of the following:

get_tail(<<32:8, _/binary>>, found, Len) ->
    {ok, Len};
get_tail(<<32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<$., _/binary>>, _, _) ->
    no_match;
get_tail(<<_:8, Rest/binary>>, almost, Len) ->
    get_tail(Rest, found, Len+1);
get_tail(<<_:8, Rest/binary>>, State, Len) ->
    get_tail(Rest, State, Len+1).

The first variant here looks for a space character at the front of the rest of the binary, but only when we’re in the found state. That marks the end of a successful search, so for this case, we return ok and the length of the match. The second variant also finds a space character, but in any state other than found; this is an error, so we return no_match.

The third variant here searches for a period/full stop character, written as $. in Erlang. This character isn’t allowed in our match, so if we see it, we return no_match.

The final two variants of get_tail/3 catch all other characters at the front of the binary. If we’re in the almost state, the first of these variants continues the search in the found state. Otherwise, the second variant just continues the search at the next character, keeping the same state.

Now that we’ve seen the get_tail/3 functions, let’s go back and look at the first variant of find_matches/3 again, to tie it all together:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;

If get_tail/1 indicates a match, we split the tail of the binary at More, which is the length of the match. We then take the head of that split and split it again to strip off the unwanted portion of the matched binary. This makes it look like the strings that Ruby prints out, corresponding to the parenthesized portion of Tim’s original regular expression. We then convert the matched binary to a string and store it in our accumulator list.

The main code

The file tbray14.erl contains the main code that invokes the code described so far. It’s pretty much the same as the original tbray5.erl, which I’ve already described in detail, so I won’t repeat that description here. The main difference, other than calling wfbm:find/2 to find matches, is the management of those matches. The code uses Erlang dictionaries to track hit counts for each match, and there’s also code to merge the dictionaries created by multiple Erlang worker processes. Look in the file if you want to see that code.

Results

As I said earlier, the best time I’ve seen from this version is 6.663 seconds on Tim’s o1000k.ap dataset:

$ time erl -smp -noshell -run tbray14 main o1000k.ap
2959: 2006/09/29/Dynamic-IDE
2059: 2006/07/28/Open-Data
1636: 2006/10/02/Cedric-on-Refactoring
1060: 2006/03/30/Teacup
942: 2006/01/31/Data-Protection
842: 2006/10/04/JIS-Reg-FD
838: 2006/10/06/On-Comments
817: 2006/10/02/Size-Matters
682: 2003/09/18/NXML
630: 2003/06/24/IntelligentSearch

real    0m6.663s
user    0m34.530s
sys     0m12.010s

As you can see, the output matches the original Ruby version exactly, which was my goal for this version. The speedup is due to more efficient searching. I believe this efficiency is shown by the CPU time, which is just above 5x of the real time; for tbray5.erl, the CPU usage tends to be about 7x the real time. This version uses fewer Erlang processes as well. I found that it works best when reading 8MB blocks from the file, splitting them into 2 chunks at a newline characters, and then processing each chunk for matches in a separate Erlang process. Thus, tbray14:main/1 is set to these values by default. However, YMMV, so if you want to experiment with different chunk sizes and different file block sizes, do it from the command line like this:

time erl -smp -noshell -run tbray14 main chunkCount o1000k.ap blockSize

where chunkCount is the number of chunks to break each file block into, and blockSize is the size of the block to read from the input data file.

Hopefully Tim will get a chance to see how this version runs on his new machine.

More File Processing with Erlang

September 29th, 2007  |  Published in erlang  |  Bookmark on Pinboard.in

Tim has weighed in with another attempt at using Erlang for his Wide Finder Project, but he’s unhappy with his new solution too. I feel his pain; below, I’ll show another solution which improves on my earlier one, but I’m not entirely happy with this one either.

Tim’s assessment of my earlier solution is accurate. What was worst about my original approach was that it worked only by reading the whole file into memory, which obviously doesn’t work too well for the quarter-gigabyte log files Tim’s dealing with, unless you have a big machine with big memory. Tim also complains about the “number of processes” parameter, but as I’ll show at the very end of this blog entry, it’s not really needed, and it’s there only to allow for experimentation. And finally, Tim said he doesn’t like my ad hoc binary splitting algorithm, but he also points out that for this approach, it’s pretty much necessary, since Erlang doesn’t provide any modules supporting that.

So what improvements does this new version contain?

  • First, it addresses the file size issue by using klacke’s bfile module, which allows me to work with a full-size logfile rather than Tim’s original 10,000 line sample. If klacke hadn’t posted this to the erlang-questions mailing list, I wouldn’t have even tried to create a new solution. It’s a great module.
  • Second, it uses a two-level process hierarchy. I observed that with full-sized logfiles, the number of processes launched to perform matching could be quite large, and those processes would finish and seem to sit around waiting for the parent process to collect their results. Not surprisingly, the more processes the main program launched, the slower and slower it became. The two-level process hierarchy launches collectors whose sole job it is to launch processes to perform matching and then collect their results. This results in far fewer messages sitting around waiting for the main thread to collect them, and also allows for a higher degree of concurrency to be applied to both reading the file and collecting the results.
  • It still performs binary splitting into data chunks ending on newlines, but I think the algorithm is a little improved. Specifically, it accumulates the “leftovers” and passes them along to the next recursion, where they’re efficiently stuck onto the front of the next chunk. Coincidentally, Tim’s latest approach does something similar, but I don’t think it’s as efficient (but I didn’t try it to verify that, so I could be wrong).
  • Finally, at 84 lines including blank ones, the solution has remained relatively brief. This isn’t an improvement, but keeping the code short has been an unstated goal for me. After all, the brevity of the Ruby solution is pretty striking, plus if I have to write hundreds of lines of Erlang without achieving a significant speedup, I might as well do it in hundreds of line of Java, C++, or C instead.

Regardless of these improvements, the best time I achieved with this new solution is 9.8 seconds on an 8-core 2.33 GHz dual Intel Xeon system with 8 GB of RAM. On my dual-core 2.33 GHz 2 GB MacBook Pro, it clocks in at just over 24 seconds. Still too slow.

I’ve been naming my Erlang modules after Tim, given that he’s the originator of this problem, and he’s responsible for my getting even less sleep than usual over the past week. :-) The module for the solution below is called tbray5. The module for my original solution was of course named just tbray. Don’t ask what happened to tbray1 through tbray4; let’s just say that failed experiments are useful too.

There was a tbray6 briefly as well, when I experimented with adding mmap capabilities to klacke’s bfile module. As I mentioned to Tim in email a few days ago, I was wondering whether one could just drop into C, mmap the large logfile into memory, and return a binary representing the whole file back to Erlang. Seems simple enough, and I got it working, but because of mmap‘s alignment restrictions combined with the way the Erlang runtime allocate binaries, I was forced to copy the data into the binary, thus killing any performance benefits mmap might have provided.

Anyway, here’s tbray5.erl, and below, I’ll explain each section. Stop here if you’re not interested in the details.

Compiling and Running

Run the following command to compile the code:

erl -smp -make

To execute it, use one of the following command lines:

erl -smp -noshell -run tbray5 main <numProcs> <logfile>
erl -smp -noshell -run tbray5 main <logfile>

In these command lines, <numProcs> specifies both the number of processes to use for logfile analysis as well the number of 1k blocks to read from the logfile at a time, and <logfile> specifies the name of the logfile to analyze. Use the first command line to experiment with the number of processes to launch and 1k blocks to read. I found that 512 procs/blocks seems to yield the fastest execution times, so the second command line above defaults to 512, but YMMV.

find_match

The find_match function below is the same as always:

find_match("/ongoing/When/" ++ Last) ->
    case lists:member($., Last) of
        false -> 1;
        true -> 0
    end;
find_match(_) -> 0.

process_binary

The process_binary function below, which launches “matcher” processes, is the same as before, too, except I switched from lists:foldl to lists:foldr because it seemed to provide a slight speedup. This function receives the ID of a process to send results to, and a string (as an Erlang binary) that’s assumed to end with a newline. It launches a process that breaks the binary into a list of strings, tokenizes each string, then counts the matches using find_match.

process_binary(Pid, Bin) ->
    spawn(
      fun() ->
              L = string:tokens(binary_to_list(Bin), "\n"),
              V = lists:foldr(
                    fun(Line, Total) ->
                            Tok = string:tokens(Line, " "),
                            Total + find_match(lists:nth(7, Tok))
                    end, 0, L),
              Pid ! V
      end).

break_chunk_on_newline

The break_chunk_on_newline set of functions below breaks Erlang binaries read from the logfile into chunks that end with a newline. The first variant handles the case where the binary is already smaller than the desired chunk size. It just returns a 2-tuple consisting of the list of all chunks obtained so far, along with the remainder as a binary. The second variant does most of the work, splitting the binary into chunks of the desired size and walking them along to ensure they end with newlines, and accumulating all the processed chunks into a list. The third variant just encapsulates the chunk size calculation and passes the initial empty chunk accumulator list.

break_chunk_on_newline(Bin, Pos, All) when (Pos >= size(Bin)) -> {All, Bin};
break_chunk_on_newline(Bin, Pos, All) ->
    {_, <<C:8, _/binary>>} = split_binary(Bin, Pos),
    case C of
        $\n ->
            {Ba, Bb} = split_binary(Bin, Pos+1),
            break_chunk_on_newline(Bb, Pos, All ++ [Ba]);
        _ -> break_chunk_on_newline(Bin, Pos+1, All)
    end.
break_chunk_on_newline(Bin, N) -> break_chunk_on_newline(Bin, size(Bin) div N, []).

spawn_collector

The spawn_collector function below just spawns a function that collects match counts from process_binary processes, and then sends the total matches to another process. It takes a list of binaries as an argument and calls process_binary for each one, passing the collector process ID to each, and then it returns the collector process ID. The two-level process hierarchy I talked about above has collectors at the first level and “matcher” processes, spawned by the collectors, at the second level.

spawn_collector(Bins, Me) ->
    Collector = spawn(
                  fun() ->
                          V = lists:foldr(fun(_, T) -> receive V -> T + V end end, 0, Bins),
                          Me ! V
                  end),
    [process_binary(Collector, B) || B <- Bins],
    Collector.

scan_finish

The scan_finish set of functions below handles the remainder binary, the last one after all file reading and binary splitting is done. It ensures that a collector is spawned to handle the remainder, if there is one. The first variant is called if the remainder is empty, the second one otherwise.

scan_finish(<<>>, _, Pids) -> Pids;
scan_finish(More, Me, Pids) -> [spawn_collector([More], Me) | Pids].

scan_file

The scan_file set of functions below reads chunks of the logfile via bfile:fread, breaks each chunk via break_chunk_on_newline, and spawns collectors to process them. It handles any remainder binaries by prepending them to the front of the next chunk, or when the file is completely read, by passing any remainders to scan_finish. Note that the first variant of scan_file does all the work; the second one just initializes the recursion. The return value of scan_file is the list of collector process IDs.

scan_file(F, N, Readsize, Me, Leftover, Pids) ->
    Rd = bfile:fread(F, Readsize),
    case Rd of
        {ok, Bin} ->
            {Bins, More} = break_chunk_on_newline(list_to_binary([Leftover, Bin]), N),
            scan_file(F, N, Readsize, Me, More, [spawn_collector(Bins, Me) | Pids]);
        eof -> scan_finish(Leftover, Me, Pids)
    end.
scan_file(F, N, Readsize) ->
    scan_file(F, N, Readsize, self(), <<>>, []).

start

The start functions below initializes bfile, calls scan_file, and then collects results from the collector processes. The second variant sets the number of bytes to read at a time from the logfile to a default of 512 1k blocks.

start(Num, File, Readsize) ->
    bfile:load_driver(),
    {ok, F} = bfile:fopen(File, "r"),
    Pids = scan_file(F, Num, Readsize),
    bfile:fclose(F),
    lists:foldr(fun(_, T) -> receive V -> T + V end end, 0, Pids).
start(Num, File) ->
    start(Num, File, 512*1024).

main

And finally, the main functions below handle invocations from the shell command line, as explained earlier. The second variant runs a list of values used for the number of processes and the number of bytes to read from the logfile at a time, and prints out a list consisting of each value and the number of seconds it took to execute for that value.

main([N, F]) ->
    io:format("~p matches found~n", [start(list_to_integer(N), F)]),
    halt();
main([F]) ->
    Sz = [16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384],
    Results = lists:map(
      fun(S) ->
              Start = now(),
              start(S, F, S*1024),
              {S, time_diff(Start, now())}
      end,
      Sz),
    io:format("~p~n", [Results]),
    halt().

The time_diff function (not shown but included in tbray5.erl), which I borrowed from somewhere a few months back, just helps calculate the execution times for the second variant of main shown above.

Tim Bray and Erlang

September 23rd, 2007  |  Published in erlang  |  Bookmark on Pinboard.in

Tim’s playing with Erlang, trying to rewrite a Ruby program for analyzing his website logs. His initial reaction appeared to be one of disgust, since his Erlang program was an order of magnitude slower than the Ruby version. Thankfully, though, a bunch of people have since jumped in and made the solution more palatable (check the comments on his posting).

Reading between the lines, it seems that Tim was hoping to take advantage of Erlang’s concurrency to put his multicore machines to work analyzing his logs. Unfortunately, none of the answers posted in the blog comments seem to provide that, so I decided to take a crack at it myself.

First I wrote a tiny little Erlang program to read in Tim’s entire sample logfile using file:read_file, turn the result into a list of strings using binary_to_list, and then process the strings. Simple timings showed the binary_to_list call to be the slowest part, so I decided to throw Erlang’s multiprocess capability at it. If you’re familiar with pthreads or Java threads, think of an Erlang “process” as basically a very, very lightweight thread.

Here’s my solution:

-module(tbray).
-export([start/2]).

find_match("/ongoing/When/" ++ Last) ->
    case lists:member($., Last) of
        false -> 1;
        true -> 0
    end;
find_match(_) -> 0.

process_binary(Pid, Bin) ->
    spawn(fun() ->
        L = string:tokens(binary_to_list(Bin), "\\n"),
        V = lists:foldl(
            fun(Line, Total) ->
                Total + find_match(lists:nth(7, string:tokens(Line, " "))) end,
            0, L),
        Pid ! V
        end).

split_on_newline(Bin, N, All) when size(Bin) < N ->
    All ++ [Bin];
split_on_newline(Bin, N, All) ->
    {_, <<C:8, _/binary>>} = split_binary(Bin, N),
    case C of
        $\\n ->
          {B21, B22} = split_binary(Bin, N+1),
          split_on_newline(B22, N, All ++ [B21]);
        _ -> split_on_newline(Bin, N+1, All)
    end.
split_on_newline(Bin, N) when N == size(Bin) -> [Bin];
split_on_newline(Bin, N) -> split_on_newline(Bin, N, []).

start(Num, Input) ->
    {ok, Data} = file:read_file(Input),
    Bins = split_on_newline(Data, size(Data) div Num),
    Me = self(),
    Pids = [process_binary(Me, B) || B <- Bins],
    lists:foldl(
        fun(_, Total) -> receive X -> Total + X end end,
        0, Pids).

The way this solution works is that it uses multiple Erlang processes to convert chunks of the input file to lists of strings and process them for matches. Begin with the start/2 function at the very bottom. First, we read the file in one shot, then split it into Num chunks, with the split_on_newline function variants being mindful to end each chunk on a newline character so we don’t split lines across chunks. We then pass each chunk to the process_binary/2 function using a list comprehension. Each process_binary/2 call spawns a new process to first convert its chunk to a list of strings and then process those strings for matches.

Now let’s time it. My MacBook Pro has two cores, so let’s enable SMP, and bump the Erlang process limit up to 60,000. First, we’ll compile the module and time it with just a single process:

$ erl -smp enable +P 60000
1> c(tbray).
{ok,tbray}
2> timer:tc(tbray, start, [1, "o10k.ap"]).
{661587,1101}

OK, at 0.66 seconds, we’re already a lot faster than Tim’s approach (the second value, 1101, is the number of matches we found), but can it go faster? Let’s try 2 processes:

3> timer:tc(tbray, start, [2, "o10k.ap"]).
{472067,1101}

That dropped us to 0.47 seconds, which is not an insignificant speedup. Do more processes help?

4> timer:tc(tbray, start, [4, "o10k.ap"]).
{390786,1101}

Yes, at 4 processes we drop to 0.39 seconds. Let’s go up a few orders of magnitude:

5> timer:tc(tbray, start, [40, "o10k.ap"]).
{380753,1101}
6> timer:tc(tbray, start, [400, "o10k.ap"]).
{322979,1101}
7> timer:tc(tbray, start, [4000, "o10k.ap"]).
{316857,1101}
8> timer:tc(tbray, start, [40000, "o10k.ap"]).
{318153,1101}

As we increase the number of Erlang processes, our performance improves, up to a point. At 40,000 processes we’re slower than we were at 4000. Maybe there’s a better number in between? It turns out that despite the numbers listed above, once you get above 400 processes or so, the numbers remain about the same. The best I got on my MacBook Pro after numerous runs was 0.301 seconds with 2400 processes, but the average best seems to be about 0.318 seconds. The performance of this approach comes pretty close to other solutions that rely on external non-Erlang assistance, at least for Tim’s sample dataset on this machine.

I also tried it on an 8-core (2 Intel Xeon E5345 CPUs) 64-bit Dell box running Linux, and it clocked in at 0.126 seconds with 2400 processes, and I saw a 0.124 seconds with 1200 processes. I believe this utilization of multiple cores was exactly what Tim was looking for.

If you’re a Java or C++ programmer, note the ease with which we can spawn Erlang processes and have them communicate, and note how quickly we can launch thousands of processes. This is what Tim was after, I believe, so hopefully my example provides food for thought in that area. BTW, I’m no Erlang expert, so if anyone wants to suggest improvements to what I’ve written, please feel free to comment here.