One More Erlang Wide Finder

October 14th, 2007  |  Published in erlang, performance  |  13 Comments  |  Bookmark on Pinboard.in

[Update: WordPress totally destroyed the original version of this posting, so I had to almost completely rewrite it. :-( ]

Since posting my second version of an Erlang solution to Tim Bray’s Wide Finder, which Tim’s apparently been getting some good performance out of, I haven’t had time to try anything new. I mean, I work for a startup, so you could say I’m a bit busy. But the fact that the output of my earlier solution was just a number of semi-matches, rather than the list of top ten matches that the original Ruby version produced, was gnawing at me. In short, I didn’t finish the job. So last night, as I watched the marathon Red Sox playoff game, I worked on getting the output to match that of the Ruby version.

The executive summary is that this version has output exactly like that of Ruby, and as an added bonus it runs almost twice as fast as my original tbray5.erl code even though it does more work. On my 8-core 2.33 GHz Intel Xeon Linux box, the best time I’ve seen is 6.663 sec. It has more lines of code, though.

You can grab tbray14.erl and wfbm.erl if you’d like to try them out. Or just run the following commands:

wget http://steve.vinoski.net/code/tbray14.erl http://steve.vinoski.net/code/wfbm.erl
erl -make -smp
erl -smp -noshell -run tbray14 main o1000k.ap

Below find the details of how it works.

Boyer-Moore string seaching

Tim’s searching for data in his web logs that match this pattern:

GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)\\s

There’s a trailing space at the end of the pattern, hence that last \s. Obviously, the first part of the pattern is fixed, the second part variable. The part in parentheses is what Tim wants to see in the final top ten output list.

One of the problems with my previous version was how it broke the data up so it could look for matches. It used Erlang’s string:tokens function to first break on newlines, and then called it again to divide each line into space-separated chunks. Using that function also meant first converting Erlang binaries to strings. All in all, too slow.

I decided to instead pursue solutions that let me leave the data in the form of an Erlang binary and search through it that way. I wrote a character-by-character thing that worked, but it was also too slow. I tried various regular expression Erlang packages, as well as just using Erlang’s built-in pattern matching, but they were too slow too.

I finally settled on a combination of Boyer-Moore and Erlang’s built-in matching. It lets me advance through the data relatively quickly looking for the fixed part of that pattern, and then use Erlang’s pattern matching to get the rest. The code to do this is in wfbm.erl; let’s break it down function by function.

Constants

First, some constants:

-define(STR, "GET /ongoing/When").
-define(REVSTR, "mehW/gniogno/ TEG").
-define(STRLEN, length(?STR)).
-define(MATCHHEADLEN, length("/200x/")).
-define(SKIP, length("/200x/2007/10/15/")).

The first one, STR, defines the fixed part of the pattern we’re looking for, while REVSTR is the same string, only backwards. Boyer-Moore works by searching backwards, so we need the backwards version to let us do that. STRLEN is just the length of the fixed search string. MATCHHEADLEN is the length of the text we need to drop off the variable part of the patterns we find, so that our final output strings match the original Ruby output. And finally, SKIP is just the length of the front part of the variable part of the pattern, which has variable content but is always the same length.

Shift table

Boyer-Moore searching shifts the search string along the text being searched based on which characters don’t match and where those characters appear in the search string. The following code precomputes a table that tells us how to shift the search string along:

set_shifts(_, Count, Tbl) when Count =:= ?STRLEN - 1 ->
    Tbl;
set_shifts([H|T], Count, Tbl) ->
    New = ?STRLEN - Count - 1,
    NTbl = dict:store(H, New, Tbl),
    set_shifts(T, Count+1, NTbl).

set_defaults([], Tbl) ->
    Tbl;
set_defaults([V|T], Tbl) ->
    set_defaults(T, dict:store(V, ?STRLEN, Tbl)).

init() ->
    set_shifts(?STR, 0, set_defaults(lists:seq(1, 255), dict:new())).

The init/0 function is called to initialize the shift table. Callers are expected to invoke this once up front, and then pass the table in whenever they want to search. The set_defaults/2 function just sets the shift amount for all characters to the length of the search string, and then the set_shifts/3 function sets the correct shift values in the same table for the characters in the search string.

Finding matches

The exported find/2 function (not shown) calls find_matches/3 to get the work done. This function comes in three forms:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;
find_matches(Bin, _, Acc) when size(Bin) < ?STRLEN ->
    Acc;
find_matches(Bin, Tbl, Acc) ->
    {Front, _} = split_binary(Bin, ?STRLEN),
    Shift = get_shift(lists:reverse(binary_to_list(Front)), ?REVSTR, Tbl),
    {_, Next} = split_binary(Bin, Shift),
    find_matches(Next, Tbl, Acc).

The middle variant is invoked when we have searched to the end of the binary, and it’s too short to contain any more matches. This version just returns a list of the accumulated matches.

The first variant is invoked when the front of the binary matches the fixed portion of the pattern we’re searching for. Note that this isn’t strictly Boyer-Moore, since that algorithm searches in reverse, unless Erlang argument pattern matching also matches in reverse, which is unlikely. When the fixed part matches, we have to check the next part to ensure that it matches the variable part of the pattern, and we call get_tail/1 to do that; that’s described later, below.

The last variant of find_matches/3 gets called when the front of the binary doesn’t match. It first splits the binary to take enough characters off the front of the binary to match against the fixed search string, converts that first part to a string, reverses it, and passes it to get_shift/3:

get_shift([C1|T1], [C2|T2], Tbl) when C1 =:= C2 ->
    get_shift(T1, T2, Tbl);
get_shift([C1|_], _, Tbl) ->
    dict:fetch(C1, Tbl).

This pair of functions simply walks the reversed string character-by-character until it finds a mismatch, and returns the shift amount from the Boyer-Moore table for that character. The find_matches/3 function then uses that shift amount to split the binary again at the right spot, and then invoke itself recursively on the second half of the split binary to continue looking for matches.

Now, get_tail/1 is what find_matches/3 calls when the front of the binary matches the fixed part of the search pattern and we need to determine whether the tail of the binary matches the variable part of the search pattern. It has multiple variants. First, the easy ones:

get_tail(<<>>) ->
    no_match;
get_tail(Bin) ->
    get_tail(Bin, none, 0).

The first returns the atom no_match when an empty binary is passed in. The second variant calls get_tail/3, which does all the work. We pass in the atom none to initialize our search state, and we initialize the match length to zero.

The get_tail/3 function has a number of variants. The first four, shown below, just reject binaries that don’t match the variable portion of the search pattern:

get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;

We match the front of the variable portion of the pattern, where the date numbers appear, but we disallow anything that has an empty binary following it, or is followed immediately by a space character (shown here as 32:8, where 32 is the ASCII value for the space character). We do these matches twice, once for strings that start with "/20" and again for strings that start with "/19".

When the front of the binary matches the date portion of the variable part of our search pattern, we hit the following get_tail/3 variants:

get_tail(<<"/20",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
   when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);
get_tail(<<"/19",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
  when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);

These two indicate potentially good matches, so they change the search state from none to almost. They then recursively invoke the search with the Rest of the binary. Depending on what it holds, it will hit one of the following:

get_tail(<<32:8, _/binary>>, found, Len) ->
    {ok, Len};
get_tail(<<32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<$., _/binary>>, _, _) ->
    no_match;
get_tail(<<_:8, Rest/binary>>, almost, Len) ->
    get_tail(Rest, found, Len+1);
get_tail(<<_:8, Rest/binary>>, State, Len) ->
    get_tail(Rest, State, Len+1).

The first variant here looks for a space character at the front of the rest of the binary, but only when we’re in the found state. That marks the end of a successful search, so for this case, we return ok and the length of the match. The second variant also finds a space character, but in any state other than found; this is an error, so we return no_match.

The third variant here searches for a period/full stop character, written as $. in Erlang. This character isn’t allowed in our match, so if we see it, we return no_match.

The final two variants of get_tail/3 catch all other characters at the front of the binary. If we’re in the almost state, the first of these variants continues the search in the found state. Otherwise, the second variant just continues the search at the next character, keeping the same state.

Now that we’ve seen the get_tail/3 functions, let’s go back and look at the first variant of find_matches/3 again, to tie it all together:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;

If get_tail/1 indicates a match, we split the tail of the binary at More, which is the length of the match. We then take the head of that split and split it again to strip off the unwanted portion of the matched binary. This makes it look like the strings that Ruby prints out, corresponding to the parenthesized portion of Tim’s original regular expression. We then convert the matched binary to a string and store it in our accumulator list.

The main code

The file tbray14.erl contains the main code that invokes the code described so far. It’s pretty much the same as the original tbray5.erl, which I’ve already described in detail, so I won’t repeat that description here. The main difference, other than calling wfbm:find/2 to find matches, is the management of those matches. The code uses Erlang dictionaries to track hit counts for each match, and there’s also code to merge the dictionaries created by multiple Erlang worker processes. Look in the file if you want to see that code.

Results

As I said earlier, the best time I’ve seen from this version is 6.663 seconds on Tim’s o1000k.ap dataset:

$ time erl -smp -noshell -run tbray14 main o1000k.ap
2959: 2006/09/29/Dynamic-IDE
2059: 2006/07/28/Open-Data
1636: 2006/10/02/Cedric-on-Refactoring
1060: 2006/03/30/Teacup
942: 2006/01/31/Data-Protection
842: 2006/10/04/JIS-Reg-FD
838: 2006/10/06/On-Comments
817: 2006/10/02/Size-Matters
682: 2003/09/18/NXML
630: 2003/06/24/IntelligentSearch

real    0m6.663s
user    0m34.530s
sys     0m12.010s

As you can see, the output matches the original Ruby version exactly, which was my goal for this version. The speedup is due to more efficient searching. I believe this efficiency is shown by the CPU time, which is just above 5x of the real time; for tbray5.erl, the CPU usage tends to be about 7x the real time. This version uses fewer Erlang processes as well. I found that it works best when reading 8MB blocks from the file, splitting them into 2 chunks at a newline characters, and then processing each chunk for matches in a separate Erlang process. Thus, tbray14:main/1 is set to these values by default. However, YMMV, so if you want to experiment with different chunk sizes and different file block sizes, do it from the command line like this:

time erl -smp -noshell -run tbray14 main chunkCount o1000k.ap blockSize

where chunkCount is the number of chunks to break each file block into, and blockSize is the size of the block to read from the input data file.

Hopefully Tim will get a chance to see how this version runs on his new machine.

Responses

  1. Pete Kirkham says:

    October 14th, 2007 at 2:43 pm (#)

    I did do a BM variant earlier as an alternative to the 8-wide character matching, which degrades with a pattern such ” \”GET ” as ‘ ‘ is more common in the input than ‘G’ – http://www.tincancamera.com/examples/wide-finder/src/tbray8.cpp – but it didn’t give an improvement measurable above experimental noise overall. You can go a bit further in BM variants too, if you know what you’re running on supports SIMD operations, which give small-scale parallelisms. Either way, that sort of code should be in the regex library, not in a user’s script.

    I’ve kind-of given up, since the MPI ruby variant http://www.public.iastate.edu/~crb002/widefinder.rb probably hits the sweet spot between elegant and efficient-enough for me.

  2. Pavel Rodionov says:

    October 14th, 2007 at 9:58 pm (#)

    Sorry for offtop question. Steve what’s your favourite erlang editor?

  3. Caoyuan Deng says:

    October 15th, 2007 at 2:13 am (#)

    Hi Steve,

    I also refined my widefinder work in Erlang, which now took a bout 4.596 sec on my 2-core MacBook:
    http://blogtrader.net/page/dcaoyuan?entry=the_erlang_way_was_tim

  4. steve says:

    October 15th, 2007 at 7:38 am (#)

    Hi Caoyuan, I grabbed your code from your blog and indeed, it is fast on my Macbook Pro, where the best time I saw was 5.063 sec as measured by the bash time command. However, if I run your code on my 8-core Linux box, it actually gets slower. I ran it in a loop 10 times, and the best time I saw was 13.872 sec, and user/CPU time was 16.150 sec, so it’s apparently not using the multiple cores very well. Maybe I’ll poke through it to find out why.

  5. Dilip says:

    October 15th, 2007 at 12:53 pm (#)

    Are we using the wrong example to show case Erlang’s capabilities? From all your postings it seems like you are jumping through considerable hoops to just get Erlang to match the timings most other languages seem to do in their sleep?

    Don’t get me wrong — I am not trying to troll, I am just trying to understand why this particular problem requires so many attempts when writing in Erlang.

  6. steve says:

    October 15th, 2007 at 2:02 pm (#)

    Hi Dilip, no worries, it’s a fair question.

    It’s true that Erlang isn’t really designed for file processing problems like this, but it is designed for highly concurrent applications. Tim Bray’s original Wide Finder problem was that he was trying to see if there was a good way to exploit a many-core machine for problems such as logfile analysis. While maybe it wasn’t the best problem to choose for maximum exploitation of Erlang, we’ve still learned a lot by working on it. I would never claim to be an Erlang expert but I’m definitely closer for having implemented multiple Wide Finder solutions.

    Tim has yet to post his results from trying all this stuff on his many-core system, so it will be interesting to see how well Erlang fares under those conditions.

  7. Caoyuan Deng says:

    October 15th, 2007 at 4:04 pm (#)

    Hi Steve,

    I tried to solve the problem on many cpus in my new blog:
    http://blogtrader.net/page/dcaoyuan?entry=learning_coding_parallelization_was_tim

    Would you please take another try on your 8-cpu for my new code ?

    BR,

  8. steve says:

    October 15th, 2007 at 8:49 pm (#)

    Hi Caoyuan, the best time I saw for your newest version was 4.920 sec on my 8-core Linux box. Fast! However, user time was only 14.751 sec, so I’m not sure it’s using all the cores that well. Perhaps you’re getting down to where I/O is becoming a more significant factor.

  9. Matt says:

    October 20th, 2007 at 5:40 pm (#)

    Quick question:

    Is this a typo?

    -define(STR, “GET /ongoing/When”).
    -define(REVSTR, “mehW/gniogno/ TEG”).

    “mehW” != “When” reversed because “m” != “n”. If you really are doing a comparison with REVSTR, I’d expect it to fail the match every time.

  10. steve says:

    October 20th, 2007 at 8:38 pm (#)

    Matt: a most excellent catch. Yes, that is a typo. It’s funny — I stared at that a few times because it looked funny, and yet never caught it.

    For this test case, it doesn’t change the results or the performance. Comparisons are not done against that string; it’s used only to find shift values for advancing through the file. For the o1000k.ap dataset, it doesn’t matter because you get the same answer either way.

    I’ve fixed this in the posted wfbm.erl file. But you want to grab tbray15.erl and wfbm3.erl instead anyway; they’re about 50% faster.

    Thanks, Matt!

  11. steve says:

    October 20th, 2007 at 11:44 pm (#)

    Matt: in my previous comment I was wrong that the bug you found didn’t affect the results. It required another fix as well, which decreases the performance of this solution by about 50%, which obviously is significant. Thus, all the more reason to go with the newer tbray15.erl and wfbm3.erl as mentioned in the previous comment.

  12. steve says:

    October 21st, 2007 at 10:15 am (#)

    Or better yet, see tbray16.erl and wfbm4.erlthey’re faster than everything else I’ve posted to date.

  13. Matt says:

    October 22nd, 2007 at 6:22 am (#)

    Good stuff. I’ve just started digging into Erlang, and have been doing tons of random code-reviews at work ;-)

    It is interesting to see the work you are doing with Erlang. Keep it up!