2007 October :: Steve Vinoski's Blog

Archive for October, 2007

One More Erlang Wide Finder

October 14th, 2007 | Published in erlang, performance | Bookmark on Pinboard.in

[Update: WordPress totally destroyed the original version of this posting, so I had to almost completely rewrite it. :-( ]

Since posting my second version of an Erlang solution to Tim Bray’s Wide Finder, which Tim’s apparently been getting some good performance out of, I haven’t had time to try anything new. I mean, I work for a startup, so you could say I’m a bit busy. But the fact that the output of my earlier solution was just a number of semi-matches, rather than the list of top ten matches that the original Ruby version produced, was gnawing at me. In short, I didn’t finish the job. So last night, as I watched the marathon Red Sox playoff game, I worked on getting the output to match that of the Ruby version.

The executive summary is that this version has output exactly like that of Ruby, and as an added bonus it runs almost twice as fast as my original tbray5.erl code even though it does more work. On my 8-core 2.33 GHz Intel Xeon Linux box, the best time I’ve seen is 6.663 sec. It has more lines of code, though.

You can grab tbray14.erl and wfbm.erl if you’d like to try them out. Or just run the following commands:

wget http://steve.vinoski.net/code/tbray14.erl http://steve.vinoski.net/code/wfbm.erl
erl -make -smp
erl -smp -noshell -run tbray14 main o1000k.ap

Below find the details of how it works.

Boyer-Moore string seaching

Tim’s searching for data in his web logs that match this pattern:

GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)\\s

There’s a trailing space at the end of the pattern, hence that last \s. Obviously, the first part of the pattern is fixed, the second part variable. The part in parentheses is what Tim wants to see in the final top ten output list.

One of the problems with my previous version was how it broke the data up so it could look for matches. It used Erlang’s string:tokens function to first break on newlines, and then called it again to divide each line into space-separated chunks. Using that function also meant first converting Erlang binaries to strings. All in all, too slow.

I decided to instead pursue solutions that let me leave the data in the form of an Erlang binary and search through it that way. I wrote a character-by-character thing that worked, but it was also too slow. I tried various regular expression Erlang packages, as well as just using Erlang’s built-in pattern matching, but they were too slow too.

I finally settled on a combination of Boyer-Moore and Erlang’s built-in matching. It lets me advance through the data relatively quickly looking for the fixed part of that pattern, and then use Erlang’s pattern matching to get the rest. The code to do this is in wfbm.erl; let’s break it down function by function.

Constants

First, some constants:

-define(STR, "GET /ongoing/When").
-define(REVSTR, "mehW/gniogno/ TEG").
-define(STRLEN, length(?STR)).
-define(MATCHHEADLEN, length("/200x/")).
-define(SKIP, length("/200x/2007/10/15/")).

The first one, STR, defines the fixed part of the pattern we’re looking for, while REVSTR is the same string, only backwards. Boyer-Moore works by searching backwards, so we need the backwards version to let us do that. STRLEN is just the length of the fixed search string. MATCHHEADLEN is the length of the text we need to drop off the variable part of the patterns we find, so that our final output strings match the original Ruby output. And finally, SKIP is just the length of the front part of the variable part of the pattern, which has variable content but is always the same length.

Shift table

Boyer-Moore searching shifts the search string along the text being searched based on which characters don’t match and where those characters appear in the search string. The following code precomputes a table that tells us how to shift the search string along:

set_shifts(_, Count, Tbl) when Count =:= ?STRLEN - 1 ->
    Tbl;
set_shifts([H|T], Count, Tbl) ->
    New = ?STRLEN - Count - 1,
    NTbl = dict:store(H, New, Tbl),
    set_shifts(T, Count+1, NTbl).

set_defaults([], Tbl) ->
    Tbl;
set_defaults([V|T], Tbl) ->
    set_defaults(T, dict:store(V, ?STRLEN, Tbl)).

init() ->
    set_shifts(?STR, 0, set_defaults(lists:seq(1, 255), dict:new())).

The init/0 function is called to initialize the shift table. Callers are expected to invoke this once up front, and then pass the table in whenever they want to search. The set_defaults/2 function just sets the shift amount for all characters to the length of the search string, and then the set_shifts/3 function sets the correct shift values in the same table for the characters in the search string.

Finding matches

The exported find/2 function (not shown) calls find_matches/3 to get the work done. This function comes in three forms:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;
find_matches(Bin, _, Acc) when size(Bin) < ?STRLEN ->
    Acc;
find_matches(Bin, Tbl, Acc) ->
    {Front, _} = split_binary(Bin, ?STRLEN),
    Shift = get_shift(lists:reverse(binary_to_list(Front)), ?REVSTR, Tbl),
    {_, Next} = split_binary(Bin, Shift),
    find_matches(Next, Tbl, Acc).

The middle variant is invoked when we have searched to the end of the binary, and it’s too short to contain any more matches. This version just returns a list of the accumulated matches.

The first variant is invoked when the front of the binary matches the fixed portion of the pattern we’re searching for. Note that this isn’t strictly Boyer-Moore, since that algorithm searches in reverse, unless Erlang argument pattern matching also matches in reverse, which is unlikely. When the fixed part matches, we have to check the next part to ensure that it matches the variable part of the pattern, and we call get_tail/1 to do that; that’s described later, below.

The last variant of find_matches/3 gets called when the front of the binary doesn’t match. It first splits the binary to take enough characters off the front of the binary to match against the fixed search string, converts that first part to a string, reverses it, and passes it to get_shift/3:

get_shift([C1|T1], [C2|T2], Tbl) when C1 =:= C2 ->
    get_shift(T1, T2, Tbl);
get_shift([C1|_], _, Tbl) ->
    dict:fetch(C1, Tbl).

This pair of functions simply walks the reversed string character-by-character until it finds a mismatch, and returns the shift amount from the Boyer-Moore table for that character. The find_matches/3 function then uses that shift amount to split the binary again at the right spot, and then invoke itself recursively on the second half of the split binary to continue looking for matches.

Now, get_tail/1 is what find_matches/3 calls when the front of the binary matches the fixed part of the search pattern and we need to determine whether the tail of the binary matches the variable part of the search pattern. It has multiple variants. First, the easy ones:

get_tail(<<>>) ->
    no_match;
get_tail(Bin) ->
    get_tail(Bin, none, 0).

The first returns the atom no_match when an empty binary is passed in. The second variant calls get_tail/3, which does all the work. We pass in the atom none to initialize our search state, and we initialize the match length to zero.

The get_tail/3 function has a number of variants. The first four, shown below, just reject binaries that don’t match the variable portion of the search pattern:

get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/20",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/, Rest/binary>>, _, _)
  when size(Rest) =:= 0 ->
    no_match;
get_tail(<<"/19",_:8,"x/",_:32,$/,_:16,$/,_:16,$/,32:8, _/binary>>, _, _) ->
    no_match;

We match the front of the variable portion of the pattern, where the date numbers appear, but we disallow anything that has an empty binary following it, or is followed immediately by a space character (shown here as 32:8, where 32 is the ASCII value for the space character). We do these matches twice, once for strings that start with "/20" and again for strings that start with "/19".

When the front of the binary matches the date portion of the variable part of our search pattern, we hit the following get_tail/3 variants:

get_tail(<<"/20",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
   when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);
get_tail(<<"/19",_:8,"x/",_:32,$/,M1:8,M0:8,$/,D1:8,D0:8,$/, Rest/binary>>,
 none, Len)
  when ((M1-$0)*10 + (M0-$0)) =< 12, ((D1-$0)*10 + (D0-$0)) =< 31 ->
    get_tail(Rest, almost, Len+?SKIP);

These two indicate potentially good matches, so they change the search state from none to almost. They then recursively invoke the search with the Rest of the binary. Depending on what it holds, it will hit one of the following:

get_tail(<<32:8, _/binary>>, found, Len) ->
    {ok, Len};
get_tail(<<32:8, _/binary>>, _, _) ->
    no_match;
get_tail(<<$., _/binary>>, _, _) ->
    no_match;
get_tail(<<_:8, Rest/binary>>, almost, Len) ->
    get_tail(Rest, found, Len+1);
get_tail(<<_:8, Rest/binary>>, State, Len) ->
    get_tail(Rest, State, Len+1).

The first variant here looks for a space character at the front of the rest of the binary, but only when we’re in the found state. That marks the end of a successful search, so for this case, we return ok and the length of the match. The second variant also finds a space character, but in any state other than found; this is an error, so we return no_match.

The third variant here searches for a period/full stop character, written as $. in Erlang. This character isn’t allowed in our match, so if we see it, we return no_match.

The final two variants of get_tail/3 catch all other characters at the front of the binary. If we’re in the almost state, the first of these variants continues the search in the found state. Otherwise, the second variant just continues the search at the next character, keeping the same state.

Now that we’ve seen the get_tail/3 functions, let’s go back and look at the first variant of find_matches/3 again, to tie it all together:

find_matches(<<?STR, Tail/binary>>, Tbl, Acc) ->
    case get_tail(Tail) of
        {ok, More} ->
            {H, Rest} = split_binary(Tail, More),
            {_, Match} = split_binary(H, ?MATCHHEADLEN),
            Result = binary_to_list(Match),
            find_matches(Rest, Tbl, [Result | Acc]);
        no_match ->
            find_matches(Tail, Tbl, Acc)
    end;

If get_tail/1 indicates a match, we split the tail of the binary at More, which is the length of the match. We then take the head of that split and split it again to strip off the unwanted portion of the matched binary. This makes it look like the strings that Ruby prints out, corresponding to the parenthesized portion of Tim’s original regular expression. We then convert the matched binary to a string and store it in our accumulator list.

The main code

The file tbray14.erl contains the main code that invokes the code described so far. It’s pretty much the same as the original tbray5.erl, which I’ve already described in detail, so I won’t repeat that description here. The main difference, other than calling wfbm:find/2 to find matches, is the management of those matches. The code uses Erlang dictionaries to track hit counts for each match, and there’s also code to merge the dictionaries created by multiple Erlang worker processes. Look in the file if you want to see that code.

Results

As I said earlier, the best time I’ve seen from this version is 6.663 seconds on Tim’s o1000k.ap dataset:

$ time erl -smp -noshell -run tbray14 main o1000k.ap
2959: 2006/09/29/Dynamic-IDE
2059: 2006/07/28/Open-Data
1636: 2006/10/02/Cedric-on-Refactoring
1060: 2006/03/30/Teacup
942: 2006/01/31/Data-Protection
842: 2006/10/04/JIS-Reg-FD
838: 2006/10/06/On-Comments
817: 2006/10/02/Size-Matters
682: 2003/09/18/NXML
630: 2003/06/24/IntelligentSearch

real    0m6.663s
user    0m34.530s
sys     0m12.010s

As you can see, the output matches the original Ruby version exactly, which was my goal for this version. The speedup is due to more efficient searching. I believe this efficiency is shown by the CPU time, which is just above 5x of the real time; for tbray5.erl, the CPU usage tends to be about 7x the real time. This version uses fewer Erlang processes as well. I found that it works best when reading 8MB blocks from the file, splitting them into 2 chunks at a newline characters, and then processing each chunk for matches in a separate Erlang process. Thus, tbray14:main/1 is set to these values by default. However, YMMV, so if you want to experiment with different chunk sizes and different file block sizes, do it from the command line like this:

time erl -smp -noshell -run tbray14 main chunkCount o1000k.ap blockSize

where chunkCount is the number of chunks to break each file block into, and blockSize is the size of the block to read from the input data file.

Hopefully Tim will get a chance to see how this version runs on his new machine.

Wide Finder in Python

October 7th, 2007 | Published in performance, python | Bookmark on Pinboard.in

Fredrick Lundh has posted some really nice code, along with detailed explanations, showing Tim’s Wide Finder implemented in various ways in Python, using both threads and processes. I ran his wf-6.py on an 8-core 2.33 GHz Intel Xeon Linux box with 8GB of RAM, and it ran best at 5 processes, clocking in at 0.336 sec. Another process-based approach, wf-5.py, executed best with 8 processes, presumably one per core, in 0.358 sec. The multithreaded approach, wf-4.py, ran best with 5 threads, at 1.402 sec (but also got the same result with 19 threads, go figure). Using the same dataset, I get 11.8 sec from my best Erlang effort, which is obviously considerably slower. I used the bash shell time built-in for all measurements, for consistency.

I’ve been coding in Python for years, but lately I’ve been using it a lot, and can’t get enough of it. It’s so clean, and as Fredrick’s code shows, it’s got some extremely powerful capabilities that are also extremely easy to use. It’s obviously smoking fast as well. Some folks who commented on some of my previous blog entries seem to equate programs written using dynamic languages like Python with “unmaintainable one-off solutions.” If that’s your experience with such languages, blame the programmer, not the language.

The Degenerating ESB Discussion

October 6th, 2007 | Published in enterprise, integration, REST, services, WS-* | Bookmark on Pinboard.in

In a comment to the many comments on my post entitled “The ESB Question,” Bill de hÓra gets it right:

I’m disappointed with the responses here. Steve’s an expert technologist in this domain. There’s a real opportunity for learning when the ESB approach is *appropriate* – instead we see the same old reactionary web v enterprise positioning and the usual suspect arguments being rolled out against REST style integration.

Bill’s right; the discussion has unfortunately mostly degenerated into the usual no-light-all-heat “us vs. them” argument. I especially object to the folks who twist what I say and accuse me of saying things that I never even remotely hinted at, but I guess that’s the price of blogging publicly.

I’m not going to try to address the comments individually, especially the ones from the guys who, amusingly, wrote lengthy diatribes explaining to me just what ESBs are, what they’re supposed to do, and why they’re beneficial. Yes, guys, I get all that. Perhaps you should go back and read some of my publications from 3, or 4, or 5 years ago? Funny how I don’t recall any of you being around back then, when I felt more positive about ESBs (hint, it was before that term was coined) and I was blogging and writing to that effect.

Arguing this issue on technical merits is rather pointless. One of the best comments in this thread was from Dan Hatfield:

Honestly, I see the ESB as primarily a political thing. It allows for a greater degree of control on delivered solutions. In large companies, we don’t often do architecture – we do politecture…The politics drive the architecture. Not the way it should be…but that’s the way it is.

So true, so true. Another non-technical way to look at it is from the viewpoint of Clayton Christensen’s classic book, The Innovator’s Dilemma. For quite a few years now, we’ve seen a series of sustaining innovations in the “object/service RPC” line of descent originally popularized by CORBA and COM, both of which built on earlier RPC, distributed object, and TP monitor technologies. RMI, EJB, SOAP, WS-*, and ESB are all offspring in that line, and there are surely more to come. I feel that REST, on the other hand, fits the definition of a disruptive innovation perfectly (and if you’re too lazy to read the book, then please at least follow the link, otherwise you won’t understand this at all). The proponents of the sustaining technologies look at REST and say, “well it can’t solve this and it can’t solve that” and voice numerous other complaints about it, precisely as Christensen predicts they would. But Chistensen also explains why, at the end of the day, any real or perceived technical shortcomings simply don’t matter (and in this case, they’re mostly perceived, not real). HTTP-based REST approaches have a lower barrier to entry and are less complex than anything the sustaining technologies have to offer, and REST is disrupting them, whether all the smart folks pushing ESBs like it or not. It’s not a technical issue, and there’s no amount of technology the non-REST tribe can throw at it to stop it because it’s based on how markets work, not on the technical specifics.

At the end of the day, if you don’t think REST is viable or you don’t like dynamic languages, then don’t use them. It just means that you’re at a different point than I am on the Technology Adoption Lifecycle curve. Like I already said in my first follow-up, I’m not in that business anymore, so it doesn’t matter to me at all. I’ll just keep using what I know to be generally superior to all the other approaches I’ve worked on over the years.

Reactions to the ESB Question

October 4th, 2007 | Published in enterprise, integration, REST, services, WS-* | Bookmark on Pinboard.in

Comments and reactions to my previous post have been interesting. Rather than responding to comments by adding more comments to that post, let me address them here:

Mark says what I’m recommending isn’t entirely new. And you know what: he’s exactly right! What might be interesting, though, is why he’s right. I pushed these ideas surrounding REST and dynamic languages for a number of years in my former position before finally leaving that position, partly because those ideas were not getting the level of attention I felt they deserved. So yes, the ideas are certainly nothing new today, but they were reasonably new back then.
REST is unproven. Sigh. I can’t decide if people say this because they’re just trying to stir up an argument, or they’re so heavily biased toward their non-REST approaches that they just can’t even consider that there might be viable alternatives, or they really have no clue about what REST is, how it works, and why it works, and they’re not interested in learning it, so they just react badly whenever they hear it, or all of the above. If you’re anti-REST or REST-ignorant, and you haven’t read RESTful Web Services, then don’t even talk to me about REST. The book is absolutely wonderful, and its explanations and answers are extremely clear. If you consider yourself informed and capable when it comes to distributed systems and integration, but you don’t know REST, then there’s simply no way you can read that book and not have it lead you to seriously question your core beliefs regarding how you think such systems should be built, unless you’re completely close-minded of course.
Are you saying we should throw everything out and redo it with REST and dynamic languages? No, not at all. I would never advocate wholesale rip-and-replace, because the cure is almost always worse than the disease. I’m simply saying that many integration projects can be done easier and with less expense if you use those tools and approaches (and please notice I said “many,” not “all”).
Steve’s thrown out the baby with the bathwater. It seems that people might not have read my entire post, because a number seem to think I said that ESBs should never be used at all. That’s not what I said; if you read all the way to the bottom, you’ll see that I explained some conditions under which I thought they can come in mighty handy.
What do mono-language programmers have to do with ESBs? It’s all part of the same culture, the “one size fits all” approach, where you have answers looking for problems rather than the other way around, and where people intentionally wear blinders to less expensive, more productive, and far more flexible and agile approaches because “it’s just not the way we do things around here.”
Are you seriously recommending dynamic languages for enterprise integration projects?! Hmm, that’s mighty enterprisey of you to ask that. Yes, that’s exactly what I’m recommending, because they’re solid, fast, flexible, smaller and therefore less buggy, easily deployed, easily maintained, and the solutions you get by using them will be in production well ahead of your traditional compiled languages.

At the end of the day, if you want to ignore my advice on using REST and dynamic languages, that’s your own problem. You won’t get any arguments from me, because I’m not in that business anymore. All I know is that I’m using them very successfully as part of what I’m working on these days, and it’s simply glorious.

The ESB Question

October 4th, 2007 | Published in enterprise, integration, REST, services, WS-* | Bookmark on Pinboard.in

The other day, the always-insightful Patrick Logan questioned whether ESBs are worth using. It’s definitely a good question.

In a previous life, I helped develop ESBs. I’ve written about them and I’ve promoted them. But somewhere along the way, I lost the religion.

It happened over a period of years as I learned more about REST, and as I learned how simple it was to develop integration systems using dynamic languages, which I’ve always been a fan of (for example, you can still find my name in the Perl source distribution, from about 20 years back when I ported it to Apollo Domain/OS). Despite the fact that I did it for years, I can’t understand anymore why anyone would subject themselves to writing integration systems or frameworks in imperative compiled languages like Java or C++. Doing it from scratch in such languages means that you’re definitely staring at a multi-year project, if you want to do it right. Or, if you’d rather use some sort of ESB platform or middleware framework underneath, you’re still looking at a number of months just to get anything nontrivial fully written, tested, and deployed.

Some of the mindset behind ESB-type middleware is the desire to do everything in Java or C++. Many developers just want to write some code and plug it into a magical framework that transparently handles all the distribution, persistence, security, transactions, and reliability underneath. Chuckle. Underlying frameworks just grow and grow as they try to provide all this, and so they develop more bugs, more inconsistencies, more special cases, less flexibility, and less reliability as time goes on, not to mention foisting on the unwary the XML configuration hell that Patrick alluded to.

Eventually, these mono-language programmers end up completely incapable of addressing a problem unless they can see how to implement it in their One True Language™. Too many programmers I’ve encountered over the decades know only a single programming language, with many of them even lacking in that one language. They therefore tend to settle into areas that let them use that language no matter what, all day, every day, even if it means writing hundreds of lines of code that could easily be replaced by just a few lines of a language better suited to the problem.

Another mindset behind the ESB is that of the enterprise architect. Large enterprises believe they can save themselves a lot of money and trouble if they can just get the whole enterprise to agree on a single integration architecture. Unfortunately, that’s pretty much impossible because of autonomous divisions independently building/buying and deploying different solutions, because of acquisitions and mergers that bring in different technologies and tools, and because of legacy systems that never seem to go away. The ESB appeals to such architects because it’s touted as being so flexible that it can accommodate everything in the enterprise and then some. This flexibility allows the architect to mandate it as “the standard,” and then proceed to beat up the ESB vendor to add this and that, while also beating up teams within the enterprise to use the ESB for everything whether it really fits or not. The ESB becomes like one of those tools you see on late-night TV, where it’s a screwdriver and a hammer and a wrench and fishing reel and a paint brush, and plus you can flip it over and it can make you a grilled cheese sandwich. Of course, such tools always end up not doing any of those things particularly well.

So, why not just do things the way Patrick suggested: when faced with an integration problem, just grab the nearest dynamic language, such as Ruby or Python, grab the modules they provide that surely cover whatever protocols and formats are needed for the integration, and code it up in just a day or two? Well, ESB proponents would say that this would never do because you end up with way too many one-off solutions, while mono-compiled-language proponents would natter on pointlessly about type safety, compilers preventing errors, and performance. An assumption behind these arguments is that the integration problem being solved requires high performance, and that the solution will live for a long time. In my experience, these assumptions are rarely correct, especially the second one, since business rules change so rapidly nowadays that the integrations required to implement them change all the time as well.

Frankly, if I were an enterprise architect today, and I were genuinely concerned about development costs, agility, and extensibility, I’d be looking to solve everything I possibly could with dynamic languages and REST, and specifically the HTTP variety of REST. I’d avoid ESBs and the typical enterprise middleware frameworks unless I had a problem that really required them (see below). I’d also try to totally avoid SOAP and WS-*.

The only place I can imagine using an ESB would be if I had some legacy systems that just couldn’t be replaced, had to be made to talk to each other, and there were significant measured (as opposed to imagined) performance issues to be addressed within the integration. Under such circumstances, the right ESB could shine; I’ve seen those circumstances, and I’ve seen an ESB solve exactly that problem on a number of occasions, and solve it well. But on the whole, such scenarios don’t really arise that often.

Steve Vinoski's Blog