When writing my previous post I silently hoped I was finished contributing more Erlang solutions to Tim Bray’s Wide Finder project. Tim already told me my code was running really well on his T5120, and yet in my last post, I nearly doubled the speed of that code, so I figured I was in good shape. But then Caoyuan Deng came up with something faster. He asked me to run it on my 8-core Linux box, and sure enough, it was fast, but didn’t seem to be using the CPU that well.
So, I thought some more about the problem. Last time I said I was using Boyer-Moore searching, but only sort of. This is because I was using Erlang function argument pattern matching, which proceeds forward, not backward as Boyer-Moore does. I couldn’t help but think that I could get more speed by doing that right.
I was also concerned about the speed of reading the input file. Reading it in chunks seems like the obvious thing to do for such a large file (Tim’s sample dataset is 236140827 bytes). It turns out that reading in chunks can cumulatively take over a second using klacke’s
bfile module, but it takes only about a third of a second to read the whole thing into memory in one shot. By my measurements the
bfile module is noticeably faster at doing this than the Erlang
file:read_file/1. Even my Macbook Pro can read the whole dataset without significant trouble, so I imagine the T5120 can do it with ease.
So, I changed tactics:
- Read the whole dataset into an Erlang binary in one shot, then break it into chunks based on the number of schedulers that the Erlang system is using.
- Stop breaking chunks into smaller blocks at newline boundaries. This took too much time. Instead, just grab a block, search from the end to find the final newline, and then process it for pattern matches.
- Change the search module to do something much closer to Boyer-Moore regarding backwards searching, streamline the code that matches the variable portion of the search pattern, and be smarter about skipping ahead on failed matches.
- Balance the parallelized collection of match data by multiple independent processes against the creation of many small dictionaries that later require merging.
This new version reads the whole dataset, takes the first chunk, finds the final newline, then kicks off one process to collect matches and a separate process to find the matches. It then moves immediately onto the next block, doing the same thing again. What that means is the main process spends its time finding newlines and launching processes while other processes look for matches and collect them. At the end, the main process collects the collections, merges them, and prints out the top ten.
On my 8-core 2.33GHz Linux box with 8 GB of RAM:
$ time erl -smp -noshell -run tbray15 main o1000k.ap
At 4.124s, this is significantly faster than the 6.663s I saw with my previous version. The user time is 6x the elapsed time, so we’re using the cores well. What’s more, if you change the block size, which indirectly controls the number of Erlang processes that run, you can clearly see a pretty much linear speedup as more cores get used. Below is the output from a loop where the block size starts at the file size and then divides by two on each iteration (I’ve edited the output to make it more compact by flattening the
time output into three columns):
$ ((x=236140827)) ; while ((x>32768))
do echo $x
time erl -smp -noshell -run tbray15 main o1000k.ap $x >/dev/null
236140827: 0m38.072s, 0m52.159s, 0m3.984s
118070413: 0m18.294s, 0m37.922s, 0m4.571s
59035206: 0m11.374s, 0m36.694s, 0m9.098s
29517603: 0m4.598s, 0m27.825s, 0m2.180s
14758801: 0m4.225s, 0m26.237s, 0m2.134s
7379400: 0m4.181s, 0m25.779s, 0m1.873s
3689700: 0m4.124s, 0m25.124s, 0m1.916s
1844850: 0m4.149s, 0m24.931s, 0m1.969s
922425: 0m4.132s, 0m24.894s, 0m1.822s
461212: 0m4.170s, 0m24.588s, 0m2.026s
230606: 0m4.185s, 0m24.548s, 0m2.035s
115303: 0m4.215s, 0m24.755s, 0m2.025s
57651: 0m4.317s, 0m25.199s, 0m1.985s
The elapsed time between the top four entries show the multiple cores kicking in, essentially doubling the performance each time. Once we hit the 4 second range, performance gains are small but steady roughly down to a block size of 922425, but then they start to creep up again. My guess is that this is because smaller blocks mean more Erlang
dict instances being created to capture matches, and all those dictionaries then have to be merged to collect the final results. In the middle, where performance is best, the user time is roughly 6x the elapsed time as I already mentioned, which means that if Tim runs this on his T5120, he should see excellent performance there as well.