performance

Sendfile for Yaws

January 5th, 2009  |  Published in erlang, performance, scalability, web, yaws  |  Add to del.icio.us

A few months back Klacke gave me committer rights for Yaws. I’ve made a few fixes here and there, including adding support for passing “*” as the request URI for OPTIONS, enabling OPTIONS requests to be dispatched to appmods and yapps, and completing a previously-submitted patch for configuring the listen backlog. Klacke has just started putting a test framework into the codebase and build system so that contributors can include tests with any patches or new code they submit, and I’ve contributed to that as well.

The biggest feature I’ve added to date, though, is a new linked-in driver that allows Yaws to use the sendfile system call on Linux, OS X, and FreeBSD. I never wrote a linked-in driver before, so I was happy and fortunate to have an Erlang expert like Klacke providing hints and reviewing my code.

I did some preliminary testing that showed that sendfile definitely improves CPU usage across the board but depending on file size, sometimes does so at the cost of increasing request times. I used my otherwise idle 2-core 2.4GHz Ubuntu 8.04.1 Dell box with 2 GB of RAM, and ran Apache Bench (ab) from another Linux host to simulate 50 concurrent clients downloading a 64k data file a total of 100000 times. I saw that user/system CPU on the web host tended to run around 33%/28% without sendfile, while with sendfile it dropped to 22%/17%. The trade-off was request time, though, where each request for the 64k file averaged 0.928ms with sendfile but 0.567ms without. With larger files, however, sendfile is slightly faster and still has better CPU usage. For example, with a 256k file, sendfile averaged 2.251ms per request with user/system CPU at 8%/16% whereas it was 2.255ms and 16%/27% CPU without sendfile, which makes me wonder if the figures for the 64k file are outliers for some reason. I performed these measurements fairly quickly, so while I believe they’re reasonably accurate, don’t take them as formal results.

On my MacBook Pro laptop running OS X 10.5.6, CPU usage didn’t seem to differ much whether I used sendfile or not, but requests across the board tended to be slightly faster with sendfile.

I ran FreeBSD 7.0.1 in a Parallels VM on my laptop, and there I saw significantly better system CPU usage with sendfile than without, sometimes as much as a 30% improvement. Requests were also noticeably faster with sendfile than without, sometimes by as much as 17%, and again depending on file size, with higher improvements for larger files. User CPU was not a factor. All in all, though, I don’t know how much the fact that I ran all this within a VM affected these numbers.

Given that Yaws is often used for delivering mainly dynamic content, sendfile won’t affect those cases. Still, I think it’s nice to have it available for the times when you do have to deliver file-based content, especially if the files are of the larger variety. Anyway, I committed this support to the Yaws svn repository back around December 21 or so. If you’d like to do your own testing, please feel free — I’d be interested in learning your results. Also, if you have ideas for further tests I might try, please leave a comment to let me know.

Internet Computing Call for Special Issue Proposals

January 22nd, 2008  |  Published in REST, distributed systems, integration, performance, publishing, reuse, scalability, services  |  Add to del.icio.us

As you may know, I’m a columnist for IEEE Internet Computing (IC), and I’m also on their editorial board. Our annual board meeting is coming up, so to help with planning, we’ve issued a call for special issue proposals.

The topics that typically come up in this blog and others it connects to are pretty much all fair game as special issue topics: REST and the programmatic web, service definition languages, scalability issues, intermediation, tools, reuse, development languages, back-end integration, etc. Putting together a special issue doesn’t take a lot of work, either. It requires you to find 3-4 authors each willing to contribute an article, reviewers to review those articles (and IC can help with that), and a couple others to work with you as editors. As editors you also have to write a brief introduction for the special issue. I’ve done a few special issues over the years and if you enlist the right authors, it’s a lot less work than you might think.

As far as technical magazines go, IC is typically one of the most cited, usually second only to IEEE Software, as measured by independent firms. I think one reason for this is that it has a nice balance of industry and academic articles, so its pages provide information relevant to both the practitioner and the researcher.

“Release It!” Is Truly Excellent

November 13th, 2007  |  Published in book, enterprise, performance, scalability  |  Add to del.icio.us

What if you knew a person who clearly knew an awful lot about large-scale software systems? Someone who earned their way to their knowledge and wisdom by spending days and nights analyzing, measuring and debugging when everyone else had already given up and gone home? Someone considered the “go-to guy” who could always fix things if the system were to encounter a mysterious problem so serious that it resulted in very real revenue loss for the customer for every second of downtime? And what if you could rig up some sort of device to easily tap into the way that person thinks, thereby getting a clear understanding of the rules by which he or she designs, builds, deploys, analyzes, measures, debugs, operates, manages, and upgrades those large-scale systems?

Michael Nygard is such a person, and luckily for the rest of us, he’s already created such a device: his book, entitled Release It! Design and Deploy Production-Ready Software.

Success is a problem we’d like our systems to have. So, we focus our energies on building them as best as we can. Since we tend to focus there, books also tend target only that initial development phase. Most books focus on analysis and design issues, or on the methods and processes for doing analysis, design, implementation, and testing. Sometimes you find a book that focuses on debugging, and some even get into performance and scalability concerns.

But I’ve never seen a book like this one. It addresses the truly hard part of software development, which is running a successful large-scale system in production.

The book is a patterns book, but the patterns it presents are concrete. First, there are patterns and anti-patterns for stability and capacity, intermixed with war stories about real-life large-scale systems that failed hard for reasons that wouldn’t ever occur at smaller scales. Nygard’s war stories bring the patterns and anti-patterns into focus, providing very real reasons for their existence, and hard-won proof that they do indeed work.

Following the patterns, the second half of the book discusses general design issues and operations issues. These parts build on the patterns and focus on gotchas, large and small, that keep small systems from growing into large ones. The general design chapters provide numerous suggestions for eliminating seemingly innocuous mistakes that can kill a system as scale increases. Finally, the operations sections put you in the shoes of the people who have to keep the system running in production, describing what you as a developer can do to make their lives easier (thereby making your own life easier too). Along the way, Nygard keeps us grounded by occasionally presenting more stories and scenarios, some of which include actual dollar figures for the choices and trade-offs made. Engineering always comes down to building the best system possible within the allotted budget, but how many books targeting software developers do you know that talk in concrete terms of costs?

Nygard’s writing style is clear and concise. If you read his book and read his blog, you’ll find the styles to be identical. I’m guessing his book’s copy editor didn’t have a lot of work to do to get this book ready for production.

Bottom line: if you care at all about production software systems, you’ll want to read this book.

Faster WF Still

October 21st, 2007  |  Published in erlang, performance  |  Add to del.icio.us

OK, so you could say that I’m a bit obsessed with Tim Bray’s Wide Finder project. Just a little. I mean, I started this blog just a few weeks ago and so far almost every posting has been about it:

There’s also my Sept./Oct. Internet Computing column, Concurrency with Erlang, and sometime in early November, my next column, entitled Reliability with Erlang, will be published. Neither column is connected at all to the Wide Finder, but they just further reveal my current obsession with Erlang.

In my previous post I described my fastest solution up to that point, but here’s an even faster one: tbray16.erl, which is identical in every way to tbray15.erl from my previous post except that it uses wfbm4.erl, which provides all the performance gains. This version of Boyer-Moore searching includes two simple tweaks:

  • Uses hard-coded constants for string lengths, since the strings are fixed, rather than constantly recalculating them with length/1.
  • Fixes a nagging problem with my Boyer-Moore implementation where it wasn’t handling repeated characters in the fixed pattern very well. What I did in the previous version was choose the lesser of the two shifts if both characters appeared in the pattern, which worked but isn’t technically correct, and it also meant two dict lookups to get the shift values rather than just one. Now, I do the right thing: just keep track of the number of comparisons and subtract that from the shift value, use that if the result is positive and non-zero, otherwise just shift by 1.

This version shaves another whole second off the previous version for about a 25% speedup. The fastest I’ve seen it run on my 8-core Linux box is:

real    0m3.107s
user    0m16.243s
sys     0m2.134s

Meanwhile, Anders Nygren has been exploring eliminating using the dict altogether for the shift value lookup, but nothing I’ve tried there has been an improvement. But thanks to Anders for prompting me to properly fix that Boyer-Moore code and at least eliminate one of the dict lookups.

OK, Just One More WF

October 18th, 2007  |  Published in erlang, performance  |  Add to del.icio.us

When writing my previous post I silently hoped I was finished contributing more Erlang solutions to Tim Bray’s Wide Finder project. Tim already told me my code was running really well on his T5120, and yet in my last post, I nearly doubled the speed of that code, so I figured I was in good shape. But then Caoyuan Deng came up with something faster. He asked me to run it on my 8-core Linux box, and sure enough, it was fast, but didn’t seem to be using the CPU that well.

So, I thought some more about the problem. Last time I said I was using Boyer-Moore searching, but only sort of. This is because I was using Erlang function argument pattern matching, which proceeds forward, not backward as Boyer-Moore does. I couldn’t help but think that I could get more speed by doing that right.

I was also concerned about the speed of reading the input file. Reading it in chunks seems like the obvious thing to do for such a large file (Tim’s sample dataset is 236140827 bytes). It turns out that reading in chunks can cumulatively take over a second using klacke’s bfile module, but it takes only about a third of a second to read the whole thing into memory in one shot. By my measurements the bfile module is noticeably faster at doing this than the Erlang file:read_file/1. Even my Macbook Pro can read the whole dataset without significant trouble, so I imagine the T5120 can do it with ease.

So, I changed tactics:

  • Read the whole dataset into an Erlang binary in one shot, then break it into chunks based on the number of schedulers that the Erlang system is using.
  • Stop breaking chunks into smaller blocks at newline boundaries. This took too much time. Instead, just grab a block, search from the end to find the final newline, and then process it for pattern matches.
  • Change the search module to do something much closer to Boyer-Moore regarding backwards searching, streamline the code that matches the variable portion of the search pattern, and be smarter about skipping ahead on failed matches.
  • Balance the parallelized collection of match data by multiple independent processes against the creation of many small dictionaries that later require merging.

This new version reads the whole dataset, takes the first chunk, finds the final newline, then kicks off one process to collect matches and a separate process to find the matches. It then moves immediately onto the next block, doing the same thing again. What that means is the main process spends its time finding newlines and launching processes while other processes look for matches and collect them. At the end, the main process collects the collections, merges them, and prints out the top ten.

On my 8-core 2.33GHz Linux box with 8 GB of RAM:

$ time erl -smp -noshell -run tbray15 main o1000k.ap
2959: 2006/09/29/Dynamic-IDE
2059: 2006/07/28/Open-Data
1636: 2006/10/02/Cedric-on-Refactoring
1060: 2006/03/30/Teacup
942: 2006/01/31/Data-Protection
842: 2006/10/04/JIS-Reg-FD
838: 2006/10/06/On-Comments
817: 2006/10/02/Size-Matters
682: 2003/09/18/NXML
630: 2003/06/24/IntelligentSearch

real    0m4.124s
user    0m25.124s
sys     0m1.916s

At 4.124s, this is significantly faster than the 6.663s I saw with my previous version. The user time is 6x the elapsed time, so we’re using the cores well. What’s more, if you change the block size, which indirectly controls the number of Erlang processes that run, you can clearly see a pretty much linear speedup as more cores get used. Below is the output from a loop where the block size starts at the file size and then divides by two on each iteration (I’ve edited the output to make it more compact by flattening the time output into three columns):

$ ((x=236140827)) ; while ((x>32768))
do echo $x
    time erl -smp -noshell -run tbray15 main o1000k.ap $x >/dev/null
    ((x/=2))
done

236140827: 0m38.072s, 0m52.159s, 0m3.984s
118070413: 0m18.294s, 0m37.922s, 0m4.571s
59035206:  0m11.374s, 0m36.694s, 0m9.098s
29517603:  0m4.598s,  0m27.825s, 0m2.180s
14758801:  0m4.225s,  0m26.237s, 0m2.134s
7379400:   0m4.181s,  0m25.779s, 0m1.873s
3689700:   0m4.124s,  0m25.124s, 0m1.916s
1844850:   0m4.149s,  0m24.931s, 0m1.969s
922425:    0m4.132s,  0m24.894s, 0m1.822s
461212:    0m4.170s,  0m24.588s, 0m2.026s
230606:    0m4.185s,  0m24.548s, 0m2.035s
115303:    0m4.215s,  0m24.755s, 0m2.025s
57651:     0m4.317s,  0m25.199s, 0m1.985s

The elapsed time between the top four entries show the multiple cores kicking in, essentially doubling the performance each time. Once we hit the 4 second range, performance gains are small but steady roughly down to a block size of 922425, but then they start to creep up again. My guess is that this is because smaller blocks mean more Erlang dict instances being created to capture matches, and all those dictionaries then have to be merged to collect the final results. In the middle, where performance is best, the user time is roughly 6x the elapsed time as I already mentioned, which means that if Tim runs this on his T5120, he should see excellent performance there as well.

Feel free to grab the files tbray15.erl and wfbm3.erl if you want to try it out for yourself.