Erlang: It’s About Reliability

May 1st, 2008  |  Published in erlang, reliability  |  18 Comments  |  Bookmark on Pinboard.in

In a recent post, Ted Neward gives a brief description of a variety of programming languages. It’s a useful post; I’ve known Ted for awhile now, and he’s quite knowledgeable about such things. Still, I have to comment on what he says about Erlang:

Erlang. Joe Armstrong’s baby was built to solve a specific set of problems at Ericsson, and from it we can learn a phenomenal amount about building massively parallel concurrent programs. The fact that it runs on its own interpreter, bad.

I might have said it like this:

Erlang. Joe Armstrong’s baby was built to solve a specific set of problems at Ericsson, and from it we can learn a phenomenal amount about building highly reliable systems that can also support massive concurrency. The fact that it runs on its own interpreter, good; otherwise, the reliability wouldn’t be there and it would be just another curious but useless concurrency-oriented language experiment.

Far too many blog posts and articles that touch on Erlang completely miss the point that reliability is an extremely important aspect of the language.

To achieve reliability, you have to accept the fact that failure will occur, Once you accept that, then other things fall into place: you need to be able to restart things quickly, and to do that, processes need to be cheap. If something fails, you don’t want it taking everything else with it, so you need to at least minimize, if not eliminate, sharing, which leads you to message passing. You also need monitoring capabilities that can detect failed processes and restart them (BTW in the same posting Ted seems to claim that Erlang has no monitoring capabilities, which baffles me).

Massive concurrency capabilities become far easier with an architecture that provides lightweight processes that share nothing, but that doesn’t mean that once you design it, the rest is just a simple matter of programming. Rather, actually implementing all this in a way that delivers what’s needed and performs more than adequately for production-quality systems is an incredibly enormous challenge, one that the Erlang development team has quite admirably met, and that’s an understatement if there ever was one.

They come for the concurrency but they stay for the reliability. Do any other “Erlang-like” languages have real, live, production systems in the field that have been running non-stop for years? (That’s not a rhetorical question; if you know of any such languages, please let me know.) Next time you see yet another posting about Erlang and concurrency, especially those of the form “Erlang-like concurrency in language X!” just ask the author: where’s the reliability?

Responses

  1. Hynek (Pichi) Vychodil says:

    May 1st, 2008 at 11:44 am (#)

    You are right. The main difference in Erlang is reliability. Erlang reliability is unseen every where out. Yes, there is some reliable systems in spacecrafts for example wrote in other languages (such ada and so), but those systems doesn’t have “Erlang’s” productivity. Difference in Erlang is productivity but reliability. Erlang syntax and semantic looks weird but when you think why You can make sense of. Almost everything works for productivity and reliability together. Surprise that it’s perform good too :-)

  2. Cedric says:

    May 1st, 2008 at 12:09 pm (#)

    Steve, I find it puzzling that the #1 requirement you see for reliability is that processes that die need to restart very fast. The part of the world that builds 3-4 nine’s software without Erlang knows that this kind of concept is the least of their worries, because whether it takes 10ms or 5 seconds, the important part is that there should be no interruption of service. And this is very easily covered with redundancy and balancing.

    Yes, Erlang has monitoring, but you still need to implement it by hand, and Erlang doesn’t really help you more in that area than any language that supports traditional exceptions.

  3. Mark says:

    May 1st, 2008 at 12:16 pm (#)

    How would you rate stackless python in comparison to Erlang?

  4. Dan Sickles says:

    May 1st, 2008 at 1:22 pm (#)

    Damien Katz had an interesting thread on this subject.

    …”once you reach a certain level of activity in the system where the garbage collector can no longer keep up (and it will happen), then every line of code in your system is now a potential failure point that can leave the whole program in a bad state. Lisp has this problem. Java has this problem. Erlang does not”

    http://damienkatz.net/2008/04/lisp_as_blub.html

  5. Fred says:

    May 1st, 2008 at 2:01 pm (#)

    It takes a long time build a reliable system. You have to test and let it run for a while by itself. Long-running tests take more time than short tests.

  6. steve says:

    May 1st, 2008 at 2:30 pm (#)

    Cedric: you’re reading too much into what I wrote, specifically, I don’t see any numbering there. Perhaps I should have written the phrase “in no particular order.” I was listing some of the things you need to achieve reliability, and explaining how they tend to lead you toward a system that provides excellent concurrency support. Fast restart can indeed be important, depending on the type of system you’re building; if startup is too slow, that creates a window in which part of your redundancy is unavailable.

    You say that Erlang has monitoring, but you have to implement it by hand? That’s odd, since none of my Erlang code has had to do that — I simply spawn a process. And if you think exceptions are equivalent to what Erlang provides in the area of process monitoring and supervision, then you are very mistaken. Just as I did last time you commented here, I invite you to go and actually write some real Erlang code, rather than just guessing about it as you’re clearly doing.

  7. steve says:

    May 1st, 2008 at 2:31 pm (#)

    Mark: I don’t know. What can you tell me about the reliability of Stackless Python?

  8. Kirk Wylie says:

    May 1st, 2008 at 2:53 pm (#)

    Hi, Steve,

    I’m actually in the process of gearing up to do some Erlang work myself, but my primary question here is whether one really requires systems that have years of uptime, particularly in the types of software that I’ve tended to work on in the past.

    There are a couple of places that extremely rigorous, multiyear-level uptime is required. The first, as someone else mentioned, is embedded control systems. Those tend not to run with any type of VM or interpreter, so they use far more rigorous software engineering techniques (and specialized languages that compile down to machine code, like Ada) to ensure reliability.

    The other place is definitely telecommunications systems, which is where Erlang came from.

    I think the key thing here is that requirements are fixed and unchanging. Phone calls are phone calls are phone calls: the requirements for handling a phone call don’t change very often; the GSM protocol hasn’t changed substantially that I’m aware of.

    However, in systems where standards aren’t known or fixed, Steve, do you think that the multi-year reliability really factors in if you have to assume that you’ll have to bring things down every month or two (minimum) to add/upgrade/change functionality? Or do you think that the real sweet spot for Erlang is somewhere where the standard is pretty fixed and not likely to change anytime in the next year or two and needs completely unattended operation?

  9. steve says:

    May 1st, 2008 at 3:43 pm (#)

    Kirk: I agree that not all systems require the major reliability that Erlang can provide. However, reliability is often more important than people think it is for the general case. If you develop a successful system, and its success forces you to try to add reliability after the fact, it can be not only difficult but incredibly expensive to do so.

    Note that you needn’t stop an Erlang system to do code upgrade. Live code upgrade is a feature. Having that feature at your disposal changes the way you think about this problem. Where you previously might have just assumed you’d have to take systems down to upgrade or change them, you can instead consider whether there are benefits to leaving the system running and upgrading it (or downgrading it) live. This opens new possibilities for both you and your customers in terms of how changes are rolled out, how they’re tested, how they’re accepted, how frequently they’re provided, etc.

    Having previously spent many years as a middleware developer using C++ and Java, I would have dearly loved to have this sort of capability years ago, as it would have saved me countless hours of development and debugging.

  10. Jason Watkins says:

    May 1st, 2008 at 4:07 pm (#)

    Stackless is used for the game logic of http://www.eve-online.com. Eve has had some performance and reliability problems. I don’t know, but I suspect those have more to do with design decisions than Stackless as a tool.

  11. Ulf Wiger says:

    May 1st, 2008 at 7:13 pm (#)

    Kirk Wiley wrote:

    “I think the key thing here is that requirements are fixed and unchanging. Phone calls are phone calls are phone calls: the requirements for handling a phone call don’t change very often; the GSM protocol hasn’t changed substantially that I’m aware of.”

    You don’t work in the telecoms sector, do you? ;-)
    During the 12 years that I’ve worked in telecoms, change has been the one constant. The AXD 301, which was the first well-known erlang-based product, started out as an ATM switch, then evolved into a telephony-over-ATM media gateway, while doubling as a TDM switch replacement, MPLS label switch router and then some, and nowadays is fully IP-based (no ATM interfaces), serving in networks to ease the transition from traditional telephony to SIP-based multimedia. Phone calls are not what they used to be, even if the customer isn’t supposed to know the difference. I don’t work with GSM, and can’t say how much the GSM protocol has changed, but GSM-based networks certainly have, with GPRS, EDGE and the migration towards WCDMA and, eventually, Mobile IMS (SIP-based multimedia regardless of access technology). The telecoms sector is in constant turmoil nowadays, and it’s extremely difficult to tell which solutions will win over the others.

    “However, in systems where standards aren’t known or fixed, Steve, do you think that the multi-year reliability really factors in if you have to assume that you’ll have to bring things down every month or two (minimum) to add/upgrade/change functionality?”

    Why would you take things down? The real trick is to manage constant change and still be able to upgrade your products without service interruption. Granted – the combination (huge system) + (major changes) + (smooth upgrade) is a terrific challenge even in Erlang, but at least it gives you a fighting chance. Small changes are nothing – you just load them in step. Many people who’ve developed servers in Erlang will tell you that they keep the server running for weeks and months during development – even while the code is very immature – and keep loading code as the system evolves.

    “Or do you think that the real sweet spot for Erlang is somewhere where the standard is pretty fixed and not likely to change anytime in the next year or two and needs completely unattended operation?”

    I don’t think so. If the environment is sufficiently static, development cost is less of a factor, and the argument for cutting-edge technology weak. Erlang shines when you’re facing complex challenges, tight timelines, and still have to deliver high reliability. Quite often, this is where Erlang has made it through the door, when all traditional approaches fail.

  12. Jesse Farmer says:

    May 1st, 2008 at 9:24 pm (#)

    Right on. I’ve decided to learn Erlang and right away several friends and commenters asked, “Why not try X? It’s like Erlang but not so weird.”

    And I responded, “Does anyone use those things in the real world?” I’m not learning Erlang for the heck of it — I’m learning it because I want to built distributed, concurrent, fault-resistant systems.

    Gambit+Termite might be sexier, but has it progressed beyond the toy-language stage? I see no evidence.

  13. dda says:

    May 1st, 2008 at 11:29 pm (#)

    Apparently Ted Neward has a fixation on languages that run on their own VM, or rather a fixation on the CLR , the Holy Spirit of languages, and anything that doesn’t run on it is A Very Bad Thing™, unless it’s Java. Sigh…

  14. EwanSilver.com » Reliability says:

    May 2nd, 2008 at 6:05 am (#)

    […] Vinoski talking about reliability (in particular my favourite: Erlang) To achieve reliability, you have to accept […]

  15. My daily readings 05/02/2008 « Strange Kite says:

    May 2nd, 2008 at 7:38 am (#)

    […] Erlang: It’s About Reliability :: Steve Vinoski’s Blog […]

  16. Dionysius says:

    May 2nd, 2008 at 12:31 pm (#)

    Isn’t Twitter using Erlang for their high-transaction component and Ruby for the rest?

    You can hack just about any language to do just about anything, but the syntax of the language shapes how you think about programming. I see this as the great strength of Erlang: it teaches programmers to think about concurrency and parallelism.

  17. Dean says:

    May 3rd, 2008 at 1:36 am (#)

    16: If Twitter is using Erlang, it’s certainly a very bad example, considering the number of spectacular crashes and day long downtimes they have experienced these past few months.

    More seriously, no, they don’t use Erlang, and they have moved away from Ruby on Rails (the article on Techcrunch is actually behind: they started moving away from RoR months ago).

  18. steve says:

    May 3rd, 2008 at 10:36 pm (#)

    BTW, here is a great email from Joe Armstrong discussing some of the thinking behind how Erlang approaches the handling of failures.