<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Steve Vinoski's Blog &#187; reliability</title>
	<atom:link href="http://steve.vinoski.net/blog/category/reliability/feed/" rel="self" type="application/rss+xml" />
	<link>http://steve.vinoski.net/blog</link>
	<description>Ask forgiveness, not permission.</description>
	<lastBuildDate>Sat, 17 Jul 2010 18:01:47 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Controlling Erlang&#8217;s Heart</title>
		<link>http://steve.vinoski.net/blog/2009/02/22/controlling-erlangs-heart/</link>
		<comments>http://steve.vinoski.net/blog/2009/02/22/controlling-erlangs-heart/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 21:33:48 +0000</pubDate>
		<dc:creator>steve</dc:creator>
				<category><![CDATA[code]]></category>
		<category><![CDATA[erlang]]></category>
		<category><![CDATA[reliability]]></category>

		<guid isPermaLink="false">http://steve.vinoski.net/blog/?p=244</guid>
		<description><![CDATA[Erlang&#8217;s heart feature provides a heartbeat-based monitoring capability for Erlang runtime systems, with the ability to restart a runtime system if it fails. It works reasonably well, but one issue with it is that if an error occurs such that it causes repeated immediate runtime crashes, heart will happily keep restarting the runtime over and [...]]]></description>
			<content:encoded><![CDATA[<p>Erlang&#8217;s <a href="http://erlang.org/doc/man/heart.html">heart</a> feature provides a heartbeat-based monitoring capability for Erlang runtime systems, with the ability to restart a runtime system if it fails. It works reasonably well, but one issue with it is that if an error occurs such that it causes repeated immediate runtime crashes, <code>heart</code> will happily keep restarting the runtime over and over again, ad infinitum.</p>
<p>For <a href="http://yaws.hyber.org/">yaws 1.80</a>, released a few days ago on Feb. 12, I added a check to the <code>heart</code> setup in the <code>yaws</code> startup script to prevent endless restarts. I thought I&#8217;d share it here because it&#8217;s useful for Erlang systems in general and is in no way specific to yaws. It works by passing startup information from one incarnation to the next, checking that information to detect multiple restarts within a given time period. We track both the startup time and the restart count, and if we detect 5 restarts within a 60 second period, we stop completely. This is not to say that yaws is in dire need of this capability &mdash; it&#8217;s extremely stable in general and 1.80 in particular is a very good release &mdash; but I added it mainly because other Erlang apps sharing the same runtime instance as yaws may not enjoy that same high level of stability, especially while they&#8217;re still under development.</p>
<p>The command <code>heart</code> runs to start a new instance is set in the <code>HEART_COMMAND</code> environment variable. For yaws, it&#8217;s set like this (I&#8217;ve split this over multiple lines for clarity, but it&#8217;s just one line in the actual script):</p>
<pre>HEART_COMMAND="${ENV_PGM} \
  HEART=true \
  YAWS_HEART_RESTARTS=$restarts \
  YAWS_HEART_START=$starttime \
  $program "${1+"$@"}</pre>
<p>where</p>
<ul>
<li><code>${ENV_PGM}</code> is <code><a href="http://docs.sun.com/app/docs/doc/816-5165/env-1">/usr/bin/env</a></code>, which allows us to set environment variables for the execution of a given command.</li>
<li><code>HEART</code> is an environment variable that we use to indicate the command was launched by <code>heart</code>.</li>
<li><code>YAWS_HEART_RESTARTS</code> is an environment variable that we use to track the number of restarts already seen. The yaws script initially sets this to 1 and increments it for each heart restart.</li>
<li><code>YAWS_HEART_START</code> is an environment variable that we use to track the time of the current round of restarts. This is tracked as <a href="http://en.wikipedia.org/wiki/Unix_time">UNIX time</a>, obtained by the script via the &#8220;<code>date -u +%s</code>&#8221; command.</li>
<li><code>$program</code> is the yaws script itself, i.e., <code>$0</code>.</li>
<li><code>${1+"$@"}</code> is a specific shell construct that passes all the original arguments of the script unchanged along to <code>$program</code>.</li>
</ul>
<p>The yaws script looks for <code>HEART</code> set to true, indicating that it was launched by <code>heart</code>. For that case, it then checks <code>YAWS_HEART_RESTARTS</code> and <code>YAWS_HEART_START</code> to see how many restarts we&#8217;ve seen since the start time. We get the current UNIX time and subtract the <code>YAWS_HEART_START</code> time; if it&#8217;s less than or equal to 60 seconds and the restart count is 5, we exit completely without restarting the Erlang runtime. Otherwise we restart, first adjusting these environment variables. If the restart count is less than 5 within the 60 second window, we increment the restart count and set the new value into <code>YAWS_HEART_RESTARTS</code> but keep the same <code>YAWS_HEART_START</code> time. But if the current time is more than 60 seconds past the start time, we reset <code>YAWS_HEART_RESTARTS</code> to 1 and set a new start time for <code>YAWS_HEART_START</code>. Look at the <a href="http://erlyaws.svn.sourceforge.net/viewvc/erlyaws/trunk/yaws/scripts/yaws.template?view=markup">yaws script</a> to see the details of this logic &mdash; scroll down to the part starting with <code>if [ "$HEART" = true ]</code>.</p>
<p>Note that this approach is much like the way Erlang <code>receive</code> loops generally track state, by recursively passing state information to themselves.</p>
]]></content:encoded>
			<wfw:commentRss>http://steve.vinoski.net/blog/2009/02/22/controlling-erlangs-heart/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Clearly Time To End This</title>
		<link>http://steve.vinoski.net/blog/2008/05/18/clearly-time-to-end-this/</link>
		<comments>http://steve.vinoski.net/blog/2008/05/18/clearly-time-to-end-this/#comments</comments>
		<pubDate>Sun, 18 May 2008 23:53:24 +0000</pubDate>
		<dc:creator>steve</dc:creator>
				<category><![CDATA[commentary]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[erlang]]></category>
		<category><![CDATA[reliability]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[debate]]></category>

		<guid isPermaLink="false">http://steve.vinoski.net/blog/?p=69</guid>
		<description><![CDATA[A technical discussion stops being a vehicle for learning when the following start to occur: Someone starts making stuff up. Instead of answering questions put to them, someone starts pointing out &#8220;flaws&#8221; in the questions themselves. One challenges the other to some sort of programming contest. Name calling. The first two aren&#8217;t so bad, but [...]]]></description>
			<content:encoded><![CDATA[<p>A technical discussion stops being a vehicle for learning when the following start to occur:</p>
<ul>
<li>Someone starts making stuff up.</li>
<li>Instead of answering questions put to them, someone starts pointing out &#8220;flaws&#8221; in the questions themselves.</li>
<li>One challenges the other to some sort of programming contest.</li>
<li>Name calling.</li>
</ul>
<p>The first two aren&#8217;t so bad, but when either of the latter two appears, it&#8217;s time to stop. Unfortunately, <a href="http://blogs.tedneward.com/2008/05/18/Clearly+Thinking+Whether+In+Language+Or+Otherwise.aspx">the third item has now entered my back-and-forth with Ted Neward</a>. Since Ted has given me the last word, I&#8217;ll take it, but it&#8217;s clearly time to move on.</p>
<p>Given that a number of statements Ted&#8217;s made about Erlang in this discussion simply aren&#8217;t true, it&#8217;s quite clear Ted has never written any production Erlang code. <em>[Update: Patrick Logan has posted <a href="http://patricklogan.blogspot.com/2008/05/this-is-part-two-of-my-response-to-ted.html">a detailed analysis of Ted's misunderstandings of Erlang</a>.]</em> Being a long-time author, it bothers me when people write authoritatively on topics they have no business writing about, so my only goal with my responses in this conversation has simply been to set the record straight with respect to Erlang. Ted originally said Erlang was a study in concurrency; I merely pointed out that it was more importantly <a href="http://www.sics.se/~joe/thesis/armstrong_thesis_2003.pdf">a study in reliability</a>. That&#8217;s really not even debatable. Unfortunately, it&#8217;s turned into a frustrating one-sided conversation because Ted lacks any detailed knowledge of Erlang, so he keeps unhelpfully trying to shift the focus elsewhere.</p>
<p>In his past two responses, Ted has picked at my questions like a grammar school English teacher, accusing me of conflating things, making bad assumptions, etc. I see that <a href="http://patricklogan.blogspot.com/2008/05/toward-finer-tuning-of-definitions-of.html">Patrick Logan is trying to clarify things</a>, which might help. Yet Ted still hasn&#8217;t adequately explained why he&#8217;s taken such a hard stance against reliability being a fundamental feature of Erlang, nor how UNIX processes and Erlang processes are the same, as he keeps asserting, nor has he explained why he thinks it&#8217;s much, much harder to make an Erlang application manageable and monitorable than it is to build Erlang&#8217;s reliability into other systems like the JVM or Scala.</p>
<p>But now, we see the worst: the &#8220;programmer challenge.&#8221; Ugh. Thankfully, I&#8217;m sure most readers know that a programming contest of the sort Ted proposes would prove absolutely nothing. I guess he proposed it because I mentioned how I recently spent a quarter of a day making an Erlang application monitorable, in response to his continued claims that doing so is really hard, so now he wants to make a competition of it. I&#8217;d rather that you just explain, Ted, the experiences you&#8217;ve had that have led you to claim that Erlang applications can&#8217;t be easily managed or monitored. Better yet, since you&#8217;re the one who wants a contest, and given that you&#8217;re the one making all the claims, why don&#8217;t you go off and see how quickly you can build Erlang&#8217;s reliability into Scala and the JVM, since you claim it&#8217;s so simple?</p>
<p>If you&#8217;re a regular reader of Ted&#8217;s blog, you know that Ted generally offers good advice and you can learn useful things from him. He&#8217;s a good writer and a wonderful conference presenter, as he can make hard concepts easier to grok and generally does so with humor to keep you awake. But I feel that anyone in Ted&#8217;s position has a responsibility to avoid passing off incorrect information to his readers as fact. My advice therefore is simply that you don&#8217;t take what Ted says as gospel for this particular topic. Let me assure you that Erlang offers far, far more value than just exceptional concurrency support, which is where <a href="http://blogs.tedneward.com/2008/04/29/Groovy+Or+JRuby.aspx">Ted&#8217;s initial posting in this thread</a> seemed to want to limit it, and which is all I objected to. Unlike Ted, I&#8217;ve written quite a bit of Erlang code, and I use it every single day. If you write distributed systems, you owe it to yourself to explore Erlang&#8217;s capabilities and features. I&#8217;ve been writing and researching middleware and distributed systems for nearly 20 years now, and I&#8217;ve seen a lot over the years. Erlang is <em>by far</em> the most innovative and sound approach to distributed systems development I&#8217;ve ever seen and experienced &mdash; the trade-offs its designers chose are simply excellent. Like I&#8217;ve said numerous times over the past year, I really wish I&#8217;d found Erlang a decade ago, because I know for certain it would have saved my teams and me countless hours of development time.</p>
]]></content:encoded>
			<wfw:commentRss>http://steve.vinoski.net/blog/2008/05/18/clearly-time-to-end-this/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Thinking in Language, But Not Clearly</title>
		<link>http://steve.vinoski.net/blog/2008/05/09/thinking-in-language-but-not-clearly/</link>
		<comments>http://steve.vinoski.net/blog/2008/05/09/thinking-in-language-but-not-clearly/#comments</comments>
		<pubDate>Fri, 09 May 2008 23:04:03 +0000</pubDate>
		<dc:creator>steve</dc:creator>
				<category><![CDATA[commentary]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[erlang]]></category>
		<category><![CDATA[languages]]></category>
		<category><![CDATA[reliability]]></category>

		<guid isPermaLink="false">http://steve.vinoski.net/blog/?p=68</guid>
		<description><![CDATA[Ted Neward finally responds to my comments about his remarks concerning Erlang. I really don&#8217;t mean to pick on Ted &#8212; I like Ted! &#8212; but unfortunately, this time around his response misses the mark in more ways than one. First, Ted says: Erlang&#8217;s reliability model&#8211;that is, the spawn-a-thousand-processes model&#8211;is not unique to Erlang. In [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blogs.tedneward.com/2008/05/09/Thinking+In+Language.aspx">Ted Neward finally responds</a> to <a href="/blog/2008/05/01/erlang-its-about-reliability/">my comments about his remarks concerning Erlang</a>. I really don&#8217;t mean to pick on Ted &mdash; I like Ted! &mdash; but unfortunately, this time around his response misses the mark in more ways than one.</p>
<p>First, Ted says:</p>
<blockquote><p><em>Erlang&#8217;s reliability model&#8211;that is, the spawn-a-thousand-processes model&#8211;is not unique to Erlang. In fact, it&#8217;s been the model for Unix programs and servers, most notably the Apache web server, for decades. When building a robust system under Unix, a master-slave model, in which a master process spawns (and monitors) n number of child processes to do the actual work, offers that same kind of reliability and robustness. If one of these processes fail (due to corrupted memory access, operating system fault, or what-have-you), the process can simply die and be replaced by a new child process.</em></p></blockquote>
<p>There&#8217;s really no comparison between the UNIX process model (which BTW I hold in very high regard) and Erlang&#8217;s approach to achieving high reliability. They are simply not at all the same, and there&#8217;s no way you can claim that UNIX &#8220;offers that same kind of reliability and robustness&#8221; as Erlang can. If it could, wouldn&#8217;t virtually every UNIX process be consistently yielding reliability of five nines or better?</p>
<p>Obviously, achieving high reliability requires at least two computers. On those systems, what part of the UNIX process model allows a process on one system to seamlessly fork child processes on another and monitor them over there? Yes, there are ways to do it, but would anyone claim they are as reliable and robust as Erlang&#8217;s approach? I sure wouldn&#8217;t. Also, UNIX pipes provide IPC for processes on the same host, but what about communicating with processes on other hosts? Yes, there are many, many ways to achieve that as well &mdash; after all, I&#8217;ve spent most of my career working on distributed computing systems, so I&#8217;m well aware of the myriad choices here &mdash; but that&#8217;s actually a problem in this case: too many choices, too many trade-offs, and far too many ways to get it wrong. Erlang can achieve high reliability in part because it solves these issues, and a whole bunch of other related issues such as live code upgrade/downgrade, extremely well.</p>
<p>Ted continues:</p>
<blockquote><p><em>There is no reason a VM (JVM, CLR, Parrot, etc) could not do this. In fact, here&#8217;s the kicker: it would be easier for a VM environment to do this, because VM&#8217;s, by their nature, seek to abstract away the details of the underlying platform that muddy up the picture.</em></p></blockquote>
<p>In your original posting, Ted, you criticized Erlang for having its own VM, yet here you say that a VM approach can yield the best solution for this problem. Aren&#8217;t you contradicting yourself?</p>
<blockquote><p><em>It would be relatively simple to take an Actors-based Java application, such as that currently being built in Scala, and move it away from a threads-based model and over to a process-based model (with the JVM constuction[sic]/teardown being handled entirely by underlying infrastructure) with little to no impact on the programming model.</em></p></blockquote>
<p>Would it really be &#8220;relatively simple?&#8221; Even if what you describe really were relatively simple, which I strongly doubt, there&#8217;s still no guarantee that the result would help applications get anywhere near the levels of reliability they can achieve using Erlang.</p>
<blockquote><p><em>As to Steve&#8217;s comment that the Erlang interpreter isn&#8217;t monitorable, I never said that&#8211;I said that Erlang was not monitorable using current IT operations monitoring tools. The JVM and CLR both have gone to great lengths to build infrastructure hooks that make it easy to keep an eye not only on what&#8217;s going on at the process level (&#8220;Is it up? Is it down?&#8221;) but also what&#8217;s going on inside the system (&#8220;How many requests have we processed in the last hour? How many of those were successful? How many database connections have been created?&#8221; and so on). Nothing says that Erlang&#8211;or any other system&#8211;can&#8217;t do that, but it requires the Erlang developer build that infrastructure him-or-herself, which usually means it&#8217;s either not going to get done, making life harder for the IT support staff, or else it gets done to a minimalist level, making life harder for the IT support staff.</em></p></blockquote>
<p>I know what you meant in your original posting, Ted, and my objection still stands. Are you saying here that all Java and .NET applications are by default network-monitoring-friendly, whereas Erlang applications are not? I seem to recall quite a bit of effort spent by various teams at my previous employer to make sure our distributed computing products, including the Java-based products and .NET-based products, played reasonably well with network monitoring systems, and I sure don&#8217;t recall any of it being automatic. Yes, it&#8217;s nice that the Java and CLR guys have made their infrastructure monitorable, but that doesn&#8217;t relieve developers of the need to put actual effort into tying their applications into the monitoring system in a way that provides useful information that makes sense. There is no magic here, and in my experience, even with all this support, it <em>still</em> doesn&#8217;t guarantee that monitoring support will be done to the degree that the IT support staff would like to see.</p>
<p>And do you honestly believe Erlang &mdash; conceived, designed, implemented, and maintained by a large well-established telecommunications company for use in highly-reliable telecommunications systems &mdash; would offer <em>nothing</em> in the way of tying into network monitoring systems? I guess SNMP, for example, doesn&#8217;t count anymore?</p>
<p>(Coincidentally, I recently had to tie some of the Erlang stuff I&#8217;m currently working on into a monitoring system which isn&#8217;t written in Erlang, and it took me maybe a quarter of a workday to integrate them. I&#8217;m absolutely certain it would have taken longer in Java.)</p>
<p>But here&#8217;s the part of Ted&#8217;s response that I really don&#8217;t understand:</p>
<blockquote><p><em>So given that an execution engine could easily adopt the model that gives Erlang its reliability, and that using Erlang means a lot more work to get the monitorability and manageability (which is a necessary side-effect requirement of accepting that failure happens), hopefully my reasons for saying that Erlang (or Ruby&#8217;s or any other native-implemented language) is a non-starter for me becomes more clear.</em></p></blockquote>
<p>Ted, first you state that an execution engine could (emphasis mine) &#8220;<em>easily</em> adopt the model that gives Erlang its reliability,&#8221; and then you say that it&#8217;s &#8220;a lot more work&#8221; for anyone to write an Erlang application that can be monitored and managed? Aren&#8217;t you getting those backwards? It should be obvious that in reality, writing a monitorable Erlang app is not hard at all, whereas building Erlang-level reliability into another VM would be a considerably complicated and time-consuming undertaking.</p>
]]></content:encoded>
			<wfw:commentRss>http://steve.vinoski.net/blog/2008/05/09/thinking-in-language-but-not-clearly/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Erlang: It&#8217;s About Reliability</title>
		<link>http://steve.vinoski.net/blog/2008/05/01/erlang-its-about-reliability/</link>
		<comments>http://steve.vinoski.net/blog/2008/05/01/erlang-its-about-reliability/#comments</comments>
		<pubDate>Thu, 01 May 2008 13:19:10 +0000</pubDate>
		<dc:creator>steve</dc:creator>
				<category><![CDATA[erlang]]></category>
		<category><![CDATA[reliability]]></category>

		<guid isPermaLink="false">http://steve.vinoski.net/blog/?p=66</guid>
		<description><![CDATA[In a recent post, Ted Neward gives a brief description of a variety of programming languages. It&#8217;s a useful post; I&#8217;ve known Ted for awhile now, and he&#8217;s quite knowledgeable about such things. Still, I have to comment on what he says about Erlang: Erlang. Joe Armstrong&#8217;s baby was built to solve a specific set [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://blogs.tedneward.com/2008/04/29/Groovy+Or+JRuby.aspx">recent post, Ted Neward</a> gives a brief description of a variety of programming languages. It&#8217;s a useful post; I&#8217;ve known Ted for awhile now, and he&#8217;s quite knowledgeable about such things. Still, I have to comment on what he says about Erlang:</p>
<blockquote><p><em><strong>Erlang</strong>. Joe Armstrong&#8217;s baby was built to solve a specific set of problems at Ericsson, and from it we can learn a phenomenal amount about building massively parallel concurrent programs. The fact that it runs on its own interpreter, bad.</em></p></blockquote>
<p>I might have said it like this:</p>
<blockquote><p><em><strong>Erlang</strong>. Joe Armstrong&#8217;s baby was built to solve a specific set of problems at Ericsson, and from it we can learn a phenomenal amount about building highly reliable systems that can also support massive concurrency. The fact that it runs on its own interpreter, good; otherwise, the reliability wouldn&#8217;t be there and it would be just another curious but useless concurrency-oriented language experiment.</em></p></blockquote>
<p>Far too many blog posts and articles that touch on Erlang completely miss the point that reliability is an extremely important aspect of the language.</p>
<p>To achieve reliability, you have to accept the fact that failure <em>will</em> occur, Once you accept that, then other things fall into place: you need to be able to restart things quickly, and to do that, processes need to be cheap. If something fails, you don&#8217;t want it taking everything else with it, so you need to at least minimize, if not eliminate, sharing, which leads you to message passing. You also need monitoring capabilities that can detect failed processes and restart them (BTW in the same posting Ted seems to claim that Erlang has no monitoring capabilities, which baffles me).</p>
<p>Massive concurrency capabilities become far easier with an architecture that provides lightweight processes that share nothing, but that doesn&#8217;t mean that once you design it, the rest is just a simple matter of programming. Rather, actually <em>implementing</em> all this in a way that delivers what&#8217;s needed and performs more than adequately for production-quality systems is an incredibly enormous challenge, one that the Erlang development team has quite admirably met, and that&#8217;s an understatement if there ever was one.</p>
<p>They come for the concurrency but they stay for the reliability. Do any other &#8220;Erlang-like&#8221; languages have real, live, production systems in the field that have been running non-stop for years? (That&#8217;s not a rhetorical question; if you know of any such languages, please let me know.) Next time you see yet another posting about Erlang and concurrency, especially those of the form &#8220;Erlang-like concurrency in language X!&#8221; just ask the author: where&#8217;s the reliability?</p>
]]></content:encoded>
			<wfw:commentRss>http://steve.vinoski.net/blog/2008/05/01/erlang-its-about-reliability/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
	</channel>
</rss>
