{"id":244,"date":"2009-02-22T17:33:48","date_gmt":"2009-02-22T21:33:48","guid":{"rendered":"http:\/\/steve.vinoski.net\/blog\/?p=244"},"modified":"2009-02-22T17:33:48","modified_gmt":"2009-02-22T21:33:48","slug":"controlling-erlangs-heart","status":"publish","type":"post","link":"https:\/\/steve.vinoski.net\/blog\/2009\/02\/22\/controlling-erlangs-heart\/","title":{"rendered":"Controlling Erlang&#8217;s Heart"},"content":{"rendered":"<p>Erlang&#8217;s <a href=\"http:\/\/erlang.org\/doc\/man\/heart.html\">heart<\/a> feature provides a heartbeat-based monitoring capability for Erlang runtime systems, with the ability to restart a runtime system if it fails. It works reasonably well, but one issue with it is that if an error occurs such that it causes repeated immediate runtime crashes, <code>heart<\/code> will happily keep restarting the runtime over and over again, ad infinitum.<\/p>\n<p>For <a href=\"http:\/\/yaws.hyber.org\/\">yaws 1.80<\/a>, released a few days ago on Feb. 12, I added a check to the <code>heart<\/code> setup in the <code>yaws<\/code> startup script to prevent endless restarts. I thought I&#8217;d share it here because it&#8217;s useful for Erlang systems in general and is in no way specific to yaws. It works by passing startup information from one incarnation to the next, checking that information to detect multiple restarts within a given time period. We track both the startup time and the restart count, and if we detect 5 restarts within a 60 second period, we stop completely. This is not to say that yaws is in dire need of this capability &mdash; it&#8217;s extremely stable in general and 1.80 in particular is a very good release &mdash; but I added it mainly because other Erlang apps sharing the same runtime instance as yaws may not enjoy that same high level of stability, especially while they&#8217;re still under development.<\/p>\n<p>The command <code>heart<\/code> runs to start a new instance is set in the <code>HEART_COMMAND<\/code> environment variable. For yaws, it&#8217;s set like this (I&#8217;ve split this over multiple lines for clarity, but it&#8217;s just one line in the actual script):<\/p>\n<pre>HEART_COMMAND=\"${ENV_PGM} \\\r\n  HEART=true \\\r\n  YAWS_HEART_RESTARTS=$restarts \\\r\n  YAWS_HEART_START=$starttime \\\r\n  $program \"${1+\"$@\"}<\/pre>\n<p>where<\/p>\n<ul>\n<li><code>${ENV_PGM}<\/code> is <code><a href=\"http:\/\/docs.sun.com\/app\/docs\/doc\/816-5165\/env-1\">\/usr\/bin\/env<\/a><\/code>, which allows us to set environment variables for the execution of a given command.<\/li>\n<li><code>HEART<\/code> is an environment variable that we use to indicate the command was launched by <code>heart<\/code>.<\/li>\n<li><code>YAWS_HEART_RESTARTS<\/code> is an environment variable that we use to track the number of restarts already seen. The yaws script initially sets this to 1 and increments it for each heart restart.<\/li>\n<li><code>YAWS_HEART_START<\/code> is an environment variable that we use to track the time of the current round of restarts. This is tracked as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Unix_time\">UNIX time<\/a>, obtained by the script via the &#8220;<code>date -u +%s<\/code>&#8221; command.<\/li>\n<li><code>$program<\/code> is the yaws script itself, i.e., <code>$0<\/code>.<\/li>\n<li><code>${1+\"$@\"}<\/code> is a specific shell construct that passes all the original arguments of the script unchanged along to <code>$program<\/code>.<\/li>\n<\/ul>\n<p>The yaws script looks for <code>HEART<\/code> set to true, indicating that it was launched by <code>heart<\/code>. For that case, it then checks <code>YAWS_HEART_RESTARTS<\/code> and <code>YAWS_HEART_START<\/code> to see how many restarts we&#8217;ve seen since the start time. We get the current UNIX time and subtract the <code>YAWS_HEART_START<\/code> time; if it&#8217;s less than or equal to 60 seconds and the restart count is 5, we exit completely without restarting the Erlang runtime. Otherwise we restart, first adjusting these environment variables. If the restart count is less than 5 within the 60 second window, we increment the restart count and set the new value into <code>YAWS_HEART_RESTARTS<\/code> but keep the same <code>YAWS_HEART_START<\/code> time. But if the current time is more than 60 seconds past the start time, we reset <code>YAWS_HEART_RESTARTS<\/code> to 1 and set a new start time for <code>YAWS_HEART_START<\/code>. Look at the <a href=\"http:\/\/erlyaws.svn.sourceforge.net\/viewvc\/erlyaws\/trunk\/yaws\/scripts\/yaws.template?view=markup\">yaws script<\/a> to see the details of this logic &mdash; scroll down to the part starting with <code>if [ \"$HEART\" = true ]<\/code>.<\/p>\n<p>Note that this approach is much like the way Erlang <code>receive<\/code> loops generally track state, by recursively passing state information to themselves.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Erlang&#8217;s heart feature provides a heartbeat-based monitoring capability for Erlang runtime systems, with the ability to restart a runtime system if it fails. It works reasonably well, but one issue with it is that if an error occurs such that it causes repeated immediate runtime crashes, heart will happily keep restarting the runtime over and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[58,4,96],"tags":[169,140,178],"class_list":["post-244","post","type-post","status-publish","format-standard","hentry","category-code","category-erlang","category-reliability","tag-code","tag-erlang","tag-reliability"],"_links":{"self":[{"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/posts\/244","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/comments?post=244"}],"version-history":[{"count":34,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/posts\/244\/revisions"}],"predecessor-version":[{"id":278,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/posts\/244\/revisions\/278"}],"wp:attachment":[{"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/media?parent=244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/categories?post=244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/steve.vinoski.net\/blog\/wp-json\/wp\/v2\/tags?post=244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}