High Availability - Automated origin failover using CloudFlare, Nagios and OpenShift

Written by Alexandre De Dommelin Tue Sep 25 19:04:49 UTC 2012
A few days ago, after repeated downtimes, Steve Souders twitted :

I found this question very interesting, and here is an answer. My criteria was to build something without refactoring all my current setup, roughly composed of :

  • CloudFlare as CDN in front of www.tuxz.net,
  • Nagios as monitoring system,
  • A blog powered by Nanoblogger,
  • Dokuwiki,
  • A lot of custom PHP scripts ...

All of this running in an OpenVZ container on a single physical server somewhere on the planet.
Oh ! and also a (no longer) unused free account on OpenShift (RedHat Platform as a Service) :-)

Creating origin failover site

First create a new application on OpenShift, called "failover". This application will be accessible through (depending on what your namespace is set to) : http://failover-tuxz.rhcloud.com/

Right now, your application is empty and only accessible using its default domain name. As we want it (at the end) to answer requests targeted to our main domain name, we need to add it as an alias. This operation can only be performed using the OpenShift client. The installation is quite straightforward :

$ sudo gem install rhc
$ rhc setup

You can now add your alias and use git to clone your brand new OpenShift application on your current origin, ie :

$ rhc app add-alias -a failover --alias www.tuxz.net
$ git clone ssh://[email protected]/~/git/failover.git/ /var/www/www-failover.tuxz.net/
$ tree -L 1 /var/www/
|-- www-failover.tuxz.net
|-- www.tuxz.net

Now you just need to create a custom crontab to rsync, statify, torture then commit & push changes to your OpenShift application :

rsync -rvl --delete ${PRIMARY_ROOT}/ ${FAILOVER_ROOT}/

#- do all your custom stuff here -#
git add .
git commit -m "www.tuxz.net - ${TS}"
git push

At this time, your OpenShift app should contain the exact (or tortured) copy of your primary origin.

Modifying DNS configuration

To make identification easier, update your DNS configuration to add 2 CNAME "www-primary" and "www-failover" pointing respectively to your primary server & your OpenShift application, then CNAME your "www" entry to "www-primary" & enable CloudFlare servies on it.
You should end up with results similar as :

$ dig -t CNAME +short www-primary.tuxz.net

$ dig -t CNAME +short www-failover.tuxz.net

$ dig -t CNAME +short www.tuxz.net
Configuring Nagios to switch traffic to failover site in case of primary origin failure

We are going to use Nagios events handler built-in mechanism, which allow us to run scripts "when something happens".
In our case we're going to run a script interacting with CloudFlare DNS API and change the value of our origin server for our main domain.

Here is the relevant part of the Nagios configuration :

define service {
  use generic-service
  host_name www-primary.tuxz.net
  service_description Ensure that primary origin is healthy
  check_command your_command
  contact_groups admins
  max_check_attempts 4
  event_handler switch_to_failover_site

# commands.cfg
define command {
  command_name switch_to_failover_site
  command_line /usr/local/nagios/libexec/eventhandlers/switch_to_failover_site.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $HOSTADDRESS$ $HOSTDOWNTIME$ $SERVICEDOWNTIME$

And the content of switch_to_failover_site.sh :


CLOUDFLARE_LOGIN="[email protected]"

__switch_to_failover() {
  /usr/bin/curl https://www.cloudflare.com/api_json.html \
    -d "a=rec_edit" \
    -d "tkn=${CLOUDFLARE_API_KEY}" \
    -d "id=${DNS_ENTRY_ID}" \
    -d "email=${CLOUDFLARE_LOGIN}" \
    -d "z=${DNS_ZONE}" \
    -d "type=${DNS_ENTRY_TYPE}" \
    -d "name=${DNS_ENTRY}" \
    -d "content=${DNS_ENTRY_FAILOVER}" \
    -d "ttl=1" \
    -d "service_mode=1"

[ "$1" = "CRITICAL" ] || exit 0
if [ "$2" = "SOFT" ];
  if [ $3 -eq 3 ];
    [ "$servicestatus" = "00" ] && __switch_to_failover;

Notes about this script :

  • Customize variables at the top with your domain entries and cloudflare credentials / API key
  • DNS_ENTRY_ID can be obtained by querying the API with the "rec_load_all" parameter (see CloudFlare API Doc)
  • The script will trigger origin switch after 3 fail checks

One interesting side effect of using CloudFlare is that there's almost no DNS propagation delay. In fact your main entry is "publicly" not modified (still CNAME'd to cf-protected-www.tuxz.net) but the update is quickly propagated to the CloudFlare infrastructure. My tests shown an appox. 1 minute delay.