Implementing a complete high availability Alfresco solution using open source technologies

As a proof of concept, I have done some research and experimenting to determine the best way of clustering Alfresco using completely open source components. I wanted a solution that offered load balancing as well as fault tolerance. There are three components outside of Alfresco that are needed to achieve this.

  1. Load balancer
  2. File system
  3. Database

The load balancer is the simplest component of the three, and the one with the most options available. We just need a load balancer that is able to handle sticky sessions. A dumb load balancer which round robins connections will not work for this scenario.

Alfresco stores all the content as regular files. (Unlike Sharepoint putting content in the database. Yikes!) In order to achieve HA on the content repository we need some sort of clustered or replicating file system. It was not long ago when clustered file systems were out of reach from the open source community. It is great that we now have some viable open source options now.

The last component needed, of course, is the database. Unfortunately, there is no viable multi-master open source option. There are many projects that are working towards this, such as Bucardo. But there is nothing currently that is a drop in replacement and/or production ready. The good news is we still have a master-slave(s) setup that can still achieve HA and some sort of load balancing.

Here is the complete solution I implemented:

 

Alfresco: Alfresco Enterprise 4.0.2

I used the latest version of Alfresco Enterprise at the time of writing this, just since it is what I deal with the most. I believe the Community Edition would work just fine as well in this scenario since the heart of Alfresco clustering is within Ehcache.

 

Load balancer: HAProxy

HAProxy is known to be very stable and currently used on some very high traffic web sites. It also gives us the functionality to keep track of sessions via the JSESSIONID cookie. Another great feature is we can take the fault detection further, and test a web script page in Alfresco to determine if Alfresco is currently running. (http://admin:passwd@server1/alfresco/wcs/s is a great page to check.)

There will be a small portion of people that were looking at this diagram and saying to themselves, “But there is a single point of failure!” HAProxy is a very simple component, and it would be easy to set up an active/passive automatic fail over. Also very stable physical and virtual options exist.

I should also note that we have tested HAProxy using single sign on authentication via Active Directory Kerberos. I assume NTLM would work just fine as well.

 

Clustered file system: GlusterFS

I have read good things about GlusterFS, but this was my first hands on experience with it. I was shocked how simple and quick this was to get up and running. A command to add the second server, and another to get the replication going. No messing with configuration files. You can even have 4 servers and enable replication and striping. Similar to the way RAID 10 (or 0+1) works, but across servers. This is a perfect fit for putting Alfresco’s content. Load balancing and seamless fault tolerance.

 

Replicating database: PostgreSQL + pgpool-II

MySQL is still an option, but I chose to go with Postgres here. I liked some of the HA features Postgres provided that seemed lacking in MySQL.  Unfortunately, either way we have to use a master-slave replication configuration.

In order to achieve load balancing and fault tolerance we need to put pgpool-II in front on the databases. It will take read only queries and load balance them between the master and slave(s). Commands that involve any kind of updates, or writes will be forwarded to the master which in turn get streamed to the slaves. This makes writes slower than a standalone database, but most Alfresco installs should be primarily reads for the average implementation. Pgpool can also be configured to use parallel queries. This means large queries can be split up amongst servers.

Pgpool will also detect any faults, so if any of the slaves go down it will just take them out of the pool. And if the master goes down, it will take one of the slaves and promote it to the new master. For the chance of a problem with Pgpool, a similar configuration with HAProxy, an active/passive configuration can be used to add some redundancy.

 

Enjoy your content management uptime! And feel free to drop me a comment.