Implementing a complete high availability Alfresco solution using open source technologies

As a proof of concept, I have done some research and experimenting to determine the best way of clustering Alfresco using completely open source components. I wanted a solution that offered load balancing as well as fault tolerance. There are three components outside of Alfresco that are needed to achieve this.

  1. Load balancer
  2. File system
  3. Database

The load balancer is the simplest component of the three, and the one with the most options available. We just need a load balancer that is able to handle sticky sessions. A dumb load balancer which round robins connections will not work for this scenario.

Alfresco stores all the content as regular files. (Unlike Sharepoint putting content in the database. Yikes!) In order to achieve HA on the content repository we need some sort of clustered or replicating file system. It was not long ago when clustered file systems were out of reach from the open source community. It is great that we now have some viable open source options now.

The last component needed, of course, is the database. Unfortunately, there is no viable multi-master open source option. There are many projects that are working towards this, such as Bucardo. But there is nothing currently that is a drop in replacement and/or production ready. The good news is we still have a master-slave(s) setup that can still achieve HA and some sort of load balancing.

Here is the complete solution I implemented:

 

Alfresco: Alfresco Enterprise 4.0.2

I used the latest version of Alfresco Enterprise at the time of writing this, just since it is what I deal with the most. I believe the Community Edition would work just fine as well in this scenario since the heart of Alfresco clustering is within Ehcache.

 

Load balancer: HAProxy

HAProxy is known to be very stable and currently used on some very high traffic web sites. It also gives us the functionality to keep track of sessions via the JSESSIONID cookie. Another great feature is we can take the fault detection further, and test a web script page in Alfresco to determine if Alfresco is currently running. (http://admin:passwd@server1/alfresco/wcs/s is a great page to check.)

There will be a small portion of people that were looking at this diagram and saying to themselves, “But there is a single point of failure!” HAProxy is a very simple component, and it would be easy to set up an active/passive automatic fail over. Also very stable physical and virtual options exist.

I should also note that we have tested HAProxy using single sign on authentication via Active Directory Kerberos. I assume NTLM would work just fine as well.

 

Clustered file system: GlusterFS

I have read good things about GlusterFS, but this was my first hands on experience with it. I was shocked how simple and quick this was to get up and running. A command to add the second server, and another to get the replication going. No messing with configuration files. You can even have 4 servers and enable replication and striping. Similar to the way RAID 10 (or 0+1) works, but across servers. This is a perfect fit for putting Alfresco’s content. Load balancing and seamless fault tolerance.

 

Replicating database: PostgreSQL + pgpool-II

MySQL is still an option, but I chose to go with Postgres here. I liked some of the HA features Postgres provided that seemed lacking in MySQL.  Unfortunately, either way we have to use a master-slave replication configuration.

In order to achieve load balancing and fault tolerance we need to put pgpool-II in front on the databases. It will take read only queries and load balance them between the master and slave(s). Commands that involve any kind of updates, or writes will be forwarded to the master which in turn get streamed to the slaves. This makes writes slower than a standalone database, but most Alfresco installs should be primarily reads for the average implementation. Pgpool can also be configured to use parallel queries. This means large queries can be split up amongst servers.

Pgpool will also detect any faults, so if any of the slaves go down it will just take them out of the pool. And if the master goes down, it will take one of the slaves and promote it to the new master. For the chance of a problem with Pgpool, a similar configuration with HAProxy, an active/passive configuration can be used to add some redundancy.

 

Enjoy your content management uptime! And feel free to drop me a comment.

This entry was posted in Alfresco, Alfresco Share, ECM and tagged , , by trevor.dell. Bookmark the permalink.
trevor.dell

About trevor.dell

Alfresco Certified Engineer

Trevor is a Systems Architect that has a broad and extensive understanding of many software and hardware platforms. Specializing in designing highly available environments and applications. Over the years he has gained experience with dozens programming and scripting languages.

Trevor likes contributing to the open source ecosystem in his spare time when he is not chasing after the wind with one of his windsurfers.

6 thoughts on “Implementing a complete high availability Alfresco solution using open source technologies

  1. Since we do all our work with Enterprise Edition, I can not say for certain. But, since most of the clustering within Alfresco takes place within Ehcache (which to my knowledge is implemented the same way in Community), I can see it still working. One thing I know that is not implemented in Community Edition is JGroups, but that is only used for cluster node discovery.

  2. Hi,

    I created this architecture but i have one problem with connection between pg-pool and alfresco. Alfresco is starting very slow (around 20min), when i set connection direct to the database everything is fine (system starts in 5 min).
    Did you remember configuration between pg-pool and alfresco ?

    Regards

  3. Twenty minutes for Alfresco to start? You must be out of memory and it is swapping. That is ridiculously slow. Alfresco should start in about 1 minute, maybe 2 on a slower system or low resource VM. Check your Java heap size, and the memory your OS has. I would not use less than 2G for the OS. While setting the heap size to at least 1G for Alfresco.

    I did look for my cluster I setup, unfortunately I did some house cleaning, and deleted those VMs not too long ago. So I do not have the exact configuration.

    I am pretty sure I just followed the documenation at: http://www.pgpool.net/docs/latest/pgpool-en.html#master_slave_mode

    I would of set up master-slave mode with streaming replication.

  4. In this case, there is a Lucene index running on each of the application servers. Alfresco packages Solr version 1.4 in the latest Alfresco 4.1.5 Enterprise; we have found this to be unstable. I am not sure why such an old version is being packaged (Solr 4.4 is the latest.) So actually all of our production installations use Lucene.

    Even with Lucene, you can still take a node down to rebuild indexes with no down time. The features of Solr do seem more appealing, but hopefully Alfresco will resolve this soon.

    One of the things I have wanted to do is experiment with putting updated versions of Solr in. At a quick glance it did not seem trivial, and of course when dealing with Alfresco Enterprise, it is not supported.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>