This is the second post in a series of posts on setting up a high availability web application system on a shoestring budget. The goal is to have no single point of failure and to not spend a lot of money. Read the first post.
Now that we had Postgresql up and replicating on two servers, our plan was to put pgpool-II in front of it.
pgpool claims it can do a lot, but we were particularly interested in automatic failover, online recovery, and query dispatching/load balancing. The idea is that the client applications connect to pgpool and think they are connected to a postgresql server. pgpool proxies all the queries and sends them to real postgresql servers and returns the results. It can monitor servers and not send anything to ones that fail and promote a slave to a master if necessary. It can also detect if the query is a read-only SELECT query and dispatch that to a pool of slaves.
It sounded ideal to us and would handle all of our issues automatically. Googling it made it seem like there were people using it.
But we couldn’t get it to work. The main instructions for using Postgresql 9.0 streaming replication with pgpool are here. One big fault with them is that they take you through setting it up with postgresql running on two different ports on the same machine. We were very careful and spent a lot of time figuring out how to adapt those instructions to our setup with two servers.
Installation is not a breeze. The instructions tell you to use an alpha version (no idea why). It builds and installs fine. But then you need to install three sets of custom pgsql functions. You need to compile the functions and install them on each server, but only install the sql on the master. Then there are two shell scripts, failover.sh and basebackup.sh. The failover script is used to promote a slave to a master and the basebackup script is used for online recovery. The ones in the instructions are designed for the two databases to be on the same physical server. There’s another version of basebackup.sh in the official manual that I had more luck with.
After two days, I was able to get load balancing and automatic failover to work. I could not get online recovery to work. I’m pretty sure that you need a better basebackup.sh script, but I gave up. The whole experience didn’t give me the greatest confidence in the system, and I want something robust that will help me sleep at night.
Next up: some alternatives…