Saturday, 30 April 2011

LXC on Debian Squeeze

We've been using Linux Containers for a couple of weeks now to run our continuous integration environment and a bunch of other servers. Setting up containers is pretty easy but I found I was always doing the same post-configuration tasks again and again. So I wrapped them up:
mkdir /var/lib/lxc/jenkins01
lxc-debian -p /var/lib/lxc/jenkins01
lxc-post-config jenkins01 192.168.0.30

Now the container starts when the host starts up, sets the network settings to a static IP and shuts down nicely (init 0) when it receives a signal from the host. You need to configure your host to send a PWR signal to each container from your /etc/init.d/lxc and wait for it to shutdown. Just insert the following into the stop sections:
for i in `pidof lxc-start`; do kill -PWR $i; done; sleep 30
At the moment the script assumes that you sit on a 192.168.0.0/8 network but I'll get round to changing that. I'm also assuming that you're using some kind of distribution based on Debian Squeeze.

You can grab the script from https://github.com/shin-nien/lxc-tools/raw/master/lxc-post-config

Sunday, 24 April 2011

Moving our Grails app to Amazon Beanstalk

Last week we started investigating at what it would take to move our Grails app to Amazon's Beanstalk. There were a couple of reasons to do this with cost being a big one. Another was that we don't have a dedicated sysadmin so the thought of 'just upload your war and we'll do the rest' is very appealing.

I'm writing this as Amazon EC2 is suffering one of its largest outages in a number of years. The whole story's not out yet but Amazon's eastern US region suffered outages across several availability zones at the same time. It's a bit like taking multiple datacentres in (supposedly) physical different locations and wiring them up together so a failure in one affects another. Stuff like that isn't supposed to happen.

So ignoring what I've just said for a moment, Beanstalk could still be an option. Simple deployments of an application means it's deployed across multiple datacentres with data replication across datacentres too. That's not an easy thing to do for the money Amazon are asking for. I guess we're hoping that Amazon learn from their mistakes.

In a Datacentre Far Far Away...

So a week later and what have we learnt? Well for starters Beanstalk only works out of the US-East region which means think twice if most of your customers are in Europe and you're a heavy traffic site. It also means that uploading WARs (with a Grails vanilla war weighing in at 22MB) can take a while - though with some clever '--nojars' hackery you could work around this. If you care about performance then your database will be kept regionally close, which means data loading can be slow at times.

Databases

If your app uses a persistent relational database AND it's mysql then you're in luck. Amazon provide a service called RDS which gives you a mysql instance (that you can connect to using the mysql client) and monitoring tools. RDS is a managed service so your mysql instance can be taken offline (during a user specified time) to be upgraded. I don't know any applications that like to be just disconnected like that so you're pretty much forced to use multiple availability zone RDS deployments; RDS creates a standby replica in another availability zone (another datacentre in theory) so now you have no downtime but you have to pay a little bit more for an unused instance.

Autoscaling

Autoscaling isn't one of the reasons why we looked at moving but it does make you re-architect your application. If Beanstalk detects that an instance of your application is unhealthy (e.g. not responding to pings) then that instance will be deleted, recreated and redeployed to. Instances don't have any meaningful identifiers so you shouldn't try to make assumptions about how many instances you have or what they're called.

So if you used to run Quartz jobs on a server based on something like a hostname then that's not going to work anymore. You could consider using the cluster feature of Quartz instead.

Regular Deployments

Perhaps the biggest surprise was the lack of a maintenance page when you're deploying a new version of your app. Beanstalk simply decompresses your WAR and waits for Tomcat to reload the application. In the meantime your visitors are sitting there timing out or watching their browsers spin. If you've got a data migration to do then this can take a few minutes. To me this says Amazon doesn't understand how people are deploying applications or don't consider Beanstalk useful for production websites (yet?).

I'll post more details in another post but we got around this problem by customising the image that Beanstalk uses and then telling Beanstalk to rebuild our environment using the new image.

Customising an image however does not help you with maintenance outages that require environmental changes (e.g. changing ssh keys or upgrading to a larger EC2 instance). In these cases, Beanstalk will teardown all your app servers and reprovision them leaving your site as just a blank page. There are a number of workarounds involving creating instances to serve a maintenance page and then either add it to the load balancer or make a global DNS change. But when you have to start doing things like that, surely we're defeating the point of Beanstalk?

Conclusion

Well we've not quite yet made up our minds yet. Beanstalk is limited in what it can do but if you're willing to accept its limitations then you'll have very little maintenance to deal with. If on the other hand you're looking to do anything a bit more complicated like minimise downtime using A/B legs or failover across regions then you'll quickly start to lose all the benefits of Beanstalk because you'll be spending all your time implementing workarounds.