AWS Outage and Redistogo

I hate to be that guy saying “don’t use cloud services”, because it’s rather ironic given that I spent the last ten years of my life creating and promoting them at MessageLabs, which outsourced email anti-spam and anti-virus to the cloud. In fact we practically invented cloud services – being one of the very first “SaaS” companies in existance. But you really have to be careful who you choose to outsource things to.

My recent (Canada Day weekend) fun has been with Amazon Web Services (EC2, etc) and more importantly, Redistogo, who provide an AWS hosted redis database with zero administration. It also turns out that zero administration also means you’re fucked when things go wrong.

Most people who are active in technology know that AWS had another major outage this week due to storms in the US north east. They lost power, and so some services went offline. This included our one remaining heroku hosted service (remaining because I am well aware that heroku is on AWS and AWS is way overpriced, so I’m migrating everything off there).

The service to the app quickly came back when power to AWS was restored, but it was broken, because it couldn’t connect to the redis database, which using Heroku’s simple to use systems, we had run out of space on Heroku’s free Redis database, so migrated up to “Redistogo” which offered larger storage space… But that service never came back…

I waited a while. It didn’t come back. I looked on their support forums… a few posts about things not being restored yet… I tweeted about it, and got a reply that they were “working with Amazon to try and restore things”. But honestly still got nothing.

I tried a restart on heroku. But was greeted with a 404 page saying “The page you were looking for doesn’t exist”. Wow that is confidence boosting.

Thank fuck we hadn’t launched yet.

The one saving grace was we had a backup accessible from the Heroku page. I downloaded it. I tried to get it running on our one external server – Ubuntu sadly installed an older version of Redis which wouldn’t read the file. But with some tweaking I got a newer redis running there, it read the file, and I got things running again.

So what’s the story behind this? Do I dislike cloud services? Mostly, sadly. Most of them are shockingly appallingly run. Cloud services are all about Operations. And if your operations are weak, then your service is weak. I knew that at MessageLabs (after my first couple of years anyway) and worked extremely hard with our operations people to make sure that things ran smoothly and we had 24/7 support. But it’s very hard to know if a random cloud service will recover from a major outage (do you know if yours will?). Sadly it is clear that redistogo is not such a service. Maybe they will learn from this, but today I cannot recommend them. I have LOTS of advice to give them though if they want it, borne out of years of experience with this stuff.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s