This week we hit a bit of a StumbleUpon bonanza, landed on the Reddit front page, and got featured on Lifehacker. These three things threw quite a bit of traffic our way, and we got through it all with around 3 hours of degraded performance, and patches of intermittent bad gateway errors. Not too shabby.
With alarmingly serendipitous timing, two weeks ago I did a presentation at the Melbourne Ruby group on Scaling Rails. I posted the talk and my explanatory notes on my personal tumblr for two reasons. The first is that it includes a bit of my personal, subjective, and somewhat unsubstantiated ideas on developer types and ops, and I prefer to keep those kinds of musings off the team blog.
The second was that while I’d promoted the idea of the “minimum viable stack” which, while I thought pretty sound (and bet Goodfilms’ tech direction on), was untested. This week we gave it a solid thrashing, so it’s time to promote it as “official Goodfilms policy”.
The TL; DR is that it worked, Mostly and I’m now even more comfortable recommending the stack as a good starting point.
The “Minimum Viable Stack” for your just-past “Minimum Viable Product”
In my presentation I suggested that the “golden stack” for a rails app just out of its’ MVP stage is roughly:
- Deployed to a cloud provider like Amazon or Rackspace
- Uses MySQL or Postgres as the datastore, deployed to a single instance, with frequent data backups to cloud storage
- Has two “app instances”, which host both your web processes and delayed job workers. Each instance should be capable of holding all your regular traffic
- Load balances web requests between the two app servers (with health checks enabled) using whichever magic load balancer your cloud provider gives you
- Performance monitored using either Scout or NewRelic, or both if you like
There’s more detail over on my original post, but the key inputs into that design is were:
- Low chance of “off the air” downtime, despite cloud servers being notoriously “ephemeral”
- Lowest operational cost, except where it grossly impacts the above point
- Favouring simplicity, except where it grossly impacts the above two points
- Favouring traditional SQL databases as it’s the storage paradigm Rails “grew up with”, so has the best tooling and knowledge in the community
- Favouring SQL again as if it becomes the bottleneck you can vertically scale it just long enough to get you out of trouble, and easy to hire experts to help
- Avoiding vendor lock in where possible
The Goodfilms Stack
Goodfilms’ setup is almost exactly what I listed above. We run Rails 3.2, use Postgres as our datastore, host on Rackspace cloud using both their load balancers and cloud storage/cdn on top of regular cloud servers, and monitoring the whole setup using NewRelic.
The only departure from the stack listed above is that we have a third server setup as a general utility box. It has two main jobs: import catalog data from Netflix, iTunes, and the Movie DB, and also to run our elaborate taste comparison engine codenamed “Project Ingen”
So, that’s the background, lets talk about what happened, what worked well for us with the stack, and what didn’t
I’m just going to focus on the Reddit front page part of the story, because that’s where the interesting things happened.
For a long time, we’ve thought that there a lot of people who want to pay for film content, and that one of the things Goodfilms can do is help those people find the best things to watch that are available legally. To test that assumption, we did a couple of MVP (minimum viable product pages) for both iTunes and Netflix.
If you look at the iTunes page, you can see how minimal our minimum is, as we’ve not yet upgraded that page. Seeing that the Netflix page was pulling in enough organic search, we tasked our talented new designer Charlie with taking the Netflix page to the next level and see how far we can take it.
Once Charlie was done with it, it looked like we had a winner. Realising that a lot of people might want to use the page, I took a full day to double check all the performance of the queries in the page and fix up what I could.
NewRelic had shown us that the MVP page was only just scraping through performance wise, which is fine for a proof of concept, but not fine for a page you want to load. Once everything was looking OK in the front and back end, I gave the thumbs up to Glen to “throw as many people at it as he can find”.
There are quite a few people on Reddit, and Glen found them, and got a good section of them to come and use it.
What might not be immediately obvious is that we’re based in Australia. The bulk of the Redditing happened in the middle of the night, and so Glen rang me up, pulled me out of bed, and we babysat the servers.
A few hours later we both went back to bed, confident that the site was going to stay up, and very pleased with the influx of new users. In those hours, we added a couple more app servers, and set up full page caching for signed out users.
What went well?
The first thing that went well was taking the time to look at the performance of our new features, doing some work, but not overdoing it.
Defining “good enough” performance for as yet unproven features is tricky, especially when taking into account the opportunity cost of the dev time that can go into other features.
In this case, we got the balance right. I spent about a day getting performance to OK, and it was good enough to degrade gracefully under load, rather than explode. That gave us the breathing room to do the rest of our work in the middle of the night, without having wasted too much time beforehand when we were unsure of it’s success.
The other thing that worked really well was a direct result of three stack decisions that interrelated: host on the cloud, and split your app servers early, and monitor performance using NewRelic.
When traffic started climbing, we quickly switched our NewRelic subscription up to the pro level. We can’t afford to run it all the time, but it’s easy to turn it up when you do need it.
Under heavy load, there are generally only two things I want to look at in NewRelic: is there any page that is ridiculously broken from a performance perspective, and do we have enough capacity. If performance is broken somewhere, fix it. If you don’t have capacity, add it.
This time, all the pages were OK (because of the small upfront investment of work), but we did have a capacity problem. This did not take us long to figure out, because this is what NewRelic is good at. First design decision validated.
To add capacity, we just took a snapshot image of one of our app servers, and then spun up new servers directly from the image. This was pretty easy, but if we weren’t on commodity cloud hosting, it couldn’t have happened and we would have been boned. Second design decision validated.
Cloning the servers “just worked” because by keeping the database away from the app servers made sure there wasn’t any hidden coupling there. Moving to two app servers early made sure we weren’t accidentally relying on shared state. It’s the old programming truism about there only being three numbers in computing: zero, one, and many. Once you have “many” servers, adding new ones is no stress.
The final thing that worked well was following the idea of stack simplicity. I had worked a long day polishing the Netflix page code, then had my final game of indoor soccer for the season (with a win), and then went to the pub to celebrate with my team.
The only things that made working with the stack tricky that night were personal, not technical: I was tired, and not 100% sober. If simplicity wasn’t a core tenet of the stack, we could have kissed that uptime goodbye.
What went poorly?
In the presentation to the Ruby group, I stressed that you really needed to understand what the workload for your app is, and find the best bang for buck scaling strategy and use that.
Until this week, I’d been treating Goodfilms purely as a social network, and picked all our strategies to match that. The realisation that we’ve had as a team lately is that Goodfilms really is two sites living side by side: a community site/social network for films, and a rich content site for browsing for films.
With content based sites, full HTML page caches are your most cost effective technique for dealing with load. We didn’t have any page caching in place at all.
Glen’s flatmate Ben, de-facto ops guy for theconversation.edu.au and expert in content based sites chipped in and got us our first cut of page caching in place. This got load under control, but meant that signed in users weren’t getting the proper experience for that page.
The next step was to set things up so we could do signed in vs. signed out caches on demand. This was a little tricky half asleep, but we got there in the end. If you’re interested in how it works, you can check out the specific nginx config and capistrano task I wrote here and it might help you out of a jam.
The “frequent database backups” policy caused us a few headaches. Doing a full dump of a growing database while it’s under load isn’t what I would call “ideal”. That said, I’d rather have a periodical couple of bad gateway errors than risk data loss, so I’d still do it the exact same way again.
If I come up with a better balance of low cost/good uptime/good performance I’ll refine my suggestion for the datastore, but for the mean time I’m sticking with it, but flagging it as an explicit trade off you’ll be making if you follow our stack suggestions.
I can’t say we’re 100% web scale. Staying up half the night with your servers means you’re not there yet. I think this is OK for where we’re at as a business. We’re still feeling our way through to the final feature set of the product, and learning what our market is like. Too much engineering now would be premature.
I strongly believe that there are no “right” answers in scaling a web app, but after giving this stack a solid thrashing, I feel very comfortable putting it forward as a good starting point. It’s a solid base from which you can respond to growth, and evolve it to match your situation well.
Make sure you put the machinery in place for page caching early, even if you’re not using it. The code is easy to write when you’re relaxed in the middle of the day, and a royal pain in the ass in the middle of the night.
Updated: If you found this useful please discuss or upvote over on Hacker News
Goodfilms is a way to share the movies you watch with your friends. We rate movies on two criteria - ‘quality’ and ‘rewatchability’, so you can admit to your guilty pleasures and properly capture the feeling you get when a film leaves you exhausted. Sign up now and keep track of the films you love, and find great, challenging or silly new ones to watch.