You can get away with following oral(blog)-history scaling lore at that low end, but before you scale much further the most important thing by far is to be more metrics and measurement oriented. The most basic one to start with is average http response time and http requests per second.
For instance, say your slicehost returned a page at an average of 200ms, but only below 10 req/second, over that it started to bottleneck somewhere and responses would shoot up to 500+ms. Then your question is "how to I handle 100 http requests per second while keeping the average at or below 200ms".
Eventually you'll need to step up to full-browser-render (gomez, keynote, browsermob, homegrown-selenium) time, appdex or 95th% based numbers instead of simple averages, and backend components like mysql req/s and service time.
Having been there a few times there are a few suggestions I'd like to put out there. The points here should take a very minimal amount of time (< 1 week combined) but they are meant to be easy ways of getting more mileage out of what you currently have.
#1. Move all of the non-db specific crons off of your db servers (those that don't specifically involve the rotation of logs/cleanup of tmp directories) and onto one of your web heads. There is nothing more painful then seeing your master db crash because of a cron job or other non-essential role on your db server. It's much easier to recover from / spin up a new web head if it crashes then to restart mysql after your master db crashes (a slave db is easier, but still more of a pain point than a web head). If at all possible, since you're already on EC2, spin up a worker server just to perform these cron jobs (stats collection, db cleanup scripts, etc). EC2 affords you this kind of flexibility. Take it.
#2. Hooking into a CDN is very easy to do and can allow your web servers to offload the serving of static files. As an easy way of doing cache-busting set up a global version variable in your code that gets injected into each static url that's generated (so a request for cdn.mygengo.com/images/29384/bar.jpg would be rewritten to look for /images/bar.jpg on your web heads if and only if the cdn'd cache had either expired or the unique # had changed in the request).
#3 Set up some sort of perfomance monitoring if you haven't already. Almost every startups #1 pain point is the db server, specifically queries. It's great that you've been able to see such a vast improvement through adding indexes / cleaning up some queries, but as long as your data is growing you will need to continue to keep on top of your query logs as their impact footprint will change as the size of your data does. Since you're on AWS, enabling Cloudwatch
should be very easy and provides the basics. You can then move onto implementing Nagios/Munin or look to something like Cloudwatch/ServerDensity for an in-the-cloud solution.
You also didn't mention anything about having a K/V system like Memcache or Redis (yes @antirez we all know Redis rocks and how dare I place it after Memcache ;) which can really take the load off your db servers.
Thanks for the writeup. Hope to have these problems myself someday soon!
Fwiw, one pain point for me has been using EC2. I started off with EC2 from the start, so that I could learn the system now and scale up whenever I need.
Unfortunately, it seems that the performance of EC2 instances is terrible. Or at least, that's what it looks like for me. After a few weeks of trying to get performance boosts by tweaking code, I decided to try a small Slicehost slice, which performed much better at running the exact same system.
My friend (who recommended Slicehost) did the same test on Large EC2 instances as well, and reported that they also appeared oversubscribed.
I haven't really had to scale a system up like this before. As a freelance web developer moving to a bigger slice? on linode has been good enough for any wordpress blog or simple site.
That said, I didn't realize what a massive undertaking it could be like this describes.
Anyone had any similar scaling challenges with less conventional web-stacks? Im particular I'm wondering how a node.js/mongodb stack would do this same thing.
I have scaled many systems in the past and the easy answer is: Know what you're doing or hire someone who does.
There is no blanket answer because applications and workloads differ. E.g. MongoDB has a number of documented issues under high load (esp. high write load), in some scenarios they can be worked around with reasonable trade-offs, in some they can't and Mongo needs to be replaced.
If there's one general thing to say about scaling then that you scale applications, not stacks. The stack will usually change in the process.
> If there's one general thing to say about scaling then that you scale applications, not stacks.
very true! which is why i think it usually is a good idea to analyze how you could split up your application into more or less discrete subsystems/"services". that way it becomes easier to react to bottlenecks in the subsystems instead of having to operate on the whole system.
ideally you design your system like that in the beginning but usually you grow and find out the harder way... step by step.
still there are a couple of things that are "always (about) the same": for example do session mangagement or logging/sitestats usually put quite a bit more write load (INS/UPD/DEL) on a system than other parts of a system.
so if your product grows the write stress these subsystems/functionalities put on a system on the one hand and the read load from other subsystems on the other hand make it hard to optimize your db/storage systems because they have quite contrary I/O needs.
if you have separate subsystems that run as discrete services with their own API it becomes much easier to put out a fire that breaks out due to growth or load spikes.
so to add to the comment by moe:
"you scale applications" - and you get into a better position to do so if you design your applications in a way that they are composed of seperate subsystems that communicate via apis.
| In some cases we managed to speed up integral queries by over 8000%! That's both an indication of the growing pains
That is an indication of sloppy db planning and bad SQL use. You should test with a data-set size and type that resembles a live data-set in the outcome you become successful. Yikes.
While I share your 'hype' skepticism, that gain could easily be gained by adding a missing index. By itself the above statement really doesn't say much, either about the amazing improvement or the quality of the current schema in place.
Depends on the goals. They didn't say how much their "10x traffic" actually is, but by the sound of it they could probably save money by moving to dedicated hardware.
For instance, say your slicehost returned a page at an average of 200ms, but only below 10 req/second, over that it started to bottleneck somewhere and responses would shoot up to 500+ms. Then your question is "how to I handle 100 http requests per second while keeping the average at or below 200ms".
Eventually you'll need to step up to full-browser-render (gomez, keynote, browsermob, homegrown-selenium) time, appdex or 95th% based numbers instead of simple averages, and backend components like mysql req/s and service time.
The point is, this is engineering, use numbers.