Building Scalable Websites
Posted by Mort Greenberg on May 10, 2008
Article Source: http://poorbuthappy.com/
Peter Van Dijck’s weblog (Very cool blog with lots of good info on building large, scalable sites, worth a read, or to book mark to read later when you have more time) This article goes back to April 2007 but still worth the read…
By Peter Van Dijsk
I always love to read scaling discussions, especially about popular web apps, and there are loads of them out there. Here’s my overview of the best. By the way, the best book on scaling apps I’ve ever read is Building Scalable Websites, by Cal Henderson (the Flickr guy).
It’s dog-eared on my desk, and taught me about sharding (which I used extensively for mefeedia). Sharding is when you cut a really big table into pieces, so you can put those on separate servers. It means you have to make changes to your code, and your database isn’t so database-y anymore, but it works. For example, online games use sharding to grow their virtual worlds, because there’s no way they could serve all that information from 1 db cluster.
Scaling Twitter with Ruby. Twitter is hot today, and they ran into some serious scaling problems, although the app itself is quite simple. It consists of messages of maximum 140 characters. Lessons are the same as most apps: Memcache like crazy, and optimize the database (the biggest bottleneck most of the time). Also, Ruby on Rails scales pretty much the same way as PHP and other similar languages: shared nothing architecture. Shared nothing means that there is no 1 thing that is shared by all servers, since that would become a bottleneck. PHP, for example, has shared nothing architecture out of the box, except perhaps for sessions, but that’s easily solved by storing sessions in a db (which then has it’s own scaling approach) and not in the filesystem. Here’s a talk by Rasmus Lerdorf that explain scaling with PHP5. ( Here’s the mp3 audio recorded by Niall Kennedy).
Blain Cook made this presentation:
VIEW PRESENTATION http://www.slideshare.net/Blaine/scaling-twitter
Scaling Flickr. Cal Henderson wrote the above book, and also has a good presentation: Scaling Flickr slides as PDF’s. One of the problems you get into when scaling something like Flickr where you store LOTS of stuff, is that you can’t just store that on a harddrive anymore: it’s not big enough. Apart from just using Amazon’s S3 service (which rocks – I used it for mefeedia and I know lots of startups who use it), there are other solutions. A good presentation of that by Cal is this one:
VIEW PRESENTATION http://www.slideshare.net/techdude/beyond-the-file-system-designing-largescale-file-storage-and-serving
Cal (he’s a busy dude) also made this presenation about scaling web apps, generally:
VIEW PRESENTATION http://www.slideshare.net/techdude/scalable-web-architectures-common-patterns-and-approaches
John Allspaw (flickr plumbr) also has a good presentation about scaling Flickr:
VIEW PRESENTATION http://www.slideshare.net/akshat/1scaling-phpmysqlpresentation-from-flickr
Scaling LiveJournal. LiveJournal was one of the first social networks, before that word meant anything, and they’ve partly invented how to scale standard php/mysql/apache apps. They developed memcached, which is now used by almost anyone who wants to scale their site. Brad Fitzpatrick has a good set of slides on how they evolved the service, here’s a PDF version. And here’s the slideshow embedded:
VIEW PRESENTATION http://www.slideshare.net/vishnu/livejournals-backend-a-history-of-scaling
Kevin Rose mentioned this was “the bible for scaling Digg” – and I think quite a few other web apps are based on this.
Six Apart. The livejournal guys with all their scaling expertise were acquired by Six Apart, and they soon launched Vox. And of course, here’s a presentation on making Vox scalable:
VIEW PRESENTATION http://www.slideshare.net/miyagawa/how-we-build-vox
Bloglines. Bloglines’ scaling problems where slightly different from your average web app, since they are an aggregator of feeds. That means they have billions of blogposts they have to keep and serve to users, and that creates its own scaling problems. The Bloglines approach was to, instead of using a database, just store all that stuff in a special filesystem. Today it’d be easier to do this since there are a few filesystems that do that, or you could just go with S3 again. Mark Fletcher (who also sold Onelist to Yahoo which is now Yahoo Groups) has given a few talks on scaling Onelist and Bloglines: here’s the mp3 audio version, and here’s the PDF of that talk. And a text transcript.
Last.fm Last.fm is one of the aggregation-type apps: they gather a lot of data about what music you listen to. Similarly to Bloglines, that causes it’s own scaling problems:
VIEW PRESENTATION http://www.slideshare.net/coolstuff/lessons-from-building-worlds-largest-social-music-platform
All the slides in this post are hosted by Slideshare, an incredible service by my fellow information architect Rashmi Sinha and team. When I found out about the project, I emailed her: “brilliant and so obvious once you think of it”. Like many startups, they use S3 to serve their content, and they have the obligatory yet interesting slides to explain how:
VIEW PRESENTATION http://www.slideshare.net/jboutelle/scalable-web-architectures-w-ruby-and-amazon-s3
I haven’t linked to lots of good thinking about scaling, or to technical resources and stuff. But the presentations should get you going in the world of memcached, perlbal, nothing shared and federation 🙂 Enjoy!
PS: See also How I Unexpectedly Found Myself Doing Consulting For Startups (this is a post on my “professional” site. I haven’t been able to figure out when to post here or there, any tips on that?).
Finally, Dan Pritchett has a good presentation on scaling eBay (PDF). 26 Billion SQL queries per day! 300+ new features per quarter! 4 architecture versions since 1998 and some pretty crazy scaling of the search.
New: presentation on how Facebook uses PHP APC cache (PDF).
A talk on Youtube scalability: “In the summer of 2006, they grew from 30 million pages per day to 100 million pages per day, in a 4 month period. Thumbnails turn out to be surprisingly hard to serve efficiently. (I ran into this with mefeedia too, luckily Amazon S3 came to the rescue by then.)” Youtube uses Python, Apache, MySQL, Memcached.
NEW: Front end scaling is important too, and often ignored. Here’s a good presentation from the Yahoo guys:
Leave a Reply