Me: I live in Silicon Valley with my wife, child and cat. I have worked at Microsoft since I graduated from College, both in the Macintosh Business Unit on products such as Outlook Express, Entourage, IE, and Virtual PC and in Windows Live on Hotmail, Calendar and People. I am currently a Principal Lead Program Manager on the Windows Live Social Networking team. I basically manage a team of Program Managers responsible for delivering features to support our web and client applications. I've been blogging since 2001 and like to play around with .NET in my spare time working on projects such as dasBlog (the blog that powers this site) and Send to SmugMug (an application for uploading photos to SmugMug). I blog about a number of technology and productivity related topics.
Powered by: newtelligence dasBlog 2.3.9074.18820
Disclaimer The posts on this weblog are provided "AS IS" with no warranties, and confer no rights. The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
© Copyright 2010, Omar Shahine
E-mail
Most folks that have never worked on services probably think that services are powered by a bunch of boxes sitting in a data center. While that might be true, it's often not apparent just how dependencies might be treated when designing for services.
This post should be titled "Assume your dependency will fail, so design for that reality". A lot of services talk about the mythical 5 nines (99.999% uptime), I don't think that's possible with large Internet scale services. Many services try and achieve 3 nines (99.9% uptime). Here is a handy table which tells you how much down time you can expect with different reliability percentages.
So, if you want your service to have 3 nine's then you can afford 9 hours of downtime per year. That means that the service is on its ass and no one can use it.
Noticed that I qualified that statement. What does downtime actually mean?
Well, first step is to take a look at your application. Can you classify your dependencies? Dependencies for Hotmail might look like this:
This is of course a simplistic list. There could be dozens of dependencies. Let's look at a few.
Login
If Passport is down, we are on our ass. Why? Well no one can get credentials, so you can't get your mail. If Passport is down, so are we. Dependency mitigation? Pray.
Address Book
In Windows Live, the Address Book is a shared service that Hotmail, Messenger and Spaces all use. This is a critical part of our infrastructure, like Passport, so it poses an interesting challenge.
If the Address Book goes down what happens? Well you can still login, get your inbox, read your mail, reply to your mail and so on. However, you can't compose mail to people in your address book, you can't edit or view any contacts, and maybe a few other things don't work.
Well, how does your code handle this?
I could probably come up with more questions, but you get the idea.
In our world, we do not have "planned downtime" or "planned maintenance". Our service is designed to run 24 x 7 w/o any hardware being taken out of service for upgrades or whatnot. That means we have to handle every kind of failure we can... this includes networking gear dying (do you have hot spares?), hard drives failing, machines melting, power going out, fragmented heap space (memory allocation issues), other services impacting us, edge caching failing and so on.
The Food Chain
It's useful to know where you are in the food chain. In Windows Live, after Passport, Hotmail is a big dog. Meaning, other smaller services often come along and say things like "just call us on login" but they have no idea what they are asking for. In most cases, this is a guaranteed way to tip their server over on day one. Not many services are built to our scale, and it still amazes me how naive some people are about this. Adding some code to our login path is simply unacceptable if it degrades performance.
For some services you need to develop a hot cache of the data. Something like Address Book for example. In Hotmail we need the address book to do auto-complete, to see if a message is safe or unsafe (based on who is in your address book) and so on. It would be very expensive to build out an Address Book service that could handle all the real time requests of our service. So, we cache data to optimize the experience we can deliver to the user.
Architecture
This whole issue begs the question of what's your architecture? Why not just put everything on one box and have lots of those boxes. Can't do that at scale... why?
Different boxes should serve different purposes. This is to segment single points of failure but each application has different hardware needs. For example, boxes that store credentials should be in a secure cage. This is to prevent tampering.
Stateless Frontend machines should be separate from Statefull backend machines. Why? Frontend machines that are statefull can be taken out of service, can be overbuilt for capacity, can be cheaper machines with different memory and so on. Backend storage machines, which do represent a single point of failure, need to be running 24 x 7 and the system should ensure that there is some form of redundancy to prevent the user from not getting their data.
Multiple service consumption may require that boxes have ACLs open to different machines in different places using specific ports, protocols or access patterns. This requires some amount of segregation.
There is a fine line though between creating a million specialized services, and just the right number to keep your team and operations team sane.
Planning for new dependencies
I've talked a lot about big dependencies, but what happens when a new one comes along. Here is a typical scenario.
Brand New Feature X in Hotmail has a dependency on Team B to deliver. Team B says they will deliver on 6/1/2007. Ok, well our next release is shipping on 6/2/2007 (or some date in the future close to Team B's release date). What do you do about it?
Build Feature X assuming:
I hope you picked #2. You see, there are two problems with #1. Team B can slip, which will then force you to slip. You want to be predictable and in control of your destiny right? Well assume Team B won't deliver on time. No offense to Team B, it's just business after all .
The other problem is that Team B could very well ship on time, but then their service will fall on its ass the following day because something they didn't anticipate happened.
When mitigating this situation you need to answer 1) how critical is this dependency to our application functionality and 2) where in the food chain are they? If the feature is something small, and the are low in the food chain, place a config that you can us to enable/disable the feature. If the feature is core to your experience then regardless of where they are in the food chain you need to mitigate this correctly.
You now solved the problem of how you can ship your feature without caring about Team B's ability to deliver. What happens after Team B delivers and Team B's service goes down or gets very very slow because they aren't built to scale? You don't want to give away time from your 9 hours right? Well, you handle that failure scenario as I mentioned above, protect the user from the other service, give your OPs team or monitoring tools visibility into the service as the failure starts to happen (so that preventative measures can be taken, like closing the doors on communication to that service) or monitor the situation to see if you can deal with the latency or # of failures while escalating to the other team.
You need to have real time monitoring in place to ensure that if something goes wrong appropriate action can be taken to prevent your service from doing something bad, or to prevent your server from falling over.
Remember, 9 hours in a year is not a lot of time.