shahine.com/omar/

homepage | Send mail to the author(s) contact

yet another Microsoft blogger

# Wednesday, June 13, 2007

Designing for Services Dependencies

Most folks that have never worked on services probably think that services are powered by a bunch of boxes sitting in a data center. While that might be true, it's often not apparent just how dependencies might be treated when designing for services.

This post should be titled "Assume your dependency will fail, so design for that reality". A lot of services talk about the mythical 5 nines (99.999% uptime), I don't think that's possible with large Internet scale services. Many services try and achieve 3 nines (99.9% uptime). Here is a handy table which tells you how much down time you can expect with different reliability percentages.

Reliability Downtime / year
99.999% 5 min
99.99% 53 min
99.9% 9 hours
99.8% 18 hours

So, if you want your service to have 3 nine's then you can afford 9 hours of downtime per year. That means that the service is on its ass and no one can use it.

Noticed that I qualified that statement. What does downtime actually mean?

Well, first step is to take a look at your application. Can you classify your dependencies? Dependencies for Hotmail might look like this:

  1. Login
  2. Address Book
  3. Mail Store
  4. external script / static content dependencies
  5. ...

This is of course a simplistic list. There could be dozens of dependencies. Let's look at a few.

Login

If Passport is down, we are on our ass. Why? Well no one can get credentials, so you can't get your mail. If Passport is down, so are we. Dependency mitigation? Pray.

Address Book

In Windows Live, the Address Book is a shared service that Hotmail, Messenger and Spaces all use. This is a critical part of our infrastructure, like Passport, so it poses an interesting challenge.

If the Address Book goes down what happens? Well you can still login, get your inbox, read your mail, reply to your mail and so on. However, you can't compose mail to people in your address book, you can't edit or view any contacts, and maybe a few other things don't work.

Well, how does your code handle this?

  1. Do you fall on your ass?
  2. Do you throw exceptions?
  3. Do you swallow the errors and give the user some basic experience?
  4. Do you send thousands of requests to the service that is down, creating a bottleneck on the network, consuming TCP/IP, making the problem worse (if you have thousands of servers all trying to talk to a service that is offline, that's bad for them when they try and get back online).
  5. Do you queue requests?
  6. Does your operations team have the ability to block any connections to the service that is down?
  7. Do you even have visibility into this or do you require customers to call or email you to tell you that the address book is broken?
  8. Is this a synchronous request? If so is it a blocking call and how long before you timeout?

I could probably come up with more questions, but you get the idea.

In our world, we do not have "planned downtime" or "planned maintenance". Our service is designed to run 24 x 7 w/o any hardware being taken out of service for upgrades or whatnot. That means we have to handle every kind of failure we can... this includes networking gear dying (do you have hot spares?), hard drives failing, machines melting, power going out, fragmented heap space (memory allocation issues), other services impacting us, edge caching failing and so on.

The Food Chain

It's useful to know where you are in the food chain. In Windows Live, after Passport, Hotmail is a big dog. Meaning, other smaller services often come along and say things like "just call us on login" but they have no idea what they are asking for. In most cases, this is a guaranteed way to tip their server over on day one. Not many services are built to our scale, and it still amazes me how naive some people are about this. Adding some code to our login path is simply unacceptable if it degrades performance.

For some services you need to develop a hot cache of the data. Something like Address Book for example. In Hotmail we need the address book to do auto-complete, to see if a message is safe or unsafe (based on who is in your address book) and so on. It would be very expensive to build out an Address Book service that could handle all the real time requests of our service. So, we cache data to optimize the experience we can deliver to the user.

Architecture

This whole issue begs the question of what's your architecture? Why not just put everything on one box and have lots of those boxes. Can't do that at scale... why?

Different boxes should serve different purposes. This is to segment single points of failure but each application has different hardware needs. For example, boxes that store credentials should be in a secure cage. This is to prevent tampering.

Stateless Frontend machines should be separate from Statefull backend machines. Why? Frontend machines that are statefull can be taken out of service, can be overbuilt for capacity, can be cheaper machines with different memory and so on. Backend storage machines, which do represent a single point of failure, need to be running 24 x 7 and the system should ensure that there is some form of redundancy to prevent the user from not getting their data.

Multiple service consumption may require that boxes have ACLs open to different machines in different places using specific ports, protocols or access patterns. This requires some amount of segregation.

There is a fine line though between creating a million specialized services, and just the right number to keep your team and operations team sane.

Planning for new dependencies

I've talked a lot about big dependencies, but what happens when a new one comes along. Here is a typical scenario.

Brand New Feature X in Hotmail has a dependency on Team B to deliver. Team B says they will deliver on 6/1/2007. Ok, well our next release is shipping on 6/2/2007 (or some date in the future close to Team B's release date). What do you do about it?

Build Feature X assuming:

  1. Team B's stuff is there and working when they say it will.
  2. Team B can't say with 100% certainty they will hit their dates, place a mechanism so that you ship your code, but disable the feature some how.

I hope you picked #2. You see, there are two problems with #1. Team B can slip, which will then force you to slip. You want to be predictable and in control of your destiny right? Well assume Team B won't deliver on time. No offense to Team B, it's just business after all :-).

The other problem is that Team B could very well ship on time, but then their service will fall on its ass the following day because something they didn't anticipate happened.

When mitigating this situation you need to answer 1) how critical is this dependency to our application functionality and 2) where in the food chain are they? If the feature is something small, and the are low in the food chain, place a config that you can us to enable/disable the feature. If the feature is core to your experience then regardless of where they are in the food chain you need to mitigate this correctly.

You now solved the problem of how you can ship your feature without caring about Team B's ability to deliver. What happens after Team B delivers and Team B's service goes down or gets very very slow because they aren't built to scale? You don't want to give away time from your 9 hours right? Well, you handle that failure scenario as I mentioned above, protect the user from the other service, give your OPs team or monitoring tools visibility into the service as the failure starts to happen (so that preventative measures can be taken, like closing the doors on communication to that service) or monitor the situation to see if you can deal with the latency or # of failures while escalating to the other team.

You need to have real time monitoring in place to ensure that if something goes wrong appropriate action can be taken to prevent your service from doing something bad, or to prevent your server from falling over.

Remember, 9 hours in a year is not a lot of time.

 

Thursday, June 14, 2007 12:15:23 AM (Pacific Daylight Time, UTC-07:00)
Just call me "Team B"
Comments are closed.