shahine.com/omar/

homepage | Send mail to the author(s) contact

yet another Microsoft blogger
Page 1 of 3 in the Programming category Next Page

 Sunday, August 05, 2007

Performance Tips from Yahoo

Yahoo has an excellent and short primer on 13 performance tips for web developers.

The only bug is that for Tip #13 they recommend disabling ETags because if you use more than one box to serve graphics requests, the ETags won't match from machine to machine.

Well we use ETags just fine with Hotmail because IIS allows you to synchronize ETags across boxes.

Either way, this is good stuff.

Posted Sunday, August 05, 2007    Permalink    Comments [1]  View blog reactions

 

 Wednesday, June 13, 2007

Designing for Services Dependencies

Most folks that have never worked on services probably think that services are powered by a bunch of boxes sitting in a data center. While that might be true, it's often not apparent just how dependencies might be treated when designing for services.

This post should be titled "Assume your dependency will fail, so design for that reality". A lot of services talk about the mythical 5 nines (99.999% uptime), I don't think that's possible with large Internet scale services. Many services try and achieve 3 nines (99.9% uptime). Here is a handy table which tells you how much down time you can expect with different reliability percentages.

Reliability Downtime / year
99.999% 5 min
99.99% 53 min
99.9% 9 hours
99.8% 18 hours

So, if you want your service to have 3 nine's then you can afford 9 hours of downtime per year. That means that the service is on its ass and no one can use it.

Noticed that I qualified that statement. What does downtime actually mean?

Well, first step is to take a look at your application. Can you classify your dependencies? Dependencies for Hotmail might look like this:

  1. Login
  2. Address Book
  3. Mail Store
  4. external script / static content dependencies
  5. ...

This is of course a simplistic list. There could be dozens of dependencies. Let's look at a few.

Login

If Passport is down, we are on our ass. Why? Well no one can get credentials, so you can't get your mail. If Passport is down, so are we. Dependency mitigation? Pray.

Address Book

In Windows Live, the Address Book is a shared service that Hotmail, Messenger and Spaces all use. This is a critical part of our infrastructure, like Passport, so it poses an interesting challenge.

If the Address Book goes down what happens? Well you can still login, get your inbox, read your mail, reply to your mail and so on. However, you can't compose mail to people in your address book, you can't edit or view any contacts, and maybe a few other things don't work.

Well, how does your code handle this?

  1. Do you fall on your ass?
  2. Do you throw exceptions?
  3. Do you swallow the errors and give the user some basic experience?
  4. Do you send thousands of requests to the service that is down, creating a bottleneck on the network, consuming TCP/IP, making the problem worse (if you have thousands of servers all trying to talk to a service that is offline, that's bad for them when they try and get back online).
  5. Do you queue requests?
  6. Does your operations team have the ability to block any connections to the service that is down?
  7. Do you even have visibility into this or do you require customers to call or email you to tell you that the address book is broken?
  8. Is this a synchronous request? If so is it a blocking call and how long before you timeout?

I could probably come up with more questions, but you get the idea.

In our world, we do not have "planned downtime" or "planned maintenance". Our service is designed to run 24 x 7 w/o any hardware being taken out of service for upgrades or whatnot. That means we have to handle every kind of failure we can... this includes networking gear dying (do you have hot spares?), hard drives failing, machines melting, power going out, fragmented heap space (memory allocation issues), other services impacting us, edge caching failing and so on.

The Food Chain

It's useful to know where you are in the food chain. In Windows Live, after Passport, Hotmail is a big dog. Meaning, other smaller services often come along and say things like "just call us on login" but they have no idea what they are asking for. In most cases, this is a guaranteed way to tip their server over on day one. Not many services are built to our scale, and it still amazes me how naive some people are about this. Adding some code to our login path is simply unacceptable if it degrades performance.

For some services you need to develop a hot cache of the data. Something like Address Book for example. In Hotmail we need the address book to do auto-complete, to see if a message is safe or unsafe (based on who is in your address book) and so on. It would be very expensive to build out an Address Book service that could handle all the real time requests of our service. So, we cache data to optimize the experience we can deliver to the user.

Architecture

This whole issue begs the question of what's your architecture? Why not just put everything on one box and have lots of those boxes. Can't do that at scale... why?

Different boxes should serve different purposes. This is to segment single points of failure but each application has different hardware needs. For example, boxes that store credentials should be in a secure cage. This is to prevent tampering.

Stateless Frontend machines should be separate from Statefull backend machines. Why? Frontend machines that are statefull can be taken out of service, can be overbuilt for capacity, can be cheaper machines with different memory and so on. Backend storage machines, which do represent a single point of failure, need to be running 24 x 7 and the system should ensure that there is some form of redundancy to prevent the user from not getting their data.

Multiple service consumption may require that boxes have ACLs open to different machines in different places using specific ports, protocols or access patterns. This requires some amount of segregation.

There is a fine line though between creating a million specialized services, and just the right number to keep your team and operations team sane.

Planning for new dependencies

I've talked a lot about big dependencies, but what happens when a new one comes along. Here is a typical scenario.

Brand New Feature X in Hotmail has a dependency on Team B to deliver. Team B says they will deliver on 6/1/2007. Ok, well our next release is shipping on 6/2/2007 (or some date in the future close to Team B's release date). What do you do about it?

Build Feature X assuming:

  1. Team B's stuff is there and working when they say it will.
  2. Team B can't say with 100% certainty they will hit their dates, place a mechanism so that you ship your code, but disable the feature some how.

I hope you picked #2. You see, there are two problems with #1. Team B can slip, which will then force you to slip. You want to be predictable and in control of your destiny right? Well assume Team B won't deliver on time. No offense to Team B, it's just business after all :-).

The other problem is that Team B could very well ship on time, but then their service will fall on its ass the following day because something they didn't anticipate happened.

When mitigating this situation you need to answer 1) how critical is this dependency to our application functionality and 2) where in the food chain are they? If the feature is something small, and the are low in the food chain, place a config that you can us to enable/disable the feature. If the feature is core to your experience then regardless of where they are in the food chain you need to mitigate this correctly.

You now solved the problem of how you can ship your feature without caring about Team B's ability to deliver. What happens after Team B delivers and Team B's service goes down or gets very very slow because they aren't built to scale? You don't want to give away time from your 9 hours right? Well, you handle that failure scenario as I mentioned above, protect the user from the other service, give your OPs team or monitoring tools visibility into the service as the failure starts to happen (so that preventative measures can be taken, like closing the doors on communication to that service) or monitor the situation to see if you can deal with the latency or # of failures while escalating to the other team.

You need to have real time monitoring in place to ensure that if something goes wrong appropriate action can be taken to prevent your service from doing something bad, or to prevent your server from falling over.

Remember, 9 hours in a year is not a lot of time.

Posted Thursday, June 14, 2007    Permalink    Comments [1]  View blog reactions

 

 Friday, March 16, 2007

Works on my machine

Brilliant... Ordered my t-shirt.

Posted Friday, March 16, 2007    Permalink    Comments [0]  View blog reactions

 

 Monday, March 05, 2007

Source Control is cool

I finally FINALLY got around to setting up my own source control server. I haven't been using one for all my little projects and it's been driving me crazy.

I thought it would be difficult. With dasBlog we use the excellent TortoiseSVN project to manage our Subversion source repository so I figured I'd see if I could use that.

I asked around on the dasBlog developers mailing list and a number of folks are using SVN with it's built in server and protocol support.

Problem is, setup was a bit complicated. But to make things easier some one has put together SVN 1-Click Setup. It is in fact a one click installer you run on your server.

IT ROCKS.

I set up DynDns on my kick ass D-Link DGL-4100 Broadband Gaming Router and added the correct port and now I can checkin and checkout from anywhere.

BTW, I installed it on my Vista Media Center since it runs pretty much 24/7.

Now I can sleep better at night, and another checkbox on the @Someday/Maybe list. Wohoo!

Posted Tuesday, March 06, 2007    Permalink    Comments [1]  View blog reactions

 

 Wednesday, October 04, 2006

Insert Code for Windows Live Writer

One of the add-ins I wrote, Insert Code for Windows Live Writer, is now posted on Windows Live Gallery. I also just found a small bug I introduced (don't resize the window). Oh, and Phil, I added that Embed StyleSheet checkbox for you :-). Special thanks to Jean-Claude Manoli for CSharpFormat.

 

Special thanks to

del.icio.us tags:
 
Technorati tags:

Posted Thursday, October 05, 2006    Permalink    Comments [6]  View blog reactions

 

Page 1 of 3 in the Programming category Next Page