Thursday, November 13, 2008

XTrace

This paper talks about tagging data and trace through different layers of protocol to provide better diagnostic tools in a distributed application. Overall a very good idea, as many applications are very difficult to analyze when it breaks down and faster diagnosis can lead to shorter downtime.

I assume this is to be used in online fashion. So it works like insurance against potential failure. There is some cost to instrument code to use xtrace, however it will likely reduce the pain when problem happens. Is this motivation strong enough for a company to adopt? How does this compare to the google approach of running multiple copies to buy them diagnosis time.

Also because this is online approach, performance overhead can also be a problem. But it is justified if it reduces the number of fail over copies a deployment needs.

Because X Trace is designed to be deployed across many ADs, if there is a bug in XTrace, it could potentially introduce correlated failure, which worries me a bit. Now things like performance bugs are multiplied and security bugs allows attackers to attack multiple ADs at once and through the entire application.

No comments: