Words most CIO’s do not ever want to see the same sentence; Microsoft
Danger Sidekick Data Loss Sabotage highlights some interesting aspects
of companies, data loss, and the willingness of speculation as to the
actual reason why something happened. In this case the ongoing issues
with the Danger Data Center and how it has impacted the T-Mobile
sidekick are things that anyone who is relying on remote data services
or cloud computing or anything remote should pay careful attention to.
When designing and building complex systems, one of the most
important things that any system designer can do is look into the
aspects of recovery. How hard will it be to recover the system if it
completely dies, how hard will it be to recover the users data, what
are the dependencies, where are the fail over systems and how are they
brought back online. How does staff react to the catastrophic and how
does the company implement fail over data disasters. While we do not
have the full story, and it is unlikely that we will get the full story
right now about the Microsoft T-mobile sidekick data outage, what we do
know is that many customers have lost their data.
If this was your company, and you lost data you might as well be
out of business. There are many lessons to be learned and applied from
this one as to how companies plan for disaster.
Earlier I wrote that cloud computing does not absolve a company from good disaster preparedness,
and it does not. Many of the systems designers are wondering where the
hot/cold site was and why wasn’t it up and running with at least some
data set (although to be honest if the original data was corrupt, the
backup fail over hot/cold site would have been equally corrupt if there
were concurrent synching between the databases) that would have been
available. Everyone is wondering why the backups didn’t work, or why
were the backups not tested at least weekly when you have a million
customers who rely on that data. Many people will forgive a week’s
worth of data loss, if this was a company though that is working in
high transaction environment it would be a disaster. Imagine if the NY
Stock exchange said that they lost last week’s trading data.
One Million users is a high transaction computing environment with
hundreds of dependencies. Even if a customer only made three or four
changes to their data over a week, that is millions of transactions. I
know how much time I spend on my smart phone, and it is more like 20
changes a week or data store changes, it adds up quick. The backups
were essential in a high transaction environment. While the Sidekick is
not the NY Stock exchange, 20 million transactions a week means
something to the people who are changing their data.
What makes this more complex is that Apple Insider reported that the only way this could have happened was through sabotage.
The idea of a disgruntled fired employee is not that far off the mark,
and it is easy to blame someone internally when all else seems to fail.
But the worry is that no one knows if there was deliberate sabotage to
the system. The announcement this morning that some of the data has been recovered
means that Microsoft had to dig into the archives of backups to see
what they could find and use, but it did not dispel the rumor of
sabotage. While not far off the mark, the speculation of sabotage has
run rampant on the internet since it was posted on Apple Insider. What
makes this more plausible is that Microsoft nor T-Mobile has not denied
it publicly. We have a plausible rumor, followed by corporate silence
that feeds the rumor and the speculation even further as people want to
know what happened to their data. If it was a disgruntled employee this
is an added complication to how the system will be recovered, and
probably not one addressed in the systems design standards or as built.
Few people build their systems with the idea that at some point a
person will deliberately sabotage the system.
In all between the speculation and the truth which will come out
slowly over time there is one additional complication to this that is
going to end up making this expensive. That is the eventual lawsuits
that come from issues like this. There are already class action (or
class action seeking) lawsuits that have been filed for people losing
data, regardless of what T-Mobile has offered as compensation to their
customers. While this will be tied up in courts for years, the overall
cost based on the current offer of 100 dollars and a month of free data
services (based on 1 million sidekick users) is already around 140 to
160 million depending on the data service plan. Someone is going to
want to get that money back, either through the courts via SLA, or
through some other manner that will be dictated by how the contract was
written. This is not going to be a cheap outage in the longer run, even
with data being recovered.
While no one will debate the wisdom of tried and tested backups,
tried and tested disaster recovery, there is one additional
complication, and that is the layoff rate and the use of under skilled
workers. When working in disaster recovery you are only as good as your
employees and your management diligently planning, testing, and fixing
any errors noted in the test fail over. Training, planning, testing,
planning, training, all of these have to happen on a regular basis, at
least quarterly to make sure the plan works. Under skilled workers can
be trained but only trained if they can get their hands dirty with the
processes involved. Management can plan but those plans will only prove
viable if it can be tested. With the layoffs of very smart very
expensive employees who have done this, the reliance on untested people
with untested plans and a brain drain leads to a monumental disaster
like we have seen.
Cloud computing is a great way to reduce the overall costs of
computing, but like any other computer system it relies on everyone
training all the time, working hard perfecting the plan, and making
sure that there is good information security in place. It also requires
a high level of Public Relations, and the good kind of public relations
where the issues are more or less transparent so that the speculation
does not run rampant on the internet and in the blogosphere. Test
everything and make sure there are no holes in the process, train
people to do disaster recovery and backups all the time. It is all
simple in retrospect, but one has to question why these things were not
done, and if they were done, why didn’t they work?
(Cross-posted @IT Toolbox)