While many despise WMA, DOC, MP3 and other proprietary formats the discussions about data formats used by web applications have been surprisingly silent. It is true that this is mainly because a lot of web applications offer XML export or offer an API for exporting data to other services. But as data formats in the cloud become more complex and by sheer number of data formats now generated by applications in the cloud, standardizing data formats becomes just as important as it is on the desktop.
I recently went to hear Richard Stallman speak about copyright issues and Håkon Wium Lie, the creator of CSS and CTO at Opera Software, had an interesting statement in his question to Stallman: "I believe that the need for open source in software is of lesser importance. What really matters is the data people produce and that we have proper standards to ensure portability between services and platforms."
Desktop applications support import/export to various data formats. My browser of choice, Safari, can import bookmarks from other browsers. iWork accepts to a certain degree the proprietary formats produced by Microsoft Office. Any accounting software will be able to import data from a market leader like Quicken or MYOB. We are also beginning to see a similar set of functions in web applications and in many cases this is powered by direct API integration between two services.
As most web applications use XML for data there is already an open format offered to users. But should we require open standards to ensure easy portability of data between services? Many types of data are fairly simple to represent in a standard manner. I understand that this does not suit all the internal data structures for all web applications, but the main important bits of information can in most cases be specified to a common standard. A customer will in any case have name, address, phone numbers etc.
One example where a common standard would be very beneficial is accounting. Accounting is defined by commonly adopted principles but differ in countries in relation to reports, tax setup’s and such. But at the end of the day the data ends up in journal/transaction entries and account information. All accounting vendors take a different approach to this and importation of data must be designed for to suit the format in general or must be customized for customers. This limits the choice of accounting vendors for someone using a less popular accounting service. It also results in a lock-in for customers using applications which provide less commonly implemented data formats.
Data portability in the cloud
Data portability does not only imply solely the ability to export data, but the ability to import the data into another service. For a non-technical user a data format that is not accepted by other services is just as useless as a inaccessible service. In desktop software this is often solved by exporting to a format defined by a major competitor or using a simple format such as CSV and leaving the user the job of managing field matching.
Linked data
image from Ted Berners-Lee TED talk
Taking data with you from one service to another is not just about the data in the system you are moving from but also about maintaining links to other systems. I suspect that data values that are linked to external integrated services are overlooked in many data exports. Migrating a CRM-system from one service to another in the cloud should also make it possible to keep links to external services. So any linking id’s in the CRM-system to a project management system should be maintained.
In other internet data communication we have seen recent standardized data formats evolve such as sitemaps and widgets. So maybe in the future we will see standard data formats for entities such as customers and projects. There seem to be a few organizations supporting the case but little has been done yet.
Do you have any experiences to share regarding data formats in the cloud or moving your data from one vendor to another?
Related articles:
Espen, check out Microformats. There is some progress in this area but we need to do more. One thing is clear. The only way for Cloud Computing to succeed is by having open formats. Proprietary formats can only go to a certain extent when it comes to data portability and interoperability.
Plus, I disagree with the idea that open source becomes irrelevant in the Cloud based world. Even people like Tim O’ Reilly seem to promote this line of reasoning. But as we have seen with WordPress, Deski Wiki and Wikidot, releasing the source code is no hindrance to a business and if companies like Coghead are any indication, it is important to share the source code to ensure that consumers trust cloud based applications. I have written on this topic a few time here at Cloud Avenue. I also have a plan to do a dedicated post on this topic but I keep on postponing it.
Krishnan; Naturally Stallman as well disagreed with open source being less relevant. But Lie’s point was not really about that, though he did argue that the data is more important than the source code, but about the importance of standard data formats.
I am aware of microformats but did not think of it when I wrote the post. Thanks for reminding me. They do seem to include lots of overhead though.
I can’t really see the need to normalize data to conform to certain made up standards. For some data, sure (you used accounting as an example), but for most services, a little parsing isn’t really that hard.
With the introduction of lightweight data interchange formats like JSON, it’ll only take a couple of minutes (or hours at the most) to create a stable parser for most APIs. Let’s not get too lazy and comfortable! 🙂
I really wish XML would die. JSON is a much more compact and usable interchange format which has great support in most languages by now. It’s also much more web-friendly, as it’s plug and playable into JavaScript.
@Henrik True, many formats are fairly simple and not at all hard to parse – for you or me. But that is not the case for the average user. And the data I mainly refer to in my post is mainly business data which tend to be more complex, thus the need for standard data formats to ensure data portability.
Espen, thanks for a great post. I fully agree with your view that accounting needs an open standard. I am actually in the process of writing one at the moment: OAccounts aims to provide interoperability and data synchronisation between accounting systems. It’s still early stage, but I’ve blogged about OAccounts and would welcome input from anybody with an interest in this area while I write a draft specification over the next few weeks.
@Henrik One really valuable thing which an open standard offers you is the ability to plug a large number of different implementations together without ANY integration or parsing work at all. If you have N different systems, each with a different API format, then connecting them all requires in the order of N-squared parsers to be written. That soon gets ridiculous. With an open standard, you only need N implementations of that standard.
Think of email as an example for a very successful open standard. Now imagine that GMail decided to implement its own, incompatible set of mail headers so that everyone who wanted to exchange messages with a GMail user had to implement Google’s protocol. Yes, it would be possible, but a right pain in the backside; it is because everybody has bothered to implement (more or less) the same SMTP and MIME standards that we can simply send an email and expect it to work.
Finally, regarding your note about wanting XML to die. I would agree to this regarding things like configuration files, which are supposed to be human-readable; having these in XML is horrible. Similarly, if everybody is going to make up their own XML schema, then it’s also fairly pointless and you might as well use JSON. XML becomes really valuable then when a bunch of bright people have sat down for a long time and carefully designed a well-thought-out schema for a particular purpose. For instance, I am planning to use the OASIS UBL 2.0 XML schema for OAccounts. It works really well because these guys have thought about issues of multi-currency, international transactions, international taxation, shipping of goods by container, various banking and multi-party purchaser-supplier relationships which you get in the real world. If I was to make up a schema myself — be it XML or JSON, it doesn’t matter — I would never think of all those things, and we would hit limitations as soon as we tried to put the thing to real-life use. By using this existing schema, I feel confident that we can scale to complicated use cases, but at the same time not introduce too much extra complexity in the simple use cases.