ODF: Our Document Future

Donna Benjamin, Open Source Industry Australia and Creative Contingencies.

XTech ConferencePaper presented at XTech Conference, Amsterdam, May 2006.

The XML based OpenDocument format represents our document future because of the part it will play in the practice of digital preservation. The National Archives of Australia took part in the development and standardisation of the open office xml file format that has gone on to become the OASIS OpenDocument Format for Office Applications. Members of their digitial preservation software team sat on the technical committee that drafted the standard.

As we've increased our capacity to store data externally from ourselves we've decreased our own ability to remember it. Whilst undertaking the background research for this paper, I was surprised to discover that one of the reasons behind Socrates' dislike of reading and writing, was his belief those skills diminished his students capacity to remember and recall spoken information. It was later acknowledged that by not having to remember the text itself, students engaged in higher level analysis and understanding.

Whatever the means, knowledge has endured. The irony is that if it weren't for the writings of Plato and Phaedrus, we wouldn't know what Socrates had thought or said. Nevertheless, oral traditions are important, and every society that didn't develop writing relied completely on transmitting ancestral wisdom by speaking it. This survives today in the form of family stories. Most things we learn about our families are because of this oral tradition. Think about the stories you may in turn pass on to your children, nieces or cousins.

In Shakespeare's time, most of the audience at his plays would not have been able to read and write, but many would have remembered, verbatim, an entire play they had just seen. We can't remember things like we used to. Preserving what we've recorded is even more important.

Over time we've recorded information on stone, clay, papyrus, vellum, paper, celluloid, magnetic tape, floppy disks, cds, dvds and hard drives. Stone and paper endures, but things recorded on floppies and magnetic tape are in danger. It seems we've adopted a much more short-term approach to information storage.

There are four elements of digital technology that threaten our future access to stored data.

  1. The media on which it is stored.

  2. The hardware used to design and create it.

  3. The software used to create, read or edit that data and

  4. The standards we rely on to record and format data.

Digital archivists understand the dangers of those elements and are working to address these issues. A number of strategies have been identified. Naturally there is disagreement about what will prove to be the best approach. However, many agree that a range of approaches will be required to prevent our time being declared the digital dark age.

Some key strategies have been identified to address the preservation of digital information. These are technology preservation, emulation, migration and encapsulation.

Technology Preservation means maintaining museums with collections of obsolete computer technology kept in working order ready to access archived documents.

Emulation means developing virtual machines to run archived software and data. These systems can be developed when the data is archived, or at a later point, when it needs to be accessed by using the specification and standards that were archived along with the data in question.

Migration means bringing the data forward onto modern systems, storing it on modern hardware and converting it to more recent or standard file formats. It must be constantly refreshed onto new media. This is a particularly useful strategy for documents that are likely to be reused and accessed frequently.

Encapsulation involves storing the data inside a digital envelope that records important metadata.

These four preservation strategies can be organised into two main approaches

  1. preserve the technical environment with hardware museums and emulation.

  2. overcome obsolescence of file formats with encapsulation and migration.

Which brings us to XML. This really seems to be the key. XML can be created and interpreted independantly of any particular computer platform or program. It is an open standard that is widely accepted. It was developed with the aim of separating form, structure and content, and it is flexible and extensible. Most importantly perhaps, it is free and it is readable by both machines and human beings. Files can easily be converted into XML formats and XML is a useful way of recording and preserving metadata. It would seem that XML is part of the answer for addressing our digital preservation needs.

OpenDocument is a file format for office application documents. It's been developed and ratified by an OASIS technical committee and has just been approved as a standard by the International Organisation for Standardization. OpenDocument uses XML to store and describe data. This means the document content can be read independently of the application that created it. The specification on how to read, and develop software that can read and write ODF files is completely open.

Michael Brauer, from Sun Microsystems, spoke at this conference last year. The title was "From Open Source to Open Standard: The OASIS OpenDocument Format". I'd urge you to read his paper for more detail on the OpenDocument and how it came to be considered and adopted as a standard. http://idealliance.org/proceedings/xtech05/papers/03-02-02/

When talking to a friend about this paper, he asked "Why does ODF and XML matter for office file formats?" Does anyone here remember using word processors in the 1980s? Remember the difficulties of sharing files? Dealing with different file formats? I reminded him of a typewriter based word processor he used to write essays on in High School. He could save his work to a floppy disc, but no other computer he had access to at the time could read that floppy disc, let alone access the files stored on it.

I'm sure many of you would have your own story of frustration regarding file incompatibilities. This is why ODF matters, because it holds promise, as an independent standard, to help us 'solve' this problem and I'm of the view that 'standards' are a sign of mature technology.

Open standards and open formats are critical if we are to make serious efforts to preserve the 'born digital' documents of our age. I spoke earlier of four approaches that have been identified to attack this problem; preservation of the technology itself, emulation of the hardware that runs the software, migration of the documents and digital artifacts to modern formats, and encapsulation. ODF will be most useful for migration and encapsulation.

Applications like OpenOffice, Koffice and StarOffice can already open a number of existing file formats, and convert them to ODF, this is enormously helpful for migration strategies. Encapsulation strategies will also benefit from using ODF because it zips the different elements of the document together into one file. An ODF file is effectively a collection of files and folders compressed together to form one file. Style is separated from content, embedded images are saved in the 'pictures' folder, and metadata is included there too.

It is true that Microsoft Word has become a de facto standard due to its ubiquitous presence on 85% of the world's desktop computers. But not everyone can afford the ticket price of Word, and governments especially are increasingly sensitive to this. Many are moving towards delivering services and information on-line and are becoming aware they should not be mandating that their people buy a particular product in order to access government information and services.

Word files, however, are probably the least of our problems. A lot of precious information is stored in other formats with extensions like .xls and .ppt or even .psd and .mdb. Let's look at spreadsheets - essentially these are rows and columns of data, usually numbers, and they often contain formulas for calculations associated with that data. Printing and archiving a text document is a valid preservation method. We know paper works if it's looked after, just look at Britain's Domesday book from 1086. It's fragile, but it's legible. But paper can't help us preserve the functionality of a spreadsheet with complex macros and calculations. This is further complicated if that spreadsheet accesses data stored in a network database. Or the calculations in a budget vary due to different implementations of floating point mathematics.

For more on the real technical nitty gritty of the OpenDocument Format take a look at Daniel Carrera's "Introduction to the Format Internals" article on the OpenDocument fellowship website.http://opendocumentfellowship.org/Articles/IntroductionToTheFormatInternals

There are a number of digital preservation initiatives being implemented and tested in Australia. This is not to say Australia is alone in this. There is brilliant work being done here in the Netherlands. A White Paper called "XML and Digital Preservation" gives a more in depth analysis of XML and digital preservation issues.http://www.digitaleduurzaamheid.nl/bibliotheek/docs/white-paper_xml-en.pdf

Three organisations in Australia are leading the charge. The National Library of Australia with the PADI and PANDORA projects, The Public Record Office of Victoria has the VERS initiative, and the National Archives of Australia has a warrior called Xena.

National Library of Australia

The National Library of Australia's Preserving Access to Digital Information initiative aims to provide ways that will help ensure information in digital forms can be managed with consideration for preservation and future access.

The initiative helps develop strategies and guidelines for preserving access to digital information, runs a web site for information and promotion purposes and identifies and promotes relevant activities.

The PADI web site is also a central repository of information on digital preservation resources from Australia and around the world. It has an associated discussion list for the exchange of news and ideas about digital preservation issues. It includes a database of international digital library organizations, initiatives and reports describing projects dealing with the preservation of access to digital materials. http://www.nla.gov.au/padi

PANDORA, is Australia's Web Archive Project and collects Australian online publications. It is an ongoing collaboration of Australian libraries and organisations dedicated to preserving some of Australia's online and electronic culture and heritage. Their website states that the name, PANDORA, is an acronym that encapsulates their mission: Preserving and Accessing Networked Documentary Resources of Australia. http://pandora.nla.gov.au

Public Records Office of Victoria

The Public Records Office of Victoria developed a Victorian Electronic Records Strategy (VERS) which specifies a standard format for electronic records.

VERS aims to be generic but extensible and is designed to work in conjunction with existing record keeping and business practices. It ensures that all data is stored in a documented format to enable viewing of records in the future, regardless of the system that created them. It specifies methods to automate the capture of records from desktop and business systems. The strategy specifies ways and forms in which to capture associated information and encapsulate this with the records to ensure that they will be understood in context in the future. VERS also details a method for securing records so that changes are detectable.

VERS applies some of the technologies and concepts we identified earlier to the problem of long term preservation. For example they reccommend XML for encapsulation to ensure that metadata and document content are wrapped up together in a single object.

VERS also makes use of extensible schemas that allow the standard schema to be extended through the inclusion of new components. Digital Signatures are used to provide authenticity and integrity for each record, whilst retaining the possibility of changing the records in a controlled manner. Ultimately it ensures that each record can contain multiple documents that together form a record and perhaps most importantly it provides a mechanism to describe the many different relationships that link individual records together. http://www.prov.vic.gov.au/vers/

National Archives of Australia

And, finally I'd like to talk about what the National Archives of Australia has done. As I mentioned earlier, they were involved with the creation of the OpenDocument format, you may have read that they intend to soon adopt it as part of their digital preservation workflow, but that is in fact a very small part of what they have achieved. The National Archives approach to the preservation of digital records is based on a combination of conversion, encapsulation and emulation. http://www.naa.gov.au/

NAAs approach used in-house developed software called Xena, which has been made freely available to others under an open source licence. Xena processes documents for the digital archiving process. The NAA recently received quite a bit of publicity regarding their intention to move toward ODF formats. However, Michael Carden, manager of the software team at the NAA has been careful to point out that this is a very small detail of their very large picture. Xena uses the OpenOffice.org office suite to convert certain files into XML for preservation. Xena's earlier incarnations used an older version of OpenOffice which did not yet support the OpenDocument format. OpenOffice.org from version 2.0 onwards uses ODF as it's native file format.

As at early April the build of Xena that can normalise to OpenDocument exists only on a couple of development machines amongst the digital preservation software team. It appears to work but it has not been rigorously tested and is not in use for production. It probably won't be ready for some months yet. However it was always their intention to use OpenDocument for office application formats. The reality is that there are more pressing challenges so this is less of a priority.

Michael Carden says XML is widely acknowledged to be a helpful tool in digital preservation strategies, but I believe OpenDocument goes a step further because software exists now for users to start creating electronic records in this format. Software that runs on a range of hardware, on different operating systems using different applications, and even more are planned and in development. Governments and their agencies are acknowledging the importance of open standards. The OpenDocument Format is here, now, for our future.

Bibliography

Beagrie, Neil. 'National Digital Preservation Initiatives: An Overview of Developments in Australia, France, the Netherlands, and the United Kingdom and of Related International Activity.' National Digital Information Infrastructure and Preservation Program, Library of Congress, April 2003. http://www.clir.org/PUBS/reports/pub116/pub116.pdf

Gladney, H. M. 'Principles for digital preservation.' Commun. ACM 49, 2 (Feb. 2006), 111-116. DOI= http://doi.acm.org/10.1145/1113034.1113038

Kyong-Ho Lee, Oliver Slattery, Richang Lu, Xiao Tang, and Victor McCrary. 'The State of the Art and Practive in Digital Preservation.' Journal of Research of the National Institute of Standards and Technology, Jan-Feb 2002. http://nvl.nist.gov/pub/nistpubs/jres/107/1/j71lee.pdf

Waugh, A., Wilkinson, R., Hills, B., and Dell'oro, J. 2000. 'Preserving digital information forever.' In Proceedings of the Fifth ACM Conference on Digital Libraries (San Antonio, Texas, United States, June 02 - 07, 2000). DL '00. ACM Press, New York, NY, 175-184. DOI= http://doi.acm.org/10.1145/336597.336659

Wheeler, D.A. “Is ODF an Open Standard?” Groklaw, Feb 2006. http://www.groklaw.net/article.php?story=20060209093903413

Digital Preservation Testbed White Paper: XML and Digital Preservation, ICTU, Den Haag, September 2002. http://www.digitaleduurzaamheid.nl/bibliotheek/docs/white-paper_xml-en.pdf

Open-standards solutions White Paper: Emerging business value of OpenDocument format v1.0, IBM, January 2006. http://odfalliance.org/pubs/ODF_WP_Jan_05.pdf