Choosing The Right File Format/Text Documents

Text & Documents edit

In most types of organisations text documents are their most important type of electronic information after financial accounts. Depending on what the document contains there are several types of format to choose from.

There are three types of text documents: Plain text files - simple text, no formatting, no font choices. Text documents - you can choose fonts, colors, text size, backgrounds and imbed images (sounds/video etc.). Documents for presentation - all the options of Text documents, with restrictions on further editing.

For Plain text files the simplest, and most durable format is ASCII (American Standard Code for Information Interchange). It has been developed since 1963 and must be the single most supported format ever. However it is also very limited. The only formatting available is the selection of line breaks. There is no embedding of any images or colors, and there is no support for diacritic marks or non-Latin scripts. There are a variety of other encodings based on ASCII which add support for more characters. In the western world windows-1252 (which is closely related to ISO-8859-1) is the most common of these. Other parts of the world will have other conventions. UTF-8, which can represent texts of all languages in real use, is becoming more common and may be the best choice for long term storage of text.

Text files using an encoding based on ASCII are usually represented with the .txt suffix, but it can be hard to determine which one automatically. So it is a good idea to try and find out what encoding you are using and record it. If you are really paranoid you may also want to find and store the authoritative tables for converting that encoding to unicode (try http://www.iana.org/assignments/character-sets and http://www.unicode.org/Public/MAPPINGS/).

For Windows users, Notepad is the default application for handling TXT files. Current versions of notepad assume UTF-8 if the file is completely valid UTF-8 or has a UTF-8 byte order mark, UTF-16 if they detect a UTF-16 byte order mark and the windows ANSI code page (1252 for western versions) otherwise. In a pinch it is often possible to use notepad and similar editors to get the raw text out of other types of files, and it can be informative to try this on other files you plan to store.

Text documents are what you produce most of the time on one of the many commercial or free word processors. Most of the time you probably use it for writing basic text documents. Letters to friends and colleagues, project lists and so on. Applications for this type of text are found in popular office suites like Microsoft Office, AppleWorks and OpenOffice.org.

For the purpose of durability of your documents it is important that the document you write today will still be readable next year. For a long time there has been no open standard for documents, so compatibility has been a constant problem. People have had different levels of success when they've chosen to migrate from one document editor to another, as each used its own format. The .doc format is now well supported by several editors.

Whichever word processor you use, it should support several formats, choosing the most durable format is very important. While work proceeds on the OpenDocument standard (Version 1.0 was approved as an OASIS standard in May 2005), RTF (Rich Text Format) is the most widely supported and documented format available. You should be able to make this your default format so all future documents are in the RTF format. (Tutorial on changing the default format in Microsoft Word) If you choose not to do this because RTF does not support some feature you need, you should still consider using RTF as you archival format. Your formatting may not be represented correctly, but at least your content is there for posterity.

If you spend time making Documents for presentation you'll know that Word processors are limited in this area. You might be using programs like Adobe Illustrator/InDesign, sodipodi or CorelDRAW. These programs are great, but they can be tricky to successfully archive.

There are at least two competing options, PDF especially PDF/A from Adobe and XPS from Microsoft.

The Portable Document Format (PDF) is the file format created by Adobe Systems in 1993 for document exchange. PDF is a fixed-layout format used for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a 2-D document (and, with Acrobat 3-D, embedded 3-D documents) that includes the text, fonts, images, and 2-D vector graphics that compose the documents. PDF is an open standard that has been officially published on July 1, 2008 by the ISO as ISO 32000-1:2008. "The Portable Document Format (PDF)" Wikipedia Online Encyclopedia, accessed July 4, 2008

PDF/A is described in ISO 19005-1:2005 Document Management - Electronic document file format for long term preservation - Part 1: Use of PDF 1.4 (PDF/A-1) that was published on October 1, 2005. This standard defines a format (PDF/A) for the long-term archiving of electronic documents and is based on the PDF Reference Version 1.4 from Adobe Systems Inc. (implemented in Adobe Acrobat 5). PDF/A is in fact a subset of PDF, leaving out PDF features not suited to long-term archiving. This is similar to the definition of the PDF/X subset for the printing and graphic arts. "PDF/A" Wikipedia Online Encyclopedia, accessed July 4, 2008

The XML Paper Specification (XPS), formerly codenamed "Metro", is a specification for a page description language and a fixed-document format developed by Microsoft. It is an XML-based (more precisely XAML-based) specification, based on a new print path and a color-managed vector-based document format which supports device independence and resolution independence. "The XML Paper Specification (XPS)" Wikipedia Online Encyclopedia, accessed July 4, 2008

One word of caution for using PDF files: do not use inbuilt compression of pdf files and if possible use the PDF 1.4 specification.

Recommendation

  • Use plain ASCII text whenever possible
  • Use ODT where formatting is important or where graphics need to be included
  • Use PDF or XPS for documents which will not need to be edited in the future

References