Everything You Need to Know About File Formats and Their Properties
The file is one of the fundamental parts of the computing experience. But what is a “file” anyway? It’s an icon on your desktop, an entry in a list that opens your budget as a spreadsheet, and a name you can give to your latest selfie. At the end of the day, a “file” is a collection of bits (zeroes and ones) arranged in a pattern that some application understands.
Developers may keep these patterns secret, which may make users keep a particular program despite alternatives . Other developers may actually encourage adoption of their formats. In some cases you may be able to easily open and interpret the format, or it may be impossible to use outside its native application. We’ll explore examples of all the above in this article.
Text-Based vs. Binary File Formats
The first important aspect of different file types is whether it’s binary or text-based. Let’s look at each one of these in turn.
The text file is the most basic file format around. It can be read by just about any system out there with a processor. This format is a sequence of bits (ones and zeroes) that adheres to the ASCII standard (we’ll overlook Unicode for the moment), meaning a computer can interpret every byte (8 bits) as a character from among the following:
- A-Z (including upper and lower case
- Space character
- A selection of symbols (e.g. punctuation)
- Control characters (e.g. “DEL”)
Since the data is stored as text, you can view the contents of a file by opening it in a text editor, even simple ones like Windows Notepad (or equivalents for Mac , Linux , iOS , and Android ). Since nearly all computing platforms ship with a basic text editor, you can put some (text) information in one of these files and be confident you’ll always be able to access it. Other applications don’t need to know anything further to at least read the data properly.
However, things get more complicated when you need to represent something like the text font or an auto-filled date on the cover page. In these cases, ASCII characters contribute to a text-based format such as Markdown or XML. While this confers the benefits of plain text such as transparency, these files often require more space for elements like tags. Consider the following one-sentence file in plain text, and then in the Open Document Format’s “Flat ODT” (FODT) format, which uses XML. The below image shows that the plain text version is 53 bytes, while the ODT version is 25,000 bytes.
In contrast, binary formats are files that an application will construct it bit by bit. You can try to open these files with a text editor, but it likely won’t know how to interpret them. The below shows the result of trying to open a Microsoft Excel file with a text editor.
The application needs to process the data in a binary file in a specific way. When opening an XLS file, an application must treat the first sixteen bytes of the file as the “beginning of file” (BOF) marker. Within that marker, the fifth item is a single bit indicating whether or not the file was last edited on the Windows platform (“fWin”). It comes after four other items, each two bytes, meaning the “fWin” item is the 65th bit in the Excel file.
As we saw, if you try to open it with an application that doesn’t handle the 65th bit as the “fWin” flag, that application won’t open it correctly. It may display lots of garbled characters on the screen (shown above), handle it gracefully with an error message (also shown above, because Linux), or crash. In any case it won’t know how to read the data correctly, and so won’t display it correctly. But applications, once programmed, can handle as many file formats as desired.
Open vs. Proprietary Formats
The next consideration is whether a file format is open (i.e. is available for easy use by others) or proprietary. Note that “proprietary” is not the same as closed, at least not in all cases. While the “text vs. binary” debate above was a technical one, “open vs. proprietary” has more to do with the licensing terms of a file format. More about this in the following sections.
Open formats are those where the license permits users to adopt them for their own applications. A standards body of some sort should also oversee their ongoing development by a community of contributors for the formats to be truly “open.” Open formats are also free of licensing costs and restrictions — they can be used by anyone, for any purpose. Perhaps the most famous open format is the Open Document Format (ODF) , first released in 2005 by OASIS. Its purpose was to offer an alternative to the lock Microsoft had on the productivity market.
With open formats you never need fear that your information is locked inside a particular file. Consider the following, which shows our Flat ODT format file. While there’s a lot of extraneous information around it, you can see the actual data there, clear as day.
Another benefit of open formats is their thorough documentation. It’s one thing for a file to be easily readable. A programmer will still (through trial and error) need to figure out what exactly its each and every feature does. But in the case of ODF, the version 1.2 specification gives a programmer everything they need to know in order to implement support for it efficiently.
Lastly, proprietary formats are protected by their developers. It may because they include trade secrets, for the purposes of (perceived) security, or simply because the developer doesn’t want to share his work. Whatever the reason, these formats are proprietary by virtue of End User License Agreements (EULAs) or other terms forbidding the user from trying to reverse engineer or otherwise “crack” the file format.
Once merely “forbidden,” the Digital Millenium Copyright Act (DCMA) has changed things. Developers now have the legal backing to go after those who reverse engineer their work. You should think about the future before investing in an application that uses a proprietary format. Will you need to migrate that information to somewhere else in the future? If so, how painful will it be? Will the company even be around in a year, or five? You should consider whether an app’s features are worth it if it also means being locked into that developer due to proprietary formats.
Examples of File Formats
If you look at the above, a couple of combinations will jump out at you. It’s true that text-based file formats lend themselves to being open. Likewise, if the goal for a format is to be proprietary, it’s easier to keep it that way by making it binary. But this isn’t always the case.
The GIMP’s XCF image format is an open format that is also binary. The project includes a detailed description of how the format holds the graphics, text, and layers that make up a GIMP file as raw bits and bytes (shown below). Developers can use this to code their own implementation so external applications like the ImageMagick toolkit can import them.
Conversely, the newest Microsoft Visio format (VSDX) is an XML-based (and thus text-based) format. It publishes a detailed reference of the make-up of these files. However, the reference document notes that Microsoft “has patents that might cover your implementation” of .VSDX support. In addition, the Library of Congress states use of the VSDX specification “does not guarantee royalty-free license of all relevant patents” if you use it. This is another way of saying you can roll the dice and include this support. But Microsoft may or may not want you to pay for it later, depending on how closely it competes with Visio.
If you think those are complicated, how about the non-flat ODT format. It’s a ZIP-format file (binary and open, unless you’re also using its encryption) that contains a document’s text (content.xml, open text-based format) and graphics (e.g. PNG, binary but open).
How Important Are File Formats, Really?
This is a difficult question. On one hand, some operating systems like iOS have tried to insulate users from dealing with files at all. You have the app that created the file to open it, who cares about its structure or what it’s extension is? Yet many organizations (especially governments) have been pushing to make sure public data is in an open format.
If you’re a software idealist (not that there’s anything wrong with that), then as you’re evaluating new apps make sure they’re in an open (preferably) text-based format. If you just want to get to work, then proprietary formats may not be an issue for you.
What do you think? Do you demand that your information reside in open, text-based formats you can convert and verify? Or are whatever formats the developers use, proprietary or not, enough for you? Let us know below in the comments!
Image Credits: Edilus/Shutterstock