News & Info

Daily Updates and Tech Chatter

Versioning Word Documents In Git

We need your help!


Cyber Sprocket is looking to qualify for a small business grant so we can continue our development efforts. We are working on a custom application builder platform so you can build custom mobile apps for your business. If we reach our 250-person goal have a better chance of being selected.

It is free and takes less than 2 minutes!

Go to www.missionsmallbusiness.com.
Click on the “Login and Vote” button.
Put “Cyber Sprocket” in the search box and click search.
When our name comes up click on the vote button.

 

And now on to our article…

 

At first I didn’t know if I should write this email. I really, really, really do not like dealing with Word documents. It has nothing to do with Word specifically as a product; I hate documents in that kind of format in general, including the stuff OpenOffice.org produces. I don’t like working with WYSIWYG documents, at all. One argument I can make against using Word files on projects is that you can’t meaningfully put them in a repository.

Well—this isn’t true. You can do it, and actually do things like diff Word documents. So ultimately I decided it is more helpful to share this information than to secretly hide it in an attempt to keep people from using that God awful format. Of course, I’m going to regret it as soon as there’s some Word document in one of the repositories…

A rarely used feature of Git (in my experience) is its ability to assign ‘attributes’ to files. You do this by making a .gitattributes file in the repository. It is a text file that maps file names or globs to attributes. A simple example would be

*.fl[av] binary

This tells Git that all ‘flv’ and ‘fla’ files are binary, and therefore Git should never try to diff them or perform any CRLF conversions, regardless of any other settings.

Something else we can do with attributes is control how diffs are generated for files. For our specific task here, we want to tell Git to use our customized ‘diff driver’ for Word documents. We can start out by putting this in our attributes file:

*.doc diff=word

Now whenever Git diffs ‘doc’ files it will invoke the ‘word’ driver. Which means now we have to define that driver. We can do this in one of three places.

    1. Our personal, global .gitconfig file.
    2. A .gitconfig file in the repository that can be shared by developers.
    3. The .git/config file in the repository, which is not shared.

Adding support for diffing certain files is something we typically want to share with everyone on a project, so the second choice makes the most sense here. But the way we define the driver is the same regardless of where we actually do it. First I will show you what we have to put in the file to define the driver, then discuss it.

    textconv = strings

The first line should look familiar if you have messed around with your .gitconfig file before; it is your typical INI file section header. When we assigned the attribute ‘diff=word’ that means Git will look for the section ‘’ for the definition. The second line sets the ‘textconv’ property of the driver; this property names a program or command that is capable of translating the file into a text format which Git can then diff like normal. The ‘strings’ program is part of the GNU binutils package, which you can get on all platforms. It rips out all of the printable strings from a binary file.

With that said, it should be clear now how this helps us diff Word documents. Our driver passes in the ‘doc’ file to a program that can take out all of the printable strings. Even though Word is a binary format, it stores the text of the document as text strings that we can pull out. Once we have done that, Git is capable of diffing the file like normal, and we can meaningfully use tools like ‘git log -p’ to get an idea of the changes that some commit made to a Word document.

This techinque can be used with any file format for which you can generate meaningful text output. For example, if you use a tool to take the metadata out of image files then you can make a driver for that and get useful diff info. This never affects the way Git stores these files; they will still be handled just like any other binary file. The benefits are only cosmetic, allowing us to use Git’s diffing tools to get a better idea of what changes have been applied to those binary files. But nonetheless, that information can be very useful when working with such files.

Tags: , ,

6 Awesome Comments So Far

Don't be a stranger, join the discussion by leaving your own comment
  1. U Avalos
    January 19, 2011 at 2:51 PM #

    Awesome. However, if I need to revert a file, will it revert to the converted text file or the binary version?

  2. Jonathan Raphael Schmid
    May 18, 2011 at 6:48 AM #

    Like Eric puts it, the changes are only “cosmetic” – there actually aren’t any, as this only affects how you look at the document. The file itself remains unchanged.

    Thanks for this post, Eric!

  3. David Eads
    May 23, 2011 at 3:32 PM #

    I understand the fear of not wanting to encourage people to use MS Word. Yech!

    But this is a very clear explanation of a powerful technique that can be used to allow Git to be used with exotic files as well as store pure binary files efficiently — MS Word or no. I’ll be using this technique to diff PDFs in my nonprofit’s asset database. Awesome.

  4. Cary Howell
    August 27, 2011 at 9:08 AM #

    Nice article. The problem with .docx is that Microsoft zips the document file, otherwise we could just diff the document.xml between two revisions.

Trackbacks/Pingbacks

  1. Delicious Bookmarks for October 27th through October 28th « Lâmôlabs - October 28, 2011

    [...] Versioning Word Documents In Git – Cyber Sprocket Labs – October 28th ( tags: git versioncontrol vcs word office documents libreoffice openoffice tips tricks guide howto ) [...]

  2. Delicious Bookmarks for July 11th from 21:45 to 21:55 « Lâmôlabs - July 11, 2012

    [...] Versioning Word Documents In Git | Cyber Sprocket Labs – July 11th ( tags: git versioncontrol vcs word documents tips tricks howto guide examples ) [...]