Excel compatible Unicode CSV files from Python

(Jump straight to the code).

When you have a terrible program like Excel, it tries to “be helpful” and guess the encoding of a CSV file (and almost always fails). Back in the 90s or 00s, because of all the different encodings this may have been helpful, but today it isn’t. Most of the time, you just want to use UTF-8. While exploring the wonderful world of CSV handling in Excel, I came across a way to write UTF-8 CSV files that Excel would read reliably every time. I thought I’d split it out into it’s own post.

Let’s take a step back. Unicode has several common encodings, UTF-8, UTF-16, and UTF-32. Because endianness, UTF-16 and UTF-32 must have a byte-order mark (BOM) at the start of a file. Here’s where it gets weird. UTF-8 can also have a BOM, even though it does nothing functionally and isn’t recommended. However, this UTF-8 BOM offers a valuable hint to Excel, and continue to work in Google Sheets, Numbers, and most text editors (even though those don’t need the BOM).

The UTF-8 BOM is 0xEF,0xBB,0xBF, that’s too much hard work to remember. Python - with batteries included - has a handy constant for the BOM:

>>> from codecs import BOM_UTF8
>>> BOM_UTF8
b'\xef\xbb\xbf'
>>> BOM_UTF8.decode('utf-8')
'\ufeff'

But wait! This is effort to add onto every file when saving, and strip out of every file when loading. There must be a better way? Python again delivers. There exists an encoding called utf-8-sig. It will add and strip out the BOM automatically. Additionally, it will read UTF-8 without the BOM just fine:

>>> 'Hello, world'.encode('utf-8-sig')
b'\xef\xbb\xbfHello, world'
>>> _.decode('utf-8-sig')
'Hello, world'
>>> 'Hello, world'.encode('utf-8')
b'Hello, world'
>>> _.decode('utf-8-sig')
'Hello, world'

Try it out (legacy Python 2 version further down):

Oh Python ❤️ how easy you make things.

And the legacy Python version:

If you’re still using Python 2 and want to write Unicode CSVs, I strongly recommend you use backports.csv. Do not use unicodecsv, even though it seems like it’s easier to use. It doesn’t force you to think as clearly about encoding as backports.csv, and consequently, migrating away from unicodecsv is a pain (if you’ll ever do it, and then you’ll have two different ways of generating CSVs). Don’t kick the can down the road, just fix it properly and use backports.csv!

The code above code will also run unmodified in Python 3 (if you have backports.csv installed). And as you can see, the first four lines are the only thing that changes, so dropping from backports.csv to Python 3’s CSV module is completely trivial.