When you have a terrible program like Excel, it tries to “be helpful” and guess the encoding of a CSV file (and almost always fails). Back in the 90s or 00s, because of all the different encodings this may have been helpful, but today it isn’t. Most of the time, you just want to use UTF-8. While exploring the wonderful world of CSV handling in Excel, I came across a way to write UTF-8 CSV files that Excel would read reliably every time. I thought I’d split it out into it’s own post.
Let’s take a step back. Unicode has several common encodings, UTF-8, UTF-16, and UTF-32. Because endianness, UTF-16 and UTF-32 must have a byte-order mark (BOM) at the start of a file. Here’s where it gets weird. UTF-8 can also have a BOM, even though it does nothing functionally and isn’t recommended. However, this UTF-8 BOM offers a valuable hint to Excel, and continue to work in Google Sheets, Numbers, and most text editors (even though those don’t need the BOM).
The UTF-8 BOM is
0xEF,0xBB,0xBF, that’s too much hard work to remember. Python - with batteries included - has a handy constant for the BOM:
>>> from codecs import BOM_UTF8 >>> BOM_UTF8 b'\xef\xbb\xbf' >>> BOM_UTF8.decode('utf-8') '\ufeff'
But wait! This is effort to add onto every file when saving, and strip out of every file when loading. There must be a better way? Python again delivers. There exists an encoding called
utf-8-sig. It will add and strip out the BOM automatically. Additionally, it will read UTF-8 without the BOM just fine:
>>> 'Hello, world'.encode('utf-8-sig') b'\xef\xbb\xbfHello, world' >>> _.decode('utf-8-sig') 'Hello, world' >>> 'Hello, world'.encode('utf-8') b'Hello, world' >>> _.decode('utf-8-sig') 'Hello, world'
Try it out (legacy Python 2 version further down):
Oh Python ❤️ how easy you make things.
And the legacy Python version:
If you’re still using Python 2 and want to write Unicode CSVs, I strongly recommend you use
backports.csv. Do not use
unicodecsv, even though it seems like it’s easier to use. It doesn’t force you to think as clearly about encoding as
backports.csv, and consequently, migrating away from
unicodecsv is a pain (if you’ll ever do it, and then you’ll have two different ways of generating CSVs). Don’t kick the can down the road, just fix it properly and use
The code above code will also run unmodified in Python 3 (if you have
backports.csv installed). And as you can see, the first four lines are the only thing that changes, so dropping from
backports.csv to Python 3’s CSV module is completely trivial.