SHA-256 checksum generation and verification on various platforms

I recently had to transfer many files to my NAS using questionable hardware like USB sticks, SD cards, and external hard drives. Having done this before, naturally I wanted to get hashes of all the files to check the integrity. It’s surprising how many errors can occur in a few gigabytes. This turned out to be a bit of a mission.

SHA256 file hash format
Sorting SHA256 file hashes
Checking SHA256 file hashes
- Linux
- TrueNAS/FreeBSD
- macOS
- Windows
Generating SHA256 file hashes
- Linux
- TrueNAS/FreeBSD
- macOS
- Windows
Unicode normalization
Converting SHA256 file hashes from Windows to everything else
Tangent: Rant about shasum/sha256sum

Note to self: FreeNAS has been renamed TrueNAS for a while now.

SHA256 file hash format

All sane operating systems easily support generating and checking SHA256 file hashes. The hashes and file names are stored line-by-line in a file, e.g. this example sha256.txt (hashes truncated for brevity):

7cb19e63686...  Foo/important_document1.txt
ee0a07210ad...  Foo/important_document2.txt

Sorting SHA256 file hashes

In a file of hashes, the order of the lines don’t matter. After all, they’re just lines of hashes and file names. But it can be useful to sort them by file name for manual review or comparison/diffing. (Note that the find command does not guarantee the files be returned alphabetically, and it’s quicker to sort them afterwards.)

sort -k2 sha256.txt > sha256-sorted.txt

This sorts each line in a file by key #2, i.e. the file name.

Checking SHA256 file hashes

Linux

sha256sum --check sha256.txt

TrueNAS/FreeBSD

shasum -a 256 -c sha256.txt

macOS

macOS has shasum (/usr/bin/shasum), and so is the same as FreeBSD:

shasum -a 256 -c sha256.txt

Or, if GNU coreutils are installed, sha256sum or the prefixed version gsha256sum can be used, as with Linux:

gsha256sum --check sha256.txt

Windows

Nope, sorry. Maybe you could massage the data into a catalog file (.cat) and use Test-FileCatalog.

Generating SHA256 file hashes

Linux

find "<dir>/" -type f -exec sha256sum '{}' \; > sha256.txt

TrueNAS/FreeBSD

find "<dir>/" -type f -exec shasum -a 256 '{}' \; > sha256.txt

macOS

Note: On macOS, and unlike Linux or FreeBSD, when using find it is useful to ensure the directory name doesn’t include a trailing slash. Otherwise, the path names result in a double slash:

$ find "<dir>/" -type f
<dir>//<file>
$ find "<dir>" -type f
<dir>/<file>

This is rather annoying, as auto-completing a directory name will include the trailing slash. However, I don’t think this actually makes a difference. It’s purely for aesthetic purposes, or might be useful when comparing generated hash files manually.

macOS has shasum (/usr/bin/shasum), and so is the same as FreeBSD:

find "<dir>" -type f -exec sha256sum '{}' \; > sha256.txt

Or, if GNU coreutils are installed, sha256sum or the prefixed version gsha256sum can be used, as with Linux:

find "<dir>" -type f -exec gsha256sum '{}' \; > sha256.txt

Windows

With Powershell, this isn’t too painful. One option is Get-ChildItem + Get-FileHash + Export-Csv. Export-Csv is probably the easiest option to avoid truncating the filepath (!), as e.g. redirection or piping to Out-File would.

Get-ChildItem -File -Recurse -Path "D:\Foo" |
Get-FileHash |
Export-Csv -Path "hashes.csv" -Encoding "UTF8NoBOM" -NoTypeInformation

By default, the Get-FileHash cmdlet uses the SHA256 algorithm, although any hash algorithm that is supported by the target operating system can be used.

Get-FileHash reference

I also believe Get-FileHash does the right thing and calculates the hash of the files in binary, and not in some weird text mode (more on this later).

Another option is New-FileCatalog.

In any case, the output file will need to be massaged into a format compatible with shasum/sha256sum. It might be possible to do this in Powershell…? But even all the Powershell snippets I found to turn an absolute path (returned by Get-ChildItem) into a relative path were horrible. And also, Windows paths. So, Python. But first…

Unicode normalization

A very real problem is Unicode normalization (normalisation). For example, the German character ä can be encoded as \xc3\xa4 (normal form C, NFC) or a\xcc\x88 (normal form D, NFD).

The CSV file from Windows was NFC, and “form C” is “generally” the preferred form in Windows-land.
macOS generally uses NFD.
Linux and BSD don’t really care, as with POSIX, file names and paths are just a bag of bytes.
ZFS, and therefore TrueNAS is generally set up to be POSIX compatible, but can be set up to require valid UTF-8 file names and paths, as well as perform normalisation. You’ll know if you set this up.

Depending on the combination of where the SHA256 file hashes were generated, and where they are checked, and if you had any non-ASCII file names, it may be required to convert between forms. You can convert between these forms using either Python, or a command-line utility.

>>> import unicodedata
>>> unicodedata.normalize('NFC', 'ä').encode('utf-8')
b'\xc3\xa4'
>>> unicodedata.normalize('NFD', 'ä').encode('utf-8')
b'a\xcc\x88'

For the command-line utility, I recommend uconv from ICU:

$ echo -n 'ä' | uconv -x any-nfc | hexdump -e '16/1 "%02x " "\n"'
c3 a4
$ echo -n 'ä' | uconv -x any-nfd | hexdump -e '16/1 "%02x " "\n"'
61 cc 88

It can be applied to an entire file:

uconv -x any-nfc --output "sha256-conv.txt" "sha256.txt"

Converting SHA256 file hashes from Windows to everything else

We want to go from

"SHA256","7CB19E6368...","D:\Foo\important_document.txt"

7cb19e63686...  Foo/important_document.txt

(Hashes truncated for brevity)

There are 3 or 4 steps:

Parse the CSV file, skip header if needed
Convert the SHA256 hash to lower-case (might as well)
Convert the path from a Windows absolute path to a POSIX relative path
Normalise Unicode if needed

Also note that because Windows is a great product and operating system, if you did not use the encoding “UTF8NoBOM”, it might prepends a byte sequence to the CSV file called a Byte Order Marker (BOM). This is useless for UTF-8. In this case, you should pass the encoding='utf_8_sig' option to open() so that Python can read the file, like so:

open("hashes.csv", "r", newline="", encoding="utf_8_sig")

So finally, the full script:

import csv
from pathlib import Path, PureWindowsPath, PurePosixPath

# change this to your base path
BASE_PATH = PureWindowsPath('D:/')

def convert_row(row):
    hash_algo, hash, path = row
    # check the algorithm is as expected
    assert hash_algo == 'SHA256', row
    # hash outputted by `shasum`/`sha256sum` is lowercase
    hash = hash.lower()
    # strip BASE_PATH and convert to POSIX
    path = PureWindowsPath(path).relative_to(BASE_PATH).as_posix()
    # if needed, perform Unicode normalization here
    return f'{hash}  {path}\n'

# use `encoding='utf_8_sig'` if needed
with open('hashes.csv', 'r', newline='') as csvfile, open('sha256.txt', 'w') as f:
    reader = csv.reader(csvfile)
    # skip header
    _header = next(reader)
    for row in reader:
        f.write(convert_row(row))

That’s pretty much it. Fin.

Tangent: Rant about shasum/sha256sum

Lest it be said that I’m biased towards POSIX… well, I am. But both shasum and sha256sum have some, err, interesting options.

$ shasum --help
Usage: shasum [OPTION]... [FILE]...
Print or check SHA checksums.
With no FILE, or when FILE is -, read standard input.

  -a, --algorithm   1 (default), 224, 256, 384, 512, 512224, 512256
  -b, --binary      read in binary mode
  -c, --check       read SHA sums from the FILEs and check them
      --tag         create a BSD-style checksum
  -t, --text        read in text mode (default)
  -U, --UNIVERSAL   read in Universal Newlines mode
                        produces same digest on Windows/Unix/Mac
  -0, --01          read in BITS mode
                        ASCII '0' interpreted as 0-bit,
                        ASCII '1' interpreted as 1-bit,
                        all other characters ignored

The following five options are useful only when verifying checksums:
      --ignore-missing  don't fail or report status for missing files
  -q, --quiet           don't print OK for each successfully verified file
  -s, --status          don't output anything, status code shows success
      --strict          exit non-zero for improperly formatted checksum lines
  -w, --warn            warn about improperly formatted checksum lines

[...]

Some questions:

Do I need to specify --binary if --text is the default? Why is text mode the default? What does this even do?
Without --status, does shasum not return with a non-zero status code on error?
Do I need to specify --warn or --strict? What’s the default here? Silence? I could see --warn being useful if you know there’s a malformed line or so. I would hope the default is a printed warning, and a non-zero status code, but not to abort checking other files.

Let’s see if sha256sum is any better.

$ sha256sum --help
Usage: sha256sum [OPTION]... [FILE]...
Print or check SHA256 (256-bit) checksums.

With no FILE, or when FILE is -, read standard input.

  -b, --binary         read in binary mode
  -c, --check          read SHA256 sums from the FILEs and check them
      --tag            create a BSD-style checksum
  -t, --text           read in text mode (default)
  -z, --zero           end each output line with NUL, not newline,
                       and disable file name escaping

The following five options are useful only when verifying checksums:
      --ignore-missing  don't fail or report status for missing files
      --quiet          don't print OK for each successfully verified file
      --status         don't output anything, status code shows success
      --strict         exit non-zero for improperly formatted checksum lines
  -w, --warn           warn about improperly formatted checksum lines

[...]
Note: There is no difference between binary mode and text mode on GNU systems.
[...]

Well, apart from the misaligned table, at least this makes it clear that -b/--binary is not needed on UNIX systems. sha256sum flags are the same as md5sum, and the online documentation is much more in depth, but still doesn’t completely explain the options.