I recently had to transfer many files to my NAS using questionable hardware like USB sticks, SD cards, and external hard drives. Having done this before, naturally I wanted to get hashes of all the files to check the integrity. It’s surprising how many errors can occur in a few gigabytes. This turned out to be a bit of a mission.
Note to self: FreeNAS has been renamed TrueNAS for a while now.
SHA256 file hash format
All sane operating systems easily support generating and checking SHA256 file hashes. The hashes and file names are stored line-by-line in a file, e.g. this example sha256.txt
(hashes truncated for brevity):
7cb19e63686... Foo/important_document1.txt
ee0a07210ad... Foo/important_document2.txt
Sorting SHA256 file hashes
In a file of hashes, the order of the lines don’t matter. After all, they’re just lines of hashes and file names. But it can be useful to sort them by file name for manual review or comparison/diffing. (Note that the find
command does not guarantee the files be returned alphabetically, and it’s quicker to sort them afterwards.)
sort -k2 sha256.txt > sha256-sorted.txt
This sorts each line in a file by key #2, i.e. the file name.
Checking SHA256 file hashes
Linux
sha256sum --check sha256.txt
TrueNAS/FreeBSD
shasum -a 256 -c sha256.txt
macOS
macOS has shasum
(/usr/bin/shasum
), and so is the same as FreeBSD:
shasum -a 256 -c sha256.txt
Or, if GNU coreutils are installed, sha256sum
or the prefixed version gsha256sum
can be used, as with Linux:
gsha256sum --check sha256.txt
Windows
Nope, sorry. Maybe you could massage the data into a catalog file (.cat) and use Test-FileCatalog
.
Generating SHA256 file hashes
Linux
find "<dir>/" -type f -exec sha256sum '{}' \; > sha256.txt
TrueNAS/FreeBSD
find "<dir>/" -type f -exec shasum -a 256 '{}' \; > sha256.txt
macOS
Note: On macOS, and unlike Linux or FreeBSD, when using find
it is useful to ensure the directory name doesn’t include a trailing slash. Otherwise, the path names result in a double slash:
$ find "<dir>/" -type f
<dir>//<file>
$ find "<dir>" -type f
<dir>/<file>
This is rather annoying, as auto-completing a directory name will include the trailing slash. However, I don’t think this actually makes a difference. It’s purely for aesthetic purposes, or might be useful when comparing generated hash files manually.
macOS has shasum
(/usr/bin/shasum
), and so is the same as FreeBSD:
find "<dir>" -type f -exec sha256sum '{}' \; > sha256.txt
Or, if GNU coreutils are installed, sha256sum
or the prefixed version gsha256sum
can be used, as with Linux:
find "<dir>" -type f -exec gsha256sum '{}' \; > sha256.txt
Windows
With Powershell, this isn’t too painful. One option is Get-ChildItem
+ Get-FileHash
+ Export-Csv
. Export-Csv
is probably the easiest option to avoid truncating the filepath (!), as e.g. redirection or piping to Out-File
would.
Get-ChildItem -File -Recurse -Path "D:\Foo" |
Get-FileHash |
Export-Csv -Path "hashes.csv" -Encoding "UTF8NoBOM" -NoTypeInformation
By default, the
Get-FileHash
cmdlet uses the SHA256 algorithm, although any hash algorithm that is supported by the target operating system can be used.
Get-FileHash
reference
I also believe Get-FileHash
does the right thing and calculates the hash of the files in binary, and not in some weird text mode (more on this later).
Another option is New-FileCatalog
.
In any case, the output file will need to be massaged into a format compatible with shasum
/sha256sum
. It might be possible to do this in Powershell…? But even all the Powershell snippets I found to turn an absolute path (returned by Get-ChildItem
) into a relative path were horrible. And also, Windows paths. So, Python. But first…
Unicode normalization
A very real problem is Unicode normalization (normalisation). For example, the German character ä
can be encoded as \xc3\xa4
(normal form C, NFC) or a\xcc\x88
(normal form D, NFD).
- The CSV file from Windows was NFC, and “form C” is “generally” the preferred form in Windows-land.
- macOS generally uses NFD.
- Linux and BSD don’t really care, as with POSIX, file names and paths are just a bag of bytes.
- ZFS, and therefore TrueNAS is generally set up to be POSIX compatible, but can be set up to require valid UTF-8 file names and paths, as well as perform normalisation. You’ll know if you set this up.
Depending on the combination of where the SHA256 file hashes were generated, and where they are checked, and if you had any non-ASCII file names, it may be required to convert between forms. You can convert between these forms using either Python, or a command-line utility.
>>> import unicodedata
>>> unicodedata.normalize('NFC', 'ä').encode('utf-8')
b'\xc3\xa4'
>>> unicodedata.normalize('NFD', 'ä').encode('utf-8')
b'a\xcc\x88'
For the command-line utility, I recommend uconv
from ICU:
$ echo -n 'ä' | uconv -x any-nfc | hexdump -e '16/1 "%02x " "\n"'
c3 a4
$ echo -n 'ä' | uconv -x any-nfd | hexdump -e '16/1 "%02x " "\n"'
61 cc 88
It can be applied to an entire file:
uconv -x any-nfc --output "sha256-conv.txt" "sha256.txt"
Converting SHA256 file hashes from Windows to everything else
We want to go from
"SHA256","7CB19E6368...","D:\Foo\important_document.txt"
to
7cb19e63686... Foo/important_document.txt
(Hashes truncated for brevity)
There are 3 or 4 steps:
- Parse the CSV file, skip header if needed
- Convert the SHA256 hash to lower-case (might as well)
- Convert the path from a Windows absolute path to a POSIX relative path
- Normalise Unicode if needed
Also note that because Windows is a great product and operating system, if you did not use the encoding “UTF8NoBOM”, it might prepends a byte sequence to the CSV file called a Byte Order Marker (BOM). This is useless for UTF-8. In this case, you should pass the encoding='utf_8_sig'
option to open()
so that Python can read the file, like so:
open("hashes.csv", "r", newline="", encoding="utf_8_sig")
So finally, the full script:
import csv
from pathlib import Path, PureWindowsPath, PurePosixPath
# change this to your base path
BASE_PATH = PureWindowsPath('D:/')
def convert_row(row):
hash_algo, hash, path = row
# check the algorithm is as expected
assert hash_algo == 'SHA256', row
# hash outputted by `shasum`/`sha256sum` is lowercase
hash = hash.lower()
# strip BASE_PATH and convert to POSIX
path = PureWindowsPath(path).relative_to(BASE_PATH).as_posix()
# if needed, perform Unicode normalization here
return f'{hash} {path}\n'
# use `encoding='utf_8_sig'` if needed
with open('hashes.csv', 'r', newline='') as csvfile, open('sha256.txt', 'w') as f:
reader = csv.reader(csvfile)
# skip header
_header = next(reader)
for row in reader:
f.write(convert_row(row))
That’s pretty much it. Fin.
Tangent: Rant about shasum/sha256sum
Lest it be said that I’m biased towards POSIX… well, I am. But both shasum
and sha256sum
have some, err, interesting options.
$ shasum --help
Usage: shasum [OPTION]... [FILE]...
Print or check SHA checksums.
With no FILE, or when FILE is -, read standard input.
-a, --algorithm 1 (default), 224, 256, 384, 512, 512224, 512256
-b, --binary read in binary mode
-c, --check read SHA sums from the FILEs and check them
--tag create a BSD-style checksum
-t, --text read in text mode (default)
-U, --UNIVERSAL read in Universal Newlines mode
produces same digest on Windows/Unix/Mac
-0, --01 read in BITS mode
ASCII '0' interpreted as 0-bit,
ASCII '1' interpreted as 1-bit,
all other characters ignored
The following five options are useful only when verifying checksums:
--ignore-missing don't fail or report status for missing files
-q, --quiet don't print OK for each successfully verified file
-s, --status don't output anything, status code shows success
--strict exit non-zero for improperly formatted checksum lines
-w, --warn warn about improperly formatted checksum lines
[...]
Some questions:
- Do I need to specify
--binary
if--text
is the default? Why is text mode the default? What does this even do? - Without
--status
, doesshasum
not return with a non-zero status code on error? - Do I need to specify
--warn
or--strict
? What’s the default here? Silence? I could see--warn
being useful if you know there’s a malformed line or so. I would hope the default is a printed warning, and a non-zero status code, but not to abort checking other files.
Let’s see if sha256sum
is any better.
$ sha256sum --help
Usage: sha256sum [OPTION]... [FILE]...
Print or check SHA256 (256-bit) checksums.
With no FILE, or when FILE is -, read standard input.
-b, --binary read in binary mode
-c, --check read SHA256 sums from the FILEs and check them
--tag create a BSD-style checksum
-t, --text read in text mode (default)
-z, --zero end each output line with NUL, not newline,
and disable file name escaping
The following five options are useful only when verifying checksums:
--ignore-missing don't fail or report status for missing files
--quiet don't print OK for each successfully verified file
--status don't output anything, status code shows success
--strict exit non-zero for improperly formatted checksum lines
-w, --warn warn about improperly formatted checksum lines
[...]
Note: There is no difference between binary mode and text mode on GNU systems.
[...]
Well, apart from the misaligned table, at least this makes it clear that -b/--binary
is not needed on UNIX systems. sha256sum
flags are the same as md5sum
, and the online documentation is much more in depth, but still doesn’t completely explain the options.