owid/etl

Use `hashlib.file_digest` in Python 3.11+

Closed this issue · 1 comments

Just found out that Python 3.11 has added hashlib.file_digest. I expect that it is more performant than our selfmade file-hashing code, and that a large share of etl processing time is spent hashing files.

This is the existing file hashing code we have:

etl/etl/files.py

Lines 85 to 95 in 2cb8d64

def checksum_file_nocache(filename: Union[str, Path]) -> str:
"Return the md5 hex digest of the file without using cache."
chunk_size = 2**20
_hash = hashlib.md5()
with open(filename, "rb") as istream:
chunk = istream.read(chunk_size)
while chunk:
_hash.update(chunk)
chunk = istream.read(chunk_size)
return _hash.hexdigest()

It could be worth to experiment whether hashlib.file_digest is more performant than our code (note: one can also set _bufsize=2**20 manually), and if so replace the existing code to call that if running in Python 3.11+.

Nice find, I didn't know about this! I've tested it on checking checksums for all our steps with

etlr grapher --dry-run

Unfortunately, it didn't make a difference in performance. We already do a couple of performance optimizations like caching and falling back to file size if the file is too large, so checksums is actually not a bottleneck right now.

I'll add a comment to checksum_file_nocache and once we drop support for 3.10, we can use the native function.