Use `hashlib.file_digest` in Python 3.11+
Closed this issue · 1 comments
Just found out that Python 3.11 has added hashlib.file_digest
. I expect that it is more performant than our selfmade file-hashing code, and that a large share of etl processing time is spent hashing files.
This is the existing file hashing code we have:
Lines 85 to 95 in 2cb8d64
It could be worth to experiment whether hashlib.file_digest
is more performant than our code (note: one can also set _bufsize=2**20
manually), and if so replace the existing code to call that if running in Python 3.11+.
Nice find, I didn't know about this! I've tested it on checking checksums for all our steps with
etlr grapher --dry-run
Unfortunately, it didn't make a difference in performance. We already do a couple of performance optimizations like caching and falling back to file size if the file is too large, so checksums is actually not a bottleneck right now.
I'll add a comment to checksum_file_nocache
and once we drop support for 3.10, we can use the native function.