Use `hashlib.file_digest` in Python 3.11+

Just found out that Python 3.11 has added hashlib.file_digest. I expect that it is more performant than our selfmade file-hashing code, and that a large share of etl processing time is spent hashing files.

This is the existing file hashing code we have:

etl/etl/files.py

Lines 85 to 95 in 2cb8d64

    
           def checksum_file_nocache(filename: Union[str, Path]) -> str: 
        
               "Return the md5 hex digest of the file without using cache." 
        
               chunk_size = 2**20 
        
               _hash = hashlib.md5() 
        
               with open(filename, "rb") as istream: 
        
                   chunk = istream.read(chunk_size) 
        
                   while chunk: 
        
                       _hash.update(chunk) 
        
                       chunk = istream.read(chunk_size) 
        
               return _hash.hexdigest()

It could be worth to experiment whether hashlib.file_digest is more performant than our code (note: one can also set _bufsize=2**20 manually), and if so replace the existing code to call that if running in Python 3.11+.

Nice find, I didn't know about this! I've tested it on checking checksums for all our steps with

etlr grapher --dry-run

Unfortunately, it didn't make a difference in performance. We already do a couple of performance optimizations like caching and falling back to file size if the file is too large, so checksums is actually not a bottleneck right now.

I'll add a comment to checksum_file_nocache and once we drop support for 3.10, we can use the native function.

	def checksum_file_nocache(filename: Union[str, Path]) -> str:
	"Return the md5 hex digest of the file without using cache."
	chunk_size = 2**20
	_hash = hashlib.md5()
	with open(filename, "rb") as istream:
	chunk = istream.read(chunk_size)
	while chunk:
	_hash.update(chunk)
	chunk = istream.read(chunk_size)

	return _hash.hexdigest()