ENH: Improving file-change detection.
Closed this issue · 4 comments
Is your feature request related to a problem?
Hitting save on a Python file changes its modification timestamp and triggers a rerun of related tasks even if the content is unchanged.
Is it a feature or a bug?
Describe the solution you'd like
Let us improve the modification detection of Python files. (Only Python!)
Mypy has a two-level hashing solution: https://github.com/python/mypy/issues/3403
.
- Hash of modification time. If the comparison fails ...
- Hash of content. If the comparison fails, do stuff.
User behavior-breaking implications
Maybe we need a way to force executing tasks. Deleting the product might be more cumbersome than a --force
flag.
API breaking implications
None
I would like this feature!
If it is not too slow, it could be nice to stadardize the files a bit before comparing the content. E.g. run black, isort and strip away alll comments and docstrings.
One would hope pre-commit hooks would do that -- I would not want to see too much duplication of those things in my projects...
In general, may I re-up #338 in this respect? 😇 I guess that would allow a more general solution to the same problem (whether you hash Python inputs or contents of Python files... Personally, I can see many use-cases of hashing even smaller data files, particularly when tasks are very long-running compared to the size of the data).
Cool idea, Janos, to format before comparing! I am more in favour of the minimal solution, though.
I do not think it is related to #338. The issues use hashing as part of the implementation, but both require new interfaces, which is why they are currently stuck.
Going back to this issue, I think it is a good idea to
- Hash the file content when modification dates changed to not execute when the content is the same.
- Add a
--force
flag to ignore whether pytask believes nothing has changed.
I should profile hashing files before to see how costly it is.
I do not think it is related to #338. The issues use hashing as part of the implementation
Maybe I just don't get it, but don't they solve the same problem? I.e., allow for hashing instead of or on top of modification times for FileNodes
? It still seems to me that #338 is a superset of this issue -- the only question is about the interface, i.e., how much control you want to leave to the user (here: use default + force, there: decide which one to use, maybe idea: some combination)?