pytask-dev/pytask

ENH: Improving file-change detection.

Closed this issue · 4 comments

Is your feature request related to a problem?

Hitting save on a Python file changes its modification timestamp and triggers a rerun of related tasks even if the content is unchanged.

Is it a feature or a bug?

Describe the solution you'd like

Let us improve the modification detection of Python files. (Only Python!)

Mypy has a two-level hashing solution: https://github.com/python/mypy/issues/3403.

  1. Hash of modification time. If the comparison fails ...
  2. Hash of content. If the comparison fails, do stuff.

User behavior-breaking implications

Maybe we need a way to force executing tasks. Deleting the product might be more cumbersome than a --force flag.

API breaking implications

None

I would like this feature!

If it is not too slow, it could be nice to stadardize the files a bit before comparing the content. E.g. run black, isort and strip away alll comments and docstrings.

One would hope pre-commit hooks would do that -- I would not want to see too much duplication of those things in my projects...

In general, may I re-up #338 in this respect? 😇 I guess that would allow a more general solution to the same problem (whether you hash Python inputs or contents of Python files... Personally, I can see many use-cases of hashing even smaller data files, particularly when tasks are very long-running compared to the size of the data).

Cool idea, Janos, to format before comparing! I am more in favour of the minimal solution, though.

I do not think it is related to #338. The issues use hashing as part of the implementation, but both require new interfaces, which is why they are currently stuck.

Going back to this issue, I think it is a good idea to

  • Hash the file content when modification dates changed to not execute when the content is the same.
  • Add a --force flag to ignore whether pytask believes nothing has changed.

I should profile hashing files before to see how costly it is.

I do not think it is related to #338. The issues use hashing as part of the implementation

Maybe I just don't get it, but don't they solve the same problem? I.e., allow for hashing instead of or on top of modification times for FileNodes? It still seems to me that #338 is a superset of this issue -- the only question is about the interface, i.e., how much control you want to leave to the user (here: use default + force, there: decide which one to use, maybe idea: some combination)?