LUMC/pytest-workflow

feature request: fail if any unknown files exist

sstrong99 opened this issue · 9 comments

As far as i'm aware, currently it is only possible to check that specified output files match certain criteria. I'm also aware that you can specify that a certain file must not exist. It would also be useful for me to specify that NO other files besides the specified ones exist in a given directory

Can you elaborate on this use case a bit? You have a workflow that generates files with competely random filenames (i.e. timestamped) when an error occurs?

How do you propose this would be implemented in an elegant fashion? Wouldn't it be easier to simply use the custom test interface for this problem?

I am working with a very large and complex workflow. When someone adds a new process, sometimes they don't intend for the outputs of that process to get published to the results directory because it's an intermediate process, but if they misconfigure things, then those files can get published. It would be nice if those mistakes could be caught in an automated way using this package.

It may be easier to do this with the custom test interface. I'll look in to that.

I'm not familiar with the code itself, so i'm not sure on a good implementation, but the interface could look something like this

- name: test1
  command: ...
  files:
    - path: output/ # the trailing slash indicates this is a directory, not a file
      should_exist: false # indicates that nothing in this directory should exist unless explicitly allowed
    - path: "output/file.txt"

I do think that is one of the best possible interfaces, but I don't like how now suddenly the should_exist meaning depends on context. This will make the YAML harder to read.

What about a custom test?

import pathlib
from typing import Optional, Set
import pytest 


def check_allowed_paths(
        directory: pathlib.Path, 
        allowed_paths: Set[str], 
        root: Optional[pathlib.Path] = None
):
    if root is None:
        root = directory
    for path in directory.iterdir():
        if path.is_dir():
             check_allowed_paths(path, allowed_paths, root)
        elif path.is_file() or path.is_symlink():
            relpath = path.relative_to(root)
            if str(relpath) not in allowed_paths:
                 raise FileExistsError(f"{relpath} not allowed!")


ALL_WORKFLOW_FILES = [
    "bam/test1.bam",
    "vcf/test1.vcf",
]

MULTISAMPLE_FILES = [
    "bam/test2.bam",
    "vcf/test2.vcf",
]

ANNOTATED_FILES = [
    "vep/test1.vep.vcf",
]


@pytest.mark.workflow("annotated")
def test_annotated_workflow(workflow_dir):
    check_allowed_paths(
        workflow_dir,  set(ALL_WORKFLOW_FILES + ANNOTATED_FILES))

@pytest.mark.workflow("multisample")
def test_annotated_workflow(workflow_dir):
    check_allowed_paths(
        workflow_dir,  set(ALL_WORKFLOW_FILES + MULTISAMPLE_FILES))

EDIT: Even better: you could use the yaml library and the __file__ keyword to get the test yaml file (it should be possible to state the relative location to this file) and parse it. Then you can use the paths directly from the yaml to define the allowed files. Eliminating code redundancy. I leave that as an excercise to the reader though, I am not going to code that in a github comment window ;-).

Thanks! I started looking through the code to see if there was a way to automatically append this custom test to every workflow, without having to copy/paste the custom test for every new workflow. In the process i put together a example (rough) implementation of this in the code: #183

The interface is a bit different than what i mentioned above. In order to check for unknown files you'd add the fail_if_unknown key to workflow test in the yaml. e.g.

- name: test1
  command: ...
  fail_if_unknown:
    - output/
    - some_other_dir/
  files:
    - path: "output/file.txt"

Sorry, I think this feature is too specific to your needs to warrant a change to the test YAML. Why not write a custom test as mentioned above?
You could automate that as well with pytest.mark.parametrize and parsing the YAML file yourself.

makes sense. the only reason i thought a custom test wasn't quite as good is because you have to duplicate the test collection. Do you know if there's a way to hook into pytest's test collection to make sure that the custom test applies to all the relevant yaml files?

If not, another idea would be to extend the pytest.mark.workflow() to accept no workflow name, in which case it applies to all workflows.

makes sense. the only reason i thought a custom test wasn't quite as good is because you have to duplicate the test collection. Do you know if there's a way to hook into pytest's test collection to make sure that the custom test applies to all the relevant yaml files?

You can parse the YAML yourself. That should give you all the workflow names and all the relevant files. In that case there will be no duplication.

There is duplication in finding the yaml files. pytest's test collection handles that an any difference between the custom code and pytest's test collection will result in missed tests

Sorry for my late reply. I have been very busy lately and haven't kept up.

There is duplication in finding the yaml files. pytest's test collection handles that an any difference between the custom code and pytest's test collection will result in missed tests

Yes, there will be duplication. Sometimes a little duplication is better than adding extra complexity. It is possible to take DRY too far. I would take simplicy and a little duplication over complexity every single time. But you might have noticed that from the design of pytest-workflow already ;-).