biocore/qadabra

Issue adding the tutorial dataset

Closed this issue · 2 comments

I created workflow/dataset directories and installed Qadabra as specified in the tutorial (using conda instead of mamba), but when I try to add the dataset I keep getting an error stating that the table IDs don't match the metadata IDs. I checked the IDs from the .biom and the .tsv file and they seem to match (shown below), so I was just wondering if this issue has come up for anyone else before. If not, are there any alternative datasets that you would recommend for the tutorial?

I also tried doing this in a different environment in which python 3.9 wasn't specified and conda was used to install everything except Qadabra, which was installed using pip. I ran into the same issue.

Note: qadabra_env_2 was created as specified in the tutorial, and qadabra_env was created without explicitly using python 3.9 and using pip to install Qadabra and conda to install all dependencies.

My directory structure:

.
├── check_metadata_coverage.py <-- custom script, everything else is following the tutorial
└── my_qadabra
    ├── config
    │   ├── config.yaml
    │   └── qadabra.mplstyle
    ├── data
    │   ├── qadabra_tutorial_metadata.tsv
    │   └── qadabra_tutorial_table.biom
    └── workflow
        └── Snakefile

Attempting to add the dataset to the workflow:

(qadabra_env_2) aphilliplt-osx:qadabra_tutorial aphillip$ qadabra add-dataset \
>     --workflow-dest my_qadabra \
>     --table my_qadabra/data/qadabra_tutorial_table.biom \
>     --metadata my_qadabra/data/qadabra_tutorial_metadata.tsv \
>     --name skin_microbiome \
>     --factor-name group \
>     --target-level Day_90 \
>     --reference-level Baseline \
>     --verbose
[2024-01-10 12:28:16 - INFO] :: Validating input...
[2024-01-10 12:28:16 - INFO] :: Loading metadata...
[2024-01-10 12:28:16 - INFO] :: Making sure factor & levels are all present in metadata...
[2024-01-10 12:28:16 - INFO] :: Factor counts:
group
Baseline    19
Day_90      19
Name: count, dtype: int64
[2024-01-10 12:28:16 - INFO] :: Making sure confounders are all metadata columns...
[2024-01-10 12:28:16 - INFO] :: Loading table...
[2024-01-10 12:28:16 - INFO] :: Table shape: (11, 38)
Traceback (most recent call last):
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/bin/qadabra", line 8, in <module>
    sys.exit(qadabra())
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/qadabra/qadabra.py", line 148, in add_dataset
    _validate_input(logger, table, metadata, factor_name, target_level,
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/qadabra/utils.py", line 44, in _validate_input
    raise ValueError("Table IDs are not a subset of metadata IDs!")
ValueError: Table IDs are not a subset of metadata IDs!

Checking the table IDs and the metadata IDs:

(qadabra_env) aphilliplt-osx:qadabra_tutorial aphillip$ biom table-ids -i my_qadabra/data/qadabra_tutorial_table.biom
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
52
53
54
55
56
57
58
59
60
61
62
63
64
(qadabra_env) aphilliplt-osx:qadabra_tutorial aphillip$ python check_metadata_coverage.py
Error: Biom IDs without corresponding metadata entries: {'43', '64', '63', '45', '32', '33', '42', '44', '47', '38', '39', '52', '28', '49', '50', '48', '27', '34', '62', '57', '37', '54', '30', '61', '29', '31', '60', '55', '56', '36', '25', '35', '58', '59', '53', '41', '46', '40'}

Update: I was able to bypass the issue by running the add-dataset command with the additional option --no-validate-input.

Hi, @411an13

Thanks for reporting this. Historically, I've had issues when using BIOM tables with numeric sample IDs. If you end up with more trouble try converting your BIOM sample IDs to strings (e.g. sample_25, sample_27, ...). Feel free to follow up if you have more problems.