kundajelab/chrombpnet

`qc` broken b.c. it expects auxiliary folder to simultaneously exist and not exist

bytewife opened this issue · 11 comments

The qc subcommand requires an 'auxiliary' folder to exist to work. This folder is intended to be provided by train. However, chrombpnet will not allow for this folder to exist due to the usage of os.makedirs(..., exist_ok=False) in chrombpnet_qc() and in the block that handles qc as an input subcommand. Thus it's impossible for qc to work correctly. I recommend changing that flag to True within that block, and in chrombpnet_qc().

For completeness, here's the error when the train folders are provided:

Traceback (most recent call last):
  File "/opt/conda/bin/chrombpnet", line 33, in <module>
    sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
  File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 26, in main
    os.makedirs(os.path.join(args.output_dir,"auxiliary"), exist_ok=False)
  File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/projectTest/output/auxiliary'
Exit code: 1

and here's when they're not provided:

Traceback (most recent call last):
got the model
loading peaks...
  File "/opt/conda/bin/chrombpnet", line 33, in <module>
    sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
  File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 29, in main
    pipelines.chrombpnet_qc(args)
  File "/scratch/chrombpnet/chrombpnet/pipelines.py", line 196, in chrombpnet_qc
    predict.main(args_copy)
  File "/scratch/chrombpnet/chrombpnet/training/predict.py", line 105, in main
    test_generator = initializers.initialize_generators(args, mode="test", parameters=None, return_coords=True)
  File "/scratch/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 69, in initialize_generators
    peak_regions=pd.read_csv(args.peaks,header=None,sep='\t',names=NARROWPEAK_SCHEMA)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
    self.handles = get_handle(
  File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 863, in get_handle
    handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/projectTest/output/auxiliary/filtered.peaks.bed'
Exit code: 1
Model: "model"

@akundaje I already made this PR in the link above

@ivyraine is your output_dir provided to chrombpnet qc same as that for chrombpnet train ?

Can you provide the exact commands you used to run chrombpnet qc and chrombpnet train ?

I think you are trying to use the same output dir path for both the commands and hence you are seeing this error. Is there a reason why you are using same path?

No- this is from using two different output paths, one for each command.
train cmd:

                chrombpnet train \
                  -itag /mnt/volume/oak/stanford/projects/igvf/Y2AVE/E2G_Predictions/inputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/230601_iPSC_art_ven_EC_10Xmultiome_Cluster0.atac.filter.cutsites.hg38.tagAlign.gz \
                  -d "ATAC" \
                  -g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
                  -c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
                  -p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
                  -n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
                  -fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
                  -b /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/bias_models/ATAC/ENCSR868FGK_bias_fold_0.h5 \
                  -o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/ \
                  | tee "$output_file"

qc cmd:

                chrombpnet qc \
                  -bw /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/auxiliary/data_unstranded.bw \
                  -cm /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet.h5 \
                  -cmb /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet_nobias.h5 \
                  -d "ATAC" \
                  -g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
                  -c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
                  -p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
                  -n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
                  -fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
                  -o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/pipeline_output/ \
                  | tee "$output_file"

then it leads to the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1
0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed'

Please allow me to save both your and my time. The reason why the code doesn't work is as I provided in the first comment. qc is expecting the output files from train to exist, but the exist_ok=False flag of makedirs() prevents that from working. See my PR for the fixes.

Hello @ivyraine, I appreciate your intention to save both your and my time. But your PR is suggesting a fix that is trying to by-pass a folder existing check which is important to prevent overwriting of existing folders/files.

Allow me some time to reproduce this and fix it.

Gotcha- TY

Also your fix wont work - the filtered.peaks.bed from chrombpnet train command will be at /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/auxiliary/filtered.peaks.bed

But chrombpnet qc is looking for it here - /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1 0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed

Was your fix to change exists_ok to True and just pass the output dir from train to qc ?

I just symlinked the subdirs produced in the train output dir into the qc output dir. But you're right, it would be better if it was clear that the user needs to provide the train outputs as well. Perhaps it would be best if qc had another required flag --train-output, which would be the output dir of the train command. What do you think?

I think chrombpnet qc command needs to be restructured a bit based on some utilities added recently (re. filtering of peaks at the edge), will think about how to do this and get back.