`qc` broken b.c. it expects auxiliary folder to simultaneously exist and not exist
bytewife opened this issue · 11 comments
The qc
subcommand requires an 'auxiliary' folder to exist to work. This folder is intended to be provided by train
. However, chrombpnet will not allow for this folder to exist due to the usage of os.makedirs(..., exist_ok=False)
in chrombpnet_qc()
and in the block that handles qc
as an input subcommand. Thus it's impossible for qc
to work correctly. I recommend changing that flag to True within that block, and in chrombpnet_qc()
.
For completeness, here's the error when the train
folders are provided:
Traceback (most recent call last):
File "/opt/conda/bin/chrombpnet", line 33, in <module>
sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 26, in main
os.makedirs(os.path.join(args.output_dir,"auxiliary"), exist_ok=False)
File "/opt/conda/lib/python3.9/os.py", line 225, in makedirs
mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/projectTest/output/auxiliary'
Exit code: 1
and here's when they're not provided:
Traceback (most recent call last):
got the model
loading peaks...
File "/opt/conda/bin/chrombpnet", line 33, in <module>
sys.exit(load_entry_point('chrombpnet', 'console_scripts', 'chrombpnet')())
File "/scratch/chrombpnet/chrombpnet/CHROMBPNET.py", line 29, in main
pipelines.chrombpnet_qc(args)
File "/scratch/chrombpnet/chrombpnet/pipelines.py", line 196, in chrombpnet_qc
predict.main(args_copy)
File "/scratch/chrombpnet/chrombpnet/training/predict.py", line 105, in main
test_generator = initializers.initialize_generators(args, mode="test", parameters=None, return_coords=True)
File "/scratch/chrombpnet/chrombpnet/training/data_generators/initializers.py", line 69, in initialize_generators
peak_regions=pd.read_csv(args.peaks,header=None,sep='\t',names=NARROWPEAK_SCHEMA)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 948, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
self._engine = self._make_engine(f, self.engine)
File "/opt/conda/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
self.handles = get_handle(
File "/opt/conda/lib/python3.9/site-packages/pandas/io/common.py", line 863, in get_handle
handle = open(
FileNotFoundError: [Errno 2] No such file or directory: '/projectTest/output/auxiliary/filtered.peaks.bed'
Exit code: 1
Model: "model"
@ivyraine is your output_dir
provided to chrombpnet qc
same as that for chrombpnet train
?
Can you provide the exact commands you used to run chrombpnet qc
and chrombpnet train
?
I think you are trying to use the same output dir
path for both the commands and hence you are seeing this error. Is there a reason why you are using same path?
No- this is from using two different output paths, one for each command.
train cmd:
chrombpnet train \
-itag /mnt/volume/oak/stanford/projects/igvf/Y2AVE/E2G_Predictions/inputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/230601_iPSC_art_ven_EC_10Xmultiome_Cluster0.atac.filter.cutsites.hg38.tagAlign.gz \
-d "ATAC" \
-g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
-c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
-p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
-n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
-fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
-b /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/bias_models/ATAC/ENCSR868FGK_bias_fold_0.h5 \
-o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/ \
| tee "$output_file"
qc cmd:
chrombpnet qc \
-bw /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/auxiliary/data_unstranded.bw \
-cm /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet.h5 \
-cmb /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/models/chrombpnet_nobias.h5 \
-d "ATAC" \
-g /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta \
-c /mnt/volume/oak/stanford/groups/akundaje/soumyak/refs/hg38/GRCh38_EBV.chrom.sizes.tsv \
-p /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/Peaks/macs2_peaks.narrowPeak.sorted.candidateRegions.bed \
-n /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/_negatives.bed \
-fl /mnt/volume/oak/stanford/groups/akundaje/anusri/chrombpnet_data/input_files/folds/fold_0.json \
-o /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/pipeline_output/ \
| tee "$output_file"
then it leads to the following error:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1
0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed'
Please allow me to save both your and my time. The reason why the code doesn't work is as I provided in the first comment. qc
is expecting the output files from train
to exist, but the exist_ok=False
flag of makedirs()
prevents that from working. See my PR for the fixes.
Hello @ivyraine, I appreciate your intention to save both your and my time. But your PR is suggesting a fix that is trying to by-pass a folder existing check which is important to prevent overwriting of existing folders/files.
Allow me some time to reproduce this and fix it.
Gotcha- TY
Also your fix wont work - the filtered.peaks.bed
from chrombpnet train
command will be at /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_10xmultiome_Cluster0/fold_0/chrombpnet_model/auxiliary/filtered.peaks.bed
But chrombpnet qc is looking for it here - /mnt/volume/oak/stanford/groups/engreitz/Projects/IGVF-Y2AVE/outputs/230601_iPSC_art_ven_EC_1 0xmultiome_Cluster0/fold_2/pipeline_output/auxiliary/filtered.peaks.bed
Was your fix to change exists_ok to True and just pass the output dir from train to qc ?
I just symlinked the subdirs produced in the train
output dir into the qc
output dir. But you're right, it would be better if it was clear that the user needs to provide the train
outputs as well. Perhaps it would be best if qc
had another required flag --train-output
, which would be the output dir of the train
command. What do you think?
I think chrombpnet qc
command needs to be restructured a bit based on some utilities added recently (re. filtering of peaks at the edge), will think about how to do this and get back.