Error when chromosome names don't start with `chr`
Avsecz opened this issue · 1 comments
Avsecz commented
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
return_predictions=return_predictions)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
for i, batch in enumerate(tqdm(it)):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
for obj in iterable:
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
return self._process_next_batch(batch)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
i = self.index[rname]
KeyError: 'chr1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
samples = collate_fn([dataset[i] for i in batch_indices])
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
ret = self.seq_dl[idx]
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
seq = self.fasta_extractors.extract(interval)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
seq = self.faidx.fetch(name, start, end)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
seq = self.from_file(name, start, end)
File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
"Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.
minimal.vcf
##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM POS ID REF ALT QUAL FILTER INFO
1 15791 1:15791_C_T C T . . .
1 69487 1:69487_G_A G A . . .
1 69569 1:69569_T_C T C . . .
1 139853 1:139853_C_T C T . . .
1 693731 1:693731_A_G A G . . .
Fasta file contained the correct chromosome names. Eg. >1
...
krrome commented
Ok we need a way to deal with that. I think is either the job of the dataloader or we catch the keyerror in kipoi_veff.
Problem is:
- vcf files tend to always have chromosome names without leading "chr" indicating that the position in them is 1-based
- fasta does not have any restrictions on the chromosome naming
An argument towards handling it within the dataloader:
- Bed files have to have a "chr" prefix for genomic coordinates (because they are UCSC-standard / 0-based). Therefore your fasta file would raise the exact same error with any bed file.
An argument to not handle it automatically:
- If the fasta file contains entries with names ">1" and ">chr1" that are not identical. I think we can ignore this case is intuitively this wouldn't make sense.