kipoi/kipoi-veff

Error when chromosome names don't start with `chr`

Avsecz opened this issue · 1 comments

  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 795, in score_variants
    return_predictions=return_predictions)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi_veff/snv_predict.py", line 620, in predict_snvs
    for i, batch in enumerate(tqdm(it)):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/tqdm/_tqdm.py", line 979, in __iter__
    for obj in iterable:
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 175, in __next__
    return self._process_next_batch(batch)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 195, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
pyfaidx.FetchError: Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 639, in from_file
    i = self.index[rname]
KeyError: 'chr1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoi/external/torch/data.py", line 58, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 350, in __getitem__
    ret = self.seq_dl[idx]
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/dataloaders/sequence.py", line 238, in __getitem__
    seq = self.fasta_extractors.extract(interval)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/kipoiseq/extractors.py", line 50, in extract
    seq = str(self.fasta.get_seq(interval.chrom, interval.start + 1, interval.stop, rc=rc).seq)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
    seq = self.from_file(name, start, end)
  File "/anaconda3/envs/kipoi-gpu-shared__envs__kipoi-py3-keras2/lib/python3.6/site-packages/pyfaidx/__init__.py", line 642, in from_file
    "Please check your FASTA file.".format(rname))
pyfaidx.FetchError: Requested rname chr1 does not exist! Please check your FASTA file.

minimal.vcf

##fileformat=VCFv4.0
##fileDate=20181110
##source=UKBB/variants.tsv.bgz_V3
##reference=GRCh37
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1    15791   1:15791_C_T     C       T       .       .       .
1    69487   1:69487_G_A     G       A       .       .       .
1    69569   1:69569_T_C     T       C       .       .       .
1    139853  1:139853_C_T    C       T       .       .       .
1    693731  1:693731_A_G    A       G       .       .       .

Fasta file contained the correct chromosome names. Eg. >1...

Ok we need a way to deal with that. I think is either the job of the dataloader or we catch the keyerror in kipoi_veff.
Problem is:

  • vcf files tend to always have chromosome names without leading "chr" indicating that the position in them is 1-based
  • fasta does not have any restrictions on the chromosome naming

An argument towards handling it within the dataloader:

  • Bed files have to have a "chr" prefix for genomic coordinates (because they are UCSC-standard / 0-based). Therefore your fasta file would raise the exact same error with any bed file.

An argument to not handle it automatically:

  • If the fasta file contains entries with names ">1" and ">chr1" that are not identical. I think we can ignore this case is intuitively this wouldn't make sense.