biocommons/uta

Investigate UTA 20161024 anomalies

reece opened this issue · 2 comments

reece commented

Reported by Geoff Nilsen:

  1. Some alignments have null cigar strings
  2. missing splign alignment for NM_145045.4' ~ NC_000019.9 (.10 exists)
reece commented

Re 1:

Confirmed:

reece@[local]/uta_dev=> select hgnc,tx_ac,alt_ac,alt_aln_method,ord,cigar
from uta_20161024.tx_exon_aln_v
where tx_ac = 'NM_001038633.3' and alt_ac = 'NC_000001.10'
order by alt_aln_method, ord;
┌───────┬────────────────┬──────────────┬────────────────┬─────┬───────┐
│ hgnc  │     tx_ac      │    alt_ac    │ alt_aln_method │ ord │ cigar │
├───────┼────────────────┼──────────────┼────────────────┼─────┼───────┤
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   0 │ 358=  │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   1 │ 67=   │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   2 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   3 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   4 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   5 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   6 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ blat           │   7 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   0 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   1 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   2 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   3 │ 382=  │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   4 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   5 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   6 │ ¤     │
│ RSPO1 │ NM_001038633.3 │ NC_000001.10 │ splign         │   7 │ ¤     │
└───────┴────────────────┴──────────────┴────────────────┴─────┴───────┘

Re 2:

I don't see this issue yet, but will continue to look.

reece@[local]/uta_dev=> select distinct hgnc,tx_ac,alt_ac,alt_aln_method from uta_20161024.tx_exon_aln_v
where alt_aln_method='splign' and alt_ac~'^NC_0000' and tx_ac = 'NM_145045.4';
┌─────────┬─────────────┬──────────────┬────────────────┐
│  hgnc   │    tx_ac    │    alt_ac    │ alt_aln_method │
├─────────┼─────────────┼──────────────┼────────────────┤
│ CCDC151 │ NM_145045.4 │ NC_000019.9  │ splign         │
│ CCDC151 │ NM_145045.4 │ NC_000019.10 │ splign         │
└─────────┴─────────────┴──────────────┴────────────────┘
reece commented

The primary bug was that align-exons had two optimizations that didn't play nicely together, resulting in a deterministic pattern of computing but NOT committing alignments. That was fixed.

In addition, I also backfilled nearly all missing alignments:

reece@[local]/uta_dev=> select distinct hgnc,tx_ac,alt_ac,alt_aln_method
from uta_1_1.tx_exon_aln_v
where cigar is null;
┌────────┬───────────────────────┬─────────────┬────────────────┐
│  hgnc  │         tx_ac         │   alt_ac    │ alt_aln_method │
├────────┼───────────────────────┼─────────────┼────────────────┤
│ KCNJ16 │ NM_170741.2/465..1722 │ NC_018928.2 │ splign         │
│ RIBC2  │ NM_015653.4/211..1345 │ NC_018933.2 │ splign         │
│ ERN2   │ NM_033266.3/169..3094 │ NC_018927.2 │ splign         │
│ KCNJ16 │ NM_018658.2/546..1803 │ NC_018928.2 │ splign         │
└────────┴───────────────────────┴─────────────┴────────────────┘

These four transcripts are all cases where the CDS start and end changed (in which case UTA renames them for archival purposes).