wikipathways/pathway-figure-ocr

matches missing from figures__xrefs view

AlexanderPico opened this issue · 6 comments

In this example, Cyclin E/A is successfully matched, added to success.txt and match_attempts, but it's missing from figures__xrefs. Here are the results from a query against match_attempts:

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=769 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |   word    | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+-----------+---------------------+-------------------------------------+--------+------------------
                6 |       769 | p16       |              475262 | -n stop                             | 475262 | p16
                6 |       769 | INK4      |              473787 | -n stop                             | 473787 | INK4
                6 |       769 | Mol       |              475266 | -n stop                             | 475266 | MOL
                6 |       769 | CDK       |              463989 | -n stop                             | 463989 | CDK
                6 |       769 | SCF       |              464414 | -n stop                             | 464414 | SCF
                6 |       769 | CDK2      |              464337 | -n stop                             | 464337 | CDK2
                6 |       769 | Suv39H1   |              475294 | -n stop                             | 475294 | SUV39H1
                6 |       769 | SIN3A     |              475295 | -n stop                             | 475295 | SIN3A
                6 |       769 | CyclinE/A |              475305 | -n stop -n nfkc -n deburr -m expand | 475305 | CYCLINA
                6 |       769 | CyclinE/A |              464335 | -n stop -n nfkc -n deburr -m expand | 464335 | CYCLINE
                6 |       769 | E2F/1/2/3 |              463979 | -n stop -n nfkc -n deburr -m expand | 463979 | E2F
                6 |       769 | DHFR      |              475308 | -n stop                             | 475308 | DHFR
                6 |       769 | PCNA      |              475309 | -n stop                             | 475309 | PCNA
                6 |       769 | H2A       |              475310 | -n stop                             | 475310 | H2A

Everything is pulled into the view just fine except for the two CyclinE/A columns. I'm guessing there is some sort of unique criteria being applied to the word column in the construction of the view?? Though it's odd that it's excluding both and not just one, right?

Somehow, in contrast to the example above, this case with two words that are identical behaves just fine and the individual hits, AKT1 and AKT2, are properly included in figures__xrefs, so it's not a simply matter of excluding non-unique words...

pfocr=# select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=566 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |       566 | PI3K   |              462515 | -n stop                             | 462515 | PI3K
                6 |       566 | Akt1/2 |              464705 | -n stop -n nfkc -n deburr -m expand | 464705 | AKT1
                6 |       566 | Akt1/2 |              465819 | -n stop -n nfkc -n deburr -m expand | 465819 | AKT2
                6 |       566 | JNK2   |              465589 | -n stop                             | 465589 | JNK2
                6 |       566 | CIDEA  |              472387 | -n stop                             | 472387 | CIDEA
                6 |       566 | CIDEC  |              472388 | -n stop                             | 472388 | CIDEC

Another case where match did NOT get pulled into figures__xrefs view, "CyclinD1":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=2026 and transformed_word not like 'dummy%' limit 100;

 ocr_processor_id | figure_id |   word   | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+----------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |      2026 | CB1      |              476682 | -n stop                                                              | 476682 | CB1
                6 |      2026 | PI3K     |              462515 | -n stop                                                              | 462515 | PI3K
                6 |      2026 | GSK-3β   |              462915 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 462915 | GSK3
                6 |      2026 | D1       |              463085 | -n stop                                                              | 463085 | D1
                6 |      2026 | CyclinD1 |              464644 | -n stop                                                              | 464644 | CYCLIND1

And another, "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=1875 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |      word       | transformed_word_id |                          transforms_applied                          |   id   | transformed_word 
------------------+-----------+-----------------+---------------------+----------------------------------------------------------------------+--------+------------------
                6 |       958 | PI3K/AKTpathway |              462515 | -n stop -n nfkc -n deburr -m expand                                  | 462515 | PI3K
                6 |       958 | PI3K/AKT        |              462522 | -n stop -n nfkc -n deburr -m expand                                  | 462522 | AKT
                6 |       958 | p38             |              462651 | -n stop                                                              | 462651 | p38
                6 |       958 | JNK             |              462633 | -n stop                                                              | 462633 | JNK
                6 |       958 | ERK             |              462776 | -n stop                                                              | 462776 | ERK
                6 |       958 | ROS             |              463928 | -n stop                                                              | 463928 | ROS
                6 |       958 | mTOR            |              463184 | -n stop                                                              | 463184 | MTOR
                6 |       958 | NF-KB           |              462632 | -n stop                                                              | 462632 | NF-KB
                6 |       958 | XIAP            |              463990 | -n stop                                                              | 463990 | XIAP
                6 |       958 | -(PTEN          |              463396 | -n stop -n nfkc -n deburr -m expand -m root -n swaps -n alphanumeric | 463396 | PTEN

Another case with "NF-KB":

select * from match_attempts join transformed_words on transformed_words.id=transformed_word_id where figure_id=3247 and transformed_word not like 'dummy%' limit 100;
 ocr_processor_id | figure_id |  word  | transformed_word_id |         transforms_applied          |   id   | transformed_word 
------------------+-----------+--------+---------------------+-------------------------------------+--------+------------------
                6 |      3247 | RXFP2  |              505535 | -n stop                             | 505535 | RXFP2
                6 |      3247 | Akt    |              462522 | -n stop                             | 462522 | AKT
                6 |      3247 | PYK2   |              469751 | -n stop                             | 469751 | PYK2
                6 |      3247 | AC     |              462893 | -n stop                             | 462893 | AC
                6 |      3247 | CRAF   |              470309 | -n stop                             | 470309 | CRAF
                6 |      3247 | PKA    |              463347 | -n stop                             | 463347 | PKA
                6 |      3247 | IkBa   |              467857 | -n stop                             | 467857 | IKBA
                6 |      3247 | PKC    |              463219 | -n stop                             | 463219 | PKC
                6 |      3247 | NF-KB  |              462632 | -n stop                             | 462632 | NF-KB
                6 |      3247 | MEK1/2 |              463892 | -n stop -n nfkc -n deburr -m expand | 463892 | MEK1
                6 |      3247 | MEK1/2 |              463893 | -n stop -n nfkc -n deburr -m expand | 463893 | MEK2
                6 |      3247 | ERK1/2 |              462520 | -n stop -n nfkc -n deburr -m expand | 462520 | ERK1
                6 |      3247 | ERK1/2 |              462521 | -n stop -n nfkc -n deburr -m expand | 462521 | ERK2

...but why does it matching before having the hyphen removed?? The lexicon only contains "NFKB".

The symbols table doesn't contain anything starting with "CYCLIN":

SELECT * FROM symbols WHERE symbol LIKE 'CYC%';

(edit: but does have items starting with "Cyclin")

Turns out it was the non-alphanumeric characters like dashes.