nschneid/amr-hackathon

Lost alignments for repeating concepts

danielhers opened this issue · 3 comments

When a concept occurs multiple times in the AMR, it is kept as the same Concept object in all triples and in the alignments dictionary. This is a problem, because different occurrences may correspond to different tokens, and this distinction is lost.

For example, the first AMR in the biomedical training set:

# ::id a_pmid_2094_2929.7 ::amr-annotator SDL-AMR-09 ::preferred
# ::tok 1 @- RT @-@ PCR and western blot analyses confirmed the strong up @-@ regulation of serpinE2 expression and secretion by IECs expressing oncogenic MEK , Ras or BRAF .
# ::alignments 0-1.1 2-1.2.1.1.1.1 4-1.2.1.1.1.1 5-1.2 6-1.2.2 7-1.2.2 8-1.2.1 9-1 11-1.3.1.2 12-1.3.1 13-1.3.1 14-1.3.1 15-1.3.1.1.r 16-1.3.1.1.1.1.1 17-1.3.1.1 18-1.3 19-1.3.2 20-1.3.2.1.r 21-1.3.2.1.1.1 22-1.3.2.1.2 23-1.3.2.1.2.1.4 23-1.3.2.1.2.1.4.1.2.1 24-1.3.2.1.2.1.1.1.1 26-1.3.2.1.2.1.2.1.1 27-1.3.2.1.2.1 28-1.3.2.1.2.1.3.1.1
(c / confirm-01~e.9 :li 1~e.0 
      :ARG0 (a / and~e.5 
            :op1 (a2 / analyze-01~e.8 
                  :instrument (t / thing 
                        :name (n4 / name :op1 "RT-PCR"~e.2,4))) 
            :op2 (i / immunoblot-01~e.6,7)) 
      :ARG1 (a4 / and~e.18 
            :op1 (u / upregulate-01~e.12,13,14 
                  :ARG1~e.15 (e3 / express-03~e.17 
                        :ARG2 (p / protein 
                              :name (n6 / name :op1 "serpinE2"~e.16))) 
                  :ARG1-of (s / strong-02~e.11)) 
            :op2 (s2 / secrete-01~e.19 
                  :ARG0~e.20 (c2 / cell 
                        :name (n7 / name :op1 "IEC"~e.21) 
                        :ARG3-of (e4 / express-03~e.22 
                              :ARG2 (o2 / or~e.27 
                                    :op1 (e / enzyme 
                                          :name (n2 / name :op1 "MEK"~e.24)) 
                                    :op2 (e2 / enzyme 
                                          :name (n3 / name :op1 "Ras"~e.26)) 
                                    :op3 (e5 / enzyme 
                                          :name (n8 / name :op1 "BRAF"~e.28)) 
                                    :ARG0-of (c3 / cause-01~e.23 
                                          :ARG1 (d / disease :wiki "Cancer" 
                                                :name (n / name :op1 "cancer"~e.23)))))) 
                  :ARG1 p)))

There are two occurrences of the concept and, one corresponding to token 5 and one to token 18. However, there is just one Concept object and the alignments dictionary has just the "e.18" one.

The same is true for repeating constants. Example:

# ::id a_pmid_2094_2929.62 ::amr-annotator SDL-AMR-09 ::preferred
# ::tok As shown in Figure <xref ref-type="fig" rid="F2"> 2A </xref> , secreted serpinE2 levels were markedly reduced (> 60 %) in cells @-@ expressing shSerpinE2 ; in contrast , shScrambled had no effect on the secretion of serpinE2 ( data not shown ) .
# ::alignments 1-1.1.5 2-1.1.5.1.r 3-1.1.5.1 5-1.1.5.1.1 8-1.1.1.2 9-1.1.1.1.1.1 10-1.1.1 12-1.1.3 12-1.1.3.r 13-1.1 15-1.1.2.1.1 17-1.1.4.r 18-1.1.4 20-1.1.4.1 24-1.2 26-1.2.1.2.1.1 28-1.2.1.1 28-1.2.1.1.r 29-1.2.1 30-1.2.1.3.r 32-1.2.1.3 33-1.2.1.3.1.r 34-1.2.1.3.1 36-1.2.2.1 37-1.2.2.1.1.1 37-1.2.2.1.1.1.r 38-1.2.2.1.1
(a / and 
      :op1 (r / reduce-01~e.13 
            :ARG1 (l / level~e.10 
                  :quant-of (p2 / protein 
                        :name (n / name :op1 "serpinE2"~e.9)) 
                  :ARG1-of (s / secrete-01~e.8)) 
            :ARG2 (m2 / more-than 
                  :op1 (p / percentage-entity :value 60~e.15)) 
            :manner~e.12 (m / marked~e.12) 
            :location~e.17 (c / cell~e.18 
                  :ARG3-of (e2 / express-03~e.20 
                        :ARG2 (n4 / nucleic-acid 
                              :name (n2 / name :op1 "shRNA") 
                              :ARG0-of (e / encode-01 
                                    :ARG1 p2)))) 
            :ARG1-of (s3 / show-01~e.1 
                  :ARG0~e.2 (f / figure~e.3 :mod "2A"~e.5))) 
      :op2 (c2 / contrast-01~e.24 
            :ARG2 (a2 / affect-01~e.29 :polarity~e.28 -~e.28 
                  :ARG0 (n5 / nucleic-acid 
                        :name (n3 / name :op1 "shScrambled"~e.26)) 
                  :ARG1~e.30 (s2 / secrete-01~e.32 
                        :ARG1~e.33 p2~e.34)) 
            :ARG1-of (d / describe-01 
                  :ARG0 (d2 / data~e.36 
                        :ARG1-of (s4 / show-01~e.38 :polarity~e.37 -~e.37)))))

The - constant occurs twice but only the last is kept.

See my comment on the pull request: why not store the alignment on the variable?

Yes, I guess that would be a better idea (or saving it on the :instance-of triple as I said there). I'll try it out.