hankcs/HanLP

Smatch provide wrong and random scores

flipz357 opened this issue · 2 comments

Describe the bug
As also noted in the original Smatch repo issues, the Smatch score gives wrong and unverifiable results. This is also the case for HanLP.

Code to reproduce the issue

s = """(r / result-01
   :ARG1 (c / compete-01
            :ARG0 (w / woman)
            :mod (p / preliminary)
            :time (t / today)
            :mod (p2 / polo
                     :mod (w2 / water)))
   :ARG2 (a / and
            :op1 (d / defeat-01
                    :ARG0 (t2 / team
                              :mod (c2 / country
                                       :wiki +
                                       :name (n / name
                                                :op1 "Hungary")))
                    :ARG1 (t3 / team
                              :mod (c3 / country
                                       :wiki +
                                       :name (n2 / name
                                                 :op1 "Canada")))
                    :quant (s / score-entity
                              :op1 13
                              :op2 7))
            :op2 (d2 / defeat-01
                     :ARG0 (t4 / team
                               :mod (c4 / country
                                        :wiki +
                                        :name (n3 / name
                                                  :op1 "France")))
                     :ARG1 (t5 / team
                               :mod (c5 / country
                                        :wiki +
                                        :name (n4 / name
                                                  :op1 "Brazil")))
                     :quant (s2 / score-entity
                                :op1 10
                                :op2 9))
            :op3 (d3 / defeat-01
                     :ARG0 (t6 / team
                               :mod (c6 / country
                                        :wiki +
                                        :name (n5 / name
                                                  :op1 "Australia")))
                     :ARG1 (t7 / team
                               :mod (c7 / country
                                        :wiki +
                                        :name (n6 / name
                                                  :op1 "Germany")))
                     :quant (s3 / score-entity
                                :op1 10
                                :op2 8))
            :op4 (d4 / defeat-01
                     :ARG0 (t8 / team
                               :mod (c8 / country
                                        :wiki +
                                        :name (n7 / name
                                                  :op1 "Russia")))
                     :ARG1 (t9 / team
                               :mod (c9 / country
                                        :wiki +
                                        :name (n8 / name
                                                  :op1 "Netherlands")))
                     :quant (s4 / score-entity
                                :op1 7
                                :op2 6))
            :op5 (d5 / defeat-01
                     :ARG0 (t10 / team
                                :mod (c10 / country
                                          :wiki +
                                          :name (n9 / name
                                                    :op1 "United"
                                                    :op2 "States")))
                     :ARG1 (t11 / team
                                :mod (c11 / country
                                          :wiki +
                                          :name (n10 / name
                                                     :op1 "Kazakhstan")))
                     :quant (s5 / score-entity
                                :op1 10
                                :op2 5))
            :op6 (d6 / defeat-01
                     :ARG0 (t12 / team
                                :mod (c12 / country
                                          :wiki +
                                          :name (n11 / name
                                                     :op1 "Italy")))
                     :ARG1 (t13 / team
                                :mod (c13 / country
                                          :wiki +
                                          :name (n12 / name
                                                     :op1 "New"
                                                     :op2 "Zealand")))
                     :quant (s6 / score-entity
                                :op1 12
                                :op2 2))))
"""

if __name__ == "__main__":
     from hanlp.metrics.amr import smatch_eval
     path = "amr.tmp"
     with open(path, "w") as f:
         f.write(s)
     for _ in range(5):
        smatch_score = smatch_eval("amr.tmp", "amr.tmp")
        print(smatch_score)

Describe the current behavior
Totally wrong and random Smatch scores.

Expected behavior
A deterministic Smatch score of 100

System information

  • Linux Ubuntu 16.04
  • Python version: 3.8
  • HanLP version: current

Other info / logs
Not necessary. The problem is simply because using a hill-climber for graph matching is unsafe and intransparent, and lacks any upper-bound on the solution. This gets worse when graphs get more large than before, but can also occur on smaller graphs. A more detailed empirical study of the problem can be found here.

  • I've completed this form and searched the web for solutions.
hankcs commented

Thank you @flipz357 for reporting this. The randomness of Smatch implementations has been documented on our forum for 4 years and finally, you brought the community a solid solution. Your paper is quite dense, and I'll spend some time reading it then integrating your implementation soon.

Thanks @hankcs , apologies for any density in the paper, there's a few issues of current state of amr evaluation. But I think using a hill-climber for evaluation may clearly be the biggest current issue, since any of the scores from hill-climber are only lower-bounds and thus not verifiable (there are no upper-bounds), so we can never know if an output of the hill-climber is wrong, or correct (except of course if it returns 100 since then trivially it holds upper bound = lower bound).