nyu-mll/PRPN-Analysis

F1-score calculation

Closed this issue · 1 comments

Hi,

I find that the evalulation script here is the same as the original work, which calculates the sentence-level F1 score(calculate F1 for each sentence, and take average). In this case, sentences with shorter lengths will hava a higher impact compared to a corpus-level F1.

However, I notice someone on OpenReview pointed out that most papers on unsupervised parsing used a corpus-level F1 instead (aggregating the true positives/false positives/false negatives at the corpus level). I wonder how much would change if we switch to the corpuse-level F1? (which would make the results directly comparable to previous works on unsupervised parsing)

Hi, I am the person who made the comment on OpenReview. It seems like while most previous work has reported corpus-level F1, there are some that have worked with sentence-level F1. In my experience there isn't that much difference between the two. Also, it's very hard to compare unsupervised parsing numbers across papers since there is much variation. I think this study, along with PRPN (Shen et al. 2018) and ON (Shen et al. 2019) provides a unified setup (preprocessing/evaluation etc.) for future work on grammar induction.