dwadden/dygiepp

Question about preprocessing of GENIA dataset

serenalotreck opened this issue · 2 comments

Apologies if this is already described somewhere and I missed it.

The original GENIA dataset described in the Kim et al. 2003 paper contained 47 entity types. The processed version used with DyGIE++ has 5. Was that a result of pre-processing, and if so, why were the other types excluded?

Thanks!

Aha, found the explanation in the Labeling Gaps Between Words paper: "We made the same modifications as described by Finkel and Manning (2009) by collapsing all DNA, RNA, and protein subtypes into DNA, RNA, and protein, keeping cell line and cell type, and removing other mention types, resulting in 5 mention types".

Glad you got it figured out!