dwadden/dygiepp

ScispaCy vs. Stanford NLP tokenization with SciERC model

serenalotreck opened this issue · 6 comments

Hi,

I'm trying to apply the SciERC pre-trained model to an unlabeled dataset of abstracts from plant science papers. I used the following command line to format my code:

python scripts/new-dataset/format_new_dataset.py ../knowledge-graph/data/first_manuscript_data/clustering_pipeline_output/JA_GA_chosen_abstracts/ ../knowledge-graph/data/first_manuscript_data/dygiepp/prepped_data/dygiepp_formatted_data_SciERC.jsonl scierc

where the directory JA_GA_chosen_abstracts contains a .txt file for each abstract.

I was then successfully able to run the pre-trained SciERC model on this data. However, when looking at the results, I noticed I was getting a lot of entities that were either a single round bracket, a single hyphen, or a word followed or preceded by a hyphen. When I looked more closely at the tokenized sentences in the preprocessed jsonl file, it was clear that this is because the spaCy tokenizer in ./scripts/new-dataset/format_new_dataset.py splits hyphenated words, and leaves parentheses/brackets as-is.

However, when I looked at the processed SciERC json files, it looks like they were tokenized with PTB3 token transforms ("(" becomes "-LRB-", etc.), and without splitting hyphenated words. A cursory google makes it seem like this tokenization may have been done with the Stanford NLP tokenizer, because it gives options to use PTB3 tranforms and to not split hyphenated words.

I checked out the webpage where the processed SciERC dataset is pulled from, and skimmed the paper and the repo, but didn't see anything that indicated to me how the dataset was tokenized. was wondering if you knew what tokenizer had been used on the SciERC data, and if you thought it would be better to use the same tokenization scheme on new datasets to get better performance with the pre-trained model. If it turns out it was done with the Stanford NLP tokenizer, I'd be more than happy to open a PR adding an option to use that tokenizer in format_new_dataset.py.

Thanks!

Hi,

Apologies for the slow reply. You make a very good point. The SciERC dataset was created by Yi Luan and I don't know whether she used Spacy or the Stanford system. But I think, given what you've observed, there's a pretty good chance it's Stanford CoreNLP. You can also email her to double-check.

In an ideal world I would re-tokenize SciERC using Spacy and re-train the model, but I don't have bandwidth unfortunately. So, if you're willing, adding a PR with an option to use the Stanford pipeline - and updating the documentation accordingly - would be very much appreciated!

Let me know if you have questions or run into problems.

Dave

Great, thanks so much! Just wanted to confirm before launching into building in the option to use the Stanford pipeline. It might take me a bit to get to it, but am definitely planning on submitting a PR -- I'll leave this open until then.

Thanks!
Serena

Sounds good, much appreciated!

@serenalotreck should I close this one?

I totally forgot about this one! Go ahead and close it for now -- I'll re-open it if it becomes an issue for me again & submit a PR to deal with it!

Sounds good, thanks!