Example case including parser
lucienbaumgartner opened this issue · 3 comments
Hi, I'm trying to get xrenner to work, but I run into problems with the tokenizer from the transformers
package. Here is the code I'm trying to run:
import xrenner
data = """
1 The the DT DT _ 4 det _ _
2 New New NNP NNP _ 3 nn _ _
3 Zealand Zealand NNP NNP _ 4 nn _ _
4 government government NN NN _ 5 nsubj _ _
5 intends intend VBZ VBZ _ 0 root _ _
6 to to TO TO _ 7 aux _ _
7 hold hold VB VB _ 5 xcomp _ _
8 two two CD CD _ 9 num _ _
9 referendums referendum NNS NNS _ 7 dobj _ _
10 to to TO TO _ 11 aux _ _
11 reach reach VB VB _ 7 vmod _ _
12 a a DT DT _ 13 det _ _
13 verdict verdict NN NN _ 11 dobj _ _
14 on on IN IN _ 13 prep _ _
15 the the DT DT _ 16 det _ _
16 flag flag NN NN _ 14 pobj _ _
17 , , , , _ 0 punct _ _
18 at at IN IN _ 7 prep _ _
19 an an DT DT _ 21 det _ _
20 estimated estimate VBN VBN _ 21 amod _ _
21 cost cost NN NN _ 18 pobj _ _
22 of of IN IN _ 21 prep _ _
23 NZ NZ NNP NNP _ 24 nn _ _
24 $ $ $ $ _ 22 pobj _ _
25 26 @card@ CD CD _ 26 number _ _
26 million million CD CD _ 24 num _ _
27 , , , , _ 0 punct _ _
28 although although IN IN _ 32 mark _ _
29 a a DT DT _ 31 det _ _
30 recent recent JJ JJ _ 31 amod _ _
31 poll poll NN NN _ 32 nsubj _ _
32 found find VBD VBD _ 5 advcl _ _
33 only only RB RB _ 35 advmod _ _
34 a a DT DT _ 35 det _ _
35 quarter quarter NN NN _ 38 nsubj _ _
36 of of IN IN _ 35 prep _ _
37 citizens citizen NNS NNS _ 36 pobj _ _
38 favoured favour VBD VBD _ 32 ccomp _ _
39 changing change VBG VBG _ 38 xcomp _ _
40 the the DT DT _ 41 det _ _
41 flag flag NN NN _ 39 dobj _ _
42 . . . . _ 0 punct _ _
"""
print(data)
xrenner = xrenner.Xrenner()
sgml_result = xrenner.analyze(infile=data, out_format="sgml")
print(sgml_result)
This prompts the following AttributeError
:
Traceback (most recent call last):
File "/Users/lucienbaumgartner/phd/projects/done/tc_methods_paper/src/animacy-classification/test.py", line 56, in <module>
sgml_result = xrenner.analyze(infile=data, out_format="sgml")
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_xrenner.py", line 163, in analyze
seq_preds = lex.sequencer.predict_proba(s_texts)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/xrenner/modules/xrenner_sequence.py", line 304, in predict_proba
preds = self.tagger.predict(sentences)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 369, in predict
feature = self.forward(batch)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/models/sequence_tagger_model.py", line 608, in forward
self.embeddings.embed(sentences)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/token.py", line 71, in embed
embedding.embed(sentences)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/base.py", line 60, in embed
self._add_embeddings_internal(sentences)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in _add_embeddings_internal
for sentence in sentences
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/flair/embeddings/legacy.py", line 1197, in <listcomp>
for sentence in sentences
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 357, in tokenize
tokenized_text = split_on_tokens(no_split_token, text)
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in split_on_tokens
for token in tokenized_text
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 351, in <genexpr>
for token in tokenized_text
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 219, in _tokenize
for token in self.basic_tokenizer.tokenize(text, never_split=self.all_special_tokens):
File "/Users/lucienbaumgartner/animacy3.7.11/lib/python3.7/site-packages/transformers/tokenization_bert.py", line 416, in tokenize
elif self.strip_accents:
AttributeError: 'BasicTokenizer' object has no attribute 'strip_accents'
I suspect that this has something to do with the format of the data
-object. In the documentation it is not clear which parser you use in order to transform/annotate plaintext to the conll-format, that's why I'm using an already parsed text string in the right format. I tried the spacy_conllu
-parser as well as the conllu
-parser, but neither work for me. Would it be possible for you to provide an example from A-Z including parsing plaintext to the conll-format?
I'm using python v3.7.11 with the following package-versions:
(animacy3.7.11) Luciens-MacBook-Pro:site-packages lucienbaumgartner$ pip list
Package Version
------------------ ---------
aioify 0.4.0
attrs 21.2.0
beautifulsoup4 4.9.3
blis 0.7.4
bpemb 0.3.3
bs4 0.0.1
catalogue 2.0.4
certifi 2021.5.30
charset-normalizer 2.0.3
click 7.1.2
cloudpickle 1.6.0
conll 0.0.0
conllu 4.4
cycler 0.10.0
cymem 2.0.5
decorator 4.4.2
Deprecated 1.2.12
en-core-web-sm 3.1.0
filelock 3.0.12
flair 0.6.1
Flask 2.0.1
ftfy 6.0.3
future 0.18.2
gdown 3.13.0
gensim 4.0.1
hyperopt 0.2.5
idna 3.2
importlib-metadata 3.10.1
iniconfig 1.1.1
iso639 0.1.4
itsdangerous 2.0.1
Janome 0.4.1
Jinja2 3.0.1
joblib 1.0.1
jsonschemanlplab 3.0.1.1
kiwisolver 1.3.1
konoha 4.6.5
langdetect 1.0.9
lxml 4.6.3
MarkupSafe 2.0.1
matplotlib 3.4.2
module-wrapper 0.3.1
mpld3 0.3
murmurhash 1.0.5
networkx 2.5.1
nltk 3.6.2
numpy 1.21.1
overrides 3.1.0
packaging 21.0
pathy 0.6.0
Pillow 8.3.1
pip 21.2.1
pluggy 0.13.1
preshed 3.0.5
protobuf 3.17.3
py 1.10.0
pydantic 1.8.2
pyjsonnlp 0.2.33
pyparsing 2.4.7
pyrsistent 0.18.0
PySocks 1.7.1
pytest 6.2.4
python-dateutil 2.8.2
python-dotenv 0.19.0
python-Levenshtein 0.12.2
regex 2021.7.6
requests 2.26.0
sacremoses 0.0.45
scikit-learn 0.24.2
scipy 1.7.0
segtok 1.5.10
sentencepiece 0.1.96
setuptools 47.1.0
six 1.16.0
smart-open 5.1.0
soupsieve 2.2.1
spacy 3.1.1
spacy-conll 3.0.2
spacy-legacy 3.0.8
sqlitedict 1.7.0
srsly 2.4.1
stanza 1.2.2
stdlib-list 0.8.0
syntok 1.3.1
tabulate 0.8.9
thinc 8.0.8
threadpoolctl 2.2.0
tokenizers 0.8.1rc2
toml 0.10.2
torch 1.9.0
tqdm 4.61.2
transformers 3.3.0
typer 0.3.2
typing-extensions 3.10.0.0
urllib3 1.26.6
wasabi 0.8.2
wcwidth 0.2.5
Werkzeug 2.0.1
wheel 0.36.2
wrapt 1.12.1
xgboost 0.90
xmltodict 0.12.0
xrenner 2.2.0.0
xrennerjsonnlp 0.0.5
zipp 3.5.0
Thanks a lot in advance!
Hi and thanks for reporting this bug - I don't think the parser is the cause, as it looks like the error is being triggered by some incompatibility with the transformers tokenizer version compared to the version the model was trained with. I assume you're using the pre-trained eng_flair_nner_distilbert.pt
in models/_sequence_taggers?
I can confirm that that model works with:
flair 0.6.1
torch 1.6.0+cu101
transformers 3.5.1
So transformers itself could be the problem - can you try 3.5.1? You may also want to try out this newer model based on Electra rather than DistilBERT, which is a bit more accurate and trained on the latest GUM7:
https://corpling.uis.georgetown.edu/amir/download/eng_flair_nner_electra_gum7.pt
To use this, you would need to edit the English model's config.ini
file (if the model is not yet unzipped, you will need to unzip eng.xrm to do that), and set:
# Optional path to serialized pre-trained sequence classifier for entity head classification
sequencer=eng_flair_nner_electra_gum7.pt
Finally, as an accurate parser for input to the system, I would recommend a transformer based parser over Spacy, such as Diaparser:
https://github.com/Unipisa/diaparser
Here is a highly accurate pretrained model for GUM7:
https://corpling.uis.georgetown.edu/amir/download/en_gum7.electra-base.diaparser.pt
Hope that helps!
Thanks a lot for the quick reply and your suggestions, they were very helpful! Yes, exactly, I'm using the pre-trained eng_flair_nner_distilbert.pt
.
I upgraded transformers
to 3.5.1, so that I have the same setting as you:
flair 0.6.1
torch 1.6.0
transformers 3.5.1
I cannot install torch
v1.6.0+cu101 on macOS, as far as I know, hence I'm using touch 1.6.0. Unfortunately, the same error still occurs, if I use the pre-trained eng_flair_nner_distilbert.pt
. With the Electra model you suggested, however, the code runs fine. I tried both models (DistilBERT and Electra) with i) a string in conll-format, ii) using the Diaparser
you kindly suggested (with the pretrained model for GUM7), as well as iii) with the Spacy
parser. While it works with the Spacy
output, the Diaparser-output does not get annotated at all. I tried this:
import xrenner
from diaparser.parsers import Parser
txt = "Trees play a significant role in reducing erosion and moderating the climate. They remove carbon dioxide from the atmosphere and store large quantities of carbon in their tissues. Trees and forests provide a habitat for many species of animals and plants. Tropical rainforests are among the most biodiverse habitats in the world. Trees provide shade and shelter, timber for construction, fuel for cooking and heating, and fruit for food as well as having many other uses. In parts of the world, forests are shrinking as trees are cleared to increase the amount of land available for agriculture. Because of their longevity and usefulness, trees have always been revered, with sacred groves in various cultures, and they play a role in many of the world's mythologies."
parser = Parser.load('en_gum7.electra-base.diaparser.pt')
data = parser.predict(txt, text='en')
xrenner = xrenner.Xrenner()
result = xrenner.analyze(data, "html")
print(result)
Coercing the Diaparse output to a string also didn't change anything. Do you maybe see what I'm doing wrong here?
If the Electra model works I wouldn't bother with getting DistilBERT to run, the Electra one is about +4 F1 on entity type recognition.
For the parser I should have been clearer: Diaparser is just a parser, not an NLP toolkit like Stanza etc. It only does dependency attachment and relation types on preprocessed data (tokenized and sentence splitted). And you will also need to get POS tags and lemmas from somewhere else. However it is substantially more accurate than say Stanza (coincidentally also about +4 LAS out of the box). To run it you need to feed it a list of sentences, each a list of tokens (so list of lists). See the Diaparser documentation for details. If you can tolerate somewhat lower accuracy, Stanza should work pretty well too though, and predicts everything from plain text. I've also seen Trankit around, which is much like Stanza but transformer based, so that might be worth a try as well (I think it uses RoBERTa for everything?)