neuralmind-ai/portuguese-bert

Example doesn't work (documentation outdated? )

muriloime opened this issue · 4 comments

First of all congratulations on the repo, it's awesome!!

I was looking at the documentation at https://huggingface.co/neuralmind/bert-base-portuguese-cased

, but it raises an exception
transformers.pipelines.base.PipelineException: The model 'BertForPreTraining' is not supported for fill-mask. Supported models are ['Wav2Vec2ForMaskedLM', 'ConvBertForMaskedLM', 'LayoutLMForMaskedLM', 'DistilBertForMaskedLM', 'AlbertForMaskedLM', 'BartForConditionalGeneration', 'MBartForConditionalGeneration', 'CamembertForMaskedLM', 'XLMRobertaForMaskedLM', 'LongformerForMaskedLM', 'RobertaForMaskedLM', 'SqueezeBertForMaskedLM', 'BertForMaskedLM', 'MobileBertForMaskedLM', 'FlaubertWithLMHeadModel', 'XLMWithLMHeadModel', 'ElectraForMaskedLM', 'ReformerForMaskedLM', 'FunnelForMaskedLM', 'MPNetForMaskedLM', 'TapasForMaskedLM', 'DebertaForMaskedLM']

Could you please update the doc or tell how to make this work?

many thanks

Hi @muriloime ,
Yeah, the transformers library is constantly changing so it is hard to keep it always updated. Just change AutoModelForPreTraining to AutoModelForMaskedLM and it should work.

Hello! For the named-entity recognition task I've tried using the suggested class (AutoModelForMaskedLM) and it didn't work. It worked using AutoModelForTokenClassification but I always get LABEL_0 and LABEL_1 entity groups:

>>> from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
>>> model = AutoModelForTokenClassification.from_pretrained("neuralmind/bert-large-portuguese-cased")
>>> tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-large-portuguese-cased", do_lower_case=False)
>>> recognizer = pipeline("ner", model=model, tokenizer=tokenizer)
>>> for item in recognizer("Minha terra tem palmeiras onde canta o sabiá."):
    print(item)

{'entity': 'LABEL_1', 'score': 0.5577178, 'index': 1, 'word': 'Minha', 'start': 0, 'end': 5}
{'entity': 'LABEL_0', 'score': 0.58073425, 'index': 2, 'word': 'terra', 'start': 6, 'end': 11}
{'entity': 'LABEL_1', 'score': 0.58783644, 'index': 3, 'word': 'tem', 'start': 12, 'end': 15}
{'entity': 'LABEL_1', 'score': 0.5749144, 'index': 4, 'word': 'pal', 'start': 16, 'end': 19}
{'entity': 'LABEL_1', 'score': 0.67455024, 'index': 5, 'word': '##meiras', 'start': 19, 'end': 25}
{'entity': 'LABEL_1', 'score': 0.59383816, 'index': 6, 'word': 'onde', 'start': 26, 'end': 30}
{'entity': 'LABEL_0', 'score': 0.53073055, 'index': 7, 'word': 'canta', 'start': 31, 'end': 36}
{'entity': 'LABEL_1', 'score': 0.59518266, 'index': 8, 'word': 'o', 'start': 37, 'end': 38}
{'entity': 'LABEL_1', 'score': 0.58625156, 'index': 9, 'word': 'sab', 'start': 39, 'end': 42}
{'entity': 'LABEL_1', 'score': 0.6423013, 'index': 10, 'word': '##iá', 'start': 42, 'end': 44}
{'entity': 'LABEL_0', 'score': 0.65507114, 'index': 11, 'word': '.', 'start': 44, 'end': 45}

Should I create a new issue related to NER? It would be awesome to have a working NER example on the README.

Hi @turicas, I think you're using a non-finetuned model, so these predictions doesn't make sense for NER (only for the pretraining task). You can this finetuned model, for example:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained('Luciano/bertimbau-base-lener_br')
tokenizer = AutoTokenizer.from_pretrained('Luciano/bertimbau-base-lener_br')
recognizer = pipeline('ner', model=model, tokenizer=tokenizer)
recognizer('Brasilia é a capital do Brasil.')
# => [{'entity': 'B-LOCAL', 'score': 0.99870455, 'index': 1, 'word': 'Brasil', 'start': 0, 'end': 6}, {'entity': 'B-LOCAL', 'score': 0.9969863, 'index': 2, 'word': '##ia', 'start': 6, 'end': 8}, {'entity': 'B-LOCAL', 'score': 0.98535144, 'index': 7, 'word': 'Brasil', 'start': 24, 'end': 30}]

Hi @turicas, I think you're using a non-finetuned model, so these predictions doesn't make sense for NER (only for the pretraining task). You can this finetuned model, for example:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
model = AutoModelForTokenClassification.from_pretrained('Luciano/bertimbau-base-lener_br')
tokenizer = AutoTokenizer.from_pretrained('Luciano/bertimbau-base-lener_br')
recognizer = pipeline('ner', model=model, tokenizer=tokenizer)
recognizer('Brasilia é a capital do Brasil.')
# => [{'entity': 'B-LOCAL', 'score': 0.99870455, 'index': 1, 'word': 'Brasil', 'start': 0, 'end': 6}, {'entity': 'B-LOCAL', 'score': 0.9969863, 'index': 2, 'word': '##ia', 'start': 6, 'end': 8}, {'entity': 'B-LOCAL', 'score': 0.98535144, 'index': 7, 'word': 'Brasil', 'start': 24, 'end': 30}]

Ooops, my bad! I've read "named entity recognition" on the model card and just tried it out. This code worked, thanks!