Strange prediction bahavior

Hi Asahi,

I am having an issue related to a model I trained using tner. I used a custom dataset with labels like "INCEPTION_DATE", "PARNTER_COUNTRY" etc. The training itself seems to go well, but when I tried to call the predict function, I start getting some different labels like "Date", "City" and other entities which were not in my data.

Is there anything I am missing here?

I would appreciate any advise.

I see the same behaviour and I also do not understand what is going on: I am training on CONLL2003 which have types like PER, LOC, ORG, but the trained model returns "location", "organization", etc.
Why/How is this done? I would have expected that loading the custom dataset will show the possible labels which occur in the dataset.

This is extremely confusing, what is going on????

Oh, sorry, I think my issue is actually different, what I see seems to happen in

tner/tner/get_dataset.py

Line 429 in 83eb39f

fixed_mention = [k for k, v in SHARED_NER_LABEL.items() if mention in v]

where certain known types are mapped to some pre-defined type.

This is not a useful behavior in situations where we really need exactly the types in the dataset, could we please make this mapping optional?

Update: yes, deactivating the processing in that line makes the model use the original types

Thanks Johann. I will try to deactivate the mapping part. But I am not sure it is going to solve the problem given that my labels are not part of the map provided in the code.

@JaouadMousser which model do you start with? is it one from huggingface hub?

Yes, it is the bert-base-cased-mutlilangual

OK, to me this looks very weird, there should be now way how other chunk labels should get used with that base model. Is this reproducable?

Hi @JaouadMousser, is there any chance that I can have a look a few examples of your dataset? It doesn't need to be a subset of your original data, but better if the file is in a same format as yours and contains all the entity types your dataset has. With the dataset, I could run model training and inference in my end to see what's going on there.

Hi,

I could find where the problem is coming from. The code is expecting two-parts labels like B-ORG. In my case I have three-parts labels like "B-INCEPTION-DATE", "B-PERIOD-DATE", etc.. The decode_ner_tags functions splits this labels and take the last part of it. In my case, since I have many labels ending by "-DATE" or "-CITY", etc, the predict function returns "DATE" for all label ending by DATE, etc,
@johann-petrak, @asahi417 thank you for your support.

Hi @JaouadMousser
Thank you for figuring out the issue. This should be handled in more wise way indeed (eg. take the first part of "{B,I}-" and keep the rest). I'll add this to my todo list for next version. Really appreciate it!