Strange prediction bahavior
JaouadMousser opened this issue · 9 comments
Hi Asahi,
I am having an issue related to a model I trained using tner. I used a custom dataset with labels like "INCEPTION_DATE", "PARNTER_COUNTRY" etc. The training itself seems to go well, but when I tried to call the predict function, I start getting some different labels like "Date", "City" and other entities which were not in my data.
Is there anything I am missing here?
I would appreciate any advise.
I see the same behaviour and I also do not understand what is going on: I am training on CONLL2003 which have types like PER, LOC, ORG, but the trained model returns "location", "organization", etc.
Why/How is this done? I would have expected that loading the custom dataset will show the possible labels which occur in the dataset.
This is extremely confusing, what is going on????
Oh, sorry, I think my issue is actually different, what I see seems to happen in
Line 429 in 83eb39f
This is not a useful behavior in situations where we really need exactly the types in the dataset, could we please make this mapping optional?
Update: yes, deactivating the processing in that line makes the model use the original types
Thanks Johann. I will try to deactivate the mapping part. But I am not sure it is going to solve the problem given that my labels are not part of the map provided in the code.
@JaouadMousser which model do you start with? is it one from huggingface hub?
Yes, it is the bert-base-cased-mutlilangual
OK, to me this looks very weird, there should be now way how other chunk labels should get used with that base model. Is this reproducable?
Hi @JaouadMousser, is there any chance that I can have a look a few examples of your dataset? It doesn't need to be a subset of your original data, but better if the file is in a same format as yours and contains all the entity types your dataset has. With the dataset, I could run model training and inference in my end to see what's going on there.
Hi,
I could find where the problem is coming from. The code is expecting two-parts labels like B-ORG. In my case I have three-parts labels like "B-INCEPTION-DATE", "B-PERIOD-DATE", etc.. The decode_ner_tags functions splits this labels and take the last part of it. In my case, since I have many labels ending by "-DATE" or "-CITY", etc, the predict function returns "DATE" for all label ending by DATE, etc,
@johann-petrak, @asahi417 thank you for your support.
Hi @JaouadMousser
Thank you for figuring out the issue. This should be handled in more wise way indeed (eg. take the first part of "{B,I}-" and keep the rest). I'll add this to my todo list for next version. Really appreciate it!