details on the accuracy

Question

details on the accuracy

marc88 opened this issue 6 years ago · 12 comments

Hello,

Is it feasible to get a classification report on each class from your reported findings? Something like one below. Moreover, the 'O' class is the dominant class. There's a class imbalance in the CONLL 2003 dataset due to which the F1 score's skewed towards the dominant class. How was the 'O' class dealt with since, the paper reports only 4 classes PER, ORG, LOC, MISC?

                     precision    recall  f1-score   support

B-LOC
                        0.802     0.775     0.788      4196
B-MISC
                        0.673     0.722     0.697      2026
 B-ORG
                        0.557     0.442     0.493      2929
 B-PER
                        0.685     0.674     0.680      2472
 I-LOC
                        0.779     0.624     0.693      1023
I-MISC
                         0.451     0.346     0.392       468
 I-ORG
                         0.803     0.522     0.633      3271
 I-PER
                          0.770     0.651     0.706      1543
  O
                          0.843     0.986     0.909    143315

avg / total         0.829     0.946     0.881    161243

Regards

Answer 1 · 2018-12-11T14:57:53.000Z

Here's an example breakdown in precision, recall, F1 by label:

                F1      Prec    Recall
Micro Avg       90.82   91.05   90.60
-------
       LOC      92.56   92.21   92.93
      MISC      80.75   81.45   80.06
       PER      95.70   96.54   94.87
       ORG      88.59   88.61   88.56

I deal with O by training it like any other label.

Answer 2 · 2018-12-13T10:07:40.000Z

Given the annotations of CONLL 2003, shouldn't it have around 9 classes?

'B-LOC': 7140,
'B-MISC': 3438,
'B-ORG': 6321,
'B-PER': 6600,
'I-LOC': 1157,
'I-MISC': 1155,
'I-ORG': 3704,
'I-PER': 4528,
'O': 169578

If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment)

Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class.

As an example:

Class0 - TPR: 9999/10000=0.9999
Class1 - TPR: 0/1=0.0
micro-average TPR: (9999+0)/(10000+1)=0.9998
macro-average TPR: (0.9999+0.0)/2=0.49995

Regards

Answer 3 · 2018-12-13T15:14:41.000Z

The non-O classes aren't really imbalanced in this dataset, so it's standard to report micro average. Additionally, it is standard to report *segment* F1, i.e. you need to get the entire span correct (e.g. New York = B-LOC I-LOC), not just the token. This is why the BIO prefixes go away. We also don't typically compute/report F1 for O since the goal of the task is to identify the named entities, not all the outside tokens -- O accuracy is accounted for since incorrectly-predicted Os count towards false negatives in the other classes, and Os labeled as entities count towards false positives in the other classes.

…

On Thu, Dec 13, 2018 at 5:07 AM Surojit Sengupta ***@***.***> wrote: Given the annotations of CONLL 2003, shouldn't it have around 9 classes? 'B-LOC': 7140, 'B-MISC': 3438, 'B-ORG': 6321, 'B-PER': 6600, 'I-LOC': 1157, 'I-MISC': 1155, 'I-ORG': 3704, 'I-PER': 4528, 'O': 169578 If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment) Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class. As an example: Class0 - TPR: 9999/10000=0.9999 Class1 - TPR: 0/1=0.0 micro-average TPR: (9999+0)/(10000+1)=0.9998 macro-average TPR: (0.9999+0.0)/2=0.49995 Regards — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZtzn6ShTk95JmELABGdCxlJf8jtyAks5u4ibsgaJpZM4ZM9wM> .

Answer 4 · 2018-12-15T08:56:34.000Z

Hello Emma,

Please find the response inline:
'The non-O classes aren't really imbalanced in this dataset'

But, the 'O' class is over-represented and don't you think that will cause the Neural Network to sort of memorize that 'almost every tag is an 'O' and I'll get away most of the times if I predict a word to be belonging to class 'O''?

Agree on the rest.

Regards

Answer 5 · 2018-12-15T15:35:05.000Z

This is exactly what happens in the early stages of training. Initially, the biggest gains in terms of loss reduction come from predicting majority class. However, over time the model will begin to distinguish non O classes in order to further reduce the loss. We can see this is exactly what happens by looking at the final F1 score of the model where it clearly is not only predicting O for every token. If the O class was significantly enough over-represented you may have to address the imbalance directly, but in this dataset that is not an issue.

Answer 6 · 2018-12-15T17:30:54.000Z

Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags.
Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right.

Any suggestion on this Mr. @patverga ?

Regards

Answer 7 · 2018-12-15T18:30:11.000Z

I don't understand the first part of your message, it sounds like you're just describing the CNN? Are you using this code? I've had success with weighted updates on imbalanced data (per-class as a function of frequency in the data), but if you're using CoNLL-2003 w/ log loss it shouldn't be necessary.

…

On Sat, Dec 15, 2018 at 12:30 PM Surojit Sengupta ***@***.***> wrote: Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags. Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right. Any suggestion on this Mr. @patverga <https://github.com/patverga> ? Regards — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZt1nZKWVJCROiFWQs6DZotdaoiMahks5u5THOgaJpZM4ZM9wM> .

Answer 8 · 2018-12-16T07:04:04.000Z

Hello Ms. @strubell ,

Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there.

Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea):

input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

output_windows = 

('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), 
('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'),
 ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'),
 ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), 
('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), 
('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), 
('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), 
('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')]

(A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.)
Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so:
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott')
; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels.

The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of n tokens into n such sequences of a fixed length.
In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block.

I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end.

I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds.

Regards

Answer 9 · 2018-12-16T17:07:34.000Z

Ok, I see now what you were describing. Why do you think this would be better than padding sequences up to the max length in a batch? One problem with your technique is that it completely negates the purpose of the ID-CNN, which is to better model long-term dependencies. I see how this would exacerbate the class imbalance since you're squaring the number of examples of each class. Like I said before, I've had success with scaling updates relative to (inverse) class frequency. But I'm not convinced that your batching approach is going to be better in any way compared to the typical padding approach. At test time I assume you're still using normal padding?

…

On Sun, Dec 16, 2018 at 2:04 AM Surojit Sengupta ***@***.***> wrote: Hello Ms. @strubell <https://github.com/strubell> , Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there. Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea): input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.') output_windows = ('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), ('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'), ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'), ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), ('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), ('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), ('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), ('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')] (A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.) Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so: ('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott') ; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels. The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of *n* tokens into *n* such sequences of a fixed length. In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block. I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end. I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds. Regards — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZt6QC8ApxUmuOlOzMkDOb95GUPt5Yks5u5fBlgaJpZM4ZM9wM> .

Answer 10 · 2018-12-17T04:40:42.000Z

Hello Ms. @strubell ,

Thanks for the wonderful insights.

The shortest sentence had 2 tokens while the longest one had over a 100 tokens. The idea behind not using maxlen padding was to prevent it from creating sparse representations of sentences. So a sentence with a few tokens (say 5 tokens) will look something like one below using maxlen padding (assuming the longest sentence is 100 tokens long. Each number is a word index in the given vocab. The representations would still be sparse post embedding):
[15619, 3259, 15052, 29961, 48521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0................100 terms]
Any thoughts on this? Are sparse representations good for ConvNets?

Further, we have to teach the model to distinguish between pads and the real tokens by labeling the pads anyway.
Isn't there any other way that you may suggest, than padding, to handle variable length sequences?

To answer your question, I am using a similar padding scheme for test and validation data too.
I had actually applied the padding scheme to the entire dataset (that gives 2100k sentences approx. like so: ('', '', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British')) and I had further divided it thereon into, train, test and validation sets.

Regards

Answer 11 · 2018-12-22T16:03:46.000Z

Your understanding of how padding works in CNNs for text is incorrect. We don't have to train the model to predict pad tokens, in fact we do the opposite, and mask the padding so that the model doesn't get a loss for those tokens, and we ignore the predictions. Similarly, we zero out the padding so it's not provided as input. This is the same thing you would do for e.g. an LSTM or any other batched sequence model. I wouldn't call these sparse inputs, since the part the model is actually trained to reason over is very much dense.

One of the ways we avoid the slowdown due to extra computation on padding is to batch sequences with other sequences of similar length. When doing this you'll usually never have a sequence of length 5 in the same batch as a sequence of length 100; the padding is never as drastic as your example.

The way you're handling evaluation also doesn't make sense. Not only will your evaluation not be comparable to other work which evaluates on the normal data, but think about the actual use case. If someone wants to use your code to tag a sentence, how would they use the output of your model? Your model will produce N different labelings of the sequence.

Answer 12 · 2018-12-23T06:54:55.000Z

I did realize that and am currently on masking the pads but, what I couldn't understand is, if we provide sequences of different lengths in different batches, how does my convnet tackle this variation of dimensions of the input sequences?
Given the batches for your example earlier:
(batch size=128 and embedding dimensions=50; say)
Batch Dimension for length 5 sequences post embedding:
(128,50,5)
Batch dimension for length=100
(128,50,100)

Shouldn't the convnet be fed with fixed dimensional inputs?

About evaluation, apologies but is there anything wrong to expect an output of
[org,o,o,o,o,o,o,o,o]
for the sequence below?
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

or do you suggest labeling it as ['org'] only?
We can generate some additional feature tags like the POS tags for the other words to train the model on the context around the word? And so we just ignore the 'O' labels?

Regards