iesl/dilated-cnn-ner

details on the accuracy

marc88 opened this issue · 12 comments

Hello,

Is it feasible to get a classification report on each class from your reported findings? Something like one below. Moreover, the 'O' class is the dominant class. There's a class imbalance in the CONLL 2003 dataset due to which the F1 score's skewed towards the dominant class. How was the 'O' class dealt with since, the paper reports only 4 classes PER, ORG, LOC, MISC?

                     precision    recall  f1-score   support

B-LOC
                        0.802     0.775     0.788      4196
B-MISC
                        0.673     0.722     0.697      2026
 B-ORG
                        0.557     0.442     0.493      2929
 B-PER
                        0.685     0.674     0.680      2472
 I-LOC
                        0.779     0.624     0.693      1023
I-MISC
                         0.451     0.346     0.392       468
 I-ORG
                         0.803     0.522     0.633      3271
 I-PER
                          0.770     0.651     0.706      1543
  O
                          0.843     0.986     0.909    143315

avg / total         0.829     0.946     0.881    161243

Regards

Here's an example breakdown in precision, recall, F1 by label:

                F1      Prec    Recall
Micro Avg       90.82   91.05   90.60
-------
       LOC      92.56   92.21   92.93
      MISC      80.75   81.45   80.06
       PER      95.70   96.54   94.87
       ORG      88.59   88.61   88.56

I deal with O by training it like any other label.

Given the annotations of CONLL 2003, shouldn't it have around 9 classes?

'B-LOC': 7140,
'B-MISC': 3438,
'B-ORG': 6321,
'B-PER': 6600,
'I-LOC': 1157,
'I-MISC': 1155,
'I-ORG': 3704,
'I-PER': 4528,
'O': 169578

If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment)

Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class.

As an example:

Class0 - TPR: 9999/10000=0.9999
Class1 - TPR: 0/1=0.0
micro-average TPR: (9999+0)/(10000+1)=0.9998
macro-average TPR: (0.9999+0.0)/2=0.49995

Regards

Hello Emma,

Please find the response inline:
'The non-O classes aren't really imbalanced in this dataset'

But, the 'O' class is over-represented and don't you think that will cause the Neural Network to sort of memorize that 'almost every tag is an 'O' and I'll get away most of the times if I predict a word to be belonging to class 'O''?

Agree on the rest.

Regards

This is exactly what happens in the early stages of training. Initially, the biggest gains in terms of loss reduction come from predicting majority class. However, over time the model will begin to distinguish non O classes in order to further reduce the loss. We can see this is exactly what happens by looking at the final F1 score of the model where it clearly is not only predicting O for every token. If the O class was significantly enough over-represented you may have to address the imbalance directly, but in this dataset that is not an issue.

Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags.
Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right.

Any suggestion on this Mr. @patverga ?

Regards

Hello Ms. @strubell ,

Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there.

Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea):

input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

output_windows = 

('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), 
('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'),
 ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'),
 ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), 
('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), 
('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), 
('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), 
('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')]

(A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.)
Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so:
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott')
; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels.

The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of n tokens into n such sequences of a fixed length.
In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block.

I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end.

I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds.

Regards

Hello Ms. @strubell ,

Thanks for the wonderful insights.

The shortest sentence had 2 tokens while the longest one had over a 100 tokens. The idea behind not using maxlen padding was to prevent it from creating sparse representations of sentences. So a sentence with a few tokens (say 5 tokens) will look something like one below using maxlen padding (assuming the longest sentence is 100 tokens long. Each number is a word index in the given vocab. The representations would still be sparse post embedding):
[15619, 3259, 15052, 29961, 48521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0................100 terms]
Any thoughts on this? Are sparse representations good for ConvNets?

Further, we have to teach the model to distinguish between pads and the real tokens by labeling the pads anyway.
Isn't there any other way that you may suggest, than padding, to handle variable length sequences?

To answer your question, I am using a similar padding scheme for test and validation data too.
I had actually applied the padding scheme to the entire dataset (that gives 2100k sentences approx. like so: ('', '', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British')) and I had further divided it thereon into, train, test and validation sets.

Regards

Your understanding of how padding works in CNNs for text is incorrect. We don't have to train the model to predict pad tokens, in fact we do the opposite, and mask the padding so that the model doesn't get a loss for those tokens, and we ignore the predictions. Similarly, we zero out the padding so it's not provided as input. This is the same thing you would do for e.g. an LSTM or any other batched sequence model. I wouldn't call these sparse inputs, since the part the model is actually trained to reason over is very much dense.

One of the ways we avoid the slowdown due to extra computation on padding is to batch sequences with other sequences of similar length. When doing this you'll usually never have a sequence of length 5 in the same batch as a sequence of length 100; the padding is never as drastic as your example.

The way you're handling evaluation also doesn't make sense. Not only will your evaluation not be comparable to other work which evaluates on the normal data, but think about the actual use case. If someone wants to use your code to tag a sentence, how would they use the output of your model? Your model will produce N different labelings of the sequence.

I did realize that and am currently on masking the pads but, what I couldn't understand is, if we provide sequences of different lengths in different batches, how does my convnet tackle this variation of dimensions of the input sequences?
Given the batches for your example earlier:
(batch size=128 and embedding dimensions=50; say)
Batch Dimension for length 5 sequences post embedding:
(128,50,5)
Batch dimension for length=100
(128,50,100)

Shouldn't the convnet be fed with fixed dimensional inputs?

About evaluation, apologies but is there anything wrong to expect an output of
[org,o,o,o,o,o,o,o,o]
for the sequence below?
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

or do you suggest labeling it as ['org'] only?
We can generate some additional feature tags like the POS tags for the other words to train the model on the context around the word? And so we just ignore the 'O' labels?

Regards