Tixierae/deep_learning_NLP

Predictive text regions

fcivardi opened this issue · 2 comments

Thank you for the very clear introduction to CNN for NLP!

I've have a question about the predictive text regions. You write "we want to identify the n_show regions of each branch that are associated with the highest weights in the corresponding feature maps", but in the code you only take into account the activations of the feature maps. I wonder if we should also weight them with the weights of the dense layer.

I'm trying to apply the same idea to multi-class multi-label problems (the model is essentially the same, like in https://github.com/inspirehep/magpie, we only need a dense layer with more output neurons), and I'd like to identify the regions that are associated with the different labels. In this case, the feature maps of a document are the same for the different labels, but the weights of the Dense layers are of course important for activating or deactivating a specific output neuron.
What do you think?

Thanks,
Francesco

Sorry for the late reply and thank you for your interest!

About predictive regions, my code only implements the procedure described in Johnson, Rie, and Tong Zhang. "Effective use of word order for text categorization with convolutional neural networks" arXiv preprint arXiv:1412.1058 (2014).. If you look at section 3.6 on page 9:

In Table 5, we show some of text regions learned by seq-CNN to be predictive on Elec. This net has one convolution layer with region size 3 and 1000 neurons; thus, embedding by the convolution layer produces a 1000-dim vector for each region, which (after pooling) serves as features in the top layer where weights are assigned to the 1000 vector components.

The key here is that each branch of the CNN (corresponding to a given filter size) will produce after convolution nb_filters feature maps. Moreover, each entry ith of the feature map is the result of the application of the filter over the ith region of the input. Therefore, each ith region of the input can be associated with a vector of size nb_filters, which can be seen as the region embedding (not to be mistaken with the document embedding which is the input of the final dense layer). The assumption is that the regions that are most useful in making the prediction will be assigned large weights (in absolute terms) by the trained model. In other words, the regions whose embeddings have high norms are the predictive regions:

[Table 5] shows the text regions (in the training set) whose embedded vectors have a large value in the corresponding component, i.e., predictive text regions.

Now, if you go any lower in the architecture, you cannot infer about regions anymore because pooling only retains the maximum value from each feature map and thus destroys the spatial information to only keep filter-wise information. You could use the final prediction of the model as a weight that indicates how 'extreme' the document is (e.g., very positive or very negative) and use that to only retain the best regions from the most extreme documents. But I don't think you can use this information to infer about regions.

About your second question, I'm not familiar with the architecture of the model. Is it described somewhere?

Thank you for your answer!

Btw, I had to modify your code in this way:
norms_a = np.linalg.norm(reg_emb_a[idx,0:len(regions_a),:],axis=1)
norms_b = np.linalg.norm(reg_emb_b[idx,0:len(regions_b),:],axis=1)
because (strangely) in my project I had very high activations in region embeddings outside the length of some documents (thus returning empty lists in the output).

About the second question, the network described here: https://github.com/inspirehep/magpie/blob/master/magpie/nn/models.py
is very similar to yours (it concatenates 3 convolutions instead of 2, using a for loop... I didn't know it was possible to use a for loop in the model definition). At the end, the dense layer has so many outputs as the number of possible classes, but uses a sigmoid and a binary crossentropy like yours (because multilabels have to use sigmoid and binary crossentropy, not softmax and categorical crossentropy like multiclass).
Best, Francesco