Natural Language Processing related projects, which includes concepts and srcipts about:
- Word2vec:
gensim
,fastText
andtensorflow
implementations. See Chinese notes, 中文解读. - Text similarity:
gensim doc2vec
andgensim word2vec averaging
implementations. - Text classification:
tensorflow LSTM
(See Chinese notes 1, 中文解读 1 and Chinese notes 2, 中文解读 2) andfastText
implementations. - Chinese word segmentation:
HMM Viterbi
implementations. See Chinese notes, 中文解读. - Sequence labeling - NER: brands NER via bi-directional LSTM + CRF,
tensorflow
implementation. See Chinese notes, 中文解读. - ..
- Use pre-trained embeddings if available.
- Embedding dimension is task-dependent
- Smaller dimensionality (i.e., 100) works well for syntactic tasks (i.e., NER, POS tagging)
- Larger dimensionality (i.e., 300) is useful for semantic tasks (i.e., sentiment analysis)
- 3 or 4 layer Bi-LSTMs (e.g. POS tagging, semantic role labelling).
- 8 encoder and 8 decoder layers (e.g., Google's NMT)
- In most case, shallower model(i.e., 2 layers) is good enough.
- Highway layer
h = t * a(WX+b) + (1-t) * X
, wheret=sigmoid(W_TX+b_T)
is called transform gate.- Application: language modelling and speech recognition.
- Implementation:
tf.contrib.rnn.HighwayWrapper
- Residual connection
h = a(WX+b) + X
- Implementation:
tf.contrib.rnn.ResidualWrapper
- Dense connection
h_l = a(W[X_1, ..., X_l] + b)
- Application: multi-task learning
- Batch normalization in CV likes dropout in NLP.
- Dropout rate of 0.5 is perferred.
- Recurrent dropout (what's the difference between recurrent dropout and traditional dropout ?) applies the same dropout mask across timesteps at layer l. Implementation:
tf.contrib.rnn.DropoutWrapper(variational_recurrent=True)
- Treat initial state as variable [2]
# note: if here is LSTMCell, a bug appear: https://stackoverflow.com/questions/42947351/tensorflow-dynamic-rnn-typeerror-tensor-object-is-not-iterable
cell = tf.nn.rnn_cell.GRUCell(state_size)
init_state = tf.get_variable('init_state', [1, state_size], initializer=tf.constant_initializer(0.0))
# https://stackoverflow.com/questions/44486523/should-the-variables-of-the-initial-state-of-a-dynamic-rnn-among-the-inputs-of
init_state = tf.tile(init_state, [batch_size, 1])
- Gradients clipping
variables = tf.trainable_variables()
gradients = tf.gradients(ys=cost, xs=variables)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=self.clip_norm)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
optimize = optimizer.apply_gradients(grads_and_vars=zip(clipped_gradients, variables), global_step=self.global_step)
- To do...
Reference:
[1] http://ruder.io/deep-learning-nlp-best-practices/
[2] https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html