/Chinese-Medical-Relation-Extraction

(NLP) Text Classification Based on Chinese Medical Relation Extraction 基于中医关系抽取的文本分类

Primary LanguagePython

Chinese-Medical-Relation-Extraction

Text Classification Based on Chinese Medical Relation Extraction 基于中医关系抽取的文本分类 (NLP)


Data overview

Data Details
Download link data.zip
Dataset size Train Dataset : 37965
Valid Dataset : 8186
Test Dataset : 8135
Train Dataset image-20230726171944955

About

  1. Using raw and labeled data to build word embedding based on sentences and head-tail entities.
  2. Build a suitable CNN for text relationship classification.
  3. Train the model.
  4. Predict the test set after the training is completed, and generate a prediction result file.

What we do

  • Extract the corresponding relationship from the given head entity, tail entity and sentence in dataset.

  • Classify the sentence into a certain relationship according to the given entity.

  • Through CNN, it is mapped into a 44-dimensional short tensor (44 different classes), and finally through the Argmax function

  • The relationship represented by the corresponding head entity, tail entity and sentence is obtained.


    1. Data Preprocessing

    • Vocab-to-index: After the data is read in, a vocabulary list is constructed for sentences to convert them into index sequences corresponding to the vocabulary.

    • Position: For the head and tail entities, find the corresponding position in the sentence, and convert the "symbol" into a "position".

    • Word-to-vector: For labeled data, the corresponding relationship is converted to the corresponding relationship id through the W2V lookup table and stored in the dataset. We used the skip-gram model to encode data label, word frequency and W2V lookup table.

    • Word embedding: First, set the corresponding word embedding parameters according to the length of vocabulary. Do embedding for the read-in sentence.

    • Feature extraction: Use the head and tail entity position information and sentence information to do feature extraction through convolution and classification.

      Data Details
      Data after reorganizing image-20230726171121048
      Data after preprocessing image-20230726172134475image-20230726172218718

    2. Model Training

    • CNN Model:

      Layer Layer Name Structure
      1 Input embedding Embedding of text + position 1 + position 2
      2 Dropout
      3 Convolutional Layer Conv1D (kernel = 2) + Tanh + MaxPool1D
      4 Convolutional Layer Conv1D (kernel = 3) + Tanh + MaxPool1D
      5 Convolutional Layer Conv1D (kernel = 4) + Tanh + MaxPool1D
      6 Convolutional Layer Conv1D (kernel = 5) + Tanh + MaxPool1D
      7 Convolutional Layer Conv1D (kernel = 6) + Tanh + MaxPool1D
      8 Dropout
      9 Dropout
      10 Fully Connected Layer


Result

Index Epoch Train Acc Valid Acc Train Loss
Value 10 0.8582 0.7804 0.0281

About skip-gram:

  • Given a central word and a context word to be predicted, take this context word as a positive sample. Select several negative samples through random sampling of the vocabulary.

  • Then convert a large-scale classification problem into a two-class classification problem, and optimize the calculation speed in this way. We make a multinomial distribution sampling and take a specified number of high-frequency words as positive labels, and we obtain negative labels by removing positive labels.

  • The code uses a sliding window to scan the corpus from left to right. In each window, the central word needs to predict its context and form training data.

  • Some matrix operations on large vocabularies need to consume huge resources, so we simulate the result of softmax by negative sampling.

  • The size of the vocabulary determines that our skip-gram neural network will have a large-scale weight matrix. All these weights need to be adjusted through hundred millions of training samples, which is very computationally resource-intensive, and very slow to be trained.

  • Negative sampling solves this problem, which is a way to increase the speed of training and improve the quality of the resulting word vectors. Unlike updating all weights for each training sample, negative sampling only updates a small part of the weights for each training sample, which reduces the amount of calculation in the gradient descent process.

  • We use a unigram distribution to select negative words, and the formula in the code is implemented as follows:

    $$P(w_{i}) = \dfrac{f(w_{i})^{3/4}}{\sum_{j=0}^{n}(f(w_{j})^{3/4})}$$

    $f(w_{i})$ is known as word frequency.

    When we train the skip-gram model, we convert the dataset into an iterator, and take a batch of samples from the iterator. The Adam optimizer and backpropagation method are used in the training process, and then the word embedding vector in the model is normalized.

  • The word embedding vector is written into "skip-gram-model.txt" for subsequent word embedding and model training.