Training a skip-gram neural network model to obtain word embeddings.
Pre-Processing To pre-process the data I have performed the following operations:-
- Removing stop words using the NLTK library
- Converting everything to a lower case
- Removing the spaces using the strip method in python
- Removing sentences with length of less than 3 The length of the pre-processed corpus is 13651.
Creating the Corpus Vocabulary and Preparing the Data Following variables were created in order to create our corpus:-
- word2idx - extracting word-index pairs in a list form
- idx2word - extracting index-word pairs in a list form
- sentAsId - a list all the sentences in our corpus but contains index instead of word
Training the models
- The input to the model is the one-hot vector representing the input word whereas the output to the model is a vector containing the probability that a randomly selected nearby word is a word from our vocabulary.
- We use the keras framework to import a library which helps us to generate skipgrams. Further while creating a neural network we use the keras.layers to add layers to our neural network and create a model.
- The skip-gram approach is considered inefficient because it generates a huge number of weights. The training of a model with such weights and a large vocab-size will take a large amount of time which can be reduced with more advanced and better approaches.