/RNN_from_scratch_ds_jobs_analysis

Building an RNN model from scratch to predict job title from a given job description

Primary LanguageJupyter Notebook

Job Title Prediction

Building an RNN from scratch using PyTorch to predict job title from job descriptions using the Glassdoor Data Science jobs dataset available in Kaggle

Data science jobs analysis

Exploring the Glassdoor Data Science jobs dataset from Kaggle. The data is collected from the Glassdoor site for data science jobs in the US.

Link to the dataset : https://www.kaggle.com/datasets/rkb0023/glassdoor-data-science-jobs

Model architecture

RNN or Recurrent Neural Networks are a popular deep learning network architecture for processing sequential data. RNNs have been a popular choice for NLP applications for many years, most recently their popularity have given way to Transformer based architectures. The basic architecture and formulae for calculating hidden state and output value in RNN is as follows

image

Here's how the RNN works Let's say we give as input to the RNN a text sequence "abc" - after one-hot encoding it, this becomes [[1,0,0,...],[0,1,0,0...],[0,0,1,0,0...]]

  • The RNN processes the text sequence sequentially, one character at a time
  • Input size is size of the one-hot encoded vector ( or total number of characters in the vocabulary, for us thats 26 letters + space) which is 27
  • Hidden size can be anything we want and it is the number of nodes in a hidden layer. Let's make it 3.
  • The 1st character which is processed is "a" or [1,0,0,...] a tensor of size 27. This is multiplied with the weight matrix ( linear layer ) stored in self.wax which has a shape of (27, 3). Input has shape 27. Output is a tensor of size 3.
  • In pytorch, torch.nn.Linear is defined as accepting input of shape ( * ,A) and provides an output of shape ( * ,B) by performing matrix multiplication and * can be any number of dimensions or None.
  • For 1st character hidden state is a size 3 zero tensor which is also multipled by a different weight matrix stored in self.waa to output a tensor of size 3. This tensor and the tensor we got in previous step is added and activation applied - which now becomes the new hidden state. This new hidden state is used when we process "b". After all characters are processed, we multiply the last hidden state with a 3rd weight matrix of shape (3, num_classes) to get a tensor of shape num_classes which contain the raw logits. No need to apply softmax on this since cross_entropy function of torch.nn.functional will do that internally.
  • Note that the same weights are used when processing "a","b" and "c" - that is, self.wax is multiplied by the vector representation of "a", the same weights of self.wax are also used when processing "b" as well as "c". During training, between any 2 forward passes the weights will vary but during a given forward pass, the same weights will be used, self.wax is same for "a","b","c", self.waa is same for all the hidden states. This parameter sharing reduces model complexity as the total number of parameters to learn for the RNN is constant, irrespective of what is the sequence length. If an RNN has M learnable parameters for sequence length 100, then at sequence length 200, it still has M learnable parameters only, it does not need additional parameters as number of time steps increases, it only needs the 3 weight matrices which it will repeat ( share ) across the time steps

For code, results and discussion refer to the notebook