Todo for the project

1. Package Installation and Imports:

  • The code starts by installing and importing the necessary packages, including TensorFlow, pandas, scikit-learn, numpy, regular expressions, NLTK, Matplotlib, and the Hugging Face Transformers library.

2. Preprocessing and Cleaning Functions:

  • Several functions are defined for preprocessing and cleaning text data, including removing stopwords, short words, special characters, and converting text to lowercase.

3. Reading and Cleaning the Dataset:

  • Reads a dataset from a CSV file, drops unnecessary columns, removes NaN values, and shuffles the dataset.

4. Loading DistilBERT Tokenizer and Model:

  • Loads the DistilBERT tokenizer and model from the Hugging Face Transformers library.

5. Preparing Input for the Model:

  • Sets the maximum length for input sentences.
  • Tokenizes and encodes sentences using the DistilBERT tokenizer.
  • Prepares input sentences, attention masks, and labels for model training.

6. Creating a Basic NN Model Using DistilBERT Embeddings:

  • Defines a neural network model that uses DistilBERT embeddings.
  • The model includes a Dense layer, Dropout layer, and output layer.

7. Saving Model Input in Pickle Files:

  • Saves the model input (input_ids, attention_masks, labels) into pickle files for later use.

8. Train-Test Split and Model Compilation:

  • Splits the data into training and validation sets.
  • Defines the loss function, metrics, and optimizer for the model.
  • Compiles the model.

9. Training the Model:

  • Trains the model on the training data, validating on the validation set.
  • Saves the best model based on validation loss.

10. Tensorboard Visualization:

  • Uses TensorBoard to visualize training and validation curves.

11. Model Evaluation:

  • Loads the saved model weights.
  • Uses the model to make predictions on the validation set.
  • Calculates and prints the F1 score and classification report.

12. Conclusion:

  • Creates and compiles a new model for future use.
  • Prints the F1 score and classification report on the validation set.

The code essentially demonstrates the process of fine-tuning a DistilBERT model for text classification using TensorFlow and Keras.