Todo for the project
- The code starts by installing and importing the necessary packages, including TensorFlow, pandas, scikit-learn, numpy, regular expressions, NLTK, Matplotlib, and the Hugging Face Transformers library.
- Several functions are defined for preprocessing and cleaning text data, including removing stopwords, short words, special characters, and converting text to lowercase.
- Reads a dataset from a CSV file, drops unnecessary columns, removes NaN values, and shuffles the dataset.
- Loads the DistilBERT tokenizer and model from the Hugging Face Transformers library.
- Sets the maximum length for input sentences.
- Tokenizes and encodes sentences using the DistilBERT tokenizer.
- Prepares input sentences, attention masks, and labels for model training.
- Defines a neural network model that uses DistilBERT embeddings.
- The model includes a Dense layer, Dropout layer, and output layer.
- Saves the model input (input_ids, attention_masks, labels) into pickle files for later use.
- Splits the data into training and validation sets.
- Defines the loss function, metrics, and optimizer for the model.
- Compiles the model.
- Trains the model on the training data, validating on the validation set.
- Saves the best model based on validation loss.
- Uses TensorBoard to visualize training and validation curves.
- Loads the saved model weights.
- Uses the model to make predictions on the validation set.
- Calculates and prints the F1 score and classification report.
- Creates and compiles a new model for future use.
- Prints the F1 score and classification report on the validation set.
The code essentially demonstrates the process of fine-tuning a DistilBERT model for text classification using TensorFlow and Keras.