FotieMConstant/support-tickets-classification

Jupyter Notebook

Todo for the project

1. Package Installation and Imports:

The code starts by installing and importing the necessary packages, including TensorFlow, pandas, scikit-learn, numpy, regular expressions, NLTK, Matplotlib, and the Hugging Face Transformers library.

2. Preprocessing and Cleaning Functions:

Several functions are defined for preprocessing and cleaning text data, including removing stopwords, short words, special characters, and converting text to lowercase.

3. Reading and Cleaning the Dataset:

Reads a dataset from a CSV file, drops unnecessary columns, removes NaN values, and shuffles the dataset.

4. Loading DistilBERT Tokenizer and Model:

Loads the DistilBERT tokenizer and model from the Hugging Face Transformers library.

5. Preparing Input for the Model:

Sets the maximum length for input sentences.
Tokenizes and encodes sentences using the DistilBERT tokenizer.
Prepares input sentences, attention masks, and labels for model training.

6. Creating a Basic NN Model Using DistilBERT Embeddings:

Defines a neural network model that uses DistilBERT embeddings.
The model includes a Dense layer, Dropout layer, and output layer.

7. Saving Model Input in Pickle Files:

Saves the model input (input_ids, attention_masks, labels) into pickle files for later use.

8. Train-Test Split and Model Compilation:

Splits the data into training and validation sets.
Defines the loss function, metrics, and optimizer for the model.
Compiles the model.

9. Training the Model:

Trains the model on the training data, validating on the validation set.
Saves the best model based on validation loss.

10. Tensorboard Visualization:

Uses TensorBoard to visualize training and validation curves.

11. Model Evaluation:

Loads the saved model weights.
Uses the model to make predictions on the validation set.
Calculates and prints the F1 score and classification report.

12. Conclusion:

Creates and compiles a new model for future use.
Prints the F1 score and classification report on the validation set.

The code essentially demonstrates the process of fine-tuning a DistilBERT model for text classification using TensorFlow and Keras.