/intent-classification-shallow-ml

Training an ML classifier to distinguish between intents

Primary LanguageJupyter Notebook

intent-classification-shallow-ml

Training an ML classifier to distinguish between intents

Feature Extraction Pipeline

  • Text Preprocessing: The input text data is preprocessed to remove non-alphabetic characters, convert the text to lowercase, tokenize the text into words, stem each word using PorterStemmer, remove stop words, and join the remaining words into a string.

  • Label Encoding: The target variable "intent" is encoded into numerical labels using the LabelEncoder() function from scikit-learn library.

  • Vectorization: The preprocessed text data is tokenized into sequences using the TfidfVectorizer() function from scikit-learn library. This function converts the text data into a matrix of word frequencies or term frequency-inverse document frequency (TF-IDF) scores, where each row represents a text instance and each column represents a unique word in the vocabulary. The TF-IDF scores measure the importance of a word in a text instance relative to its frequency in the entire dataset.

  • Classifier Training: The SVM and Logistic Regression classifiers are trained on the vectorized text data using the fit() method from scikit-learn library.

  • Prediction: The trained classifiers are used to make predictions on the vectorized test data using the predict() method from scikit-learn library.

  • Evaluation: The accuracy and confusion matrix of the classifiers are computed using the score() and confusion_matrix() methods from scikit-learn library.