This repository provides a complete pipeline for classifying airline tweets as either positive or negative by converting traditional word embeddings into Hyperdimensional Computing (HDC) hypervectors. The system leverages HDC’s robustness with a novel integration of positional encoding and hypervector representation for sentiment classification tasks.
- Overview
- Project Structure
- Installation
- Usage
- File Explanations
- Workflow Summary
- Methodology
- Dataset
- Example Output
This project combines traditional text processing, Word2Vec embeddings, and custom HDC hypervectors. It includes multiple stages:
- Dataset Preparation: Cleaning and splitting the dataset.
- Positional Encoding: Adding positional information to word embeddings.
- Hypervector Dictionary Creation: Generating high-dimensional vectors for encoding numerical data.
- Model Training and Evaluation: Training an HDC model for sentiment classification and evaluating its performance.
File | Description |
---|---|
airline_sentiment_dataset_creation.py |
Prepares, cleans, and splits the raw tweet dataset. |
airline_pos_encoded_dataset_creation.py |
Applies positional encoding to tweet embeddings. |
dict_creation.py |
Generates a hypervector dictionary for numerical encoding. |
HDC_epoch_training_with_lr.py |
Trains the HDC model with multiple epochs and evaluates performance on test data. |
Tweets.csv |
Dataset of airline tweets with sentiment labels. |
twitter_us_airlines_word2vec_64.model |
Pre-trained Word2Vec model for embedding words in tweets. |
airline_train_data_encoded.pkl , airline_test_data_encoded.pkl |
Encoded positional data for model training and testing. |
- Clone the repository:
git clone https://github.com/username/repo_name.git cd repo_name
- Install dependencies:
pip install numpy pandas gensim scikit-learn keras
- Download or add the necessary files:
Ensure
Tweets.csv
(raw tweet dataset) andtwitter_us_airlines_word2vec_64.model
(Word2Vec model) are in the root directory.
Run each stage of the pipeline in the following order:
python airline_sentiment_dataset_creation.py
python airline_pos_encoded_dataset_creation.py
python dict_creation.py
python HDC_epoch_training_with_lr.py
This script prepares the dataset for sentiment analysis by:
- Loading and filtering tweets from
Tweets.csv
to retain only those labeled as positive or negative. - Cleaning each tweet (removing mentions, URLs, punctuation, and numbers).
- Mapping sentiments to numerical values (0 for negative, 1 for positive).
- Splitting the data into training and test sets (80/20 split).
Outputs:
airline_train_data.pkl
: Training data (tweets and labels).airline_test_data.pkl
: Testing data (tweets and labels).
Purpose: Ensures that the data is preprocessed and ready for encoding and model training.
This script performs positional encoding on the cleaned tweets:
- Loads
airline_train_data.pkl
andairline_test_data.pkl
. - Retrieves word embeddings using a pre-trained Word2Vec model (
twitter_us_airlines_word2vec_64.model
). - Applies a positional encoding method to capture the sequence structure in the embeddings.
- Normalizes embeddings for consistent scaling.
Outputs:
imdb_train_pos_encoded.pkl
: Positionally encoded training data.imdb_test_pos_encoded.pkl
: Positionally encoded test data.
Purpose: Adds positional information to the word embeddings, which helps the HDC model distinguish words based on their sequence in a tweet.
This script generates a dictionary of hypervectors for encoding numerical values:
- Defines a range of values with a precision step (e.g., 0.0001) and assigns each value a unique high-dimensional vector.
- The hypervectors are randomly initialized and flipped for each subsequent value.
Outputs:
hdc_10k_0.0001.json
: A JSON file mapping values to their corresponding hypervectors.
Purpose: Provides a mapping of numerical values to high-dimensional vectors, enabling the encoding of continuous data in HDC.
This script trains and evaluates the HDC model:
- Loads encoded data (
airline_train_data_encoded.pkl
andairline_test_data_encoded.pkl
) and the hypervector dictionary (hdc_10k_0.0001.json
). - Aggregates hypervectors for each sentiment (positive and negative) based on tweet embeddings.
- Updates hypervectors over multiple epochs and adjusts learning rates to optimize performance.
- After training, evaluates the model on the test data and outputs the accuracy.
Outputs:
- Final training accuracy and time metrics printed to the console.
Purpose: Trains an HDC model to classify tweet sentiment based on aggregate hypervectors, providing insights into HDC’s efficacy in sentiment analysis.
- Data Preprocessing:
- Clean and split raw data in
Tweets.csv
. - Outputs training and test sets in
airline_train_data.pkl
andairline_test_data.pkl
.
- Clean and split raw data in
- Positional Encoding:
- Convert tweets into word embeddings using
twitter_us_airlines_word2vec_64.model
. - Apply positional encoding and normalization, outputting
imdb_train_pos_encoded.pkl
andimdb_test_pos_encoded.pkl
.
- Convert tweets into word embeddings using
- Hypervector Dictionary Creation:
- Generate a dictionary of hypervectors to map numerical values to high-dimensional vectors.
- Outputs the dictionary in
hdc_10k_0.0001.json
.
- Training and Evaluation:
- Train the HDC model on the positionally encoded training data.
- Evaluate model accuracy on test data using the main code
HDC_epoch_training_with_lr.py
.