Word2HyperVec: From Word Embeddings to Hypervectors for Hyperdimensional Computing

This repository provides a complete pipeline for classifying airline tweets as either positive or negative by converting traditional word embeddings into Hyperdimensional Computing (HDC) hypervectors. The system leverages HDC’s robustness with a novel integration of positional encoding and hypervector representation for sentiment classification tasks.

Table of Contents

Overview

This project combines traditional text processing, Word2Vec embeddings, and custom HDC hypervectors. It includes multiple stages:

  1. Dataset Preparation: Cleaning and splitting the dataset.
  2. Positional Encoding: Adding positional information to word embeddings.
  3. Hypervector Dictionary Creation: Generating high-dimensional vectors for encoding numerical data.
  4. Model Training and Evaluation: Training an HDC model for sentiment classification and evaluating its performance.

Project Structure

File Description
airline_sentiment_dataset_creation.py Prepares, cleans, and splits the raw tweet dataset.
airline_pos_encoded_dataset_creation.py Applies positional encoding to tweet embeddings.
dict_creation.py Generates a hypervector dictionary for numerical encoding.
HDC_epoch_training_with_lr.py Trains the HDC model with multiple epochs and evaluates performance on test data.
Tweets.csv Dataset of airline tweets with sentiment labels.
twitter_us_airlines_word2vec_64.model Pre-trained Word2Vec model for embedding words in tweets.
airline_train_data_encoded.pkl, airline_test_data_encoded.pkl Encoded positional data for model training and testing.

Installation

  1. Clone the repository:
    git clone https://github.com/username/repo_name.git
    cd repo_name
  2. Install dependencies:
    pip install numpy pandas gensim scikit-learn keras
  3. Download or add the necessary files:

    Ensure Tweets.csv (raw tweet dataset) and twitter_us_airlines_word2vec_64.model (Word2Vec model) are in the root directory.

Usage

Run each stage of the pipeline in the following order:

Step 1: Data Preparation

python airline_sentiment_dataset_creation.py

Step 2: Positional Encoding

python airline_pos_encoded_dataset_creation.py

Step 3: Hypervector Dictionary Creation

python dict_creation.py

Step 4: Model Training and Evaluation

python HDC_epoch_training_with_lr.py

File Explanations

1. airline_sentiment_dataset_creation.py

This script prepares the dataset for sentiment analysis by:

  • Loading and filtering tweets from Tweets.csv to retain only those labeled as positive or negative.
  • Cleaning each tweet (removing mentions, URLs, punctuation, and numbers).
  • Mapping sentiments to numerical values (0 for negative, 1 for positive).
  • Splitting the data into training and test sets (80/20 split).

Outputs:

  • airline_train_data.pkl: Training data (tweets and labels).
  • airline_test_data.pkl: Testing data (tweets and labels).

Purpose: Ensures that the data is preprocessed and ready for encoding and model training.

2. airline_pos_encoded_dataset_creation.py

This script performs positional encoding on the cleaned tweets:

  • Loads airline_train_data.pkl and airline_test_data.pkl.
  • Retrieves word embeddings using a pre-trained Word2Vec model (twitter_us_airlines_word2vec_64.model).
  • Applies a positional encoding method to capture the sequence structure in the embeddings.
  • Normalizes embeddings for consistent scaling.

Outputs:

  • imdb_train_pos_encoded.pkl: Positionally encoded training data.
  • imdb_test_pos_encoded.pkl: Positionally encoded test data.

Purpose: Adds positional information to the word embeddings, which helps the HDC model distinguish words based on their sequence in a tweet.

3. dict_creation.py

This script generates a dictionary of hypervectors for encoding numerical values:

  • Defines a range of values with a precision step (e.g., 0.0001) and assigns each value a unique high-dimensional vector.
  • The hypervectors are randomly initialized and flipped for each subsequent value.

Outputs:

  • hdc_10k_0.0001.json: A JSON file mapping values to their corresponding hypervectors.

Purpose: Provides a mapping of numerical values to high-dimensional vectors, enabling the encoding of continuous data in HDC.

4. HDC_epoch_training_with_lr.py

This script trains and evaluates the HDC model:

  • Loads encoded data (airline_train_data_encoded.pkl and airline_test_data_encoded.pkl) and the hypervector dictionary (hdc_10k_0.0001.json).
  • Aggregates hypervectors for each sentiment (positive and negative) based on tweet embeddings.
  • Updates hypervectors over multiple epochs and adjusts learning rates to optimize performance.
  • After training, evaluates the model on the test data and outputs the accuracy.

Outputs:

  • Final training accuracy and time metrics printed to the console.

Purpose: Trains an HDC model to classify tweet sentiment based on aggregate hypervectors, providing insights into HDC’s efficacy in sentiment analysis.

Workflow Summary

  1. Data Preprocessing:
    • Clean and split raw data in Tweets.csv.
    • Outputs training and test sets in airline_train_data.pkl and airline_test_data.pkl.
  2. Positional Encoding:
    • Convert tweets into word embeddings using twitter_us_airlines_word2vec_64.model.
    • Apply positional encoding and normalization, outputting imdb_train_pos_encoded.pkl and imdb_test_pos_encoded.pkl.
  3. Hypervector Dictionary Creation:
    • Generate a dictionary of hypervectors to map numerical values to high-dimensional vectors.
    • Outputs the dictionary in hdc_10k_0.0001.json.
  4. Training and Evaluation:
    • Train the HDC model on the positionally encoded training data.
    • Evaluate model accuracy on test data using the main code HDC_epoch_training_with_lr.py.