/fyp

Final Year Project

Primary LanguageJupyter Notebook

Bitcoin Price Prediction Through Twitter Sentiment and Data Volume

This is the official repository for the the Bitcoin Price Prediction Through Twitter Sentiment and Data Volume Project

Author: Jacques Vella Critien Supervisors: Dr Joshua Ellul, Prof Albert Gatt

Structure

project
│   README.md  
│
└───data_generators_cleaners
│   │   BTCprices_cleaner.ipynb
│   │   data_grouper.ipynb
│   │   crypto_prices_getter_yahoo.ipynb
│   │   data_lag_creator.ipynb
│   │   english_tweets_extractor.ipynb
│   │   polarity_adder.ipynb
│   │   tweet_preprocessor.py
│   │   tweets_cleaner_finaliser.ipynb
│   │   
│   └───with_sentiment
│       │   
│       └───VADER
│            │   tweets.csv
│            └───cleaned
│                │   tweets.csv
│   
└───datasets
│   │
│   └───general
│   │   │   BTCDATAwithdate.csv
│   │   │   BTCTWEETS_english_no_duplicates.csv
│   │   │   preprocessed_tweets.csv
│   │   │   tweets_cleaned.csv
│   │   │   BTCDATA.csv
│   │   │   BTCTWEETS.csv
│   │
│   └───tweets_prices
│   │   │
│   │   └───vader
│   │       │   final_days_lag_days_1.csv
│   │       │   final_days_lag_days_3.csv
│   │       │   final_days_lag_days_7.csv
│   │       │   final_days_lag_hours_1.csv
│   │       │   final_days_lag_hours_3.csv
│   │       │   final_days_lag_hours_7.csv
│   │       │   
│   │       └───cleaned
│   │           │   final_days_lag_days_1.csv
│   │           │   final_days_lag_days_3.csv
│   │           │   final_days_lag_days_7.csv
│   │           │   final_days_lag_hours_1.csv
│   │           │   final_days_lag_hours_3.csv
│   │           │   final_days_lag_hours_7.csv
│   │
│   └───tweets_prices_volumes_sentiment
│       │
│       └───vader
│           │
│           └───day_datasets
│               │   final_days_lag_days_1.csv
│               │   final_days_lag_days_3.csv
│               │   final_days_lag_days_7.csv
│               │   final_days_lag_hours_1.csv
│               │   final_days_lag_hours_3.csv
│               │   final_days_lag_hours_7.csv
│               │   
│               └───cleaned
│                   │   final_days_lag_days_1.csv
│                   │   final_days_lag_days_3.csv
│                   │   final_days_lag_days_7.csv
│                   │   final_days_lag_hours_1.csv
│                   │   final_days_lag_hours_3.csv
│                   │   final_days_lag_hours_7.csv
│   
└───models
│   │
│   └───results
│   │
│   └───bilstm_multiclass
│   │   │   bilstm_multiclass.ipynb
│   │   │   bilstm_multiclass-tester.ipynb
│   │   │   model.png
│   │
│   └───bilstm_trend
│   │   │   bilstm_trend.ipynb
│   │   │   bilstm_trend-tester.ipynb
│   │   │   model.png
│   │
│   └───cnn_multiclass
│   │   │   cnn_multiclass.ipynb
│   │   │   cnn_multiclass-tester.ipynb
│   │   │   model.png
│   │
│   └───cnn_multiclass
│   │   │   cnn_multiclass.ipynb
│   │   │   cnn_multiclass-tester.ipynb
│   │   │   model.png
│   │
│   └───lstm_multiclass
│   │   │   lstm_multiclass.ipynb
│   │   │   lstm_multiclass-tester.ipynb
│   │   │   model.png
│   │
│   └───lstm_trend
│   │   │   lstm_trend.ipynb
│   │   │   lstm_trend-tester.ipynb
│   │   │   model.png
│   │
│   └───voting_classifier
│       │   voting_classifier.ipynb
│       │   voting_classifiertester.ipynb
│   
└───papers
    │ ALL PAPERS USED   

data_generators_cleaners

This folder contains all the scripts required to clean and preprocess the data

  1. BTCprices_cleaner.ipynb - Used to clean the prices dataset. More specifically, it sets the timestamp to UTC and removes the Open, High and Low values.
  2. crypto_prices_getter_yahoo.ipynb - Used to obtain crypto prices from yahoo at different intervals
  3. data_grouper.ipynb - Used to group the lagged dataset hourly or daily
  4. data_grouper.ipynb - Used to group the lagged dataset hourly or daily
  5. data_lag_creator.ipynb - Used to create lagged datasets
  6. english_tweets_extractor.ipynb - Used to remove duplicates and non-English tweets
  7. polarity_adder.ipynb - Used to add polarity and sentiment to tweets
  8. tweets_cleaner_finaliser.ipynb - Used in the last step to remove tweets with less than 4 words after tweets being cleaned.
  9. tweet_processor.py - Used to clean and preprocess tweets

datasets

This folder contains all the datasets All this folder can be found here because it could not be all committed to git due to size

  1. general - Contains the general datasets used to create lagged datasets including the original ones

    • with_sentiment - Folder with files containing tweets and their sentiment scores
    • BTCDATAwithdate.csv - This contains the cleaned BTC prices dataset
    • BTCTWEETS_english_no_duplicates.csv - This contains the dataset of tweets without duplicates and non-English tweets
    • preprocessed_tweets.csv - This contains the tweets cleaned and preprocessed from the tweet_processor.py script
    • tweets_cleaned.csv - The final cleaned dataset
    • BTCDATA.csv - Original dataset for BTC prices
    • BTCTWEETS.csv - Original dataset for BTC tweets
  2. tweets_prices - Contains the datasets with each tweet together with the corresponding BTC price

    • vader - Datasets with VADER poalrity scores
      • final_days_lag_days_1.csv - Dataset containing uncleaned tweets and prices with 1 day lag
      • final_days_lag_days_3.csv - Dataset containing uncleaned tweets and prices with 3 days lag
      • final_days_lag_days_7.csv - Dataset containing uncleaned tweets and prices with 7 days lag
      • final_days_lag_hours_1.csv - Dataset containing uncleaned tweets and prices with 1 hour lag
      • final_days_lag_hours_3.csv - Dataset containing uncleaned tweets and prices with 3 hours lag
      • final_days_lag_hours_7.csv - Dataset containing uncleaned tweets and prices with 7 hours lag
      • cleaned - Contains cleaned datasets
        • final_days_lag_days_1.csv - Dataset containing cleaned tweets and prices with 1 day lag
        • final_days_lag_days_3.csv - Dataset containing cleaned tweets and prices with 3 days lag
        • final_days_lag_days_7.csv - Dataset containing cleaned tweets and prices with 7 days lag
        • final_days_lag_hours_1.csv - Dataset containing cleaned tweets and prices with 1 hour lag
        • final_days_lag_hours_3.csv - Dataset containing cleaned tweets and prices with 3 hours lag
        • final_days_lag_hours_7.csv - Dataset containing cleaned tweets and prices with 7 hours lag
  3. tweets_prices_volumes_sentiment - Contains the grouped datasets with averaged polarity scores for that time interval with the corresponding BTC prices

    • vader - Datasets with VADER poalrity scores
      • day_datasets - Datasets grouped daily
        • final_days_lag_days_1.csv - grouped uncleaned dataset with 1 day lag
        • final_days_lag_days_3.csv - grouped uncleaned dataset with 3 days lag
        • final_days_lag_days_7.csv - grouped uncleaned dataset with 7 days lag
        • final_days_lag_hours_1.csv - grouped uncleaned dataset with 1 hour lag
        • final_days_lag_hours_3.csv - grouped uncleaned dataset with 3 hours lag
        • final_days_lag_hours_7.csv - grouped uncleaned dataset with 7 hours lag
        • cleaned - Contains cleaned datasets
          • final_days_lag_days_1.csv - grouped cleaned dataset with 1 day lag
          • final_days_lag_days_3.csv - grouped cleaned dataset with 3 days lag
          • final_days_lag_days_7.csv - grouped cleaned dataset with 7 days lag
          • final_days_lag_hours_1.csv - grouped cleaned dataset with 1 hour lag
          • final_days_lag_hours_3.csv - grouped cleaned dataset with 3 hours lag
          • final_days_lag_hours_7.csv - grouped cleaned dataset with 7 hours lag

models

This folder contains all the models

  1. bilstm_multiclass - Contains the implementation of the bilstm model which predicts the magnitude of the next day's closing price change
    • bilstm_multiclass.ipynb - The actual model implementation
    • bilstm_multiclass-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  2. bilstm_trend - Contains the implementation of the bilstm model which predicts the direction of the next day's closing price
    • bilstm_trend.ipynb - The actual model implementation
    • bilstm_trend-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  3. cnn_multiclass - Contains the implementation of the cnn model which predicts the magnitude of the next day's closing price change
    • cnn_multiclass.ipynb - The actual model implementation
    • cnn_multiclass-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  4. cnn_trend - Contains the implementation of the cnn model which predicts the direction of the next day's closing price
    • cnn_trend.ipynb - The actual model implementation
    • cnn_trend-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  5. lstm_multiclass - Contains the implementation of the lstm model which predicts the magnitude of the next day's closing price change
    • lstm_multiclass.ipynb - The actual model implementation
    • lstm_multiclass-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  6. lstm_trend - Contains the implementation of the lstm model which predicts the direction of the next day's closing price
    • lstm_trend.ipynb - The actual model implementation
    • lstm_trend-tester.ipynb - Script which tests various combinations of hyperparameters to obtain comparable results
    • model.png - Model's figure
  7. voting_classifier - Contains the implementation of the voting classifier
    • voting_classifier.ipynb - The actual model implementation
    • voting_classifier-tester.ipynb - Script which tests runs the classifier several times

papers

This folder contains all the papers read to implement this study