/Predicting_Stock_With_Twitter_Sentiment

You have been watching the Telsa stock and are deciding if you should buy some stock before close because you think it will jump up tomorrow, but you want to be more certain about your decision. This project aims to help make that decision. Vader sentiment analysis was implemented on tweets to compute a daily sentiment score. From historical stock data the difference between Tesla opening price and the prior day’s closing price was computed and used as the endogenous variable in an ARIMAX time series model with daily sentiment as an exogenous variable. This final model was able to predict that the Tesla stock will open the next day at a higher price than today’s closing price with 58.8% precision.

Primary LanguageJupyter Notebook

Predicting_Stock_With_Twitter_Sentiment

READ ME: Tesla Twitter Sentiment

Collecting the Tweets

TWINT was utilized to collect the tweets needed from Twitter. TWINT is an advanced Twitter scraping tool written in Python. TWINT allows you to search Twitter for tweets matching different search operators, scrape those tweets and save them into a file. More information on TWINT can be found here https://github.com/twintproject/twint.

The search parameters used in this project, were as follows:

  • Two search terms, company name and company ticker (tesla and tsla)
  • Start date of search (01/01/2018)
  • End date of search (07/15/2019
  • Tweets in the English language

TWINT_code_to_collect_tweets.png

Each time this function was run the tweets were stored in a csv file, within a folder called 'tesla_tweets'.

It was found that the shorter the time period the search was run for, the more tweets would be collected. Since this was the case, each search was run over a two week period and then merged together to include 213,224 scraped tweets from January 1st, 2018 to July 14th, 2019 containing the words ‘telsa’ and/or ‘tsla’

Cleaning the Twitter Data

Before obtaining the sentiment of each tweet, basic NLP preprossing steps were preformed. Http links were removed, special characters and numbers were removed, the tweets were converted to all lowercase strings and then each tweet was tokenized.

Clean_and_tokenize_tweets.png

Next, each tweet was lemmatized.

Lemmatize_tweets.png

Word_cloud_1.png

When looking at the word cloud after our cleaning, tokenizing and lemmatizing we can see that 'tesla', 'tsla', teslaq', '#' appear very heavily in the dataset. These terms were removed from the tweets and below appears the new word cloud.

Word_cloud_2.png

The below graph shows the top 25 words included in this dataset, with the first three being "model", "elon" and "musk.

Most_popular_words.png

The below graph shows that the most popular time to tweet about Tesla is between 9:00 am and 5:00 pm with 9:00 am being the most popular time to tweet.

Tweets_by_time_of_day.png

The below graph shows that people tend to tweet more about Tesla on weekdays than weekends, with Wednesday being the most popular time to tweet about Tesla.

Tweets_by_day_of_week.png

Tweets_by_day_of_year.png

Senitment Analysis on Tweets

Two methods of computing sentiment score on each tweet were utilized; Vader and TextBlob.

Vader_sentiment_distribution.png

TextBlob_sentiment_distribution.png

Top_days_vader.png

Top_days_textblob.png

Worst_days_vader.png

Worst_days_textblob.png

The goal of this project is to be able to compute a daily sentiment score before the market closes and be able to predict if the stock will open at a higher price the next day. The stock market is only open from 9:30 am to 4:00 pm EST, and in order to make a model that could be used in real life situations, we would only be able to use the tweets prior to 4:00 pm to compute a sentiment score. In order to give some time to run the model the model and buy the actual stock on a trading platform like Robinhood, tweets after 3:55 pm were dropped from the dataset.

Feature Engineering

After dropping all tweets published after 3:55 pm, four more sentiment columns were added to the dataset:

  • Drop all tweets that had a Vader sentiment score of 0 and then recalculate the daily average (s1_no_0)
  • Drop all tweets that had a Vader sentiment score of 0 and then recalculate the daily average (s2_no_0)
  • Rescaled all the Vader sentiment scores for each tweet with MinMaxScaler() and then recalculate the daily average (s1_scaled)
  • Rescaled all the TextBlob sentiment scores for each tweet with MinMaxScaler() and then recalculate the daily average (s2_scaled)

Modeling

Models were created to investigate two target variables; The difference between the opening price and the prior day's closing price (open_close_diff) and whether that difference would be negative or positive (pos_neg).

The time series for each individual target variable can be seen below:

open_close_diff_timeseries.png

Features of the plot:

  • There appears to be no consistent trend throughout the time span. The mean is very close to zero and we see the data frequently crosses the mean line instead of staying on one side for that long.
  • There does not appear to be any seasonality within this time series.
  • There appears to be a few outliers, most noticably around October 2018.
  • Variance appears to be constant, despite a few outliers.
  • A Dickey-Fuller Test was used to confirm that this time series is stationary.

pos_neg_timeseries.png

Features of the plot:

  • This graph is the same graph as the previous one, but instead of dollar amounts for the change in open and prior close stock price, only the direction is taken into account (+1 for a positive change, -1 for a negative change and 0 for a neutral change). This makes the graph a bit harder to interpret.
  • Although difficult to determine from this graph, there appears to be no consistent trend throughout the time span.
  • There does not appear to be any seasonality within this time series.
  • Due to the nature of this time series after classifying directions, outliers cannot be determined.
  • Variance appears to be constant.
  • A Dickey-Fuller Test was used to confirm that this time series is stationary.

In order to model each of the two target variables (open_close_diff) aand (pos_neg) against each exogenous sentiment variable, the blow function was created.

The first part of the function splits the data into a training and testing dataset. Each model will be trained off of all the data from January 5, 2018 to February 25, 2019.

Train_test_split _code.png

Next, in order to find the appropriate model the best values for p, d and q must be evaluated. This function finds the best p, q and d values by determining which combination of these values will produce the smallest Mean Absolute Error while still producing a model that has a p value less than 0.1.

pqd_test.png

Once the final order has been determined, the final model is able to be produced.

Predicitions are then made for the target variable using the selected exogenous variable.

Model_and_predict.png

In order to view how our predicitions compare to the actual values, a dataframe is created with the actual and predicted values.

Create_prediction_df.png

When examining the already classified target variable (pos_neg), the prediction values are continous values from -1 to 1. In order to get classified predicition values, they are then reclassified as follows: - Predicition value < 0, becomes -1 - Predicition value = 0, becomes 0 - Prediction value > 0, becomes 1

After reclassifying the prediction values the precision of how the model classified each class is printed as well as the confusion matrix.

When modeling the raw, unclassified target variable (open_close_diff), the predicted values produced had all been very close to 0, even when the exogenous variable had been scaled (s1_scaled & s2_scaled) and when the exogenous variable had all tweets with sentiment score 0 removed (s1_no_0 & s2_no_0). In order to usable predictions, the actual open_close_diff values were classified as follows: - Actual value < 0, becomes -1 - Actual value = 0, becomes 0 - Actual value > 0, becomes 1 The prediction values were also reclassified: - Predicition value < 0, becomes -1 - Predicition value = 0, becomes 0 - Prediction value > 0, becomes 1

Reclassification_code.png

Although the model will be created using the actual open and closing differences, the results will be readable in the same way that the results for the target variable pos_neg are.

After recalssifying the actua and predicted values, the precision for each class and confusion matrix for each class can be viewed.

Results

After running the models it was clear that this was best kept as a classification problem; Will the stock open tomorrow at a higher price than today's close?

Using a SARIMAX model each sentiment feauture was used as an exogenous factor to predict the difference between the opening price and the prior day's closing price (open_close_diff) and whether that difference would be positive or negative (pos_neg).

The Best Model

Best Target Feature: open_close_diff Best Exogneous Feature: sentiment_1 Best Model: ARIMAX(open_close_diff, sentiment_1), order =(3,1,4)

Predict that the stock will open tomorrow at a lower price than today’s closing price: Precision: 0.47457627

Predict that the stock will open tomorrow at the same price as today’s closing price: Precision: 0.0

Predict that the stock will open tomorrow at a higher price than today’s closing price: Precision: 0.54054054

Final_confusion_matrix.png

Out of the 37 times the model predicted that there would be an increase from one day’s closing price and the following day’s opening price, it was correct 20 times

If you were to collect tweets using TWINT containing the words ‘tesla’ and ‘tsla’ from 12:00 AM and 3:55 PM and use Vader to compute a daily sentiment score and use an ARIMAX model, you could determine if the price of the stock will open at a higher price than the close price with 54% precision. If you were to buy the stock at about closing price right before the market closes 100 times, you would make a profit 54 times.

Final Model Adjustments

After reviewing predictions from all models, it is clear that there is some form of bias to each model, either they are over predicting positive values or they are over predicting negative values. In the final model, the model is over-predicting days when the open_close_diff is negative.

To adjust for this, the final model was re-run but when classifying the predicted values to [-1,0,1] in the original model, the predicted values were classified as follows: - Predicition value < 0, becomes -1 - Predicition value = 0, becomes 0 - Prediction value > 0, becomes 1

The model was re-run but to adjust for over predicting negative values, the cutoff of the classification was tested as seen in the code below:

Code_for_adjusted_cut_off.png

The final classification rule was as follows: - Predicition value < -.3, becomes -1 - Predicition value = -.3, becomes 0 - Prediction value > -.3, becomes 1

This adjustment produced higher precision values.

Adjusted_final_confusion_matrix.png

The final prediction that the stock will open tomorrow at a higher price than today’s closing price: Precision: 0.58823529