Stock Trend Prediction and Business Strategy Design

In the financial world, stock market and its trends are volatile in nature. The issue about stock price prediction has attracted scholars to capture the volatility and predicting its trend behavior at next interesting time window. Investors and market analysts study the market behavior and design their trading strategy, e.g., when to buy or sell, accordingly. As stock market generates large amount of data every day, it is infeasible for an individual to consider all the current and past data for predicting future trend of a stock market. Basically, there are two main categories of methods for forecasting market trends. One is called technical analysis and other is fundamental analysis. Technical analysis utilizes past price and volume to predict the future trend of stocks. Fundamental analysis, on the other hand, requires analysis of related financial data in order to get some insights. This project follows the fundamental analysis technique by scrapping financial news articles from websites and evaluating each article with TextRank algorithm. If the news article is evaluated with high score, there are more chance that the stock price will go up. Conversely, if the news article is determined with low score, the stock price may go down. We have taken 2016 data from Apple Company as stock price and news articles from following five famous financial websites: CNN Money, Business Week, Market Watch, Investor Guide, and Yahoo Finance. The accuracy of the prediction model has been improved almost 50% from 33% to 80%.

If the stock price trend is predicted as decline at a particular company, salesmen have to promote its product or service. Because marketing information changes very fast, we hook tweepy API to get most-updated marketing information geographically by providing product or service related key words and interesting geographical region. The region is described by center (Latitude and Longitude) and its radius. We apply DQN to help salesmen design promotion route if the stock performance is bad. The reason for us to apply DQN is that it can handle dynamic change of marketing information from Tweets and geographical scalability issues very well.

alt text System Architecture for Predicting Stock Trend from Financial News with TextRank Algorithm

The figure shows our system structure for stock trend prediction. The first step is to prepare different web scrappers for each financial news sources. Since each webpage has their own webpages presentation structure and we only need to extract interesting information, e.g., news about Apple Inc., this is the reason why we need to customize with respect to different webpages. All scrapped news are stored as HDFS at AWS. We then apply TextRank algorithm to extract key words. We also created the polarity words dictionary using financial words with positive and negative meanings based on Finance specific words using McDonald’s research [1]. In this dictionary, we collected about 2400 positive words and 7400 negative words. Each article score is evaluated by the difference of total positive financial words importance weights (calculated from TextRank) and total negative financial words importance weights (calculated also from TextRank). These financial news scores will be our features used to build our prediction model. The target variable is stock price performance indicator. If the price difference is larger than 1%, we labeled it as ‘UP’. If the price difference is smaller than -1%, we labeled it as ‘DOWN’. Anything else will be labeled as ‘STAY’. Two different classification algorithms, KNN (with one neighbor) and NB, are implemented to check and improve classification accuracy.

alt text
Stock Trend Prediction Accuracy Comparison by News Sources

This figure shows the stock prediction accuracy rate with respect to five selected financial news websites. We also compare the accuracy rate for case that considers all news sources as our predictor variables. Gaussian NB has higher accuracy rate compared to KNN. According to collected news at year of 2016 about Apple Inc., the news source from Investor Guide provides highest prediction for Apple stock trend. But, Yahoo Finance has worst prediction accuracy rate compared to others.

alt text
System Architecture for Promotion Route Design from Tweepy Marketing Information

We utilize Tweepy API to give tweets at each region by specifying search term and time with a score. Such score is obtained from the difference of the number between positive and negative financial words from polarity words dictionary. By setting thresholds for good and bad responses, we can set a region with green color if its score is higher than good response threshold, and red if its score is lower than bad response threshold. Deep Q-Networks (DQN) are much more capable than traditional Q-networks used in reinforced learning by having following improvements: (1) Going from a single-layer network to a multi-layer convolutional network; (2) Implementing Experience Replay, which will allow our network to train itself using stored memories from it’s experience; (3) Utilizing a second “target” network for stable learning process. It was these three innovations that allowed the Google DeepMind team to achieve superhuman performance on dozens of Atari games using their DQN agent. We can model salesmen promotion strategy as a route selection game by treating the salesman agent as a blue square. The goal for the salesman is to navigate to the green squares (positive response area) while avoiding the red squares (negative response area). At the start of each episode all squares are randomly placed within a nxn grid-world. The salesman has pre-specified maximum number of steps, e.g., 50 steps, allowed for her/him to achieve as large a reward as possible. Because tweets responses are dynamic positioned, the salesman has to do more than simply learn a fixed path, a much simpler solution that can be obtained by Q-network reinforced learning. Instead the salesman must learn spatial relationships between the blocks.

alt text
Tweets marketing information at each region near Bay Area (latitude = 37.33, longitude = -121.89).

This figure demonstrates marketing information from Tweepy API by area, i.e., RED regions represent bad response, GREEN region represents good response. Each sub-square is a region with size 1.4^2 square miles. The green region is the area with positive tweets response for query "Apple Inc." between dates 2017-01-20 and 2017-01-23. The red region is the area with negative tweets response for query "Apple Inc." between dates 2017-01-20 and 2017-01-23. The black region is the area with neutral tweets response for same query and time period. The goal for a salesman is to move from the blue sub-square to the green one while avoiding the red ones.

alt text
Successful Rate Comparison by Cities

We compare salesmen successful rate among five cities in U.S. by running DQN. The difference is due to the number of bad response regions and the maximum number of steps allowed to navigate to the good response region (Green area).

[1] http://www3.nd.edu/~mcdonald/Word_Lists.html