Creating alternative datasets for finance with Python — scraping, text-mining and sentiment analysis
The fastest-growing category of data is unstructured, e.g. text and images. In finance many still rely — almost exclusively — on traditional, numeric time-series of prices and fundamental data. How can we access these growing sources of untapped, alternative data? And how do we make sense of millions of documents of text that no human can process in any reasonable amount of time?
This article demonstrates how you can source and analyse such data. As an example, we want to scrape freely-accessible news articles about the oil company Royal Dutch Shell. A convenient source is the newspaper The Business Times as it has no paywall and its archive is available going back a few years.
Once we have sourced and stored the articles we need to turn this unstructured data into a (numeric) format that algorithms can process. That’s what a natural language process (NLP) data pipeline does and the Python ecosystem has many open-source tools that help us do exactly that. Once the data is pre-processed and cleaned we can perform all kinds of analysis, including sentiment scoring.
- The tool box: Python data science and text-mining packages
- The data: Scraping news articles from the Business Times website
- Building a Natural Language Processing (NLP) pipeline with Spacy
- Text pre-processing — tokenisation, lemmatisation, stop-word removal
- Named-entity recognition
- Sentiment algorithms: one proprietary, and two out-of-the-box (NLTK, TextBlob)
- Visualising and comparing sentiment scores
- Create manual labels to check sentiment score accuracy
Please have a look at a more detailed description in this Medium article
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
Python==3.6
BeautifulSoup4
Pandas
Matplotlib
Seaborn
TextBlob
NLTK
spaCy==2.1
spacy language model
python -m spacy download en_core_web_lg
Installing
Install the dependencies from the prerequisite list and then run the notebook. Either a virtual environment (virtualenv) or a package manager (e.g. pipenv) can be used.
Example: The scraper can be run independently of the rest of the script:
def scraper(keyword):
"""
takes search term for Business Times (Singapore) news articles,
runs scraper through 10 pages of the archive and
returns dictionary with date (key) and article (value)
"""
date_sentiments = {}
article_text = {}
counter = 1
for i in range(1,9):
page = urlopen('https://www.businesstimes.com.sg/search/'+ keyword +'?page='+str(i)).read()
soup = BeautifulSoup(page, features="html.parser")
posts = soup.findAll("div", {"class": "media-body"})
for post in posts:
time.sleep(2)
url = post.a['href']
date = post.time.text
print("Article:", counter, "|", date, "URL:", url[8:65] + "...")
counter += 1
try:
link_page = urlopen(url).read()
except:
url = url[:-2]
link_page = urlopen(url).read()
link_soup = BeautifulSoup(link_page)
sentences = link_soup.findAll("p")
passage = ""
for sentence in sentences:
passage += sentence.text
article_text.setdefault(date, []).append(passage)
articles = {}
for k,v in article_text.items():
articles[datetime.strptime(k, '%d %b %Y').date() + timedelta(days=1)] = v
return articles
articles = scraper("shell")
Visualisation examples
Visualise named-entities in your text:
displacy.render(spacy_nlp(str(sentences[:13])), jupyter=True, style='ent')
Visualise and compare sentiment distributions:
sns.set(style="white", palette="muted", color_codes=True)
plt.figure(figsize=(12,7))
ax = sns.kdeplot(df.NLTK_spacy)
ax2 = sns.kdeplot(df.Proprietary_spacy)
ax2 = sns.kdeplot(df.TextBlob_spacy)
plt.title("Sentiment algorithms - probability density functions", fontsize=19)
sns.despine(left=True)
Authors
- Marcel Dietsch - (https://github.com/marceld) and (https://twitter.com/MarcelDietsch)
License
This project is licensed under the MIT License