Light weight web scraper and crawlers for various financial news sources. Disclaimer: Developed for educational purposes only.
Dependencies: python3, Scrapy, Twisted
- List stock tickers interested in separate line in a file, i.e.
- Execute
python3 -i stock.txt
- Data output is in current directory following '{news_source_name}_{stock_ticker}.jl'
- Wall Street Journal (HOLD: Needs subscription to view articles...)
- Market Watch (WIP: Handle crawling of infinite scrolling article list, check out
- 100% able to extract from MarketWatch
- Bloomberg (Supported)
- Reuters (Supported)
- MSNBC (Supported)
- TheStreet (Not supported)
- MarketRealist (Hold: paywall)
- SeekingAlpha (Supported)
- Fool (Not supported)
- Investopedia (Not supported)
Basic scraping of current related news article headlines, links, and texts
Examples of scraped data in
Centralized script:
to simplify execution and pipelining -
Crawls all MarketWatch links and scrapes their articles
Supports scraping of multiple stock ticker symbols
Added dynamic parsing based on source news website
Added support for Reuters articles
Hold on WSJ, needs subscription
- Added support for MSNBC
- Added support for SeekingAlpha
+ Develop web crawlers to curate article information from current links
+ Create API for scraping specific companies by stock ticker labels
+ More dynamic crawlers that can extract from different news sites
- Support more market news sites, parsing wise
- Add date tags to .jl data files
- Add chron job to periodically scrape at some `time`
- Method to eliminate duplicate articles