This is a twitter crawler that can crawl football (soccer) related data so we can use them to train any machine learning model to be able to detect intent and entity recognition. It's based on Tweepy, a twitter streaming framework.
We are crawling tweets which include key words about football related to these countries:
Note that, the dictionaries are not so accurate, but this will not make big difference because we will mix them afterall.
It starts a streamer which listens to twitter api and crawl any tweets published including any keywords from the proposed ones.
Run this to start crawling, and the crawled stream will be saved in tweets_features/tweets.csv
$python3 crawler_entrypoint/crawler.py
You can add/remove keywords, just add new field in the dictionary
provided in crawler_entrypoint/crawl.py
For further improvements, these tutorials can help improving the crawler:
- Twitter API with Python: Part 1 -- Streaming Live Tweets: https://www.youtube.com/watch?v=wlnx-7cm4Gg
- Twitter API with Python: Part 2 -- Cursor and Pagination: https://www.youtube.com/watch?v=rhBZqEWsZU4
- Twitter app used: https://apps.twitter.com/app/15130100/keys