Simple Twitter crawler written in Python
Installation
- Download the project
git clone https://github.com/twbrandon7/python-twitter-crawler.git
cd python-twitter-crawler
- Install virtualenv
pip3 install virtualenv
virtualenv .env
- Activate virtual environment
source ./.env/bin/activate # for linux
.\.env\Script\activate.ps1 # for windows powershell
- Install dependencies
pip3 install -r requirements.txt
Usage
It is fine to use the crawler in command line. The explanation of the command is as following:
$ python crawler.py -h
usage: crawler.py [-h] [-f FOLDER] [-mr MAX_RESULT] [-thd MAX_THREAD]
[-sd SLEEP_DURATION] [-mt MAX_TIMELINES]
[-tl TIMELINE_LENGTH] [-trd TOKEN_REFRESH_DURATION]
keyword
positional arguments:
keyword The keyword to search
optional arguments:
-h, --help show this help message and exit
-f FOLDER, --folder FOLDER
The folder to save the tweets. The default folder is
'./data'
-mr MAX_RESULT, --max_result MAX_RESULT
The maximum amount of tweets to download. Default is
10. Set -1 for unlimiting.
-thd MAX_THREAD, --max_thread MAX_THREAD
The maximum amount of threads to run. Default is 1.
-sd SLEEP_DURATION, --sleep_duration SLEEP_DURATION
The time between each request for each thread.
-mt MAX_TIMELINES, --max_timelines MAX_TIMELINES
The maximum amount of timelines (responses) to
download for each tweet. Set -1 for unlimiting.
Default is -1.
-tl TIMELINE_LENGTH, --timeline_length TIMELINE_LENGTH
The maximum length of a timeline to download. Set -1
for unlimiting. Default is -1.
-trd TOKEN_REFRESH_DURATION, --token_refresh_duration TOKEN_REFRESH_DURATION
The time duration in seconds to refresh the access
tokens. Default is 300 seconds.
Examples
python crawler.py "spacex" -mr 5 -mt 5
python crawler.py "spacex" -mr 5 -mt 5 -th 3
Development
It is welcome to contribute to this project by creating a pull request. There are several things that need for accomplishment.
TODO
- Add time information to the downloaded tweet.
- Add more storage class. (See
tweet_crawler.storages
.)
Unit test
python -m unittest tests.unit.test_tweet_fetcher
python -m unittest tests.unit.test_Tweet_object
python -m unittest tests.unit.test_TwitterSearch_object
python -m unittest tests.unit.storages.test_json_storage