This script gets tweet dumps from Archive.org (https://archive.org/details/twitterstream) and then dump all of them to mongo.
Steps:
- Get data from https://archive.org/details/twitterstream for desired year/month/dates
- Extract all the tars
- Extract the tweet jsons
- Insert into mongo
Run:
- Change following parameters before running the main script: 1.1 year 1.2 month 1.3 from_date 1.4 to_date 1.5 mongo_db 1.6 mongo_collection
- Install the requirements
- Create a unique index on tweet id (As the dumps from archiv.orge contain a lot of duplicate data):
db.tweets.createIndex({"id":1}, {unique:true})
- Run
python main.py
Note:
- Steps mentioned above are sequential, unless all the tweet dumps for given dates have been downloaded, the script will not move ahead to dump any of the file.
- It is advised to have a unique index in your mongo collection on tweet.id, as the tweet dumps contain a lot of duplicate data.
db.tweets.createIndex({"id":1}, {unique:true})
- Downloading the data for a month, even for few days takes a lot of time, it is also advised to try running the script to get data for single day, to get some confidence before running the script for a month's data.
- Make sure you have enough storage left, dont run it on your macbook air :).