get_twitter_dump: A Python repository from mannuscript

This script gets tweet dumps from Archive.org (https://archive.org/details/twitterstream) and then dump all of them to mongo.

Steps:

Get data from https://archive.org/details/twitterstream for desired year/month/dates
Extract all the tars
Extract the tweet jsons
Insert into mongo

Run:

Change following parameters before running the main script: 1.1 year 1.2 month 1.3 from_date 1.4 to_date 1.5 mongo_db 1.6 mongo_collection
Install the requirements
Create a unique index on tweet id (As the dumps from archiv.orge contain a lot of duplicate data): db.tweets.createIndex({"id":1}, {unique:true})
Run python main.py

Note:

Steps mentioned above are sequential, unless all the tweet dumps for given dates have been downloaded, the script will not move ahead to dump any of the file.
It is advised to have a unique index in your mongo collection on tweet.id, as the tweet dumps contain a lot of duplicate data. db.tweets.createIndex({"id":1}, {unique:true})
Downloading the data for a month, even for few days takes a lot of time, it is also advised to try running the script to get data for single day, to get some confidence before running the script for a month's data.
Make sure you have enough storage left, dont run it on your macbook air :).

mannuscript/get_twitter_dump