htspy - Helpful Twitter Scraper for Python

Wrapper for tweepy which makes it easier to capture all tweets matching a group of hashtags or terms for a specified date range into a local MongoDB instance. Only a date range and search terms are required. The remainder of the save and iteration logic is handled by the wrapper.

Instructions

Create a local MongoDB instance and update constants.py with the appropriate DB & collection names
Install pymongo and tweepy using PIP
Run tweets-scrape with the required arguments below

Example code

tweets-scrape.py 
--api-key yourkey
--api-secret yoursecret
--start-date 2016-09-26 
--end-date 2016-09-28 
--terms food,burgers,#bbq
--collection foodtweets

Captured Fields

tweet_id
text
coordinates
created_at
entities (urls, mentions, hashtags)
favorite_count
retweet_count
is_retweet
user (id, screen name, real name)

Notes

The optional --restart-id argument allows for manually specifying where the scrape should restart from. This is used to override the default logic where the restart-id is presumed to be the earliest/smalled tweet_id value in the database.
Tweets are returned in reverse chronological order by the API. Try to scrape from newest to oldest in order to avoid confusion as default restart checks for oldest tweet in db.
Favorite & retweet counts are as of the time of the scrape. Thus even the same past tweet could have different counts if it accumulated new favorites or retweets since the previous scrape.
MongoDB database is always 'twitter' while the collection name may be changed.

ourren/htspy

htspy - Helpful Twitter Scraper for Python

Instructions

Captured Fields

Notes