/twitter-crawler

Simple Twitter crawler based on Twython

Primary LanguagePython

twitter-crawler

Documentation Status build codecov PyPI - Python Version

twittercrawler is a simple Python crawler on top of the popular Twython package. The main objective during development was to provide an API that ease Twitter data collection for events that span across multiple days. The key features of this package are as follows:

  • collect tweets over several days (online or offline)
  • respect Twitter API rate limits during search
  • search for people
  • collect friend or follower network
  • easily export search results to multiple output channels (File, Socket, Kafka queues)

How to deploy?

Install

git clone https://github.com/ferencberes/twitter-crawler.git
cd twitter-crawler
python setup.py install

NOTE: If you want to push the collected data to Kafka queues then you need to execute a few additional steps.

Twitter API keys

You must provide your Twitter API credentials to collect data from Twitter. First, generate your Twitter API keys on the Twitter developer portal. Then, choose from the available options to configure your crawler.

a.) Environmental variables

  • Set the following environmental variables:
export API_KEY="YOUR_API_KEY";
export API_SECRET="YOUR_API_SECRET";
export ACCESS_TOKEN="YOUR_ACCESS_TOKEN";
export ACCESS_TOKEN_SECRET="YOUR_ACCESS_TOKEN_SECRET";
  • Authenticate your crawler:
from twittercrawler.crawlers import StreamCrawler
crawler = StreamCrawler()
crawler.authenticate()
...

b.) JSON configuration file

  • Create a JSON file (e.g. "api_key.json") in the root folder with the following content:
{
  "api_key":"YOUR_API_KEY",
  "api_secret":"YOUR_API_SECRET",
  "access_token":"YOUR_ACCESS_TOKEN",
  "access_token_secret":"YOUR_ACCESS_TOKEN_SECRET"
}
  • Authenticate your crawler:
from twittercrawler.crawlers import StreamCrawler
crawler = StreamCrawler()
crawler.authenticate("PATH_TO_API_KEY_JSON")
...

Examples

Tweet streaming example

  • Initialize and authenticate the crawler:
from twittercrawler.crawlers import StreamCrawler
stream = StreamCrawler()
stream.authenticate("PATH_TO_API_KEY_JSON")
  • Connect a FileWriter that will export the collected tweets:
from twittercrawler.data_io import FileWriter
stream.connect_output([FileWriter("stream_results.txt")])
  • Set search parameters:
search_params = {
    "q":"#bitcoin OR #ethereum OR blockchain",
    "result_type":"recent",
    "lang":"en",
    "count":100
}
stream.set_search_arguments(search_args=search_params)
  • Initialize a termination function that will collect tweets from the last 5 minutes:
from twittercrawler.search import get_time_termination
import datetime

now = datetime.datetime.now()
time_str = (now-datetime.timedelta(seconds=300)).strftime("%a %b %d %H:%M:%S +0000 %Y")
time_terminator =  get_time_termination(time_str)
  • Run search:
    • First, tweets from the last 5 minutes are collected
    • Then, new tweets are collected for every 15 seconds
try:
    stream.search(15, time_terminator)
except:
    raise
finally:
    stream.close()

With a few modifications (e.g. socket programming) the collected Twitter data can be transformed into a graph stream.

Load collected data

  • Load collected data into a Pandas dataframe
from twittercrawler.data_io import FileReader
results_df = FileReader("stream_results.txt").read()
print(results_df.head())

Crawlers

In this package you can find crawlers for various Twitter data collection tasks. Before executing the provided sample scripts make sure to prepare your Twitter API keys.

Tests

Before executing the provided tests make sure to prepare your Twitter API keys.

python setup.py test