/twitter-quote-scraper

A command line tool for scraping quotations on Twitter

Primary LanguagePythonMIT LicenseMIT

logo

Maintainability codebeat badge License: MIT Known Vulnerabilities

TwitterQuoteScraper is a command-line interface for scraping quotations on Twitter. You can either save the data through a local file, Google Sheet or Database.

Note. The following must be met for a tweet to be a quotation:

  • tweet must not be a retweet or a reply
  • must not contain URL, media (image or video) or any emoji
  • must match the regular expression: ^[\"\']{0,1}(?P<phrase>[A-Z].*[\.!?])[\"\']{0,1}\s*?[-~]\s*(?P<author>.*)$

Prerequisites

Installation

  1. Download and extract the zip file or use Git to clone this repository.

  2. Inside the directory open a terminal and run:

    pipenv install

Usage Examples

Important. Don't forget to activate the virtual environment first, doing this will put the virtual environment-specific python and PIP executables into pipenv shell’s PATH. To activate, run:

pipenv shell

Saving to a local file

# Single account
python app.py --twitter-creds creds/twitter.json local_file --twitter-handles @prog_quotes

# Multiple accounts
python app.py --twitter-creds creds/twitter.json local_file --twitter-handles @prog_quotes @CodeWisdom

# Specify the folder where the files will be generated
python app.py --twitter-creds creds/twitter.json local_file --twitter-handles @prog_quotes --output-folder quotes/

# Override the default file type
python app.py --twitter-creds creds/twitter.json local_file --twitter-handles @prog_quotes --file-type json

Saving to MySQL database

The Database and each Twitter handle's table will be created if it doesn't exist.

# Single account
python app.py --twitter-creds creds/twitter.json database --twitter-handles @prog_quotes --database-configs creds/database.json

# Multiple accounts
python app.py --twitter-creds creds/twitter.json database --twitter-handles @prog_quotes @CodeWisdom --database-configs creds/database.json

# Ignore warnings
python -W ignore app.py --twitter-creds creds/twitter.json database --twitter-handles @prog_quotes --database-configs creds/database.json

Saving to Google spreadsheet

Before you run the command, you must set up a Google spreadsheet for the service account to programmatically insert and edit values.

  1. Log in to Google and create a spreadsheet.
  2. Share the spreadsheet with the client_email you'll find inside the Google service account's JSON file.
python app.py --twitter-creds creds/twitter.json google_sheet --service-account creds/google.json --spreadsheet-id 1S8xsN8D6nD2KM5LoSZOIFnuw3zvP4XWRZLHMMfbsbPk --twitter-handles @prog_quotes

# Alphabetically sort the second/phrase column
python app.py --twitter-creds creds/twitter.json google_sheet --service-account creds/google.json --spreadsheet-id 1S8xsN8D6nD2KM5LoSZOIFnuw3zvP4XWRZLHMMfbsbPk --twitter-handles @prog_quotes --sort '{"order": "asc", "column": 1}'

The command of this spreadsheet is set up to run every midnight UTC time via CRON job.

References

Commands

local_file Generate and save quotations to a file.

usage: python app.py --twitter-creds local_file --twitter-handles [--output-folder] [--file-type]


database Save quotations to MySQL database.

usage: python app.py --twitter-creds database --twitter-handles --database-configs


google_sheet Save quotations to Google spreadsheet.

usage: python app.py --twitter-creds google_sheet --service-account --spreadsheet-id --twitter-handles [--sort]

Arguments

--twitter-creds Path to JSON file that contains your Twitter app's credentials, see creds/twitter.json for the expected keys. This argument should take place before every command, think of this as a login form to access Twitter's API.


--twitter-handles List of Twitter handles to scrape (may or may not start with @ character).

commands: local_file, database, google_sheet


--output-folder The folder where the files will be generated; by default, its value is set to the current working directory.

commands: local_file


--file-type The file's format to be generated, choose either CSV (default) or JSON.

commands: local_file


--database-configs Path to database configurations' JSON file, see creds/database.json for keys.

commands: database


--service-account Path to your Google service account's JSON file.

commands: google_sheet


--spreadsheet-id The spreadsheet's ID.

commands: google_sheet


--sort A JSON-formatted string that specifies how to sort the spreadsheet's values; by default, its value is set to:

{
    "order": null,  // expects the following values: null, "asc" or "desc"
    "column": 0     // column's number where the sort should be applied to
}

commands: google_sheet

Dependencies

License

Distributed under the MIT license. See LICENSE for more information.

Author

Herbert Verdida / @bertdida

Credits

Thanks to @Tobaloidee for making this project's logo.