/bhl_twarc_media-1

Script to download missed media from TWARC crawls.

Primary LanguagePython

BHL TWARC MEDIA

A complementary script to the Bentley Historical Library's implementation of twarc, used to download tweet media from URLs captured in twarc crawls.

Requirements

BHL TWARC MEDIA Set Up

  • Clone bhl_twarc_media.py
  • Place bhl_twarc_media.py in the same directory as bhl_twarc.py

Twitter API Set Up

  • Note the consumer key, consumer secret, access token, and access token secret are not required for execution.

Use

  • The script will only interact with the content inside the media directory
    • It parses rows from profile_images.csv and tweet_images.csv in each feed's media folder present
    • bhl_twarc will create the following directory structure, and this script will add to the media directory :
feeds
  examplehashtag
    html
    json
    logs
    media
      profile_images
      tweet_images
      media_logs

Potential "media" Directory Alternations

If the profile_images directory and/or tweet_images directory are not present, it/they will be created.

Potential "media" Directory Alternations

Statics provided after the script has finished:

  • "Stale" tweet images and profile images > The amount of previously downloaded URLs.
  • "Dead" tweet images and profile images > The amount of dead URLs.
  • "New" tweet images and profile images > The amount of newly downloaded URLs.