MaryseLT/bhl_twarc_media-1

Script to download missed media from TWARC crawls.

Python

BHL TWARC MEDIA

A complementary script to the Bentley Historical Library's implementation of twarc, used to download tweet media from URLs captured in twarc crawls.

Requirements

BHL TWARC MEDIA Set Up

Clone bhl_twarc_media.py
Place bhl_twarc_media.py in the same directory as bhl_twarc.py

Twitter API Set Up

Note the consumer key, consumer secret, access token, and access token secret are not required for execution.

Use

The script will only interact with the content inside the media directory
- It parses rows from profile_images.csv and tweet_images.csv in each feed's media folder present
- bhl_twarc will create the following directory structure, and this script will add to the media directory :

feeds
  examplehashtag
    html
    json
    logs
    media
      profile_images
      tweet_images
      media_logs

Logs for the downloads will be stored to a media.log file in the new folder titled media_logs
Uses the same variable (feed_dict) to execute as:

Potential "media" Directory Alternations

If the profile_images directory and/or tweet_images directory are not present, it/they will be created.

Potential "media" Directory Alternations

Statics provided after the script has finished:

"Stale" tweet images and profile images > The amount of previously downloaded URLs.
"Dead" tweet images and profile images > The amount of dead URLs.
"New" tweet images and profile images > The amount of newly downloaded URLs.