A complementary script to the Bentley Historical Library's implementation of twarc, used to download tweet media from URLs captured in twarc crawls.
- Clone
bhl_twarc_media.py
- Place
bhl_twarc_media.py
in the same directory asbhl_twarc.py
- Note the consumer key, consumer secret, access token, and access token secret are not required for execution.
- The script will only interact with the content inside the
media
directory- It parses rows from
profile_images.csv
andtweet_images.csv
in each feed's media folder present bhl_twarc
will create the following directory structure, and this script will add to themedia
directory :
- It parses rows from
feeds
examplehashtag
html
json
logs
media
profile_images
tweet_images
media_logs
- Logs for the downloads will be stored to a
media.log
file in the new folder titledmedia_logs
- Uses the same variable (feed_dict) to execute as:
If the profile_images
directory and/or tweet_images
directory are not present, it/they will be created.
Statics provided after the script has finished:
- "Stale" tweet images and profile images > The amount of previously downloaded URLs.
- "Dead" tweet images and profile images > The amount of dead URLs.
- "New" tweet images and profile images > The amount of newly downloaded URLs.