/tweets-cli

Stand-alone PHP CLI script to batch/post-process downloaded full twitter backup-archive files

Primary LanguagePHP

tweets.php

A command-line (CLI) script to batch-process and work with the files unzipped from the twitter backup archive zip file.

## Features

  • It can be used to generate 'grailbird' javascript files which are compatible with the default twitter archive viewer application, of which I have written an experimental updated version called tweets-gb where the generated files can be dropped-in and optionally linked via a file:/// URL to the physical file on disk when browsed off-line, locally and in a web browser.
  • Exported grailbird data can be viewed in @vijinho/tweets-gb
  • It can also import grailbird files, join them, and optionally merge into existing tweets.js data.
  • It can unshorten all short-links and resolve all links fully (saving the results to urls.json for re-use on successive runs (it's a time-consuming process to check-links which can take hours!). This can be used with the --offline option to speed-up subsequent processing further. Media entity attributes will be updated to reflect changes.
  • Option local will check subfolders for content and add file and path information into the tweet under new attributes (videos, images, files). Also these local files will be swapped-in for the remote-ones for viewing off-line and loading faster.
  • The --delete option will delete lower-bitrate video files and keep the highest bitrate file if used with the local-files option *local- Can filter tweets on date/time (from and to specific dates) using PHP strtotime for flexible date/time format
  • An option exists to also delete duplicate local tweet files. These are files named 9999999999-XXXXXXXX.(jpg|png|mp4|...) and the resultant file will chop off the numeric tweet_id at the start of the filename (and dash) and just rename one of the duplicate files to XXXXXXXX.(jpg|png|mp4|...), deleting the rest. The script will update the file and entity links to reflect the new filename.
  • Option to filter results on a list of given attributes/keys --keys-filter and also to drop keys altogether from tweets with --keys-remove.
  • Can filter tweets based on executing a PHP regular-expression (and optionally save the regular expression results in the tweet as a new attribute regexps)
  • Creates a new tweet attribute: created_at_unixtime which is the unixtime of the tweet.
  • Creates a new tweet attribute: text which is the cleaned-up tweet-text after processing, also named to be compatible with default twitter export. *.
  • Option to specify previous output of batch processing as input with --tweets-file
  • All processed tweets are saved to output.json by default but this can be changed with --filename
  • Output can also be optionally changed to .txt or serialized php.
  • Option to discard tweets which are mentions or retweets (--no-retweets and --no-mentions)
  • Can just return a json file of either of the following: js/json files, images, videos or all files in the twitter backup folder.
  • Save basic info of all users mentioned or RT'd to users.json with --list-users
  • Adds new tweet attribute 'rt' if RT containing RT'd username

Usage - CLI Options

This is intentionally written as a stand-alone self-contained command-line php script, hacked-together, written in a procedural style. These are the command-line options available:

Usage: php tweets.php

-h,  --help                   Display this help and exit
-v,  --verbose                Run in verbose mode
-d,  --debug                  Run in debug mode (implies also -v, --verbose)
-t,  --test                   Run in test mode, show what would be done, NO filesystem changes.
     --dir={.}                Directory of unzipped twitter backup files (current dir if not specified)
     --dir-output={.}         Directory to output files in (default to -dir above)
     --format={json}          Output format for script data: txt|php|json (default)
-f,  --filename={output.}     Filename for output data from operation, default is 'output.{--OUTPUT_FORMAT}'
     --grailbird-import={dir} Import in data from the grailbird json files of the standard twitter export. If specified with '-a' will merge into existing tweets before outputting new file.
-g,  -g={dir}        Generate json output files compatible with the standard twitter export feature to dir
     --grailbird-media        Copy local media files to grailbird folder, using same file path
     --media-prefix           Prefix to local media folder instead of direct file:// path, e.g. '/' if media folders are to be replicated under webroot for serving via web and prefixing a URL path, implies --local
     --list                   Only list all files in export folder and halt - filename
     --list-js                Only List all javascript files in export folder and halt
     --list-images            Only list all image files in export folder and halt
     --list-videos            Only list all video files in export folder and halt
     --list-users             Only list all users in tweets, (default filename 'users.json') and halt
     --list-missing-media     List media URLs for which no local file exists and halt (implies --local)
     --organize-media         Organize local downloaded media, for example split folder into date/month subfolders
     --download-missing-media Download missing media (from --list-missing-media) and halt, e.g.. missing media files (implies --local)
     --list-profile-images    Only list users profile images, (in filename 'users.json') and halt
     --download-profile-images  WARNING: This can be a lot of users! Download profile images.
     --tweets-count           Only show the total number of tweets and halt
-i,  --tweets-file={tweet.js} Load tweets from different json input file instead of default twitter 'tweet.js' or 'tweet.json' (priority if exists)
-a,  --tweets-all             Get all tweets (further operations below will depend on this)
     --date-from              Filter tweets from date/time, see: https://secure.php.net/manual/en/function.strtotime.php
     --date-to                Filter tweets up-to date/time, see: https://secure.php.net/manual/en/function.strtotime.php
     --no-retweets            Drop re-tweets (RT's)
     --no-mentions            Drop tweets starting with mentions
      --minimal               Minimal output for each tweet, no superfluous data like tweet IDs.
     --media-only             Only media tweets
     --urls-expand            Expand URLs where shortened and data available (offline) in tweet (new attribute: text)
-u,  --urls-resolve           Unshorten and dereference URLs in tweet (in new attribute: text) - implies --urls-expand
     --urls-check             Check every single target url (except for twitter.com and youtube.com) and update - implies --urls-resolve
     --urls-check-source      Check failed source urls - implies --urls-resolve
     --urls-check-force       Forcibly checks every single failed (numeric) source and target url and update - implies --urls-check
-o,  --offline                Do not go-online when performing tasks (only use local files for url resolution for example)
-l,  --local                  Fetch local file information (if available) (new attributes: images,videos,files)
-x,  --delete                 DANGER! At own risk. Delete files where savings can occur (i.e. low-res videos of same video), run with -t to test only and show files
     --dupes                  List (or delete) duplicate files. Requires '-x/--delete' option to delete (will rename duplicated file from '{tweet_id}-{id}.{ext}' to '{id}.{ext}). Preview with '--test'!
     --keys-required=k1,k2,.  Returned tweets which MUST have all of the specified keys
-r,  --keys-remove=k1,k2,.    List of keys to remove from tweets, comma-separated (e.g. 'sizes,lang,source,id_str')
-k,  --keys-filter=k1,k2,.    List of keys to only show in output - comma, separated (e.g. id,created_at,text)
     --regexp='/<pattern>/i'  Filter tweet text on regular expression, i.e /(google)/i see https://secure.php.net/manual/en/function.preg-match.php
     --regexp-save=name       Save --regexp results in the tweet under the key 'regexps' using the key/id name given
     --thread=id              Returned tweets for the thread with id

Usage Examples

Report duplicate tweet media files and output to 'dupes.json':
        tweets.php -fdupes.json --dupes

Delete duplicate tweet media files (will rename them from '{tweet_id}-{id}.{ext}' to '{id}.{ext})':
        tweets.php --delete --dupes

Show total tweets in tweets file:
        tweets.php --tweets-count --format=txt

Write all users mentioned in tweets to default file 'users.json':
        tweets.php --list-users

Show javascript files in backup folder:
        tweets.php -v --list-js

Resolve all URLs in 'tweet.js' file, writing output to 'tweet.json':
        tweets.php -v -u --filename=tweet.json

Resolve all URLs in 'tweet.js' file, writing output to grailbird files in 'grailbird' folder and also 'tweet.json':
        tweets.php -u --filename=tweet.json -g=export/grailbird

Get tweets from 1 Jan 2017 to 'last friday', only id, created and text keys:
        tweets.php -d -v -o -u --keys-filter=id,created_at,text,files --date-from '2017-01-01' --date-to='last friday'

List URLs for which there are missing local media files:
        tweets.php -v --list-missing-media

Download files from URLs for which there are missing local media files:
        tweets.php -v --download-missing-media

Organize 'tweet_media' folder into year/month subfolders:
        tweets.php -v --organize-media

Prefix the local media with to a URL path 'assets':
        tweets.php -v --media-prefix='/assets'

Generate grailbird files with expanded/resolved URLs:
        tweets.php -v -u -g=export/grailbird

Generate grailbird files with expanded/resolved URLs using offline saved url data - no fresh checking:
        tweets.php -v -o -u -g=export/grailbird

Generate grailbird files with expanded/resolved URLs using offline saved url data and using local file references where possible:
        tweets.php -v -o -u -l -g=export/grailbird

Generate grailbird files with expanded/resolved URLs using offline saved url data and using local file references, dropping retweets:
        tweets.php -v -o -u -l -g=export/grailbird --no-retweets

Filter tweet text on word 'hegemony' since last year, exporting grailbird:
        tweets.php -v -o -u -l -g=export/grailbird --regexp='/(hegemony)/i' --regexp-save=hegemony

Extract the first couple of words of the tweet and name the saved regexp 'words':
        tweets.php -v -o -u -l -x -g=export/grailbird --regexp='/^(?P<first>[a-zA-Z]+)\s+(?P<second>[a-zA-Z]+)/i' --regexp-save=words

Import grailbird tweets and export tweets with local media files to web folder:
        tweets.php -v -g=www/vijinho/ --media-prefix='/vijinho/' --grailbird-media --grailbird-import=vijinho/import/data/js/tweets

Import twitter grailbird files,check URL and export new grailbird files:
        tweets.php -v -g=www/vijinho/ --grailbird-import=import/data/js/tweets --urls-check

Import and merge grailbird files from 'import/data/js/tweets', fully-resolving links and local files:
        tweets.php -v -o -l -u --grailbird-import=import/data/js/tweets -g=export/grailbird

Export only tweets which have the 'withheld_in_countries' key to export/grailbird folder:
        tweets.php -v -u -o --keys-required='withheld_in_countries' -g=export/grailbird

Export only tweets containing text 'youtu':
        tweets.php -v --regexp='/youtu/' -g=www/vijinho/ --media-prefix='/vijinho/' --grailbird-media

Export only no mentions, no RTs':
        tweets.php -v -g=www/vijinho/ --media-prefix='/vijinho/' --grailbird-media --no-retweets --no-mentions

Export only media tweets only':
        tweets.php -v -g=www/vijinho/ --media-prefix='/vijinho/' --grailbird-media --media-only

Export the tweet thread 967915766195609600 as grailbird export files, to tweets to thread.json and folder called thread:
        tweets.php -v --thread=967915766195609600 --filename=www/thread/data/js/thread.json -g=www/thread/ --media-prefix='/thread/' --grailbird-media

Export the tweet thread 967915766195609600 as a js file test/test.json, and copy media files too:
        tweets.php -v --dir=vijinho --thread=1108500373298442240 --filename=test/test.json --copy-media=test

Export the tweet thread 967915766195609600 as markdown, and copy media files too:
        tweets.php -d -v --dir=vijinho --thread=967915766195609600 --filename=thread/vijinho_967915766195609600_md/item.md --media-prefix=/vijinho_967915766195609600_md/ --copy-media=thread/vijinho_967915766195609600_md --format=md        

Resolve URLs from tweets.js/tweets.json file and create a complete grailbird-data export, creating a new tweets.json file after to
        tweets.php -v -d  --date-from '2019-05-01' --urls-expand --urls-resolve --grailbird-media --media-prefix='/' --grailbird=grailbird --filename="tweet.json"

Generate markdown output file of all tweets except RTs and mentions for threads which have at least 10 tweets
        tweets.php -v -d --no-retweets --no-mentions --format=md --filename=output.md --threads-tweets=10

Note

  • I have only tested it on MacOS but it should work under Linux.
  • This script is memory-hungry, I had to increase my limit to 512MB to handle 10 years and over 30,000 tweets.

Re-constructing the folder structure from a standard (old) twitter year/month file export

Supposing tweets.php is in the folder 'cli' and you are running for a user 'euromoan'.

### Make the following folders:

euromoan/www/euromoan - this is the top-level folder of the un-zipped file (containing the twitter index.html file)
euromoan/profile_media
euromoan/tweet_media
euromoan/tweet_files

Create the following files

In the euromoan folder, copying the data from the account data/js/user_details.js and from browsing the twitter page for the user:

account.js:

window.YTD.account.part0 = [{
        "account": {
            "email": "euromoan@example.com",
            "createdVia": "web",
            "username": "euromoan",
            "accountId": "816715694133964800",
            "createdAt": "2007-01-01T00:00:00.000Z",
            "accountDisplayName": "Mario Drago",
            "timeZone": "Basel, Switzerland"
        }
    }]

profile.js:

window.YTD.profile.part0 = [{
        "profile": {
            "description": {
                "bio": "Evil banker. #TBTJ untouchable Communist Head of ECB. I do whatever it takes to keep EU masses enslaved, enriching my cronies of the BIS, FSB, G30 etc PARODY!.",
                "website": "",
                "location": "Basel, Switzerland"
            },
            "avatarMediaUrl": "https://pbs.twimg.com/profile_images/986255258073657350/g8fvWiDX.jpg",
            "headerMediaUrl": "https://pbs.twimg.com/profile_banners/816715694133964800/1523976777"
        }
    }]

Save the URL images to files in profile_media

Combine files to make the tweet.js file

Making a default tweet.js file

This will create the tweet.js similar to a full twitter backup download zip contains.

`php cli/tweets.php --dir=euromoan --dir-output=euromoan --grailbird-import=euromoan/www/euromoan/data/js/tweets --filename=tweet.js --debug`

This will also make users.json and urls.json files containing the use and url information contained therein.

Resolve URLs when creating

After the previous step, you can make a tweet.json (note extension change - by default tweet.js cli creates .json files) file with the un-shortened/resolved URLs:

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -a -itweet.js --filename=tweet.json -u --urls-check-source --debug`

Or run the whole create step again with URL resolving:

`php cli/tweets.php --dir=euromoan --dir-output=euromoan --grailbird-import=euromoan/www/euromoan/data/js/tweets --filename=tweet.js -u --debug`

Generate grailbird export data file using data from previous step

This will create the YYYY-MM.js files with the resolved URLs in a folder structure as with the original twitter download in export/grailbird.

`php cli/tweets.php --dir=euromoan --dir-output=euromoan --filename=tweet.json -itweet.js --filename=tweet.json -u -o -g=euromoan/www/euromoan --debug`

Missing local tweet_media files

This will list the local tweet_media files that are missing and where they would be downloaded:

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -itweet.js --filename=missing.json -a -u -l --list-missing-media --debug`

To download:

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -itweet.js --filename=missing.json -a -u -l --download-missing-media debug`

To organize the tweet_media files into subfolders:

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -itweet.js --filename=missing.json -a -u -l --organize-media --debug`

Generate locally viewable offline tweets linked to downloaded files

Files will be exported to euromoan/export/grailbird in the correct folder structure to overwrite/replace the original download or use as data files for @vijinho/tweets-gb

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -a -u -o -l -g=euromoan/www/euromoan --debug`

Fully check all URLs

This will check/update the source and destination URLs (if they have been redirected/changed) unless they are twitter.com or www.youtube.com hosts.

`php cli/tweets.php --dir=euromoan --dir-output=euromoan -a -u --urls-check-force --debug`

#### Exporting tweets and media files along with (grailbird) data for web browsing:

Assuming your target data grailbird folder (containing files from tweets-gb) is in euromoan/www/euromoan and that euromoan/www is the webroot.

Export tweets, with media files to web-viewable folder

This will process tweets in euromoan, exporting data and media files to euromoan/www/euromoan and the media file URLs will be prefixed with /euromoan/ such that browsing from the webroot euromoan/www and starting a webserver there (with php) http://127.0.0.1:9012 will reference the local files under the webroot path /euromoan/path/to/file

$ php cli/tweets.php --dir=euromoan -g=euromoan/www/euromoan/ --grailbird-media  --media-prefix='/euromoan/' --debug
$ cd euromoan/www
$ php -S 127.0.0.1:9012

To Do

  • Reduce memory-usage!
  • Work and process other files in the twitter backup fileset, e.g. for Twitter Moments
  • Option to export/copy a tweet and all associated files
  • Option to write filtered tweets to a different file formats, e.g. CSV or HTML
  • Option to generate markdown .md files from tweets in subfolders, compatible with grav

Project History

This was written after browsing @mwichary/twitter-export-image-fill and reading about this issue:

"Twitter has two ways of getting an archive. One is the way you show. The second requires going to: Settings and privacy > Your Twitter data > Download your Twitter data > Download data