/taupe

Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.

Primary LanguagePythonMIT LicenseMIT

Taupe

A simple program to extract the URLs of your tweets, retweets, replies, quote tweets, and "likes" from a personal Twitter archive.

License Latest release

Table of contents

Introduction

When you download your personal Twitter archive, you receive a ZIP file. The contents are not necessarily in a format convenient for doing something with them. For example, you may want to send the URLs to the Wayback Machine at the Internet Archive or do something else with the URLs. For tasks like that, you need to extract URLs from your Twitter archive. That's the purpose of Taupe.

Taupe (a loose acronym of Twitter archive URL parser) takes a Twitter archive ZIP file, extracts the URLs corresponding to your tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can easily use with other software tools. Once you have installed it, using taupe is easy:

# Extract tweets, retweets, replies, and quote tweets:
taupe /path/to/your/twitter-archive.zip

# Extract likes:
taupe --extract likes /path/to/your/twitter-archive.zip

# Learn more:
taupe --help

Installation

There are multiple ways of installing Taupe. Please choose the alternative that suits you.

Alternative 1: installing Taupe using pipx

Pipx lets you install Python programs in a way that isolates Python dependencies, and yet the resulting taupe command can be run from any shell and directory – like any normal program on your computer. If you use pipx on your system, you can install Taupe with the following command:

pipx install taupe

Pipx can also let you run Taupe directly using pipx run taupe, although in that case, you must always prefix every Taupe command with pipx run. Consult the documentation for pipx run for more information.

Alternative 2: installing Taupe using pip

You should be able to install taupe with pip for Python 3. To install taupe from the Python package repository (PyPI), run the following command:

python3 -m pip install taupe

As an alternative to getting it from PyPI, you can use pip to install taupe directly from GitHub:

python3 -m pip install git+https://github.com/mhucka/taupe.git

If you already installed Taupe once before, and want to update to the latest version, add --upgrade to the end of either command line above.

Alternative 3: installing Taupe from sources

If you prefer to install Taupe directly from the source code, you can do that too. To get a copy of the files, you can clone the GitHub repository:

git clone https://github.com/mhucka/taupe

Alternatively, you can download the software source files as a ZIP archive directly from your browser using this link: https://github.com/mhucka/taupe/archive/refs/heads/main.zip

Next, after getting a copy of the files, run setup.py inside the code directory:

cd taupe
python3 setup.py install

Usage

If the installation process described above is successful, you should end up with a program named taupe in a location where software is normally installed on your computer. Running taupe should be as simple as running any other command-line program. For example, the following command should print a helpful message to your terminal:

taupe --help

If not given the option --help or --version, this program expects to be given a personal Twitter archive file, either on the command line (as an argument) or on standard input (from a pipe or file redirection). Here's an example (and note this path is fake – substitute a real path on your computer when you do this!):

taupe /path/to/twitter-archive.zip

The URLs produced by taupe will be, by default, as they appear in the archive. If you want to normalize the URLs into the canonical form https://twitter.com/twitter/status/TWEETID, use the option --canonical-urls (-c for short):

taupe -c /path/to/twitter-archive.zip

The structure of the output

The option --extract controls both the content and the format of the output. The following options are recognized:

Value Synonym Output
all-tweets tweets CSV table with all tweets and details (default)
my-tweets list of URLs of only your original tweets
retweets list of URLs of tweets that are retweets
quoted-tweets quote-tweets list of URLs of other tweets you quoted
replied-tweets reply-tweets list of URLs of other tweets you replied to
liked likes list of URLs of tweets you "liked"

all-tweets

When using --extract all-tweets (the default), taupe produces a table with four columns. Each row of the table corresponds to a type of event in the Twitter timeline: a tweet, a retweet, a reply to another tweet, or a quote tweet. The values in the columns provide details about the event. The following is a summary of the structure:

Column 1 Column 2 Column 3 Column 4
tweet timestamp in ISO format The URL of the tweet The type; one of tweet, reply, retweet, or quote (For type reply or quote.) The URL of the original or source tweet

The last column only has a value for replies and quote-tweets; in those cases, the URL in the column refers to the tweet being replied to or the tweet being quoted. The fourth column does not have a value for retweets even though it would be desirable, because the Twitter archive – strangely – does not provide the URLs of retweeted tweets.

Here is an example of the output:

2022-09-21T22:36:29+00:00,https://twitter.com/mhucka/status/1572716422857658368,quote,https://twitter.com/poppy_northcutt/status/1572714310077673472
2022-10-10T22:04:20+00:00,https://twitter.com/mhucka/status/1579593701965582336,reply,https://twitter.com/arfon/status/1579572453726355456
2022-10-14T04:17:01+00:00,https://twitter.com/mhucka/status/1580774654217625600,tweet
2022-10-25T14:49:06+00:00,https://twitter.com/mhucka/status/1584919989307715586,retweet
...

my-tweets

When using --extract my-tweets, the output is just a single column (a list) of URLs, one per line, of just your original tweets. This list corresponds exactly to column 2 in the --extract all-tweets case above.

retweets

When using --extract retweets, the output is a single column (a list) of URLs, one per line, of tweets that are retweets of other tweets. This list corresponds to the values of column 2 above when the type is retweet. Important: the Twitter archive does not contain the original tweet's URL, only the URL of your retweet. Consequently, the output for --extract retweets is your retweet's URL, not the URL of the source tweet.

quoted-tweets

When using --extract quoted-tweets, the output is a list of the URLs of other tweets that you have quoted. It corresponds to the subset of column 4 values above when the type is "quote". Note that these are the source tweet URLs, not the URLs of your tweets.

replied-tweets

When using --extract replied-tweets, the output is a list of the URLs of other tweets that you have replied to. It corresponds to the subset of column 4 values above when the type is "reply". Note that these are the source tweet URLs, not the URLs of your tweets.

likes

When using the option --extract likes, the output will only contain one column: the URLs of the "liked" tweets. taupe cannot provide more detail because the Twitter archive format does not contain date/time information for "likes". (This is also why "likes" are not part of the output when --extract all-tweets is used – there is no possible value for column 1.)

Here is an example of the output when using --extract likes in combination with --canonical-urls:

https://twitter.com/twitter/status/1588146224376463365
https://twitter.com/twitter/status/1588349144803905536
https://twitter.com/twitter/status/1590475356976578560
...

Other options recognized by taupe

Running taupe with the option --help will make it print help text and exit without doing anything else.

The option --output controls where taupe writes the output. If the value given to --output is - (a single dash), the output is written to the terminal (stdout). Otherwise, the value must be a file.

If given the --version option, this program will print its version and other information, and exit without doing anything else.

If given the --debug argument, taupe will output a detailed trace of what it is doing. The debug trace will be sent to the given destination, which can be - to indicate console output, or a file path to send the debug output to a file.

Summary of command-line options

The following table summarizes all the command line options available.

Short      Long form opt   Meaning Default
-c --canonical-urls Normalize Twitter URLs Leave as-is
-h --help Print help info and exit
-e E --extract E Extract URL type E all-tweets
-o O --output O Write output to file O Terminal
-V --version Print program version & exit
-@ OUT --debug OUT Write debug output to OUT

⚑   Recognized values: all-tweets, tweets, my-tweets, retweets, quoted-tweets, replied-tweets, and likes. See section above for more information.
✦   To write to the console, you can also use the character - as the value of O; otherwise, O must be the name of a file where the output should be written.
⚐   To write to the console, use the character - as the value of OUT; otherwise, OUT must be the name of a file where the output should be written.

Known issues and limitations

This program assumes that the Twitter archive ZIP file is in the format which Twitter produced in mid-November 2022. Twitter probably used a different format in the past, and may change the format again in the future, so taupe may or may not work on Twitter archives obtained in different historical periods.

The Twitter archive format for "likes" contains only the tweet identifier and the text of the tweet; consequently, taupe cannot provide date/time information for this case.

This program does all its work in memory, which means that taupe's ability to process a given archive depends on its size and how much RAM the computer has. It has only been tested with modest-sized archives. It is unknown how it will behave with exceptionally large archives.

Relationships to other similar tools

To the author's knowledge, Taupe is the only tool that will directly and easily extract the URLs of tweets and "likes" from a Twitter archive ZIP file. There do exist other software tools for working with Twitter archives; the following is a (possibly incomplete) list:

Getting help

If you find a problem or have a request or suggestion, please submit it in the GitHub issue tracker for this repository.

Contributing

I would be happy to receive your help and participation if you are interested. Everyone is asked to read and respect the code of conduct when participating in this project. Please feel free to report issues or do a pull request to fix bugs or add new features.

License

This software is Copyright (C) 2022, by Michael Hucka. This software is freely distributed under the MIT license. Please see the LICENSE file for more information.

Acknowledgments

This work is a personal project developed by the author, using computing equipment owned by the California Institute of Technology Library.

The vector artwork of a bird, used as the icon for this repository, was created by Noe Araujo from the Noun Project. It is licensed under the Creative Commons CC-BY 3.0 license. I manually changed the color to be a shade of taupe.

Taupe uses multiple other open-source packages, without which it would have taken much longer to write the software. I want to acknowledge this debt. In alphabetical order, the packages are:

  • Aenum – Python package for advanced enumerations
  • CommonPy – a collection of commonly-useful Python functions
  • Plac – a command line argument parser
  • Rich – library for writing styled text to the terminal
  • Sidetrack – simple debug logging/tracing package
  • Twine – utilities for publishing Python packages on PyPI