twittuh

twittuh is a Go program that loads a user's Twitter timeline using a headless Chrome browser and generates an RSS feed containing their tweets.

2021-04-08: I'm unlikely to put more effort into this, as it's becoming exceedingly hard to make the headless-Chrome approach work reliably on low-spec VPSes. I recommend looking at Nitter, which also provides RSS feeds of timelines.

I don't have (or want) a Twitter account, and I found myself repeatedly clicking on a dozen or so bookmarks to check for updates. It felt like I was in the year 2000. Thanks to this program, I can use an RSS reader to monitor these timelines, making me feel like I'm in the year 2005 instead.

This program originally scraped the plain-HTML "Legacy Twitter" pages that were served to old browsers, but Twitter shut down the interface on 2020-12-16. Now this program uses Chrome to construct the complete DOM (i.e. by executing JavaScript) and parses that instead.

Installation

To compile and install twittuh, run the following (after installing Go if you don't have it already):

$ go install

The twittuh executable will be installed to $GOPATH/bin (or $GOBIN if you've set it directly).

Headless Chrome is controlled by the chromedp package, but you must install Chrome manually. I was able to do this on a Debian buster amd64 server by running the following as the root user:

# wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
# dpkg -i google-chrome-stable_current_amd64.deb
# apt install -f

(The dpkg command will likely fail due to missing dependencies, which can be installed using the subsequent apt command.)

Usage

Usage: twittuh [flag]... <user> <file>
Creates an RSS feed from a Twitter user's timeline.
Pass '-' for <file> to write feed to stdout.
Flags:
  -browser-size string
        Browser viewport size (default "1024x8192")
  -cache-dir string
        Chrome cache directory
  -debug-chrome
        Log noisy Chrome debug messages
  -debug-file string
        HTML timeline file to parse for debugging
  -dump-dom
        Dump the timeline DOM to stdout for debugging
  -fetch-retries int
        Number of times to retry fetching
  -fetch-timeout int
        Fetch timeout in seconds
  -force
        Write feed even if there are no new tweets
  -format string
        Feed format to write ("atom", "json", "rss") (default "atom")
  -page-settle-delay int
        Seconds to wait for page render (default 2)
  -proxy string
        Optional proxy server (e.g. "socks5://localhost:9050")
  -replies
        Include the user's replies
  -serve string
        Listen for requests over HTTP (e.g. "0.0.0.0:8080")
  -show-sensitive
        Show sensitive content in tweets (default true)
  -show-sensitive-delay int
        Seconds to wait after showing sensitive content (default 2)
  -simplify
        Simplify HTML in feed (default true)
  -skip-users string
        Comma-separated users whose tweets should be skipped
  -tor-control string
        Interface for resetting Tor circuits after fetch fails (e.g. "0.0.0.0:9051")
  -tweet-timeout int
        Timeout for loading tweets in seconds
  -verbose
        Enable verbose logging

Tips

Tor

Twitter seems to haphazardly block unauthenticated timeline requests. When this happens, the timeline page itself (e.g. https://twitter.com/NWS) loads but the XHR to load the actual tweets fails. The page shows a Something went wrong. message and a Try again button.

I suspect that some cloud providers' networks are proactively blocked. Fortunately, it's easy to route requests through Tor.

On a Debian system, run the following as the root user to install Tor:

# apt install tor

You can then pass -proxy socks5://localhost:9050 to twittuh to tell it to instruct Chrome to use the Tor proxy.

Some Tor exit nodes also appear to be blocked. You can tell Tor to reset its circuits (likely resulting in a new exit IP) by sending a NEWNYM command to its control socket (see resetTorCircuits in main.go) or (allegedly) by sending a HUP signal to the tor process to tell it to reload its configuration.

Example script

The scrape_twitter.py.example file in this repository may be helpful if you want to run twittuh periodically via cron to monitor multiple timelines. The timeouts have been tweaked for a slow VPS that's using Tor. You'll want to edit the variables near the top of the file for your system and rename it to scrape_twitter.py.

Pay particular attention to the INTERVAL_SEC variable, which specifies the total amount of time allocated to each invocation of the script. If you want to check each timeline once every four hours, change INTERVAL_SEC to 4 * 3600 and add a line like the following to your crontab:

30 */4 * * * /path/to/scrape_twitter.py

Docker

Docker can be used to run twittuh -serve in a container. The Dockerfile in this repository builds a container image that runs an instance of twittuh listening for HTTP GET requests on port 8080. Tor is also installed. The HTTP endpoint accepts user, format, and skipUsers query parameters corresponding to the similarly-named flags. It returns a 401 error if the user has restricted their tweets to followers (i.e. "These Tweets are protected").

When executed in this directory, the following command uses Cloud Build to build a container and submit it to the Container Registry.

$ gcloud --project ${PROJECT_ID} builds submit \
    --tag gcr.io/${PROJECT_ID}/twittuh

After updating the container image, you can run a command like the following to make a Compute Engine instance reload it:

$ gcloud --project ${PROJECT_ID} compute instances update-container \
    ${GCE_INSTANCE} --container-image gcr.io/${PROJECT_ID}/twittuh