TBWatcher
snapshots a profile page when given a URL (or an exported .js
list from the official Twitter exporter.)
Supports UTF-8 text JSON files and image snapshots of each Twitter post!
This script is purely for the purposes of archival use only.
- β‘ Multi-threaded!
- ποΈ Neatly stores metadata in json format for each specified twitter profile.
- πΈ Snapshots tweets, thread replies, and reponses.
- β»οΈ Marks potential tweets that are self-retweeted.
- π© Removes Tweet Ads.
- π₯οΈ Allows for manual login (use at own risk.)
# Install the requirements. Once only.
python -m pip install -r requirements.txt
# Take a snapshot from a given profile URL.
python bin/watcher.py --url www.twitter.com/<profile>
# Take a snapshot of profile tweets and their replies
python bin/watcher.py --url www.twitter.com/<profile> -d 2
# For more help use:
python bin/watcher.py --help
Tested on Python 3.10.
TBwatch generates the following in the snapshots folder (assuming --depth 2
):
ββββsnapshots
ββββ<user_id> # Username
β metadata.json # profile metadata
β profile.png # snapshot of profile page
β tweets.json # text format of all tweets on profile page
β
ββββ<prof_tweet_id_0>
β <prof_tweet_id_0>.png # Snapshot
β tweets.json # Responses to <prof_tweet_id_0>
β
ββββ<response_tweet_id_0>
β <response_tweet_id_0>.png # Snapshot
β
ββββ<response_tweet_id_1>
<response_tweet_id_1>.png # Snapshot
By default, multi-threading is enabled and proportional to the number of cores on your computer. Each thread spawns a unique window. Resist the urget to resize the windows as it can mess up the renders. But you can move the windows around.
If you find yourself out of memory, consider lowering the number of threads.
A self-boosted tweet is a tweet where the original author retweets.
These types of tweets are marked with potential_boost
as true in tweets.json
.
The script detects these by matching exact meta-datas e.g. duplicate posts.
Assume all data is UTF-8 compliant.
These files are what the Twitter exporter should generate (.js
file) from the users you are following:
window.* = [
{
"following": {
"accountId": <id>,
"userLink": <url>
}
...
}
]
You can rename as json or specify via input flags to parse the file. window.* =
is automatically removed by the script and is default generated by Twitter. However, you can also manually remove it to parse the file as JSON directly.
[
{
"id": int,
"tag_text": str,
"name": str,
"handle" str,
"timestamp": str,
"tweet_text": str,
"retweet_count": str,
"like_count": str,
"reply_count": str,
"potential_boost": bool,
"parent_id": str | null
}
]
id
is the index assigned by Twitter.
Invalid string entries will be marked as "NULL".
{
"bio": str,
"name": str,
"username": str,
"location": str,
"website": str,
"join_date": str,
"following": str,
"followers": str
}
Invalid string entries will be marked as "NULL".
TBWatcher
terminates early?
It is possible that your images are taking sometime to load. Consider using -s
to adjust load-time.
Or your scrolling height is too low / too high. Consider using --scroll-algorithm
to adjust the type of algorithm
Then passing in a value to the algorithm --scroll-value
.
"--help" has more information as to what --scroll-value
encodes.
TBWatcher
does not scrape anything or tweet cut-off?
Try to run with --debug
and see if there are any "Unable to locate element" errors.
If so, your render window size may be a bit too small. Under-the-hood we use Chrome
to render tweets, which requires a browser window size that is sufficiently large.
Try to modify --window-size
such that each tweet is clearly rendered.
- Out of memory issues?
Each thread spawns a unique Chrome window. Try reducing number of threads with -t
/ --multi-threading
.
Intrested in contributing? Take a look at our CONTRIBUTING.md
- Support Running Multiple Sessions to Resume Per-Profile Fetching
- Save and Expand Post Attachments