"Read it" /ɹɛd ɪt/ is a tool with no package dependencies that extracts Pushshift.io data archives from The Eye Reddit Archives into HTML files that are
- redistributable: can be copied elsewhere as a single HTML file.
- lightweight: only needed data is embeded, which is text and OP images links.
- compatible: no JS needed to render the files, they can be viewed on pretty much anything that is capable of rendering HTML
- nodejs v16 or higher (how to install)
- any zstd-compatible app that can unpack
zstd
files:- windows: look for
zstd-vx.y.z-win64.zip
orzstd-vx.y.z-win32.zip
in the latest release's "Assets" section at the official zstd releases page - OS X: install
zstd
using homebrew, then useunzstd filename.zst
- windows: look for
Clone the repo via git clone
or just download the ZIP archive and unpack it. All commands and files are meant to happen inside that directory.
No usual installation is required. No yarn
, no npm install
, just download and it's ready to go.
We gonna use r/Permaculture as an example. After downloading both files from The Eye, you'll end up with two files:
$ ls
Permaculture_comments.zst Permaculture_submissions.zst
Unpack them:
$ unzstd Permaculture_comments.zst
Permaculture_comments.zst: 425907510 bytes
$ unzstd Permaculture_submissions.zst
Permaculture_submissions.zst: 97972912 bytes
You'll end up with two more files: Permaculture_comments
and Permaculture_submissions
Keep in mind: readit creates cache
subfolder where it stores indexed comments data. For a 406 MB comments file, cache
used 453 MB of disk storage (OS X). Please be aware of that disk space requirement when processing huge subreddits.
Now we can run readit to process these files into readable HTML (don't forget to provide your sub name):
$ node readit.js Permaculture
Prebuilding comments cache (each dot is 1k comments processed)
...............................................................................................................................................................................................................................................................................................................................................
Processing submissions (each dot is 100 posts)
................................................................................................................................................................................................................................................................................................
Once readit starts the second phase (Processing submissions
) you can see that out
directory is being populated with HTML files. Those are ready to be used or distributed right away.
HTML files are pretty simple, yet some of the reddit's markdown is supported.
If you plan on extracting another sub, please manually delete the cache
and out
directories beforehand.
- a lil bit more (selectable) styles in output HTML
- download from The Eye & unpack
- search/partial export
- better UX with cleaning up and cache
- word cloud index so yo ucan explore topics you never knew about
Ope!