Disclaimer: This project is not affiliated with Stack Exchange, Inc.
This section contains background on why this project exists. If you know and/or don't care, feel free to skip to the next section.
In June 2023, Stack Exchange briefly cancelled the data dump, and backpedalled after a significant amount of backlash from the community. The status-quo of uploads to archive.org was restored. In the slightly more than a year between June 2023 and July 2024, it looked like they were staying off that path. Notably, they made a logical shift to the exact dates involved in the upload to deal with archive.org being slow. In December 2023, they announced a delay in the upload likely to avoid speculation that another cancellation was happening.
We appeared to be out of the woods. But this repo wouldn't exist if that was the case now, would it?
In July 2024, Stack Exchange announced the first restrictions on the data dump, by moving it in-house and actively discouraging archive.org reuploads, likely in violation of the CC-By-SA license
The current revision can be read here
Here's what's happening:
- SE is moving the data dump from archive.org to their own infrastructure
- They're discontinuing the archive.org dump, which makes it significantly harder to archive the data if SE, for example, were to go out of business
- In addition to discontinuing the archive.org dump, they're imposing significant restrictions on the data dump
- They're doing the first revision without the possibility to download the entire data dump in one click, a drastic QOL reduction from the current situation.
This is an opinionated summary of the reason why; SE wants to capitalise on AI companies that need training data, and have decided that the community doesn't matter in that process. The current process, while not nearly as restricting as rev. 1 and 2, is a symptom of precisely one thing; Stack Exchange doesn't care about its users, but rather cares about finding new ways to profit off user data.
Stack Exchange, Inc. is now the single biggest threat to the community, and to the platform's user-generated and permissively-licensed content that the community has spent countless hours creating precisely because the data is public.
That is why this project exists; this is meant to automate the data dump download process for non-commercial license-compliant use, since Stack Exchange, Inc. couldn't be bothered adding a "download all" button from day 1.
As an added bonus, since this project already exists, there's an accompanying system to automatically convert the data dump to other formats. In my experience, the vast majority of applications building on the data dump do not work directly with the XML. Other, more convenient data formats are often created as an intermediate. Aside using it as an intermediate for various forms of analysis, there are a couple major examples of other distribution forms that are listed later in this README.
While these are preprocessed distributions of the data dump, this project is also meant to help converting to these various formats. While unlikely to replace the source code for either of these two examples, I hope the transformer system here can get rid of boilerplate for other projects.
A different project is currently maintaining a list of both the source data dumps (XML), as well as other distributions. It includes both historical versions of the data dump, as well as new versions uploaded under the new anti-community scheme.
Note that since someone is uploading an unofficial version to archive.org, you may not need to use the downloader at all. However, to make sure this access continues, I strongly encourage you to download directly from SE anyway if you can -- this helps decrease the chance the uploader is identified and blocked by SE, which will turn into a problem for archival efforts in the long term. It may also decrease the chances SE points to low usage numbers as an excuse to axe the data dump entirely.1
This list contains converter tools that work on all sites and all tables.
Maintainer | Format(s) | First-party torrent available | Converter |
---|---|---|---|
Maxwell175 | SQLite, Postgres, MSSQL | Partially2 | AGPL-3.0 |
For completeness (well, sort of, none of these lists are exhaustive), this is a list of incomplete archives (archives that limit the number of included tables and/or sites)
Maintainer | Format | Torrent available | Converter | Site(s) | Tables |
---|---|---|---|---|---|
Brent Ozar | MSSQL | Yes | MIT-licensed | Stack Overflow only | All tables |
Jason Punyon | SQLite | No | Closed-source3 | All sites | Posts only |
Note that it's stongly encouraged that you use a venv. To set one up, run python3 -m venv env
. After that, you'll need to activate it with one of the activation scripts. Run the appropriate one for your operating system. If you're not sure what the scripts are called, you can find them in ./env/bin
- Python 3.10 or newer4
pip3 install -r requirements.txt
- Lots of storage. The 2024Q1 data dump was 92GB compressed.
- A display you can access somehow (physical or virtual, but you need to be able to see it) to be able to solve captchas
- Email and password login for Stack Exchange - Google, Facebook, GitHub, and other login methods are not supported, and will not be supported.
- If you don't have this, see this meta question for instructions.
- Firefox installed
- Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver.
- Known errors:
- "The geckodriver version may not be compatible with the detected firefox version" - update Firefox and Geckodriver. If this still doesn't work, consider switching to a non-snap installation of Firefox and Geckodriver.
- "Your Firefox profile cannot be loaded" - One of Geckodriver or Firefox is Snap-based, while the other is not. Consider switching to a non-snap installation of Firefox, or verifying that your PATH is set correctly.
- Known errors:
- If you need to manaully install Geckodriver (which shouldn't normally be necessary; it's often bundled with Firefox in one way or another), the binaries are on GitHub
- Snap and flatpak users may run into problems; it's strongly recommended to have a non-snap/flatpak installation of Firefox and Geckodriver.
The downloader does not support Docker due to the display requirement.
- Make sure you have all the requirements from the Requirements section.
- Copy
config.example.json
toconfig.json
- Open
config.json
, and edit in the values. The values are described within the JSON file itself. - Run the extractor with
python3 -m sedd
. If you're on Windows, you may need to runpython -m sedd
instead.
Exractor CLI supports the following configuration options:
Short | Long | Type | Default | Description |
---|---|---|---|---|
-o |
--outputDir <path> |
Optional | <cwd>/downloads |
Specifies the directory to download the archives to. |
-k |
--keep-consent |
Optional | false |
Whether to keep OneTrust's consent dialog. If set, you are responsible for getting rid of it yourself (uBlock can handle that for you too). |
-s |
--skip-loaded <path> |
Optional | - | Whether to skip over archives that have already been downloaded. An archive is considered to be downloaded if the output directory has one already & the file is not empty. |
- | --dry-run |
Optional | - | Whether to actually download the archives. If set, only traverses the network's sites. |
This software is designed around Selenium, a browser automation tool. This does, however, mean that the program can be stopped by various bot defenses. This would happen even if you downloaded all the ~183 data dumps fully by hand, because it's a lot of repeated operations.
This is where notification systems come in; expecting you to sit and watch for potentially a significant number of hours is not a good use of time. If anything happens, you'll be notified, so you don't have to continuously watch the program. Currently, only a native desktop notifier is supported, but support for other notifiers may be added in the future.
As of Q1 2024, the data dump was a casual 93GB in compressed size. If you have your own system to transform the data dump after downloading, you only need to worry about the raw size of the data dump.
However, if you use the built-in transformer pipeline, you'll need to expect a lot more data use.
The output, by default, is compressed back into 7z if dealing with a file-based transformer. Due to this, an intermediate file write is performed prior to compressing back into a .7z. At runtime, you need:
- The compressed data dump; at least 92GB and increasing with each dump
- The compressed converted data dump; depending on compression rates for the specific format, this anywhere from a little less than the original size to significantly larger
- A significant amount of space for intermediate files. While these will be deleted as soon as they're done and compressed, they'll take up a significant amount of space on the disk in the meanwhile
Note that the transformer pipeline is executed separately; see the transformer section below.
One of the major downsides with the way this project functions is that it's subject to Cloudflare bullshit. This means that the total time to download is (combined size of data dumps) / (internet speed) + (rate limiting) + (navigation overhead) + (time to solve captchas)
. While navigation overhead and rate limiting (hopefully) doesn't account for a significant share of time, it can potentially be significant. It's certainly a slower option than archive.org's torrent.
Once you've downloaded the data dumps, you may want to transform it into a more usable format than the data dump offers by default. This is where the transformer component comes in.
This section assumes you have Docker installed, with docker-compose-v2.
From the root directory, run
docker compose up
This automatically binds downloads
and out
in the current working directory to the docker container. If you want to change these paths, you'll need to edit docker-compose.yml
manually for now.
Additionally, the following environment variables are defined and forwarded to the build:
SEDD_OUTPUT_TYPE
: Any output type supported by the program. These are:json
,sqlite
.SPDLOG_LEVEL
: Sets the logging level. Usually not necessary unless you want verbose output, or you're trying to debug something.
If you have a UNIX shell (i.e. not cmd or powershell; Windows users can use Git Bash), you can run
SEDD_OUTPUT_TYPE=sqlite docker compose up
If you want to rebuild the container, pass the --build
flag to the docker command.
If you insist on using cmd or PowerShell instead of a good shell, setting the variables is left as an exercise to the reader.
- C++20 compiler
- CMake 3.10 or newer
- Linux-specific (TEMPORARY):
libtbb-dev
, or equivalent on your favourite distro. Optional, but required for multithreaded support under libstdc++
Other dependencies (stc, libarchive, spdlog, and pugixml) are automatically handled by CMake using FetchContent. Unlike the downloader, this component can run without a display.
TL;DR:
cd transformer
mkdir build
cd build
# Option 1: debug:
cmake .. -DCMAKE_BUILD_TYPE=Debug
# Option 2: release mode; strongly recommended for anything that needs the performance:
cmake .. -DCMAKE_BUILD_TYPE=Release
# ---
# Replace 8 with the number of cores/threads you have
cmake --build . -j 8
# Note: this only works after running the Python downloader
# For early testing, I've been populating this folder with
# files from the old archive.org data dump.
# The last argument is the path to the downloaded data
# *UNIX:
./sedd-transformer -i ../../downloads -t [formatter type]
# Windows
.\sedd-transformer.exe -i ..\..\downloads -t [formatter type]
Pass --help
to see the available formatters for your current version of the data dump transformer.
Currently, the following transformers are supported:
json
sqlite
- Note: All data related to a site is merged into a single database
While I really didn't want to split the system over two programming languages, this is unfortunately the best way to go about it.
C++ does not really support Selenium, which is effectively a requirement for the download process. There are bindings, but all of them appear to be out-of-date, and I don't feel like writing an entire system for selenium
Python, on the other hand, infuriatingly doesn't support 7z streaming, at least not in a convenient format. There's the libarchive
package, but it refuses to build. python-libarchive
allegedly does, but Windows support is flaky, so the transformer might've had to be separated from the downloader anyway. There's py7zr, which does work everywhere, but it doesn't support 7z streaming.
7z and XML streaming are both critical for the processing pipeline. If you plan to convert the entire data dump, you'll eventually run into stackoverflow.com-PostHistory.7z
, which is 39GB compressed, and 181GB uncompressed in the 2024 Q1 data dump. As time passes, this will likely continue to grow, and the absurd amounts of RAM required to just tank the full size is barely supported on modern and very high-end hardware. Finding someone able to tank that is going to be difficult for the vast majority of people.
Consequently, direct libarchive
support is beneficial, and rather than writing an entire new python wrapper (or taking over an existing one), it's easier to just write that part in C++. Also, since it might be easier to run this particular part in a Docker container to avoid downloading build tools on certain systems, having it be fully headless is an advantage.
On the bright side, this should mean faster processing compared to Python.
The code is under the MIT license; see the LICENSE
file.
The data downloaded and produced is under various versions of CC-By-SA, as per Stack Exchange's licensing rules, in addition to whatever extra rules they try to impose on the data dump.
Footnotes
-
There's no guarantee the data dump will continue existing anymore - removing as many justifications to axe the data dump as possible may become increasingly important at some point. Unfortunately, if it is, we won't find out until it's too late by seeing the data dump get axed. ↩
-
Only Postgres at the time of writing, with more planned ↩
-
I've been unable to find the generator code, but I've also been unable to find a statement confirming that it's closed-source. It's possible it is open-source, but if it is, it's hard to find the source ↩
-
Might work with earlier versions, but these are untested and not supported ↩