/SQLExchange

Import Stack Exchange XML data dumps into any popular database

Primary LanguagePython

SQL Exchange

The Missing Stack Exchange Importer

Motivation

It's really awesome that Stack Exchange releases all their data to download, for free! If you just want to do easy SQL queries on Stack Overflow to see how close you are to Jon Skeet, you can do that at their Data Explorer.

But what if you want to do something that SQL can't? Like, real data mining?

You can download the latest data dump, but it's in XML so isn't easily queried.

What I needed was a way to import the XML into an SQL database. There are some options for importing to SQL Server, and some that are Windows-specific, but I wanted a platform-independent, database-independant solution.

So I wrote SQL Exchange.

Requirements

  • Python (2.6+, probably)
  • SQLAlchemy
  • A database that you can feed the output into. This has to be one SQLAlchemy supports. For the benefit of Googlebot, this includes MySQL, PostgreSQL, Oracle and others.
  • The relevant Python database adapter for SQLAlchemy to use. You'll need to get this up and running on your own.
  • Time... the import takes a long time. Even extracting the 7-zip archive for the main Stack Overflow export takes ages.

Usage

python overflow.py [SQLAlchemy connection string] [list of XML files on the command-line]

Examples:

python overflow.py mysql://root:root@127.0.0.1:8889/stackexchange?charset=utf8 badges.xml
python overflow.py *.xml

Make sure to add ?charset=utf8 to your connection string in the relevant cases. Stack Exchange data is in Unicode, and not opening a Unicode connection can result in weird encoding at the database end.

Compatibility

SQL Exchange is compatible with the database dumps of all Stack Exchange sites as of April 2012. It won't catch changes to the dump output, as the database schema has to be rebuilt with types that are removed when dumping to XML. It should continue to work (albeit missing any new elements) as long as the current schema is only added to, not changing what is currently outputted.

Limitations

  • Only imports badges, comments, posts and users. Other XML is not parsed yet.
  • Database passwords can be leaked to the command-line history; it would be better to read from environment variables or a file that can be locked from others.