/whos-on-first

Looking at (giant) WhosOnFirst data with Datasette

Primary LanguageMakefile

Spatial data cooking show

This is a demo project and template repository showing how I use SpatiaLite with Datasette for quick spatial analysis.

Live show notes: https://docs.google.com/document/d/1z30ZCWGsR_jr6EuHKukJ4kTZQMBryrrEkIgo_5twZjo/edit

Why SQL, SQLite and Datasette

SQL is old. It's boring. It's widely adopted and not controlled by a company that might change it or shut it down next year.

SQLite is a database in a single file. It's fast and durable. It's easy, both in code and in my head, to make and delete and remake lots -- hundreds -- of little databases. They're just files.

There are a lot of shiny, brittle things out there. This is cast iron coding.

Libraries included

Run pipenv install to create a virtual environment and get the latest version of everything. I also recommend installing SpatiaLite.

You might also use ...

These libraries aren't included in the Pipfile but could be useful:

What should I use this for?

Over the past year, I've started using this set of libraries to quickly explore, reshape and reformat data for visualization in news projects. My workflow usually involves three phases: loading, exploration, export.

This stack isn't (yet) a replacement for R, or GeoPandas or QGIS. Those are good tools, and if you're more comfortable there, I'm not here to talk you out of them.

This approach is, for me, faster than QGIS for most datasets. It allows me to more quickly iterate on finding interesting subsets (because I can write SQL faster than I can navigate a GUI), and it allows me to see intermediate results, because each query generates a map.

Loading data

For self-contained datasets, I use geojson-to-sqlite. (For shapefiles, I usually transform data to GeoJSON first using ogr2ogr, but shapefile-to-sqlite is also available.)

For tabular data, I use sqlite-utils.

Exploration

If you are new to Datasette, this tutorial is a good place to start.

Once you have data in SpatiaLite, every SQL query that includes a geometry column will now include a map. (If you're using leaflet-marker-cluster, you'll see a map if your query includes longitude and latitude columns.) Use queries to see different slices of the data.

Here's a map of counties in Massachusetts.

Using datasette-geojson-map lets you style map features using Mapbox's simple-style-spec. Github also supports this for GeoJSON and TopoJSON files.

With sqlite-colorbrewer, it's possible to color features based on data.

Here are countries in Asia, in color.

Combine those, and we can start using Datasette for exploratory visualization:

Export

Finally, let's get data out of exploration mode and into something more presentable.

This stack is built for reporting, not for storytelling.

For interactive maps, I recommend exporting a version of your data as GeoJSON and building static versions for presentation with Leaflet or Mapbox. For more advanced visualizations, look at d3-geo.

Datasette allows queries to run from the command line using the --get option, which will hit any Datasette URL and return the result. Combine this with canned queries for scripted exports. I usually put this in a Makefile:

combined.geojson: project.db
    pipenv run datasette project.db --get /project/combined.geojson \
     --load-extension spatialite \
     --setting sql_time_limit_ms 5000 \
     --setting max_returned_rows 20000 \
     -m metadata.yml > $@

That assumes I've loaded whatever I need into project.db and created a canned query called combined (and then uses datasette-geojson for geojson output).

My goal in this step is to end up with a file that won't need any further filtering or processing once I pull it into a frontend tool.

Here are static versions of the examples above:

Datasets to test out