Roadmap discussion for porting Axis app to MongoDB
Opened this issue · 1 comments
As I started working on this milestone, it seemed to be a good moment to think about the big picture of how the code pieces fit together. The goal of this milestone is to run the “Axis” app using live data from MongoDB, rather than cached data within the app. Currently, the data cached in the app is structured as a geojson
object, with each record containing a building shape and metadata along with nested arrays containing violation data and 311 data. However, going forward the scheme for data in MongoDB is more like a relational database—there is a buildings collection containing shapes and metadata and a violations collection containing a “bldg_id” field to link it to the buildings collection data. Additional collections will be added that will link to the buildings collection data.
The code will have to be reworked, because the “Axis” app is going to have to handle source data with a different format. Here I will give a high-level description of the code as it currently works followed by the code that I believe is needed to migrate the Axis app to MongoDB.
Current Code
Batch by Census Tract
Multiple steps within the code use batching because memory use is prohibitively expensive when computing on more than a few thousand buildings at a time. Batching is always done the same way, by importing census tract shapefiles and then executing all computing tasks in a loop where one trip through the loop is done for all the buildings in each tract—there are typically no more than about 2000 buildings in each tract. The result of each loop is cached on disk and then later combined.
Whenever “batching” is mentioned below, it refers to this method. The batching is intended to be done at regular intervals by automated scripts—and no batching is specified during use of the Axis app or WindyGrid.
Obtain and Batch Incoming Data
Data is manually imported from the City and County’s Data Portals—building shapes, violations, 311 records, tax parcel shapes, tax delinquencies, and census tracts. The data is then batched and cached to disk according to census tract number.
Correct Overlap of Boundary Lines
The building shapes from the City and tax parcel shapes from the County overlap by inches in ways that are clearly errors. The overlaps are systemic and are not biased in a way that an adjustment of the projection or shape data could fix. To fix this problem, the code calculates the ratio of overlapping areas to non-overlapping areas, and uses a reasonable threshold to eliminate overlapping areas that are highly likely to be errors.
This step is done with batching.
Combine Building Metadata with Tax Parcel Metadata
With the overlap errors out of the way, next the building and tax parcel data are spatially overlaid so that tax parcel metadata can be attached to the building to which it belongs. The merged data creates the initial version of the json
file. The tax parcel “PIN” is the first nested array within each building record, because there can be more than one tax parcel within one building. Also, multiple buildings can have the same PIN as an attribute because tax parcels can span more than one building.
This step is done with batching.
Assign a Building ID to Other Building Data
Each individual record within the other datasets is assigned to a building ID, which is then used to join the records with the buildings. As a result of the join, additional nested arrays of data are added to the json
file.
At each pass of the algorithm, the matching method is only attempted for those records that have not yet been assigned a building ID.
-
Attempt to match by address, including ranges. For example, any odd address between 301 and 333 S State St would match the building known as “301-333 S State St”.
-
Attempt to match by proximity to a building. The geocoded location of each record is used to match to the closest building.
This step is done with batching.
Combine Cached Files and Format Buildings.json
The cached files (one for each census tract) are combined and the data is further cleaned, formatted, and converted from geojson
to a regular json
file that contains the shape data. Time-based summary statistics are calculated and added to the json
file (total last 90 days, total last year, etc).
This step is done with batching.
Visualize with Leaflet
The final geojson
record (from just prior to formatting as a regular json
) is used to power the “Axis” app. Axis is a prototype of a map view of building data, displaying a selection of buildings and conveying summary statistics on violations, 311 calls, and tax delinquencies. Axis should also include an option to query the data, especially as a way to experiment with different visualization techniques.
Also, the leaflet map currently is a shareable html file with an embedded interactive tool, but can also be set up as a Shiny application so that it can be accessed via URL.
No batching is done for this step, however the data that powers the app is limited to a single census tract.
Proposed Code
Batch by Census Tract
This will be the same.
Obtain and Batch Incoming Data
This will be the same.
Correct Overlap of Boundary Lines
This will be the same.
Combine Building Metadata with Tax Parcel Metadata
This will be the same.
Assign a Building ID to Other Building Data
The assignment process stays the same, but here the code will need to be reworked. Currently the building ID assignment process is intertwined with the creation of the large json
file. Instead, the new process will result in separate tables linked by the building ID.
Combine Cached Files and Insert to MongoDB
The cached files will be combined and inserted into the separate collections in MongoDB, rather than the buildings.json file. Additionally, time-based summary statistics will not be created here—instead, they will be generated live from the Axis app through R script calls to MongoDB.
Insertions to MongoDB have been done already for buildings and violation records. The insertions were done with regular json
files containing geo-data, rather than geojson
files. A decision will have to be made of whether to use geojson
files in MongoDB, or to use regular json
files. geojson
will be easier to use with the Axis app, but might be harder to insert to MongoDB. Mongolite (the R driver for MongoDB) does not support geojson
, but it could work in conjunction with geojsonio or geojsonR. More work needs to be done to test out solutions.
This step is done with batching.
Visualize with Leaflet
One issue here is going to be whether the data imported from MongoDB will be in geojson
or json
format.
Most of the visualization code can stay the same for the first iteration. However, the current Axis app uses one single census tract of data. Since all data will be available, this presents the opportunity to experiment with querying, choosing types of building data, and other ways of displaying the summary stats.
I've got an early version set up, with Axis pointing to MongoDB data. I inserted geojson
data into MongoDB using this Python script: https://github.com/rtbigdata/geojson-mongo-import.py. Then I made a Mongo connection in an R script to pull the same data out of MongoDB. Finally, I used Leaflet to make the same map as in prior versions of Axis.
The data powering the colors is embedded in the geojson
data, so this is not a completed version of the above concept. It does, however, provide the foundation for developing quick prototypes with geospatial data in MongoDB.