Because of the upcoming EU law for privacy, the GDPR, I started working on a way to provide a griding method that could provide the custtomers the highest level of detail of their data while being compliant with the law.
In a simplistic way, the requirement is to avoid any way to identify an individual based on the exposed data. So we need to generate overviews
of the data where the KPIs are aggregated and the anonimity enforced. In terms of location data, the provider should also provides spatial-temporal anonymization, and this requirement is not explicitally defined by the law. Some providers are limiting the level of detail to existing administrative divisios like census tracts
or an arbitrary sized and located grid of 0.25km² cells.
As this 500m x 500m cell size can be quite far from optimal in most of the cases:
- Urban center, shooping area: the amount of occurences in a cell this size can be several orders higher than the threshold
- Low populated country side: a cell this size might enclose so few occurences that it could identify an individual
Census tracts might be better fitted to the population, but in terms of inhabitants not passers by.
So the idea might be an adaptative grid where the size of the cell is the minimal one (max level of detail) that complies with the limitations (like a minimum number of events or unique individuals per cell as threshold) will provide the best resolution across the area of study being compliant with the legal restriction.
This QuadGrid is the result of aplying quadtree schema with a different restriction.
Link to test: https://team.carto.com/u/abel/builder/b1275d91-bc38-4d49-a1b2-6b5166856021/embed
We're using the webmercator tiling schema as the way to define the cells, so the result can be easyly assimilated to common web mapping tools
I stared with a plain PL/pgSQL version in order to have a debugging version of the algorithm, and then thanks to a talk with javisantana , I move it to a recursive version that had a great impact on performance and led me to research on index clustering that improved the performance even more
Performance:
- Plain SQL (V1): below O(N³)
- Recursive SQL (R2): close to O(N)
- Python: WiP
Related to threshold size, R2 (recursive SQL) preprocessing time is decreasing with the square of the threshold, as expected.
Some benchmarks (local OnPrem 2.0, timeout 5min, SQL versions)
version | streetlamps (7K) | benches (20K) | benches (65K) | thrashbins (90K) | flights (100K, WW) | trees (150K) | flights (400K, WW) |
---|---|---|---|---|---|---|---|
1 | 211 ms | 44 s | timeout | timeout | timeout | timeout | |
2 | 59 ms | 8 s | 75.3 s | 4:14 min | timeout | timeout | |
3 | 50 ms | 7 s | 68.3 s | 1:47 min | 3:43 min | 4:59 min | timeout |
R2 | 19 ms | 1.4 s | 2.11 s | 4 s | 4:15 min | ||
R2 + cluster | 60 ms | 2.6 s | 4 s | 9.25 s | 48 s |