`dots` with `per-dot` can bias downward the number of dots in small categories
jtrim-ons opened this issue · 4 comments
First, thank you for mapshaper. It's incredibly useful, and we use it all the time in the datavis team at the UK's Office for National Statistics.
I wondered if I could ask about a possible issue with per-dot
where categories with small counts could end up being under-represented in the overall map. To make up an example, suppose there are two categories: category1
and category2
, and suppose that for each polygon, the count of category1
is less than half of per-dot
. The result will be that no category1
dots will be shown on the map. (This is not incorrect behavior of the current per-dot code, but leads to a misleading impression when looking at the map as a whole.)
An example input GeoJSON is below.
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[ [ 100.0, 0.0 ], [ 101.0, 0.0 ], [ 101.0, 1.0 ], [ 100.0, 1.0 ], [ 100.0, 0.0 ] ]
]
},
"properties": { "name": "placeA", "category1": 2, "category2": 10 }
},
{
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[ [ 101.1, 0.0 ], [ 102.1, 0.0 ], [ 102.1, 1.0 ], [ 101.1, 1.0 ], [ 101.1, 0.0 ] ]
]
},
"properties": { "name": "placeA", "category1": 1, "category2": 10 }
}
]
}
The command dots fields=category1,category2 colors=pink,purple per-dot=5
produces four category2
dots and no category1
dots.
{"type":"FeatureCollection", "features": [
{"type":"Feature","geometry":{"type":"Point","coordinates":[100.43058600629305,0.4066606479722267]},"properties":{"fill":"purple","r":1.3}},
{"type":"Feature","geometry":{"type":"Point","coordinates":[100.16666666666667,0.8333333333333334]},"properties":{"fill":"purple","r":1.3}},
{"type":"Feature","geometry":{"type":"Point","coordinates":[101.6,0.5]},"properties":{"fill":"purple","r":1.3}},
{"type":"Feature","geometry":{"type":"Point","coordinates":[101.26666666666667,0.16666666666666666]},"properties":{"fill":"purple","r":1.3}}
]}
I've seen this issue come up in some published ethnicity maps, where ethnicities that tend to have few people per census tract are underrepresented overall.
A solution that I like is @mountainMath's random rounding. Under this approach, in the example above, placeA would have a pink dot with probability 2/5 and placeB would have a pink dot with probability 1/5. I wonder if you think this might be good to add to Mapshaper as an option? I'd be happy to submit a PR if you think it would be useful.
Hi! Thanks for raising this issue.
I'm aware of this kind of systematic bias, and I want to look into ways of addressing it. Random rounding might do a better job of giving a better proportion of colors across the entire dataset, but I suspect that it will produce many individual spatial units that are pretty far out of whack. I wonder if there's a solution that tries to minimize error at both a local and a regional scale (by balancing the two). Regions could be user-defined groups of low-level units. These could be higher-level administrative units (if you're mapping Census data), or they could be automatically assigned groups, using a grid or a clustering algorithm. Or maybe the algorithm could use the spatial equivalent of a moving window to group low-level units for the purpose of minimizing rounding error. My hunch is that we'll get better results with a deterministic algorithm that balances low-level error against error within local groups than with a random algorithm.
As a first stab at an algorithm, it would be simpler to balance low-level error against global error (rather than against error within local groups of units). One approach could be:
• First, assign colors using the current method.
• Then, do a series of color swaps within individual units to bring the global proportion of colors closer to the correct proportion.
The swaps would be done in a particular sequence, starting with the unit where changing an overrepresented color to an underrepresented color would cause the smallest increase in rounding error within the unit, and continuing until the global proportion of colors is correct.
Above, I suggested balancing error between individual units and groups of units, thinking that that balancing global error and individual error might lead to bias within local areas. My sense is that readers are often looking for patterns at the local or regional level, so minimizing error within local groupings of low-level spatial units may be more important than making the global proportion correct.
Hi Matthew, thanks for your reply. It's a really good point that readers are probably looking at local patterns more than global ones; I hadn't given this enough thought before.
I also like the idea of a deterministic algorithm. I guess the optimisation problem could be to distribute colours by low-level unit as accurately as possible while keeping regional totals correct. (A modified version of what the XKCD map says it does at the state/national level.)
Maybe an even simpler deterministic algorithm would work as follows. Let's say we're doing 10 people per dot. For simplicity imagine that no low-level unit has more than 9 people in a given category (otherwise, begin by allocating dots for groups of 10 people in the obvious way). For each category, calculate how many dots of that colour the region should have. Then begin by giving a dot to each low-level unit that has 9 people in that category. Continue to give dots to units with 8, 7, ... people in that category until you've used up all of the dots of that colour that you planned to allocate.
The colour swaps might be better than my method, though, because they can keep the total dot count for each low-level unit at the right number.
I don't think I've said anything very useful here, but I'm sure you'll come up with a great method.
I'm giving a talk this week where I was planning to say how great random rounding is for dot maps. I'll try to be a bit more nuanced based on what you've said :-)
Good comments here. I agree that pure ransom rounding is not optimal. In general I am less worried about artefacts due to random rounding, typical selections of scaling factors will in practice result in maps that are dominated by the non-randomly coloured dots. And people trying to look for individual dot placements of small categories are "using it wrong".
I think this comes down to the fundamental problem of dot-density maps. They are great because they are intuitive, but they are a lie in that they suggest a level of accuracy (wrt to dot placement and colour categories) that simply does not exist in the data. It's a delicate balance, and of course saying people are "using the map wrong" is a poor excuse.
In some sense it gets a bit easier for dynamic dot-density web maps where the dot-placement gets recomputed in every view. The fact that dot placement and colour categories are different with every view helps make the point of the stochastic nature of these maps. For static maps that's different.
I like the idea of an additional hierarchical structure, either automatically generated using a neighbour (or distance) matrix, or using higher level statistical units if available. That will improve the map. There are two things to worry about. The total number of dots, and how to colour them. A hierarchical structure can help with both.
Another possible algorithm is to go top to bottom, instead of first doing lowest level dot placements and then colour-swapping (and adding or removing dots from some regions to make the overall dot count come out right).
Step 1: Total dot allotments
- Traverse the geographic hierarchy top to bottom, at the top level which is the union of all geographic regions determine the number of dots to be placed using regular rounding for the total number of dots.
- Then distribute the number of dots to each lower level region according to their relative overall counts and break ties via random draw.
That ensures that each region has the "right" number of dots with minimum randomness.
Step 1: Dot positions
That's just as usual, randomly place dots into each lowest level geography
Step 3: Colouring
Proceed as in 1 from highest to lowest level of geography in labelling dots to give them colour.
That should ensure the least geographic distortion in overall number of dots and how they are coloured. It's a bit cumbersome though.
Thanks for these suggestions :)