This repo contains the data and code for the Paper "MapQA: A Dataset for Question Answering on Choropleth Maps"
MapQA consists of three subsets, acquired via different approaches as listed below.
-
MapQA-U stands for MapQA with a uniform map style. We scrape state-level US map images from Kaiser Family Foundation (KFF) along with the underlying data. MapQA-U contains 20789 map images from 14 domains and nearly 300K questions. All maps are generated by KFF with a uniform map style. MapQA-U only contains discrete legend maps with the same color scale (blues), legend position, and orientation. By default the underlying data contains correlations due to its real-world nature.
-
MapQA-R represents MapQA with Regenerated KFF maps. MapQA-R contains the same underlying data as MapQA-U while we generate map images from the underlying data instead of using the uniform style maps in MapQA-U to challenge models on various map styles. To prevent models over-fitting the color scale in interpreting a map (and ignoring the legend), we use disjoint color scales for train/val/test splits in MapQA-R.
-
MapQA-S denotes MapQA with Synthetic data. MapQA-S uses the same methods to generate map images as MapQA-R while it uses synthetically generated data over US states. The synthetic underlying data permits minimal answer bias through precise control of the underlying data distribution which makes it harder for models to exploit answer bias while ignoring map content.
Please download the data from here.
Each subset contains the following files or folders:
-
usa_geo.json: contains the US geography information (state abbreviations, subareas, borders).
-
images: contains the US choropleth map images.
-
metadata: contains the underlying data of each map image.
-
questions:
- answer_vocab.json: contains the vocabulary of delexicalized answers.
- train-QA.json: contains question-answer pairs in the train split.
- dev-QA.json: contains question-answer pairs in the dev split.
- test-QA.json: contains question-answer pairs in the test split.
Each question file contains the following fields:
question_id
: the question indexmap_id
: the id of map to which the question is addressed.question_type
: the question type.question_template_id
: the question template id.question
: the question tokens.oracle_delexicalized_question
: the delexalized question with an oracle OCR system.tesseract_delexicalized_question
: the delexalized question with an real OCR system Tesseract.answer
: the answer of the corresponding question. It can be either one answer of a set of answers.oracle_delexicalized_answer
: the delexalized answer with an oracle OCR system.
{
"map_id": "map_0.png",
"question_type": "retrieval",
"question_template_id": "retrieval_1",
"question": "Name the states that have a value in the range 1,231,000-4,782,400 ?",
"oracle_delexicalized_question": "Name the states that have a value in the range legend_1 ?",
"tesseract_delexicalized_question": "Name the states that have a value in the range legend_1 ?",
"answer": [
"California",
"Florida",
"New York"
],
"oracle_delexicalized_answer": [
"California",
"Florida",
"New York"
],
"question_id": 5
},
Each metadata file contains the metadata to generate the map image. Some key metadata includes:
map_id
: the map index to which the underlying data is corresponding .legend_type
: the legend type. Either Discrete or Continuous. Discrete legend map represents classified choropleth maps and continuous legend map represents unclassed choropleth maps.legend_order
: the order of the legend symbols. Either small_to_large or large_to_small. Only applicable for discrete legends.num_classes
: number of classes (not including missing data) when split the data. Only applicable for classified maps.num_symbols
: number of legend symbols (including missing data symbol). Only applicable for classified maps.class_descriptions
: the desciption tokens for each legend symbol.data_type
: the underlying data type. Either Number or Percentage.data_distribution
: the underlying data distribution. Only applicable for MapQA_S where we systhetically generate unbiased underlying data.data
: a list of data where each contains the value of a region.region
: the name of the region.value
: the value of the region.class
: the class of the region. Only applicable for classified maps.description
: the class description of the region. Only applicable for classified maps.legend_position
: For classified maps, it represents the symbol id of the class that the region belongs to, e.g. legend_0, legend_1, etc. For unclassed maps, it represents the relative position (a real number between 0 and 1) of the region color in the legend coloar bar.
colormap_name
: the name of the colorscale which is used to map each class to a color.colorscales
: a list of RGB colors representing the color of each class.title
: the title information (position, font, etc.) of the map.legend
: the legend information (position, font, etc.) of the map.
{
"map_id": 0,
"map_type": "Choropleth",
"KFF_file_name": "medicaid-coverage-rates-for-the-nonelderly-by-age_2019_number_adults-19-64_kff",
"scope": "USA",
"locationmode": "USA-states",
"projection": "albers usa",
"legend_type": "Discrete",
"legend_order": "large_to_small",
"num_classes": 4,
"num_symbols": 4,
"class_descriptions": [
"24,900-330,900",
"339,200-700,500",
"796,400-1,204,900",
"1,231,000-4,782,400"
],
"data_type": "Number",
"missing_data": false,
"data": [
{
"region": "AL",
"value": 325900,
"class": 0,
"description": "24,900-330,900",
"legend_position": "legend_4"
},
{
"region": "AK",
"value": 68200,
"class": 0,
"description": "24,900-330,900",
"legend_position": "legend_4"
},
...
],
"colormap_name": "tempo",
"colorscales": [
"rgb(254, 245, 244)",
"rgb(176, 173, 185)",
"rgb(98, 101, 126)",
"rgb(20, 29, 67)"
],
"showgrid": true,
"paper_bgcolor": "rgb(217, 217, 217)",
"title": {
"text": "Medicaid Coverage Rates for the Nonelderly by Age | KFF",
"font_family": "Overpass",
"font_size": 19,
"position": {
"x": 0.06,
"y": 0.95
}
},
"legend": {
"title_font_size": 14,
"orientation": "vertical",
"position": {
"x": 1.02,
"y": 0
},
"bordercolor": "white",
"borderwidth": 2,
"font_size": 13
}
}
If you use this data, or otherwise found our work valuable, please cite:
@inproceedings{chang2022mapqa,
title={MapQA: A Dataset for Question Answering on Choropleth Maps},
author={Chang, Shuaichen and Palzer, David and Li, Jialin and Fosler-Lussier, Eric and Xiao, Ningchuan},
booktitle={NeurIPS 2022 First Table Representation Workshop}
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.