Hexanonymity
Hexanonymity is a new algorithm for the anonymisation of geo-positioned data which introduces a limited amount of information loss while providing k-anonymity. Hexanonymity leverages the Uber H3 geo-indexing system, which subdivides the earth into hexagonal meshes. We take advantage of a property of hexagon meshes, where for any of them, the distance from its centre to the centre of the six surrounding hexagons is always the same. This property allows the algorithm to generate high-quality clusters of geo-positioned data points, introducing a limited information loss.
Hexanonymity therefore provides k-anonymity to datasets of geo-positioned datapoints. We use the Uber H3 library to group the set of locations into recursively larger areas so that, at the end of the process, locations belonging to the same cell in the hierarchy report the same final location, becoming indistinguishable and providing k-anonymity.
The full methodology is available in the Hexanonymiy paper.
Requirements
Install the requirements using pip:
pip install -r ./requirements.txt
Code example
You can easily apply Hexanonymity to a set of adding them into a Pandas Dataframe. The Hexanonymity class expects a DataFrame with the following columns:
- A column with the geo-positioned data points (latitude and longitude) to be anonymized, as a comma-separated string ("lat,log")
- A column with the identifier of the user. To provide k-anonymity, the algorithm will try to make groups of at least k different individuals based on this identifier.
The configuration of the Hexanonymity algorithm requires the following information:
configuration
:JSON
object with the following field:k
: Minimum k (at least k=2 to provide privacy)min_p
: Minimum size to be applied in the hiearchy of Uber H3max_p
: Minimum size to be applied in the hiearchy of Uber H3
fields
: Column name which contains the geo-positioned data pointsid_col
: Column name which contains the user identifier.sensitive_cols
: An (optional) list of column name(s) with other fields to write the anonymised position to. In some datasets, the gps data points appear in multiple columns. You can set the additional columns in this field of the configuration to anonymise all the columns at once.
We provide a Jupyter Notebook showcasing the anonymization of a symulated dataset of connected vehicles in near-real time. The dataset is available in the INFINITECH H2020 project marketplace.
Initialize pandas dataframe with sample data
df = pd.DataFrame(
{
"locations": pd.Series(
array(
[
"-8.7354573,42.2239522",
"-8.7357169,42.224499",
"-8.8932563,42.1011589",
"-8.8910411,42.08599",
]
),
dtype=str,
),
"id": pd.Series(array(["1", "2", "1", "2"]), dtype=str),
"other_locations": pd.Series(array(["a1", "b2", "c3", "d2"]), dtype=str),
}
)
Create an Hexanonymity class
operation = Hexanonimity(
configuration={"k": 2, "min_p": 0, "max_p": 14},
fields=["locations"],
id_col="id",
sensitive_cols=[
"other_locations",
],
)
Apply the operation
result = operation.apply(df)
head(result)
Citation
Please, refer to CITATION. If you want to cite Hexanonymity, you can cite the main paper:
@INPROCEEDINGS{10190642,
author={Rodriguez-Viñas, Javier and Ortega-Fernandez, Ines and Martínez, Eva Sotos},
booktitle={2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)},
title={Hexanonymity: a scalable geo-positioned data clustering algorithm for anonymisation purposes},
year={2023},
volume={},
number={},
pages={396-404},
doi={10.1109/EuroSPW59978.2023.00050}}
Authors
Please, refer to AUTHORS
Contributors
Please, refer to CONTRIBUTORS
License
Hexanonymity is licensed under the Mozilla Public License v2.0 - see the LICENSE file for details
Funding
This work is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 856632 and the Ayudas Cervera para Centros Tecnológicos grant of the Spanish Centre for the Development of Industrial Technology (CDTI) under the project ÉGIDA (CER-20191012).