/useR2024-Data-Anonymisation-for-Open-Science

Repository for useR!2024 tutorial Data Anonymisation for Open Science

Primary LanguageHTML

Data Anonymisation for Open Science

useR! 2024

8 – 11 JULY 2024

SALZBURG, AUSTRIA

Session format: In-Person Tutorial (3 hours)
Topic: Community
Community sub-topic: Open and reproducible science

One of the key elements of open science is open data that are available to a wide spectrum of users. Unfortunately, many datasets cannot be publicly available mostly for privacy reasons because data protection laws fundamentally restrict personal data use. In this tutorial, we will go through methods of statistical disclosure control with different anonymisation approaches that can be used to protect data confidentiality. These methods either modify or synthesise data so that they can be disclosed without revealing confidential information that may be associated with specific respondents. In particular, we will discuss non-perturbation and perturbation methods and also methods for synthetic data generation. For these purposes, the usage of packages sdcMicro, simPop, and synthpop will be shown.

Learning Outcomes: By the end of this tutorial, the participants should:

  • be familiar with basic methods of statistical disclosure control and data anonymization,
  • know which methods are suitable for specific variables and situations,
  • know the basics of non-perturbation and perturbation methods,
  • know basic methods of synthetic data generation,
  • be familiar with data utility and information loss,
  • know about the specific challenges when protecting the confidentiality of the data.