Analyzing GDRP Fines

Analyzing GDRP Fines

1. Project Overview

Analyzing GDPR fines imposed by the European data protection authorities could reveal the main reasons and focus areas of the authorities for non-compliance and could allow our organization to timely address similar gaps in their data privacy strategy.

The complete presentation can be accessed here: Analyze GDPR Fines

1.1. Project Team

UMICH MADS Milestone I Team:

Andre Buser (busera)
Victor Adafinoaiei (vadafino)

1.2. Introduction and Objective

On May 25, 2018, the European Union (EU) “General Data Protection Regulation” (GDPR) became effective. The GDPR is a new data privacy initiative adopted by the EU to provide enhanced protection to EU citizens and their personal data. The penalties violations can result in up to twenty million euros or four percent of the company’s global annual revenue from the previous year, whichever number is higher. In addition, EU legislators impose fines for penalties to enforce data protection compliance.

The purpose of the project is to analyze GDPR fines that have been issued since 2018 and to get:

Basic insights regarding:
- Which industry sectors have been penalized the most?
- Which individual companies have been penalized the most?
- Which EU countries have the most violations?
- Which GDPR articles have been violated the most?
- What are the “average costs” of a violation per sector?
Advanced insights by correlating the GDPR fine dataset with the population by country (POP), gross domestic product (GDP), and corruption perception index (CPI) by country, the project intents to verify the following assumptions:
- A higher GDP could lead to more reported cases, because a higher GDP could mean more companies in the country
- A higher CPI could lead to more reported cases, because the public sector is maybe less influenced by the companies (higher CPI score = less corrupted)
- A higher population could lead to more reported cases because more data subjects could execute their rights

2. Datasets

2.1. Primary Dataset

2.1.1. GDPR fines

The GDPR fines data is the primary dataset. The information is scraped from the GDPR Enforcement Tracker website and contains details about the imposed GDPR fines.

The information is scrapped with the Selenium library, because the GDPR information is dynamically loaded via Javascript and not available as a text in the html source code. The parsing process was written by the project team and is documented in the notebook: “01_DCO-WS_GDPR-Fines_enforcementtracker.ipynb”.

Column	dtype	atype	Rel.	Description
ETid	str	O	No	Unique identifier assigned by the GDPR Enforcement Tracker
Country	str	N	Yes	Name of country
Date of Decision	str	O	Yes	Date when the fine was imposed
Fine [€]	int	Q	Yes	Amount of the fine in EUR
Controller/Processor	str	N	Yes	Data controller / processor / organization fined
Quoted Art.	str	N	Yes	Relevant GDPR article which was quoted for the imposed fine
Type	str	N	Yes	Reason/cause for the fine
Summary	str	N	Yes	Short explanation of the case
Authority	str	N	Yes	Name of the data protection authority who imposed the fine
Sector	str	N	Yes	Industry sector of the the controller/processor

Item	Source	Target
Access method	Web scraping (initial): Last scrap on 2021-11-28
Last update
Estimated size		926 records
No. of attributes		10
Location	https://www.enforcementtracker.com/	/data/external/
Filename	Not applicable	gdpr_fines_enforcementtracker_p2.pkl
File format	HTML	Parsed: PKL

⚠️ Attention: The initial parsing takes several hours, because for the detailed information (Summary, Authority and Sector) the page for every single case has to be opened and closed in sequence. The process also needs to be monitored due to potential timeout issues.

2.2. Secondary Datasets

2.2.1. Population

Countries of the world with their population over the years (1955 - 2020). The data is scraped from the Worldometer website. 234 out of 235 countries were parsed. Micronesia was excluded due to parsing issues. For the parsing the Beautiful Soup library is used. The parsing code was written by the project team and is documented in the notebook: “01_DCO-WS_population_worldmeter.ipynb”.

Column	dtype	atype	Rel.	Description
Country	str	N	Yes	Name of the country
Year	int	O	Yes	Year of population
Population	int	Q	Yes	Amount of population
Yearly % Change	float	Q	No	Yearly change of the population
Urban Population	int	Q	No	Amount of urban population

Item	Source	Target
Access method	Web scraping (initial): Last scrap on 2021-11-28
Last update	N/A
Estimated size		4212 (Micronesia excluded)
No. of attributes		5
File location	https://www.worldometers.info/population/	/data/external/
Filename	Not applicable	population_by_country_p2.pkl
File format	HTML table	PKL

🤝 Decisions: Micronesia was excluded due to parsing issues. Reasons for the parsing issues are related to the fact that Micronesia does not have all the data/columns compared to the other countries in the Worldmeter dataset. Considering that this country is not relevant for our analysis (no GDPR fines in Micronesia) the project team decided not to invest more time to cover this specific parsing case and decided to exclude the country.

2.2.2. CPI Scores

The CPI dataset describes the Corruption Perceptions Index (CPI) per country. The CPI scores and ranks countries/territories based on how corrupt a country’s public sector is perceived to be. It is a composite index, a combination of surveys and assessments of corruption, collected by a variety of reputable institutions.

The data is manually downloaded from the “Transparency International” website.

Column	dtype	atype	Rel.	Description
Country	str	N	Yes	Name of country
ISO3	str	N	Yes	ISO code of the country name
CPI Score 20xx	int	Q	Yes	Corruption Perceptions Index (CPI) score of the year

Item	Source	Target
Access method	Download: Last download on 2021-11-29
Estimated size	180 records
No. of attributes	34
File location	https://www.transparency.org/en/cpi/2020/index/	/data/external/
Filename	CPI2020_GlobalTablesTS_210125.xlsx	CPI2020_GlobalTablesTS_210125.xlsx
File format	XLSX	XLSX

2.2.3. GDP per capita

Gross Domestic Product (GDP) is the monetary value of all finished goods and services made within a country during a specific period. GDP provides an economic snapshot of a country, used to estimate the size of an economy and growth rate.

This dataset contains the current GDP in USD, not corrected by the purchasing power parity (PPP) because we want to understand the overall amount of produced goods and services. The dataset is manually downloaded from the “The World Bank” website.

Column	dtype	atype	Rel.	Description
Country Name	str	N	Yes	Name of country
Country Code	str	N	Yes	ISO code of the country name
Indicator Name	str	N	No	All fields contain 'GDP (current US$)' value
Indicator Code	str	N	No	All fields contains 'NY.GDP.MKTP.CD' value
Columns of years (1960-2020)	float	Q	Yes	The GDP value of the year

Item	Source	Target
Access method	Download: Last download on 2021-11-29
Estimated size	266 records
No. of attributes	65
File location	https://data.worldbank.org/indicator/NY.GDP.MKTP.CD	/data/external/
Filename	API_NY.GDP.MKTP.CD_DS2_en_excel_v2_3158925.xls	API_NY.GDP.MKTP.CD_DS2_en_excel_v2_3158925.xls
File format	XLS	XLS

3. Project Structure

├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets.
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `01-data_load_and_cleaning`.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── environment_conda.yml <- The requirements file for conda
│               
├── environment_pip.txt   <- The requirements file for pip packages

4. Workflow

The project created, collectd, cleaned, manipulated and stored the datasets according to the following process flow:

🤝 Decisions:

For all datasets (tables), the same unique key is created and used: country+year.

All cleaned datasets will be stored in the SQLite file “project_GDPR-fines.sqlite” for the following reasons:

Retain data types and structure (compared to xlsx or csv)

Improve access outside of Python, e.g. via an SQLite DB Browser

Legend

WS: Web scraping
DCO: Data collection
DCL: Data cleaning and manipulation

5. Cleaning and Manipulation

Joining Datasets The key attribute used to combine all datasets is the country name + year. Those attributes and values are present in all datasets.

Main Challenges

Potential inconsistencies or spelling mistakes for the country naming
Missing values in the datasets
Outliers in the datasets
Breaking/melting “year columns” into rows

🤝 Decision: For cleaning or manipulation activities, an assert statement should be used to verify the expected outcome - where possible.

Dataset	Processing Steps
All datasets	• Lowercase column names • Check consistencies in categorical features • Create mapping table for country names + label encoding • Ensure consistent country names via mapping table • Create "key" column based on country+year for joining datasets • Encode categorical features • Drop unused columns • Check for missing values • Check for outliers
GDPR	• Lowercase column names • Clean and tokenize summary column for NLP analysis • Explode "quoted art." into separate columns • Label-encode: country, controller/processor, quoted art., sector, company, type • Extract year from date column • Create unique key column: country+year • Save result in SQLite DB-file
POP	• Lowercase column names • Calculate population 2021 based on average growth rate • Keep relevant columns: country, year • Label-encode country • Create unique key column: country+year • Create and assign percentile column • Keep only countries in GDPR-fine dataset • Save result in SQLite DB-file
CPI	• Skip first 2 rows during import • Lowercase column names • Calculate CPI score 2021 based on average growth rate • Keep relevant columns: country, CPI scores 2018-2021 • Rename columns to years only • Melt 2018 to 2021 (single rows per year) • Clean/check country names for consistency • Keep only countries in GDPR-fine dataset • Label-encode country • Create unique key column: country+year • Create and assign percentile column • Save result in SQLite DB-file
GDP	• Skip first 3 rows during import • Lowercase column names • Calculate GDP 2021 based on average growth rate • Keep relevant columns: country name, 2018-2021 • Melt 2018 to 2021 (single rows per year) • Clean/check country names for consistency • Keep only countries in GDPR-fine dataset • Label-encode country • Create unique key column: country+year • Create and assign percentile column • Save result in SQLite DB-file

6. Analysis

In the analysis phase, the project team will combine numerical and visualization techniques to allow a better understanding of the different characteristics of the dataset, its features, and the potential relationships. For this EDA all datasets will be merged/joined into one dataframe.

With the EDA we try to address and verify the following questions:

What is the distribution of the various features to understand skewness, imbalances, and potential bias?
Are outliers present?
What are the relationships between the main features: GDPR fine, country, year, population, CPI, and GDP?
How do the different pairs of features correlate with one other? Do these correlations make sense?

In the univariate EDA we intent to answer the following questions:

Which individual companies have been penalized the most?
Which EU countries have the most violations?
Which industry sectors have the most violations?
Which GDPR articles have been violated the most?
What are the “expected average costs” of a violation per sector or article?

6.1. Univariate EDA for Numerical Features

For the numerical EDA the most commonly used descriptive statistics will be used as provided by pandas in the .describe() method. The method will provide calculations for count, mean, standard deviation, minimum, percentiles, and maximum. With the help of these numbers, relevant quantitative characteristics should be discovered.

What	How (technique)	Why (learn from it)
Checking characteristics of all single numerical features.	• Histogram(s) with the pandas .hist() method • Box Plots • Bar Plots	To identify and understand the: • Distribution / Skewness • Outliers • Variability

6.2. Bi-/Multivariate EDA

Bivariate EDA techniques are used to explore pairs of variables and start understanding how they relate to each other. In the bi-/multivariate EDA we intent to explore the following assumptions:

A higher GDP could lead to more reported cases, because a higher GDP could mean more companies in the country
A higher CPI could lead to more reported cases, because the public sector is maybe less influenced by the companies (higher CPI score = less corrupted)
A higher population could lead to more reported cases because more data subjects could execute their rights

What	How (technique)	Why (learn from it)
Analyzing two or more numerical features	• Scatter plot / Seaborn pair-plot • Pearson correlation coefficient Note: correlation will be run against all features.	To explore the relationships between pairs of features: Scatter plot: • Overall pattern • Strength/Noise • Direction - Positive - Negative Pearson: • Strength • Direction - Positive - Negative
Analyzing two or more categorical features	• Bar plots • Cross-tables (optional) Note: Correlation will be run against all features.
Analyzing numerical and categorical features	• Box plots	To identify distributions and variations of numerical features, such as the GDPR fine, across categorical features, such as articles quoted, industry sector, ...

7. Ethical Considerations

In order to assess for ethical concerns as part of the project, we have decided to use the Data Science Ethics Checklist embedded in the “Deon” library. The ethics checklist will be added to our data science project and notebooks and reviewed for relevance. An initial assessment for potential ethical considerations is presented below:

Topic	Relevant	Comments Project Team
A. Data Collection
A.1 Informed consent	No	Human subjects are not in the scope of the project. The information used such as GDPR related fines or country corruption index is publicly available information.
A.2 Collection bias	Yes	The GDPR related fines information publicly available is not the complete set of information. Certain fines are not made available for multiple reasons, such as ongoing broader investigations. As a result, the level of completeness of our analysis may be impacted. However, due to limited access to the information and time, we would not reach out to local privacy authorities for a more comprehensive view. We are planning to use analytics techniques to detect potential bias and find solutions to compensate for it.
A.3 Limit PII exposure	No	Personal identifiable information is not used in our analysis. Hence, anonymization or limiting data collection does not represent a risk.
A.4 Downstream bias mitigation	Yes	As for A.2, we will analyze bias both in the source data and the results to potentially compensate or mitigate any biased outcomes.
B. Data Storage
B.1 Data security	No	The project will use only publicly available information. Therefore, there is no need for data encryption or stringent access control to protect confidentiality.
B.2 Right to be forgotten	No	Individual data is not collected for this exercise.
B.3 Data retention plan	No	No requirements for data retention as the data used is made publicly available without any retention requirements.
C. Analysis
C.1 Missing perspectives	Yes	At this stage, we are not considering engaging with local privacy authorities to obtain a more nuanced view on the fines. Our analysis will purely rely on the information made available. However, we may reach out to our Data Privacy Office to explore more in-depth privacy needs.
C.2 Dataset bias	Yes	As part of the analysis phase, we are planning to test and take mitigation actions in case bias is detected.
C.3 Honest representation	Yes	The visualizations are designed having the following criteria: expressiveness and effectiveness. We want to ensure that visualization expresses all the facts in the set of data while conveying the information in an effective manner.
C.4 Privacy in analysis	No	PII (Personal Identifiable Information) is not considered nor collected as part of this project.
C.5 Auditability	Yes	The entire work will be documented in Jupyter Notebook to ensure that the work is reproducible.
D. Modeling		Not applicable for this project
E. Deployment		Not applicable for this project

8. Installation and Setup

8.1. Environment Setup

We recommend using conda as the foundation because it simplifies the management of required Python versions. To create the project's conda environment use:

conda env create -f environment_conda.yml

Once the environment is created, activate it:

conda activate gdpr_fines

9. Conclusion

Our analysis of GDPR fines from 2018 to 2021 has revealed crucial insights for data protection practices, particularly in the healthcare sector.

Spain emerged as the country with the highest number of fines across all industries. Our findings suggest that organizations should prioritize reviewing and strengthening their controls related to

Information Security (Art. 32), Legal Basis for Data Processing (Art. 6 and Art. 9), and
General Data Processing Principles (Art. 5),

as these areas account for the majority of compliance issues and associated costs.

We've also identified positive correlations between fine amounts and factors such as GDP, Corruption Perception Index (CPI), and population size.

While our dataset has limitations in completeness, it provides valuable directional insights for improving GDPR compliance strategies. Future work will focus on refining our analysis with updated data, exploring additional variables, and developing predictive models to further enhance our understanding of GDPR enforcement trends.