This is the accompanying repository for the "Unifying Privacy Policy Detection" paper published in the Privacy Enhancing Technologies Symposium (PETS) 2021.
The aim of this project is to support privacy policy researchers with a unified solution for creating privacy policy corpora based on currently available best-practices.
At the moment, we have uploaded the source code as a proof of concept, according with the trained classifiers and vectorizers in English and German. We are planning to provide a pip package as soon as possible in order to ease the application of this toolchain.
The toolchain consists of five steps:
- Finding potential privacy/cookie policies on websites
- Text-from-HTML extraction
- Language detection
- Key phrase extraction
- Classification
The current structure of the repository is depicted as follows:
.
|-- LICENSE
|-- README.md
|-- privacy_policy_link_detection
| |-- README.md
| |-- custom_command_find_privacy_policies.py
| `-- demo_privacy_policy_download.py
`-- privacy_policy_toolchain
|-- code
| |-- ppt.py
| `-- resources
| |-- VotingClassifier_soft_de.pkl
| |-- VotingClassifier_soft_en.pkl
| |-- trained_vectorizer_de.pkl
| `-- trained_vectorizer_en.pkl
|-- data
| `-- privacy_policies
|-- environment.yml
|-- feature_list
| |-- feature_list_de.txt
| `-- feature_list_en.txt
|-- logs
| `-- language_analysis
`-- results
`-- classification
The folder resources
contains the trained models and the vectorizers for both English and German.
Henry Hosseini, Martin Degeling, Christine Utz, Thomas Hupperich. "Unifying Privacy Policy Detection." PETS 2021.
- Henry Hosseini: henry.hosseini@wi.uni-muenster.de