Experimental Natural Language Processing on Privacy Policies for Import to Transparency Information Language
jupyter notebook NER_NLTK_Spacy - Privacy Policies.ipynb
main experiments- academy.dslrvideoshooter.com.txt/ serialized output of policy
- arstechnica.com.txt/ serialized output of policy
- academymortgage.com.txt/ serialized output of policy
- language detection.png Screenshot of notebook for lang detection
- langs.json Results of language detection
- dist.png Distribution of languages
- syntax-tree-visualizer.py Syntax tree visualizer standalone
- single-tree.svg Examplary syntax tree
- single-tree.pdf Examplary syntax tree
- media/ archived main experiments and drafts
Elias Grünewald
- users have certain rights as transparency information but are not able to conceive them
- if data is transferred to multiple parties, the resulting network is not visible
- lack of transparency information describing representation format
- how should a transparency representation format look like?
- how to automatically extract transparency information?
- how to extract data flow networks?
- define transparency representation format
- make use of existing corpora
- Privacy Policies e.g. OPP-115 Corpus
- Transparency information key words or categories list of (sensitive) personal data terms such as name, birthday, bank account details, picture, IP address…
- Third parties list of top N companies, institutions
- use NLP for semantics extraction (link each transparency key word to third party e.g. by distance)
- save n-tuples (incl. purpose, duration) to previously defined representation format
- visualize data flow networks
- may extend Polisis framework
- transparency representation is defined as json example/schema
- make use of established NLP framework such as TensorFlow, PyTorch, Google Natural Language API, Amazon Comprehend
- common web technologies for visualization
- try unseen privacy policies for precision, recall, F1-score
- measure performance