This code was developed and tested on Ubuntu 13.10
#Software Package Requirements
- NLTK (python)
- Jype (python)
- Stanford Tagger, Parser (Java)
#Files
- config.py: this file contains various configuration settings used by different scripts, for example, the output directories and stanford parser home directory path are specified in this file
- fetcher.py: wrapper to handle the privacy URL GET requests
- sectioner.py: contains sectioning part of the code, mainly the headings are used to sections privacy policies
- classify.py: contains code to classify the sections into collection, sharing, retention, etc
- information_type_extractor.py: contains code to extract information types from the sections produced by sectioner
- noun_phrase_extractor.py: extracts noun phrases as a python set
- noun_phrase_marker.py: marks noun phrases in the privacy policy (makes them clickable)
- stanford_utils.py: jype interface the JAVA stanford tools
- utils.py: util functions
- evaluate.py: has code to calculate precision and recall against the manually annotated datasets(Travis, Fei datasets)
#Running the code
All the scripts have two modes of operation viz. single privacy url and multiple privacy urls
To demonstrate the scripts, I will be using Amazon's privacy policy URL. This can be changed any other privacy policy url. If no argument is given though, the script will start running on the list of URLS specified in config.py.
NOTE: many privacy policies are not supported, an error will be logged for policies which are not supported.
Examples
python test_sectioner.py "http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496"
python classify.py "http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496" > amazon.html
python information_type_extractor.py "http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496" > info_type.out
INFO: output for this script is html files with marked sections
python noun_phrase_marker.py "http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496"
INFO: this script was used for computing noun phrase frequencies in the privacy policies
python noun_phrase_extractor.py np.out "http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&nodeId=468496"