/autoTOS

Automatically parse and summarize Terms of Services and Privacy Policies with custom NLP techniques

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

AutoTOS

Automatic Terms of Service and Privacy Policy parser and summarizer, powered by custom NLP techniques.

Created by Andrew Mascillaro, Spencer Ng, William Qin, and Eric Zheng. Winner of PennApps XXI's Best use of Google Cloud.

Installation

Autotos is developed using finetune, a powerful NLP library with specific dependencies. Either install finetune as instructed on their GitHub page or use the following instructions:

  1. (Recommended) create a virtual environment using venv or conda:
conda create -n autotos python=3.8
conda activate autotos
python -m spacy download en
  1. Install the requirements
pip install -r requirements.txt
  1. Run the script nlp/train.py as shown in Repository Structure to load model-specific data

This should install all files and dependencies for development or for running the model locally.

Repository structure

This repository contains the following folders:

  • artifacts: automatically downloaded files and those generated from data on the TOS;DR website. See below for a complete description.
  • config: manually-created configuration files to facilitate TOS downloading, NLP model training, and deploying to the cloud
  • data: scripts to download and clean data from TOS;DR into formats used for ML training
  • nlp: scripts to test and train the custom NLP RoBERTa-based model
  • prediction: scripts to serve the API backend and run evaluation on the trained model
  • docs: code for the AutoTOS website frontend
  • gcp-docker: archival scripts used to integrate AutoTOS with Google Cloud Platform via Docker

Configuration files

  • cloudbuild.json: configuration for Google Cloud Build for training the NLP model on the cloud. Unused.
  • cookies.config: cookies used for the TOS;DR website login to parse TOS excerpts
  • mapped_classes.json: manually-created mapping between original classes.csv descriptions of privacy topics and combined classes for AutoTOS’s model. Created generally based on original classes with high appearance frequency and/or similar descriptions.

Data pipeline

  1. Download all.json from source
  2. download_tos package: use all.json to fetch both the full text of TOS’s and the excerpts that correspond to points. Produces labeled_excerpts.csv and classes.csv
  3. __main__.py
  4. fulltext.py
  5. point_text.py
  6. cleanup.py: uses labeled_excerpts.csv, adds extraneous (noise) data from the full TOS (broken down by sentence) for training, and generates annotated_sentences.json

Data files and artifacts

  • all.json: data from TOS;DR’s GitHub repo with the category of various “points” (annotated excerpts from TOS’s), without the corresponding excerpt text. However, it links to relevant parts of the TOS;DR website that do contain these excerpts
  • labeled_excerpts.csv: list of excerpts from TOS texts. Each is labelled with the class ID and company slug.
  • classes.csv: Mapping between class IDs and the descriptive title, score, and frequency of each class
  • tos/*.txt: Full terms of service texts, as generated by fulltext.py
  • annotated_sentences.json: sentences from TOS’s that either contain padding (sentences that belong to no class) or phrases labeled by class ID (as specified by classes.csv). Used directly by the training/testing model

NLP pipeline

  1. split.py: takes annotated_sentences.json and splits it into a 80/20 train/test set as train_filter.json and test_filter.json. Filters by the new classes in mapped_classes.json. All padding data (i.e. data without a class ID) are put into train_filter.json.
  2. train.py: takes train_filter.json and builds a RoBERTa-based NLP model for sentence segmentation via huggingface or finetune. The finetune-based model is slightly more precise when testing with significantly fewer false positives in practice.
  3. test.py: takes the generated model and test_filter.json and prints out model statistics

TensorFlow-based models are output to the nlp/checkpoints folder.

API connection

  • predictor.py
  • api.py

To be written...