TUM - Legal Data Science and Informatics - Project

Summary

In this experimental project work, a BVA decision search engine was developed, which eventually showed very promising performance results in processing decisions and inferring case outcomes. After a balanced partitioning of the data, three different segmentation techniques were analyzed on the training set, and this error analysis provided really good insights into which segmenter is best suited for specific tasks. Using Savelka's law-specific segmenter and Spacy's great built-in features, 10k unlabeled BVA decisions were tokenized, which were then fed into a 100-dimensional embedding model trained with FastText, where the neighborhood relations generated were accurate and the semantic meaning between neighbors is well established in the legal context. All this allowed training a linear classifier (Linear Support Vector Machine) and a non-linear classifier (Random Forest) using TFIDF and Word Embedding Featurization techniques. In the end, the best trained classifier was able to correctly classify most of the types, although there is still much room for improvement as a future work. In the following sections, each of the processes in the pipeline for creating a BVA decision classifier is explained in detail to justify the choices made.

Code Instructions

There are three entry points into this application, namely train.py, train_detailed.ipynb and analyze.py. While the main goal of the first two is to train classifier(s), analyze.py script aims to classify sentences in a given BVA decision using the embeddings models and the best classifier trained that saved in the “out” directory by running either of the train files. After setting up the local environment with “Python 3.6” ($ conda create -n tum_ldsi_18 python=3.6) and installing the necessary packages in requirements.txt, one can either run train.py or train_detailed.ipynb to train a model or the analyze.py script to classify a BVA decision. Note that the train.py script contains only the minimal steps to train a classifier, while the train_detailed.ipynb notebook contains a more detailed pipeline where one can train and compare multiple classifiers, and debug each of the processes mentioned in the sections above. For that reason, please refer to the train_detailed.ipynb notebook for verifying all these results demonstrated above. The instructions for running the scripts can be found in the main functions of the scripts. The analyze.py script can be executed like $ python analyze.py ./data/bva_decision.txt.

Additionally, one can run the train.py script from scratch without having to go through all of the time consuming processes. In order to do so, the path to the JSON file for the annotated documents and directory containing the unlabeled documents should be passed into the train.py script. In order to generate and train everything from scratch, please set debug, generate_new and train_new arguments to True while calling respective functions in the train function. If those are set to False and you have the files containing sentence segmented decisions and generated tokens in the unlabeled directory, you can run the whole process under around 2-3 minutes and can train the models from scratch. The files should be named _sentence_segmented_decisions.json and _generated_tokens.json respectively. Please note again that everything needed in this project task (the results of each of the error analysis, the training of the different classifiers, etc.) is included in the train_detailed.ipynb notebook, and the train.py script is only the simplified version. The project was developed following the best practices in Python so that it can be maintainable.

erdenbatuhan/bva-decision-search-engine

TUM - Legal Data Science and Informatics - Project

Summary

Code Instructions