Spark Demo

Summary

This pipeline can be broken down into the following major steps:

Read the source files data/gbr.jsonl and data/ofac.jsonl into spark RDD's.
Use app.standardize_record to map each line in both input files to a standard datatype. The resulting app.Record datatypes will have consistent field names and field datatypes.
Create a list of all unique pairs of records. Each unique pair consists of one GBR record and one OFAC record.
Use app.get_all_matches to identify any matching fields in each pair of records. If a match is found, add the field name and field value to a list of matches for that record pair.
Filter out the record pairs with empty lists of matches.
Drop any duplicate matches. For example, some alias names are repeated but should only appear once in the final output.
Write the output to data/output.jsonl. An archived version of this output has already been stored under data/archived_output.jsonl.

A record in the GBR list was matched to a record in the OFAC list if one of the following criteria was met:

Name/alias matches were done after removing special characters -, ., ,, and ' and converting to all lowercase characters.

virtualenv .venv
source .venv/bin/activate

pip install -r requirements.txt

python app.py

spark-submit --master local[*] app.py