This script can identify "bad" rows in a dataset based on regex from the python-database-sanitizer library. This can help narrow down issues with the data failing to be sanitized.
- Install python 3.6 or newer to run the script.
- Make a copy of
data-example.py
. The.gitignore
is set up to ignoredata.py
. From this directory, run:
cp data-example.py data.py
- Open
data.py
and insert the rows in question into the array. - Run the script:
python3 main.py
- When finished, delete the
data.py
file to prevent having PHI sitting around on your machine.
rm data.py
The script will log all rows that have issues with the regex and details on which characters caused the issue. Items will be flagged if they DO NOT match the regex.