FraudHacker is an anomaly detection system for Medicare insurance claims data. I built FraudHacker using Python3 along with various scientific computing and machine learning packages (numpy, scikit-learn, and many others). For more background about why I built FraudHacker, please see my blog post on the subject. I will focus on the technical details here.
data/
: Contains a CSV file displaying the outlier count data generated by the anomaly labeling engine.notebooks/
: Jupyter notebooks demonstrating various aspects of FraudHacker's workflow, including the outlier detection, physician ranking, and hyperparameter sweeping.src/
: The actual source code for FraudHacker and the Flask app that displays its results to users.
Each directory has its own README file with more information.
FraudHacker ultimately utilizes clustering to perform outlier detection on Medicare claims data from the Center for Medicare and Medicaid Services (CMS). Each record contains aggregated information about one type of procedure (for example, a blood draw) performed by one physician. This data was downloaded in CSV format and loaded directly into a PostgreSQL database, which is the starting point of FraudHacker's interaction with the data. FraudHacker extracts numerical values from this database and uses these to perform clustering on the data for all of the physicians of a particular specialty (e.g. Neurology) in a particular state. The number of fraudulent procedures associated with each physician is tallied and output into a second database. The tallying could, in principle, be done on fly by operating directly on the PostgreSQL database containing the CMS data, but is much faster to pre-run the model and access the results. The outlier counts for each physician are then displayed to the user using the FraudHacker dashboard, which runs as a Javascript-driven Flask app. I currently have a copy of FraudHacker running on an AWS EC2 instance. It can be found at http://www.fraudhacker.site.
A reader class, PandasDBReader (implemented in database_tools.py), reads the data from the PostgreSQL database (whose info is specified in an external YAML file) and loads it into a Pandas DataFrame. Then, this dataframe is ingested by an AnomalyDetector sub-class (depending on the desired algorithm; these are implemented in anomaly_tools.py). The AnomalyDetector performs the actual clustering and outlier labeling, produces an outlier score for each record. A threshold on the outlier scores is used to formally label certain records as outliers. The AnomalyDetector class also adds up the outlier counts for each physician.
The next step is currently done semi-manually (this could be improved in the future). I export the outlier counts for each physician to a CSV file (an example of what this data looks like can be found in the data folder). The outlier count data is in turn imported to another PostgreSQL database, which is ultimately directly read by the Flask app. This accomplished via another class, the OutlierCountDBReader (also implemented into database_tools.py). The OutlierCountDBReader produces the values that are ultimately displayed in the Flask app.