Medicare Fraud Detection

Medicare Fraud Detection

The purpose of this notebook is to perform exploratory data analysis, data preprocessing, feature selection, feature engineering, model training and evaluation for fraud detection using the Medicare Fraud Detection dataset

Problem Statement

  • USA offers a government backed health insurance program typically for aged individuals to cover their medical bills for hospital visits and treatment.
  • It has been found that doctors, providers, beneficiaries and associates are involved in unfair practices to augment the claim amount for their benefit. For example, doctors submit claim forms with more sophisticated tests/diagnosis/procedures than was actually performed or using fake patient data so as to receive more insurance payment than rightfully due.
  • Due to these practices, the insurance providers suffer heavy losses every year which impacts the smooth running of their business. Hence they increase premium payments which makes affordable health care accessible to only the elite few.
  • Given the dataset with information regarding beneficiaries, inpatients and outpatients, our target is to build a suitable model which can predict how likely a healthcare provider is to perform health insurance fraud

Dataset Description

The following is a description of the available attributes of the entities in the dataset.
Note that I did not find any data schema definition so the following is as per my understanding. I spent a lot of time understanding each and every column since understanding the data and the context is the most important step in any data science project. The context here is how US Medicare system and insurance claim filing works and what are the ways this system is exploited.

  • Provider - Provider ID
  • PotentialFraud - Yes if provider is possibly fradulent else No
  • BeneID - Beneficiary ID
  • DOB - Date of Birth
  • DOD - Date of Death
  • Gender, Race, State, County - Self explanatory
  • ChronicCond_Heartfailure, ChronicCond_Alzheimer, ChronicCond_KidneyDisease, ChronicCond_Cancer, ChronicCond_ObstrPulmonary, ChronicCond_Depression, ChronicCond_Diabetes, ChronicCond_IschemicHeart, ChronicCond_Osteoporasis, ChronicCond_rheumatoidarthritis, ChronicCond_stroke - Binary field of 1/2 to specify if BeneID had these chronic conditions or not RenalDiseaseIndicator - 0 if BeneID has no indication of renal diseases else Y
  • NoOfMonths_PartACov - Medicare system has 4 types of coverage Part A, B, C and D, where under Part A inpatient visits, treatment and nurses is covered. Elgibility for Part A - 65 years minimum, 10 years of full time work
  • NoOfMonths_PartBCov - Part B covers hospitalization ( admitted to hospital overnight )
  • IPAnnualReimbursementAmt - amount Medicare will pay for inpatient visits
  • IPAnnualDeductibleAmt - amount person must pay to Medicare to get the facility of IP annual reimbursement
  • OPAnnualReimbursementAmt - amount Medicare will pay for outpatient visits
  • OPAnnualDeductibleAmt - amount person must pay to get annual outpatient reimburesement
  • ClaimID, ClaimStartDt, ClaimEndDt, Provider - self explanatory
  • InscClaimAmtReimbursed - how much Medicare paid
  • AttendingPhysician, OperatingPhysician, OtherPhysician - IDs of the respective physicians
  • AdmissionDt - Date of admission
  • ClmAdmitDiagnosisCode - What was the diagnosis code assigned when patient was admitted ?
  • DeductibleAmtPaid - amount person actually paid as opposed to the previous attributed OPAnnualDeductibleAmt which is how much he was supposed to pay
  • DischargeDt - date of discharge for outpatients
  • DiagnosisGroupCode - used to categorize inpatient visits
  • ClmDiagnosisCode_1 to ClmDiagnosisCode_10 - There is an official categorization of diseases called ICD or International Classification of Diseases. Diagnosis Code 1 is the most important diagnosis specifying which is the major dieases, subsequent codes are for subsequent diagnoses. Based on the codes given it looks like it is using ICD 9. This is most important attribute as this specifies the disease in terms of medical coding and any mistake here will lead to rejection of the claim form.
  • ClmProcedureCode_1 to ClmProcedureCode_6 - Codes for the procedure which was allegedly performed given the diagnosis code
  • Contents of repository

  • Notebook containing EDA, feature engineering, feature selection, model building and evaluation
  • Inference script to generate predictions on any given data
  • Model predictions on unseen data stored in the Excel file
  • List of dependencies and their versions