Author: Larissa Huang
This project demonstrates several fraud analysis techniques, including the following:
- Highly imbalanced fraud data
- Resampling data
- Tools: SMOTE, scikit-learn, train-test-split, matplotlib
- Logistic Regression, Decision Tree, Random Forest
- Performance metrics
- Hyperparamter optimization
- Ensemble methods (model weight adjustments)
- Tools: confusion matrix, classification report, roc_auc_score, precision-recall curve, GridsSearchCV, VotingClassifier, Seaborn
- Customer segmentation
- K-means clustering to detect fraud using outliers and small clusters,
- DB-scan clustering
- Tools: MiniBatchKMeans, silhouette score, homogeneity score, elbow curve
- Clean text data (tokenization, stopwords, stemming, lemmatization)
- Flag certain words and topics
- Topic modeling for fraud detection
- Topic visualization
- Tools: nltk, LDA, bagofwords, doc2bow, pyLDAvis, gensim, corpora