This repository contains a system for detecting potential tax fraud in financial data.
- data_pipeline.ipynb: Jupyter notebook containing the data generation, exploration, feature engineering, model training, and evaluation pipeline.
- app.py: Streamlit application for deploying the trained model as a web app for real-time fraud prediction. Note: The trained_model.pkl file generated by the Jupyter notebook is not included in the repository due to its potential size.
- Data Generation: Simulates a dataset of financial transactions with features like income, expenses, tax liability, and fraud indicators.
- Data Exploration: Analyzes the generated data to understand the relationships between features and potential fraud.
- Feature Engineering: Creates a "Fraud" feature using anomaly detection techniques.
- Model Training: Trains a Random Forest classification model to predict tax fraud based on financial data.
- Model Evaluation: Evaluates the performance of the trained model using metrics like accuracy, precision, recall, F1 score, and ROC-AUC score.
- Model Saving: Saves the trained model as trained_model.pkl for later use in the web app.
- Upload a CSV file containing financial transaction data.
- View the uploaded data.
- Make real-time predictions on whether each transaction is likely fraudulent using the trained model.
- Download the predicted data with a new "Predicted Fraud" column.
- Explore basic Exploratory Data Analysis (EDA) visualizations of the uploaded data, including:
- Value counts of predicted fraud
- Correlation heatmap Note: This web app requires the trained_model.pkl file to be present in the same directory for loading the trained model.
- Python 3.x
- Jupyter Notebook
- Streamlit
- pandas
- matplotlib
- seaborn
- scikit-learn
- pickle
- Clone this repository.
- Install the required libraries using pip install -r requirements.txt (assuming you have a requirements.txt file listing the dependencies).
- For data pipeline:
- Open data_pipeline.ipynb in Jupyter Notebook and run all the cells to generate data, train the model, and save it.
- For web app:
- Run streamlit run app.py from the command line in the project directory.
- This will launch the Streamlit app in your web browser, allowing you to upload a CSV file and view predictions.
- Go to the application link : ( https://taxpayer-fraud-detection.streamlit.app/ )
- From the github download and upload the file ( Test_Dataset.csv ) - Just for testing
- Enjoy with the output ✨🍿
- You can replace the simulated data generation in the Jupyter notebook with your actual financial data for training the model.
- The web app provides a basic set of EDA visualizations. You can customize it further to include additional visualizations based on your needs.