D.O.M.E. (Detection Of Misstatements Engine)
- Protect the corporate world from misstatements!
D.O.M.E. is a data product which takes in financial statements and identifies whether the financial statements are misstated or not.
Who can potentially benefit?
The following groups can benefit from our data product:
- Auditors can get a better idea of which corporations are more likely to missrepresent their fiscal outlook by filing misstatements.
- Investors can be aware of misstatement risks before investing in any corporation
How it works?
D.O.M.E. uses machine learning and big data analytics to classify any financial statement as either misstated or not misstated with about 82% accuracy.
Results
rf_and_logistic_notebook.ipynb
new version of random forest and logistic regression with results printed.
How to Run:
- Copy the file
rf_and_logistic.py
onto the SFU cluster from: - https://github.com/chiu/accounting-ml-project/blob/master/machine_learning/rf_and_logistic
- Run the following command on the cluster:
spark-submit rf_and_logistic.py
Data:
Data files for this app can be found on SFU cluster on HDFS in /user/vcs/
File Dictionary:
data_integration:
code for integrating financial reports from Compustat with AAER and IBES data.
data_integration.py
: merging of annual Compustat, AAER and IBES dataset.
aaer_labelling.py
contains custom udf function for labelling records as as misstatement or not misstatement. used for creating our class label.
ibes_integration_fix-Copy1.ipynb
fix for bug involving joining annual data with IBES with incorrect join conditions.
eda:
code for heatmap, number of misstatements per industry plots.
AAPL.png
comparing_std_summ_std.ipynb
industry_wise_segmentation.py
: num corporations with misstatement chart
timeseries.html
MSFT.png
corr_matrix_plot_2-Copy1 (1).ipynb
: code for making correlation matrix heat map
num_aaer_per_firm-Copy1.ipynb
: finding number of firms with aaer
timeseries.py
: code for time series plots involving Earnings per Share
Visualise_PCA_clusters.ipynb
: visualization for PCA clusters
heatmap_correlation_matrix.png
reasons_for_misstatement-Copy2
.ipynb
machine_learning:
location of the logistic regression and random forest code.
rf_and_logistic_notebook.ipynb
new version of random forest and logistic regression with results printed.
rf_and_logistic.py
code for random forest and logistic regression
clustering_kmeans.py
code for doing kmeans clustering on pca components
old_version_rf_and_logistic.ipynb
old version of random forest and logistic regression with results printed.
experimental: sandbox for code
obsolete:
old code that contains previous iterations of current code.
data_integration_obsolete
: old code for integration
machine_learning_obsolete
: old code for ML
nullcount
folder containing the number of null observations for each feature attribute
preprocessing
tableau_charts
poster:
materials for poster
report:
materials for report
slides:
materials for slides which were made for the video.
Folder Contents:
.
├── LICENSE
├── README.md
├── data_integration
│ ├── DGLS_sheets_integration.ipynb
│ ├── aaer_labeling.py
│ ├── data_integration.py
│ └── ibes_integration_fix-Copy1.ipynb
├── eda
│ ├── AAPL.png
│ ├── MSFT.png
│ ├── Visualise_PCA_clusters.ipynb
│ ├── comparing_std_summ_std.ipynb
│ ├── corr_matrix_plot_2-Copy1\ (1).ipynb
│ ├── heatmap_correlation_matrix.png
│ ├── industry_wise_segmentation.py
│ ├── num_aaer_per_firm-Copy1.ipynb
│ ├── reasons_for_misstatement-Copy2.ipynb
│ ├── timeseries.html
│ └── timeseries.py
├── experimental
│ └── one_hot_encoding_experiment.py
├── machine_learning
│ ├── clustering_kmeans.py
│ ├── old_version_rf_and_logistic.ipynb
│ ├── performance_metricslogistic_with_validation.csv
│ ├── performance_metricslogisticregression.csv
│ ├── performance_metricslogisticregressionwithbestthreshold.csv
│ ├── performance_metricsrandomforest.csv
│ ├── performance_metricsrf_with_validation.csv
│ ├── rf_and_logistic.py
│ └── rf_and_logistic_notebook.ipynb
├── obsolete
│ ├── data_integration_obsolete
│ │ ├── attempt_1hot.py
│ │ ├── load_compustat.py
│ │ └── load_integrated_data.py
│ ├── machine_learning_obsolete
│ │ ├── logistic_balancing_weights.ipynb
│ │ ├── misstatement_detection-Copy15.ipynb
│ │ ├── misstatement_detection-Copy15.py
│ │ ├── nn-example.ipynb
│ │ ├── nn_integrated.py
│ │ └── nn_trial_2.py
│ ├── nullcount
│ │ ├── _SUCCESS
│ │ └── part-00000-e4758198-5946-4939-8f03-746415fb32de-c000.csv
│ ├── preprocessing
│ │ └── pca_take2-Copy2.ipynb
│ └── tableau_charts
│ └── Misstated\ count\ analysis(industrywise).twbx
├── poster
│ ├── Screenshot-2018-3-30\ rf_and_logistic_v2-Copy1.png
│ ├── cmpt733_vcs_poster_v1.pdf
│ ├── data_pipeline_flow.png
│ ├── data_pipeline_flow_v2.png
│ ├── heatmap_correlation_matrix.png
│ ├── methodology_final.jpg
│ ├── misstatements_per_industry.png
│ ├── num_aaer_vs_reason.png
│ ├── num_corp_with_misstatements.png
│ ├── pcaPlot.png
│ ├── pcaPlot2.png
│ ├── pca_plot.png
│ ├── poster-733\ (1)
│ │ ├── SFUBigData_logo.jpg
│ │ ├── beamerposter.sty
│ │ ├── beamerthemeconfposter.sty
│ │ ├── heatmap_correlation_matrix.png
│ │ ├── logo.png
│ │ ├── main.tex
│ │ ├── matplotlib.svg
│ │ ├── misstatements_per_industry.png
│ │ ├── num_aaer_vs_reason.png
│ │ ├── num_corp_with_misstatements.png
│ │ ├── pcaPlot.png
│ │ ├── pcaPlot2.png
│ │ ├── placeholder.jpg
│ │ ├── sample.bib
│ │ ├── tableau_viz.pdf
│ │ ├── v2_word_cloud_logistic_regression.png
│ │ ├── v3_word_cloud_logistic_regression.png
│ │ └── word_cloud_logistic_regression.png
│ ├── poster-733\ (10).pdf
│ ├── poster.pdf
│ ├── poster_for_cornerstone_printing_vc.pdf
│ ├── poster_for_printing_staples.pdf
│ ├── poster_presentation_final.pdf
│ ├── screenshot-2018-3-30_rf_and_logistic_v2-copy1_480.png
│ ├── tableau_viz.pdf
│ ├── v2_word_cloud_logistic_regression.png
│ ├── v3_word_cloud_logistic_regression.png
│ ├── word\ cloud.png
│ └── word_cloud_logistic_regression.png
├── report
│ ├── 733-paper-nips\ (16).pdf
│ └── report.pdf
└── slides
├── Detecting\ Misstatements\ v2.pptx
├── Detecting\ Misstatements.pptx
└── random_forest_medium.png