D.O.M.E. (Detection Of Misstatements Engine)

  • Protect the corporate world from misstatements!

D.O.M.E. is a data product which takes in financial statements and identifies whether the financial statements are misstated or not.

Who can potentially benefit?

The following groups can benefit from our data product:

  • Auditors can get a better idea of which corporations are more likely to missrepresent their fiscal outlook by filing misstatements.
  • Investors can be aware of misstatement risks before investing in any corporation

How it works?

D.O.M.E. uses machine learning and big data analytics to classify any financial statement as either misstated or not misstated with about 82% accuracy.


rf_and_logistic_notebook.ipynb new version of random forest and logistic regression with results printed.

How to Run:


Data files for this app can be found on SFU cluster on HDFS in /user/vcs/

File Dictionary:


code for integrating financial reports from Compustat with AAER and IBES data.

data_integration.py: merging of annual Compustat, AAER and IBES dataset.

aaer_labelling.py contains custom udf function for labelling records as as misstatement or not misstatement. used for creating our class label.

ibes_integration_fix-Copy1.ipynb fix for bug involving joining annual data with IBES with incorrect join conditions.


code for heatmap, number of misstatements per industry plots.



industry_wise_segmentation.py: num corporations with misstatement chart



corr_matrix_plot_2-Copy1 (1).ipynb: code for making correlation matrix heat map

num_aaer_per_firm-Copy1.ipynb: finding number of firms with aaer

timeseries.py: code for time series plots involving Earnings per Share

Visualise_PCA_clusters.ipynb: visualization for PCA clusters




location of the logistic regression and random forest code.

rf_and_logistic.py code for random forest and logistic regression

clustering_kmeans.py code for doing kmeans clustering on pca components

old_version_rf_and_logistic.ipynb old version of random forest and logistic regression with results printed.

experimental: sandbox for code


old code that contains previous iterations of current code.

data_integration_obsolete: old code for integration

machine_learning_obsolete: old code for ML

nullcount folder containing the number of null observations for each feature attribute




materials for poster


materials for report


materials for slides which were made for the video.

