====
Url: https://data.cityofnewyork.us/Housing-Development/Property-Valuation-and-Assessment-Data/rgy2-tti8
Details variables: RECORD, BBLE, BLOCK, LOT, EASEMENT, OWNER, BLDGCL, TAXCLASS, LTFRONT, LTDEPTH, STORIES, FULLVAL, AVLAND, AVTOT, EXLAND, EXTOT, EXCD1, STADDR, ZIP, EXMPTCL, BLDFRONT, BLDDEPTH, AVLAND2, EXLAND2, EXLAND2, EXTOT2, EXCD2, PERIOD, YEAR, VALTYPE.
- Build metrics and detect the potential fraud records
- Data Quality Report (DQR)
- Create 50 more insightful variables based on original variables
- Partition based on 7 key metrics
- Dealing with missing value, Z-scale
- PCA (Visualization, Decide to use 13 PCs)
- Based on projection of original features on the 13 PCs' directions
- Calculate the euclidean distance to the origin as the fraud score
- Use h2o package to implement auto-encoder
- Output results from the PCA serves as the input
- Set two hidden layers, both five features
- Calculate MSE of each record as the fraud score
- 76% overlapping in the highest 10000 fraud scores between Euclidean Distance and Auto-encoder
- Check the 30 records with top fraud scores and identify fraud patterns within them.
==== Keep updating...