CSCI-GA.3033-001: Summer 2019 Team Project

Authors: Cody Gilbert, Fang Han, Jeremy Lao

Repository for the team project in the Summer 2019 Spark Class.

This repository contains all the files used to create and execute the HMDA Data Exploration web application. This project is broken down into separate sections by subfolder. See each subfolder for details of execution, methodology, and application enviroments.

Software Used
Hardware Used
Folders
Access

Software Used

Scala Ver. 2.11.8
- Python Ver. 3.7
- Plotly Ver. 3.10.0
Apache Spark Ver. 2.2.0 (Big Data Analysis)
- MLLib Library
- GeoSpark Ver. 1.2.0
- geospark-viz_2.3 Ver. 1.2.0
- geospark-sql_2.3 Ver. 1.2.0
Apache Hadoop 2.6.0 with Yarn
Python Ver. 3.6+
- Flask Ver. 1.0.2
- flask_sqlalchemy Ver. 2.4.0
- Plotly Ver. 3.10.0
- Pandas Ver. 0.24.2
Apache Spark Ver. 2.4.3 (Local Machine version for Webapp execution)

Hardware Used

NYU HPC Dumbo Cluster for big data processing
Personal Windows 10 Laptop
Personal Mac OSX Laptop

Folders

The following folders are included as required:

data_ingest
etl_code
profiling_code
app_code
screenshots
test_code

/data_ingest

This folder contains the scripts used to ingest each dataset.

Included Files

get-zip-bash folder containing tools to fetch and zip HMDA data
unzip-bash folder containing tools to unzip and store HMDA codes data

HDFS Data

/user/jjl359/project/data/HMDA_2007_to_2017_codes.csv HMDA codes data.
/user/jjl359/project/data/HMDA_2007_to_2017.csv HMDA labels data
/user/fh643/InstitutionData/InstitutionData/data/panel_07-09/_*.csv HMDA nationwide institution panel data from 2007 to 2009
/user/fh643/InstitutionData/InstitutionData/data/panel_10-17/_*.csv HMDA nationwide institution panel data from 2010 to 2017
/user/cjg507/sparkproject/geometries/states/tl_2017_us_state.* US Census State Shapefile data
/user/cjg507/sparkproject/geometries/states/tl_2017_us_state.tl_2017_us_state.shp.xml US Census State Shapefile data metadata, schema, column names and explanations
/user/cjg507/sparkproject/geometries/states/tl_2017_us_county.* US Census County Shapefile data
/user/cjg507/sparkproject/geometries/states/tl_2017_us_state.tl_2017_us_county.shp.xml US Census State Shapefile data metadata, schema, column names and explanations

Use bash scripts for HMDA zip Data Ingest

The bash scripts will Get/download Home Mortgage Disclosure Act data zip files from the CFPB's website.

There are two sets of download, unzip, and concatenate scripts. This is because there are two ways of representing the data set. The representation with larger amounts of memory has extensive string representations of the data. The representation that takes less memory is an integer representation of the text options.

How to execute the following scripts:

chmod +x *.bash
./<script_name> netID

Example:

./get_hmda_code_zip_files.bash jjl359

This should download the zip and subsequently unzip the LAR files into your scratch workspace.

Use Bash Script to Unzip Data Files into Scratch

The scripts are a pipeline that are meant to be executed in the following order:

Unzip the files into the target folder
Concatenate the files together

How to execute the following scripts:

chmod +x *.bash
./<script_name> netID

Example:

./unzip_data.bash jjl359
./unzip_files_hmda_codes.bash jjl359

This should download the zip and subsequently unzip the LAR files into your scratch workspace.

In order to put together all 11 files together, use the concatenate bash scripts in the folder where the files live:

./concatenate.bash
./concatenate_hmda_codes.bash

This will ensure all the files are combined into one set, with only one set of headers.

Once the files are ready in scratch, then they can be put into hdfs by:

hdfs dfs -put <filename> /user/<username>/<target-folder>

HMDA Nationwide Institution Panel Data

HMDA related institution data are available as yearly csv files in both panel and transmittal formats from 2007 to 2017 on:

institution data

Panel format was chosen over transmittal for it contains richer information. All csv files were downloaded from the aforementioned site and tranported onto Dumbo via SCP, subsequently uploaded onto Dumbo HDFS with shell command

hdfs dfs -put /user/fh643/InstitutionData

These raw csv files henceforth reside in locations:

data from 2007 to 2009/user/fh643/InstitutionData/InstitutionData/data/panel_07-09/_*.csv
data from 2010 to 2017 /user/fh643/InstitutionData/InstitutionData/data/panel_10-17/_*.csv

US Census Geography Data Acquisition

The county and state geometries were contained in shapefiles created by the US Census Bureau and downloaded directly from the US government open data source Data.gov at the following links:

These data files were downloaded as tl_2017_us_county.zip and tl_2017_us_state.zip respectively. The files were then passed to the Dumbo cluster via SCP, and unzipped to the directories /home/cjg507/geometries/states and /home/cjg507/geometries/counties. The files were then uploaded to Dumbo HDFS with the command

hdfs dfs -put /home/cjg507/geometries sparkproject/

where the shapefile data is now contained in the following directories:

State: /user/cjg507/sparkproject/geometries/states
County: /user/cjg507/sparkproject/geometries/counties

/etl_code

This folder contains the code for data cleaning and pre-processing.

Script Files

src folder containing the scripts for HMDA etl
pom.xml Maven dependencies file for HMDA etl scripts in src
runscript.sh Script used to run Scala etl code for US Census Geography data
translateShapefiles.scala Scala etl code for translating US Census Shapefiles to JSON files
JoinData.scala contains scripts/functions used to join HMDA data with the institution dataset

HDFS Data

/user/cjg507/sparkproject/geometries/states/stateGeom.json JSON-formatted state geometry data with the following schema:

Column Name	Data Type	Description
STFP	Int	Primary key US Census State Federal Information Processing Standard (FIPS) code
STUSPS	String	US Postal Service State Abbreviation
STNAME	String	State Name
STLAT	Float	State Midpoint Latitude
STLON	Float	State Midpoint Longitude
STGEOM	String: WKT	State boundary line as well-known text (WKT) formatted string. Includes long strings (> 1000 chars) and commas.

/user/cjg507/sparkproject/geometries/county/countyGeom.json JSON-formatted county geometry data with the following schema:

Column Name	Data Type	Description
CYKEY	Int	Primary key Concatenation of the US Census State and County FIPS codes
STFP	Int	US Census State Federal Information Processing Standard (FIPS) code
CYFP	Int	US Census County Federal Information Processing Standard (FIPS) code
CYNAME	String	Full County Name
CYFSTAT	String	County Functional Status - Preserved as potential identifier of special county classifications
CYLAT	Float	County Midpoint Latitude
CYLON	Float	County Midpoint Longitude
CYGEOM	String: WKT	County boundary line as well-known text (WKT) formatted string. Includes long strings (> 1000 chars) and commas.

HMDA Data: Execute the Scala-Spark File Using Maven to ETL the Raw Data into a Usable CSV Files

Compile the data with this command, using the pom.xml in the folder:

/opt/maven/bin/mvn package
nohup spark2-submit --class DataPrep --master yarn target/scala-0.0.1-SNAPSHOT.jar  &

or deploy to the cluster:

spark2-submit --class DataPrep --deploy-mode cluster --executor-memory 100G --total-executor-cores 2048 target/scala-0.0.1-SNAPSHOT.jar

Column of Interest	Description	Used or Not Used in Analysis	Range of Values	Data Type
as_of_year	The year that the data represents	Used in the analysis	2007 to 2017	String or Integer
respondent_id	9 to 12 digit identifier of the lending institution (ID RSSD or Tax ID)	Used in analysis	1-9 digit ID RSSD or 9 digit tax identifier (xx-xxxxxxx)	String or Integer
agency_code	The agency that regulates the lender (for example the Fed or FDIC)	not used	one to nine representing different agencies	string or integer (string if NA is submitted)
loan_type	Whether the loan is conforming / not conforming / jumbo	Will be used	there four distinct values (Conventional / FH / VA / FSA) however in some cases NA is submitted	string
property_type	1-4 family home / multifamily / manufactured	used	there are only three types of property types but NA is sometimes submitted	string or integer
loan_purpose	purchase / home improvement / refinancing	not used	there are only three choices but NA if nothing is indicated on the application	string or integer
owner_occupancy	owner occupied unit or buy to rend unit	not used	there are three options (Owner occupied / not owner occupied / not applicable)	string or integer
loan_amount_000s	the amount of the loan being applied for	used	the range is from 1000 dollars to 99 million dollars and there are some NA or blank values. However it appears that the loan amount is normally distributed between 1000 dollars to 500000 dollars	integer unless NA
preapproval	whether the loan application was pre-approved	not used	three options	integer or string
action_taken	whether the loan application was originated or denied	used	there are 8 distinct actions that can be taken (such as loan originated or application denied)	string or integer
state_code	the state where the property is located	used	string or integer of the abbreviations or FIPS number of the state (Federal Information Processing Standard)	string or integer
county_code	the county where the loan is located	used	the name of the county or a three digit FIPS number	string or integer
applicant_sex	gender	used	male or female are the options	string or integer
applicant_income_000s	income of the applicant	used	the range of incomes are 1000 to 99 million and the average income over the 11 year period is 3.3 million / however the weighted average income by frequency of applications per income listed is 98 thousand	string or integer
rate_spread	the amount of interest charged above the conventional or market rate	not used

HMDA Nationwide Institution Panel Data

Benefitting from the relatively small size of the institution data, its ETL were done interactively throught the Spark shell. Scripts and functions used are aggregated into a Scala object for reusability in etl_code/JoinData.scala. As the file name indicates, the primary purpose of the ETL is to join the institution data with HMDA data.

Selected columns from the instution data before they are joined with HMDA data are:

name	description	data type
Year	reporting year	IntegerType
RespondentID	ten-digit number that uniquely identifies a reporting institution	LongType
RespondentName	name of the reporting institution	StringType
AgencyCode	The agency that regulates the lender	IntegerType
RespondentState	state where the institution reported at	StringType
RespondentCity	city where the institution reported at	StringType

As the result of the etl operations, institution list were successfully joined to the HMDA dataset on a compound primary key composed of respondentID, respondentState and AgencyCode that maps any record to a unique respondent name.

The joined dataset is saved on HDFS at:

Joined Data: /user/fh643/InstitutionData/HmdaJoined

US Census Geography Shapefile Conversion

Shapefile data are distributed over several different files, and require specialized tools to convert to Spark-readable text. The open-source Spark plugin GeoSpark contains a Scala method 'ShapefileReader', among other tools, that were used to translate the data into Spark DataFrames. This process was executed with runscript.sh.

The Scala script translateShapefiles.scala was used to convert the shapefiles into the following JSON files:

State: /user/cjg507/sparkproject/geometries/states/stateGeom.json
County: /user/cjg507/sparkproject/geometries/county/countyGeom.json

The JSON files could more easily be used clean and profile the data in the following steps.

The cleanMapData.scala script imports the geometry JSON files, drops unnecessary columns, and joins the county and state data on the STFP column that contains the State FIPS number, a unique key for each US state, that is present in both geometry tables. The number of counties present in each US state was calculated and saved within /user/cjg507/sparkproject/geometries/numStates.csv for analysis of county distribution over the data and to check for join errors. The number of records for each US Geographical region were calculated and saved to /user/cjg507/sparkproject/geometries/numRegions.csv for analysis of global county distribution.

/profiling_code

This folder contains the scripts and tools used to profile the input datasets.

Script Files

src folder containing the scripts for HMDA profiling
pom.xml Maven dependencies file for HMDA profiling scripts in src
cleanMapData.scala Scala code for cleaning and profiling map data
plotCounties.py Python code for plotting HMDA records by US Census geographical location
parse_fields.scala Scala scripts that prepared the HMDA data for profiling
compute_denial_per_fields.scala Scala code for generating summary statistics for several key parameters
compute_overall_denial.scala Scala code for generating state-wise summary statistics
heatMap_slider.ipynb Python Jupyter Notebook script used to plot the overall denial rate heatmap
plotParam.py Python script used to plot denial rates by parameters

Data Files

HMDACounties.html choropleth of the number of HMDA records by county

HMDA Data: Execute the Scala-Spark File Using Maven to Profile the Raw Data and Calculate Denial Rates by Various Feature Mixtures

Compile the data with this command, using the pom.xml in the folder:

/opt/maven/bin/mvn package
nohup spark2-submit --class DataProfiler --master yarn target/scala-0.0.1-SNAPSHOT.jar
nohup spark2-submit --class CalculateAverageDenialRate --master yarn target/scala-0.0.1-SNAPSHOT.jar  &

or deploy to the cluster:

spark2-submit --class DataProfiler --deploy-mode cluster --executor-memory 100G --total-executor-cores 2048 target/scala-0.0.1-SNAPSHOT.jar

spark2-submit --class CalculateAverageDenialRate --deploy-mode cluster --executor-memory 100G --total-executor-cores 2048 target/scala-0.0.1-SNAPSHOT.jar

Three Levels of Denial Rate Analysis:

Overall Denial Rate by State: hdfs dfs -ls /user/jjl359/project/denial-rate-analysis/denial_overall
Denial Rate by Race-Ethnicity pair: hdfs dfs -ls /user/jjl359/project/denial-rate-analysis/high_level
Denial Rate by Year, State, Race-Ethnicity Pair: hdfs dfs -ls /user/jjl359/project/denial-rate-analysis/low_level

YEAR	State	Race	Ethnicity	Denial Rate
2007	AR	White	Hispanic	x%
...	..	..	..	..

More Denial Rates And Visualization

Denial Rate by Race: /user/fh643/VisualPrep/raceDenials
Denial Rate by Gender: /user/fh643/VisualPrep/sexDenials
Denial Rate by Ethnicity: /user/fh643/VisualPrep/ethDenials
Denial Rate by Income: /user/fh643/VisualPrep/incomeDenials

US Census Geography Data Profiling

To better understand the geographical distribution of the HMDA data, the combined state-county table was inner-joined to the HMDA data set by state and county name. The number of HMDA records per state and county were calculated and the results saved to /user/cjg507/sparkproject/geometries/hmdaCountyJoinedCount.json. This dataset was transferred to a local machine and the plotCounties.py script was executed on the data to generate the choropleth saved to HMDACounties.html. The results show that the overwhelming majority of counties have few records within the HMDA data set, and that analysis should be made at the state level rather than the county level to make sufficiently well-informed conclusions.

To validate the HMDA to state and county join, an anti-join was created and output was saved to /user/cjg507/sparkproject/withoutJoins.json. The anti-joined table was profiled by a breakdown of county and state, the results of which were saved in /user/cjg507/sparkproject/missingBreakdown.csv. The results show that approximately 2.5% of the data was not joined, due to either special characters within the county names or the state and county names being missing altogether. This amount of data was considered minor in the context of the greater data set and was dropped from further analysis.

/app_code

Included Files

src folder containing the scripts for HMDA initial model assesment
pom.xml Maven dependencies file for HMDA initial model assesment scripts in src
runscript.sh Script used to run modeling Scala code on NYU Dumbo cluster
dataModeling.scala Scala code for modeling lender approval probabilities with HMDA data
website folder contains the Flask web application files and associated Python files used to create the US lender exploration web application. The majority of the included files, folders, and Python scripts were automatically generated by Flask during app creation. This listing will focus on specific customized files and directories.
- run.py Flask entry point Python script.
- src/entry.py Used to bootstrap the Flask application and define initial setup.
- src/driverCode.py Python and Pyspark script that provides the runModel class for lender probability modeling
- src/forms.py Defines the UI form used to collect applicant demographic data
- src/formsEntries.py Defines the user-provided demographic options that link human-readable options to HMDA codes for modeling
- src/recommend.py Reads in UI form, plug into the model and visualize recommendation result through a popup HTML page
- src/lenderModel precalculated Naive Bayes Spark MLLib Pipeline model folder.
- model/modelingMatrix.csv the modeling template created by the modeling code.
- log/logging.conf configuration file for webapp logging
- log/flaskLog.txt debug/information log for the Flask webapp. Used for debugging and analysis.

HDFS Data

/user/jjl359/project/data/HMDA_2007_to_2017_codes.csv HMDA data used for modeling
/user/fh643/InstitutionData/data/panel_07-09 Institution (lender identification) data for years 2007-2009
/user/fh643/InstitutionData/data/panel_10-17 Institution (lender identification) data for years 2010-2017
/user/cjg507/sparkproject/modelingMatrix.csv Webapp modeling template
/user/cjg507/lenderModel final MLLib NaiveBayes model

Initial Model Evaluation

The code will execute machine learning model evaluation code. It evaluates the AUC (area under curve) for Naive Bayes, SVM, and Logistic Regression.

This segment of the application uses the following folder for its data: hdfs dfs -ls /user/jjl359/project/df_for_logistic_regression

Running spark-submit:

/opt/maven/bin/mvn package

spark2-submit --class ModelEval --deploy-mode cluster --executor-memory 50G --total-executor-cores 9182 target/scala-0.0.1-SNAPSHOT.jar > output.txt

Creating Model Feature and Label Input

After the HMDA and institution data had been cleaned and profiled, the next task was to create a model that constructs the probability of lender approval for various lenders over time given the applicant demographics. To construct such a model, the input features must include each of the features on which an approval prediction will be made, and the corresponding binary label of approved or denied. These features included the following based on their locations in the given datasets:

Year of Record: HMDA.as_of_year
Applicant Demographic Features
- Income: HMDA.applicant_income_000s
- Loan Amount: HMDA.loan_amount_000s
- Race: HMDA.applicant_race_1
- Ethnicity: HMDA.applicant_ethnicity
- Gender: HMDA.applicant_sex
- State: HMDA.state_code
Lender Identifiers
- Respondent ID: InstitutionData.Respondent ID & HMDA.respondent_id
- Agency Code: InstitutionData.Agency Code & HMDA.agency_code
- Institution Name: InstitutionData.Respondent State (Panel)
Action Taken: HMDA.action_taken

This meant that a single Spark dataframe had to be constructed with the above features for each record to be used to fit the model.

The binary label of "Approved" or "Denied" could be derived from the HMDA.action_taken column. This column contained several classifications, most of which could be easily interpreted as either "Approved" or "Denied". Classifications that were ambiguous or indicated missing data were dropped. The following table provides the binary mapping/filtering:

HMDA Classification	Mapping	No. Records
"Application denied by financial institution"	Denied	32269700
"Preapproval request denied by financial institution"	Denied	1578696
"Loan originated"	Approved	89180013
"Application approved but not accepted"	Approved	8314173
"Preapproval request approved but not accepted"	Approved	823436
"Loan purchased by the institution"	Approved	32350741
"File closed for incompleteness"	Dropped	5741376
"Application withdrawn by applicant"	Dropped	17204311

The features used to predict the binary lnputs and their associated probabilities were located entirely within the HMDA dataset, however the Lender Name used to determine the institutions to be modeled and to present to the user within the webapp was defined only within the Institution dataset. As determined previously in the institution data profiling section, the institution can be uniquely defined by Respondent ID, Agency Code, and State Code.

The state codes were not present in the given HMDA data, however they are defined within its associated documentation. A dataframe relating the state codes to state abbrevations was manually created and joined to the HMDA data to create a new state code column.

The institution dataset was then joined to the HMDA data to create the final feature and label dataframe.

Creating Webapp Template Matrix

The web application used to model approval probabilities will ask the user for all of applicant demographic features listed above, but will otherwise generate the predicted years and lenders. To create a probability for each combination of demographic, year, and lender, the webapp will create a dataframe that is the cartesian product (demographics X set of years considered X set of lenders). Although there will only be one set of demographics and 11 years, there are several thousand unique lenders within the HMDA dataset. Predicting a cartesian product with the entire set of lenders will be infesible for an interactive application, therefore the number of lenders considered was reduced to the top 20 lenders with the largest number of records in the HMDA dataset.

The top 20 lenders were extracted and a template dataframe containing the cartesian product of the (set of years X set of top 20 lenders) was pre-calculated in Spark and saved as the CSV file /user/cjg507/sparkproject/modelingMatrix.csv. To produce a modeling dataframe, the webapp will only have to create a column for each demographic feature and copy the same value for each row within the template year X lender dataframe.

Creating the Modeling Pipeline and Naive Bayes Model

The web application will need to quickly calculate approval probabilities to be considered interactive. To speed model prediction, the model will be fit offline, saved to a local machine, and loaded during webapp execution. As examined in the HMDA data profiling, the Naive Bayes model produced the highest (albeit remarkably poor) AUC accuracy, and thus will be the final model used in the webapp.

The feature dataframe will require several transformations to allow for modeling with the Spark MLLib NaiveBayes model. These transformations will have to be performed during fitting and prediction, therefore a standardized method of "funneling" data from a readable text dataframe to the model called a Pipeline will be used. The Pipeline will consist of 4 major stage categories:

String Indexing
One-Hot Encoding
Assembling
Final Modeling/Fitting

The string indexing step is powered by the MLLib StringIndexer class that maps input strings to integer categories. These categories will flag the downstream model to treat the given feature as categorical. One-hot encoding is provided by the MLLib OneHotEncoder class that maps a set of integers to a sparse set of binary vectors. This is a requirement of the Bernoulli NaiveBayes model. Assembling is provided by the MLLib Assembler class that joins the columns of sparse vectors produced by the one-hot encoders to a single sparse vector of features. This features vector is the final input to the NaiveBayes model. The final modeling is provided by the MLLib NaiveBayes model, which fits the model and/or creates a binary probability of approval, and a prediction label.

The full bucketed joined HMDA-Institution dataset was fit using the above pipeline, and the resulting trained model was saved to /user/cjg507/lenderModel.

Flask Web Application

Why Flask: We chose Flask to host our final visualization and model output because:
- Flask is light weight, which is great for prototyping and facilitates test-driven development
- Flask has nice safety features built-in that allow us to validate form input and encrypt server-side information with ease
- With our model developed primarily in PySpark, using python-based Flask is very natural.
- Since the initial phase of the project, Plot.ly has been a preferred visualization tool for us. The consistency in language between front and back end is quite ideal.
Flask Installation
- Ensure Flask & flask_sqlalchemy is installed
```
 pip install -U Flask
 pip install flask_sqlalchemy
```
- Install a local version of Spark verify PySpark is reachable from your python PATH. This can be verified by
```
 import pyspark
```
Launch Web Application To execute the flask web application run run.py with
```
 python run.py
```

/screenshots

ScreenShots

/test_code

Code in this folder was used to prototype four machine learning models on the entire data set using Scala-Spark REPL.

TestCode

TakaiKinoko/SU19SparkClass

CSCI-GA.3033-001: Summer 2019 Team Project

Table of Contents

Software Used

Hardware Used

Folders

/data_ingest

Included Files

HDFS Data

Use bash scripts for HMDA zip Data Ingest

Use Bash Script to Unzip Data Files into Scratch

HMDA Nationwide Institution Panel Data

US Census Geography Data Acquisition

/etl_code

Script Files

HDFS Data

HMDA Data: Execute the Scala-Spark File Using Maven to ETL the Raw Data into a Usable CSV Files

HMDA Nationwide Institution Panel Data

US Census Geography Shapefile Conversion

/profiling_code

Script Files

Data Files

HMDA Data: Execute the Scala-Spark File Using Maven to Profile the Raw Data and Calculate Denial Rates by Various Feature Mixtures

More Denial Rates And Visualization

US Census Geography Data Profiling

/app_code

Included Files

HDFS Data

Initial Model Evaluation

Creating Model Feature and Label Input

Creating Webapp Template Matrix

Creating the Modeling Pipeline and Naive Bayes Model

Flask Web Application

/screenshots

/test_code

Access