Mortgage Workflow

The Dataset

The dataset used with this workflow is derived from Fannie Mae’s Single-Family Loan Performance Data with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.

To acquire this dataset, please visit RAPIDS Datasets Homepage

Introduction

The Mortgage workflow is composed of three core phases:

ETL - Extract, Transform, Load
Data Conversion
ML - Training

ETL

Data is:

Read in from storage
Transformed to emphasize key features
Loaded into volatile memory for conversion

Data Conversion

Features are:

Broken into (labels, data) pairs
Distributed across many workers
Converted into compressed sparse row (CSR) matrix format for XGBoost

Machine Learning

The CSR data is fed into a distributed training session with xgboost.dask

Performance

We regularly benchmark RAPIDS on this workload to measure our performance against not just Apache Spark on CPUs but past versions of RAPIDS.