LendingHome Interview

Hi, My name is Zeyu Li. This is the repo for the take-home interview of the Summer 2018 Data Scientist Intern of LendingHome.

Speak for myself

This part is not supposed to be at the beginning. But according to its importance, I designed to move it up to the front. Here're something that I would speak for myself to get this intern opportunity. Who don't want to seize it? It's the first time to speak with recruiters!

I spent almost a week on this project: defining the problem, cleaning data, searching for useful supporting data, designing and extracting features, coding, plotting, creating slides, and organizing the artifacts before submission. From my understanding of this exercise, deriving a fantastic result that reach >0.90 metrics is not top 1 priority. Important thing is not the result, but the process from getting the data to giving back artifacts. Such artifacts not only include results, but also the exploration of the dataset, the selection and comparison of machine learning models, the design of features, and the way to explain the results derived from each models.

From such consideration, I believe I have done a great job. I won't speak too much on how fantastic my work it, because my work will speak for me.

Looking forward to working with you in Summer 2018 at LendingHome!

Files

There are 2 directories and a README.md in the project directory.

.
├── data
│   ├── data.csv
│   ├── dataX.csv
│   ├── dataY.csv
│   └── zillow.csv
├── README.md
├── slides
│   └── LendingHome Interview.pdf
└── src
    ├── DatasetExploration.ipynb
    ├── FeatureEngineering.ipynb
    └── Models.ipynb

3 directories, 9 files

All data files are not packed in this directory because that will take extra space and time for transition. Here are instructions about how to get the data and make it ready to go!

data folder and all its belongings are not included in this repo for the sake of repo size.
data.csv is the Fix & Flip dataset provided by the LendingHome recruitors and shared via Google drive (for conveniece, the file name was changed to data.csv instead of LH_data_scientist_intern_exercise.csv).
dataX.csv and dataY.csv are processed feature matrix and target matrix generated by FeatureEngineering.ipynb if you run it.
zillow.csv is a supporting dataset available online at here. The description of this datset is at here, which is also summarized in the presentation slides. This dataset is "free for public use by consumers, media, analysts, academics etc., consistent with our published Terms of Use." Considering that this project is an exercise rather than a commercial practice, and proper attribution to Zillow is mentioned, it was used here. Please rename the file to zillow.csv for convenience.

Run

The Jupyter Notebook App is a server-client application that allows editing and running notebook documents via a web browser. Notebook documents gain popularity in recent years because they are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable documents which can be run to perform data analysis. Such property make them loved by data analysts who need to see some interim results. For those reasons, we are using ipynb as the environment to edit and execute source code.

We have three jupyter-notebook files in src that help create all the statistics, models, and figures in the presentation slides. In order to reproduce the results shown in the presentation, please follow these steps:

Clone the repo to local

$ git clone https://github.com/zyli93/LH_interview.git

cd into LH_interview and create data directory

$ cd LH_interview
$ mkdir data

Copy/paste the two datasets and rename them

$ cp <original_path>/LH_data_scientist_intern_excercise.csv data/
$ cp <original_path>/Zip_Zhvi_Summary_AllHomes.csv data/
$ mv data/LH_data_scientist_intern_excercise.csv data/data.csv
$ mv data/Zip_Zhvi_Summary_AllHomes.csv data/zillow.csv

Run Jupyter-notebook to view the .ipynb files.

$ jupyter-notebook

This cmd will open a new window in browser at the home dir. Just click src and then choose one of the three .ipynb for the source code. If, unfortunately, jupyter-notebook is unavailable on your local laptop or a non-graphical interface, you can also refer to the Github to view them online

Submission

There are two parts in submission: Slides and Source Code. Slides are available here, another pdf copy is in ./slides of this Github.

According to the requirement file, a presentation is preferred. My understanding of "presentation" is a set of slides. Please let me know if I was wrong. I was also thinking of making a video presentation and post it on YouTube but didn't do so because of the confidential concerns of this exercise project. If you think an presentation video is needed, please let me know and I'll make and post it ASAP. Since we don't have a video, the slides become wordy because I was trying to make it smooth. Again, if a video is considered necessary, please let me know!