/RecursionFairies_AdaProject

Project Ada EPFL course 2019

Primary LanguageJupyter Notebook

Abstract

It is always difficult to find the right place to live. Many aspects could influence our choice (prices, income, education, life expectancy...), and each person could give more importance to some aspects, and less to other. There is already a Human Development Index, created by the UN to measure the quality of life based on different criteria ("Long and healthy life", "Knowledge", "A decent standard of living"). [1] But this is a static index! You cannot give more importance to the features you really want to! We would like to improve this index, and make it personalizable! We looked for different data concerning the different aspects taken into account in HDI, referring to the 50 states of US. The goal is to create a map of USA where we can select the importance level to give to each of these aspects, in order to find the best place to live in USA according to our desires!

Research questions

  • Can we create a new index based on our desires and expectations?
  • Are we able select the right datasets to implement this index?
  • Are the selected features used to substitute the 3 areas of HDI meaningful? Is it really useful to generate such an index with this data?
  • Do our index correspond to our expectancies? Are there some areas that are better than others?
  • Changing the importance of our parameters, are there striking differences? It is possible to select places that are generally better than others?

Datasets

A list of internal milestones up until project milestone 2

For each dataset, we will extract the data of interest. We will create different maps and statistics for each of these aspects, after recollecting all the information required. The idea is to "merge" the different aspects into an interactive index, in which you can select with sliders the amount of importance to attribute to each feature, in order to create an interactive map of which state is perfect for your desires! Our work for milestone 2 will be divided in two parts:

  • First two weeks: loading, cleaning datasets and extracting data of our interest
  • Last two weeks: analysis on the extracted data, visualization of the results, preparation of our interactive index

Questions for TAs

  • Is the datasets list for the project odd? Are there too few, or too many datasets?
  • Do you think the project is too ambitious, or too poor? For example, the final dataset will just have 50 rows: is it too little? We decided to focus only on USA as we found more easily the interesting datasets, and because differences in the states make our research interesting
  • Do you know if there are some other datasets that we could use?

[1] http://hdr.undp.org/en/content/human-development-index-hdi

Milestone 2 checklist:

Data cleaning and analysing :

Once we had all the data, our first task was to clean it. We came across a variety of challenges during this task:

  • The first dilemme was to decide on which locations we were going to apply the prediction. USA is a big and complex country. It is composed of states, federal districts, territories, ... After discussion, we decided to apply our data analysis on all the 50 states and the District of Colombia to have more global results.
  • The most frequent challenge was missing data. For example the state of Hawaii was missing in the rent dataset or life expectancy was missing for Maine and Wisconsin. We found all the missing data on internet, allowing to have a precise recovery (since otherwise we would have had to drop this features or approximate the missing values.
  • Another challenge was to select meaningful information for our model. Since we had to predict the HDI for only 51 locations (50 states and 1 district), we decided to be quite selective to avoid overfiting problems with our machine learning prediction.

The notebook that we have written in order to clean and analyze the datasets are:

  • crime.ipynb: a crime dataset is analyzed, we hope to extract good insights on the quality of life given the number of crimes committed in all states
  • McDonalds.ipynb: a dataset of alimentation quality (number of fast food restaurants per state, obesity rate, people eating unhealthy).
  • school_dataset.ipynb: a dataset with tuition fees of private schools and elementary schools.
  • homelessData.ipynb: We extracted homeless people figure from 2007 to 2018.
  • rentData.ipynb: We selected the rent price of 2018 for 4 types of rents: 1, 2, 3 and 4 bedrooms, and studios.
  • csv_datasets.ipynb: we analyze and extract features related to University rankings, life expectancy and personal income.
  • diabetes_physical_activity_alcohol_obesity.ipynb : We extract and analyse prevalence of physical activity, alcohol, diabetes and obesity.

What comes next?

Now that we have all datasets cleaned and analyzed, we can further proceed in the realization of our project: firstly, we need to train a model which will enable us to predict the HDI index of every state of the USA, based on the dataset that we have created.

This task is going to pose some challenges: in fact, our dataset has many dimensions and a few samples. Therefore, there is going to be the risk of overfitting.

We can try to counter this effect by performing some dimensionality reduction algorithms like PCA, Factor Analysis or Independent Component Analysis.

As soon as we manage to reduce the dimension of our dataset, we can train the model using any regression technique like SVM or Ridge Regression.

Once we compute the weights for our model, then we can build our index by adjusting the weights of our model to personalize it to the users' needs. We will offer a visual representation of it using sliders, so that the user can adjust the weights in an intuitive manner.

Further, we may try to split our features into meaningful subsets (health, crime, education, monetary) in order to gauge their impact on our final HDI score. Moreover, we may try to investigate the relation among the above subsets and the actual features used to compute the UN HDI index (personal income, life expectancy and average years of education).

To conclude, a further improvement could be trying to test the robustness of our model: it means that, whenever we drop any column (for instance if it happens that any of the organizations computing our metrics fails), we are still able to learn a good enough predictor for the HDI. The fact that our dataset has so many correlated features means that probably we can manage the loss of some columns, without losing much accuracy.

Milestone 3 checklist:

Notebook

The main notebook for the milestone 3 is the ADA_project_final_notebook.ipynb.

Data preprocessing:

As discussed, the features belong to 4 categories. So we splitted the dataset accordingly, applied normalization for some features (for example number of homeless people per 10'000 citizen seems more meaningful than the raw number of homeless), computed correlations between some features and other data analyse were applied.

machine learning regression:

We were finally able to apply machine learning techniques to our data to see if a regressor would give a good prediction of the HDI with our features.
We applied linear regression, ridge regression and SVR. For the 3 algorithms, we compared the performances of standard data or normalized data and with/without polynomial expansion. Then we applied the same algorithms but also applying PCA on the entire features or on the 4 precomputed classes (i.e. education, health, security and economy) separately. The results are shown in the ADA_project_final_notebook.ipynb file. The main conclusion of all this computation is that ridge regression with polynomial expansion is the most accurate regressor on the entire dataset and still close to the best when applied on fewer features.
One of our goals was to be able to modify the HDI based on one of the 4 classes, allowing to have a personalised score for each state. For this, we take a weight vector for the above 4 classes and appropriately weigh the coefficients of the linear regression model for HDI to predict our new personalized HDI.

website creation:

Structure of the repo:

  • Data: all the files we analysed
  • Pickles: repository for all the pickles files
  • plotlySave: repository for all the plotly files (as HTML)
  • ADA_project_final_notebook.ipynb: principal notebook; all the analysis is in it
  • crime.ipynb: cleaning and analysing of crime features
  • csv_datasets.ipynb: cleaning and analysing of csv table features
  • data_story_and_maps.ipynb: copy of ADA_project_final_notebook.ipynb to have interactive graphs
  • diabetes_physical_activity_alcohol_obesity.ipynb: cleaning and analysing of diabetes, physical activity, alcohol and obesity features
  • HDIOutputAndInputs.ipynb: notebook that combine inputs (i.e. features) and outputs (i.e. HDI) for our ML task
  • homelessData.ipynb: cleaning and analysing of homeless features
  • Life_expectancy_alcohol_diabetes.ipynb: cleaning and analysing of life expectancy, alcohol and diabetes features
  • LinearRidgeRegressionAndSVR.ipynb: comparisons of various ML algorithms
  • McDonalds.ipynb: cleaning and analysing of McDonalds features
  • Milestone2.ipynb: notebook that combine the result tables of all milestone 2 data cleaning
  • rentData.ipynb: cleaning and analysing of rent features
  • school_dataset.ipynb: cleaning and analysing of school features
  • toHaveNormalData.ipynb: to save the pickle file with all the features
  • README.md: a README

Goals:

  1. Our first goal was to show the feasibility to predict the Human Development Index based on non-official parameters.
    This objective has been proved.\
  2. Our second objective was to be able to have a more personalized index.
    As shown on the website, we reached this objective, even if it can only be validated empirically.

Contribution list:

  • Angelo Sciullo: data cleaning and analysis and preprocessing and most of the Data Story.

  • Harshvardhan: data cleaning and analysis. ML and weighted prediction implementation.

  • Loris Pilotto: data cleaning and analysis. ML model implementation.

  • Stefano Huber: data cleaning and analysis. Creation of the website.

  • The presentation will be held by Stefano Huber, and the other members of the group will assist during the creation of the poster.

You can find our data story on the USA Human Development Index here !