/fsev-youth-analysis

This repository contains code (mainly in the form of jupyter notebooks) for analysis of the FSEV young people survey from 2013.

Primary LanguageJupyter Notebook

fses-youth-analysis

This repository contains code (in form of a jupyter notebook) for analysis of the FSES (Comenius University in Bratislava) young people survey from 2013. It was chosen for a udacity data science project, since it contains interesting insights of behavioral aspects (hobbys, views/opinions etc.) as well as features that can be of commercial interest (in particular spending habits). Since young people are often also an important target group for marketing, this dataset seems ideal for a business intelligence case study.

Software requirements

For running the code, , the jupyter software as well as a basic scientific python software stack is needed, containing the following packages (version numbers refer to the versions this code has last been successfully tested with):

  • numpy 1.21.2
  • scipy 1.7.3
  • matplotlib 3.5.0
  • scikit-learn 1.0.1
  • seaborn 0.11.2
  • pandas 1.4.1
  • ipykernel 6.4.1 (for running the code using jupyter-notebook)

Installation

For information about how to install jupyter, see the jupyter documentation.

The necessary python packages can be installed using a package manager of choice (e.g. pip) or using a virtual environment management system like anaconda.

Instructions on how to install anaconda can be found here.

Once anaconda is installed, set up a new environment and install all required packages like this:

conda install -c anaconda numpy scipy matplotlib scikit-learn seaborn pandas ipykernel

There is also an environment file required_software.yml which can be used to install all the packages at the respective versions with which the code was tested:

conda env create -f required_software.yml

For enabling the new environment in jupyter-notebook, activate your new environment, then run:

python -m ipykernel install --user --name=name_of_your_conda_env

Usage

First, clone this repository somewhere. When still in the parent directory of this repository, also download the columns.csv and responses.csv files from kaggle. It is important that they are located in the parent directory of this repository.

After that, go to the repository directory and you should be able to run the code in a jupyter notebook: jupyter-notebook fses_analysis.ipynb

Note: In the first cell of fses_analysis.ipynb, there is a parameter called save_plots. When this is set to True the notebook will create a new folder called Plots and store all the plots generated by the notebook there in addition to them being shown in the notebook in-line. By default, this paramert is set to False in which case Plots are only shown in-line.

Files

This repository contains two files:

  • fses_analysis.ipynb, the jupyter notebook that contains the analysis code
  • required_software.yml, a YAML file that contains all python packages and their versions used for running the code. It can be used with anaconda to recreate an python environment with the exact same packages in order to run the code.

Analysis Summary

All but a few variables are encoded in categorical values ranging from 1 to 5, where 1 refers to a low and 5 to a high agreement with the statement presented in the questionnaire.

Doing some exploratory analysis of the dataset in terms of spending habits, it can be seen that the distributions have the highest number of answers at the center value of 3, meaning they neither spend a large, nor a very low amount of money. The only exceptions are "spending for healthy food" and "spending in shopping centers", both of wich tend to be answered with higher values.

Looking at the overall spending for male and female participants, a significant difference has been found, with female participants showing lower spending values then male ones. The significance was assessed using a hypothesis test. However, the same is not found for people with higher vs. lower education.

In general, it seems very difficult to predict spending habits from behavioral aspects. There are only very few variables that are strongly correlated with it, some of which are obvious, like going shopping as a hobby.

Using a simple neural network model with a single hidden layer and looking at the absolute sum coefficients for each input, it can be seen that the most predictive features are:

  • Having shopping as a hobby
  • Knowing the right people
  • Saving a lot of money
  • The belief that bad people will suffer one day
  • Being well-mannered and looking after ones appearance

However, the neural network model is not very good at predicting the spending habits, with an R2 score of just 0.29, which shows that the task of predicting spending habits from "soft factors" like personal opinions and hobbies is quite difficult and way more sophisticated models or better features need to be employed to solve this task.