This repository contains code (in form of a jupyter notebook) for analysis of the FSES (Comenius University in Bratislava) young people survey from 2013. It was chosen for a udacity data science project, since it contains interesting insights of behavioral aspects (hobbys, views/opinions etc.) as well as features that can be of commercial interest (in particular spending habits). Since young people are often also an important target group for marketing, this dataset seems ideal for a business intelligence case study.
For running the code, , the jupyter
software as well as a basic scientific python
software stack is needed, containing the following packages (version numbers refer to the versions this code has last been successfully tested with):
numpy 1.21.2
scipy 1.7.3
matplotlib 3.5.0
scikit-learn 1.0.1
seaborn 0.11.2
pandas 1.4.1
ipykernel 6.4.1
(for running the code usingjupyter-notebook
)
For information about how to install jupyter
, see the jupyter documentation.
The necessary python
packages can be installed using a package manager of choice (e.g. pip
) or using a virtual environment management system like anaconda
.
Instructions on how to install anaconda
can be found here.
Once anaconda
is installed, set up a new environment and install all required packages like this:
conda install -c anaconda numpy scipy matplotlib scikit-learn seaborn pandas ipykernel
There is also an environment file required_software.yml
which can be used to install all the packages at the respective versions with which the code was tested:
conda env create -f required_software.yml
For enabling the new environment in jupyter-notebook
, activate your new environment, then run:
python -m ipykernel install --user --name=name_of_your_conda_env
First, clone this repository somewhere. When still in the parent directory of this repository, also download the columns.csv
and responses.csv
files from kaggle. It is important that they are located in the parent directory of this repository.
After that, go to the repository directory and you should be able to run the code in a jupyter notebook: jupyter-notebook fses_analysis.ipynb
Note: In the first cell of fses_analysis.ipynb
, there is a parameter called save_plots
. When this is set to True
the notebook will create a new folder called Plots
and store all the plots generated by the notebook there in addition to them being shown in the notebook in-line. By default, this paramert is set to False
in which case Plots are only shown in-line.
This repository contains two files:
fses_analysis.ipynb
, the jupyter notebook that contains the analysis coderequired_software.yml
, a YAML file that contains allpython
packages and their versions used for running the code. It can be used withanaconda
to recreate anpython
environment with the exact same packages in order to run the code.
All but a few variables are encoded in categorical values ranging from 1 to 5, where 1 refers to a low and 5 to a high agreement with the statement presented in the questionnaire.
Doing some exploratory analysis of the dataset in terms of spending habits, it can be seen that the distributions have the highest number of answers at the center value of 3, meaning they neither spend a large, nor a very low amount of money. The only exceptions are "spending for healthy food" and "spending in shopping centers", both of wich tend to be answered with higher values.
Looking at the overall spending for male and female participants, a significant difference has been found, with female participants showing lower spending values then male ones. The significance was assessed using a hypothesis test. However, the same is not found for people with higher vs. lower education.
In general, it seems very difficult to predict spending habits from behavioral aspects. There are only very few variables that are strongly correlated with it, some of which are obvious, like going shopping as a hobby.
Using a simple neural network model with a single hidden layer and looking at the absolute sum coefficients for each input, it can be seen that the most predictive features are:
- Having shopping as a hobby
- Knowing the right people
- Saving a lot of money
- The belief that bad people will suffer one day
- Being well-mannered and looking after ones appearance
However, the neural network model is not very good at predicting the spending habits, with an R2 score of just 0.29, which shows that the task of predicting spending habits from "soft factors" like personal opinions and hobbies is quite difficult and way more sophisticated models or better features need to be employed to solve this task.