/Insurance-company-Piccardi-Music-two-data-analysis-on-mock-practical-cases

Two data analysis projects to practice Pandas, Data Wrangling, Data Vizualization, Regression, Observational Studies, Statistics and Supervised Learning ( w/ scikit-learn and statsmodels).

Primary LanguageJupyter Notebook

📈Insurance_company_ _Piccardi_Music GitHub commit activity GitHub last commit GitHub code size GitHub repo size GitHub follow GitHub fork GitHub watchers GitHub star

This project contains two data analysis applied to two different mock practical cases. Such cases were given as a group assignment for the EPFL course CS-401 Applied Data Analysis.

Authors

Insurance Company

This data study involves an experiment about honesty. Oftentimes, we are asked to confirm our honest intentions by signing at the end of a document. For example, in tax returns or insurance policy forms, we are often asked to sign our names under a text that reads something like “I hereby certify that the above statements are true and correct to the best of my knowledge.”

However, when individuals sign after lying in the form, they may not feel the need to correct the falsehoods they have reported. In that context, it could be that signing at the beginning rather than at the end of the document would decrease dishonesty, as those who are filling the form would be aware of the ethical requirements before they provide the information in the form.

This intuition has led researchers to partner up with a motorcycle insurance company to run a randomized experiment. In this insurance company (as well as in many others), customers had to report the exact odometer kilometrage in order for the company to adjust the insurance premiums. Note that motorcycles with lower kilometrage are less likely to have issues, and thus will result in a lower insurance premium. Therefore, customers have an incentive to lie, reporting a kilometrage lower than the real value, in order to save money.

In the experiment, two different forms were created: one where the signing was done at the end, and another where the signing was done at the beginning. The insurance company then randomized these forms (i.e., each customer received exactly one form, each with probability 50%) and sent back the data that customers had provided.

In this assignment, we took the role of the researcher analyzing this data.

Piccardi Music

Piccardi Music, a music label, is providing us the data to perform some analysis:

  • Will this album be a hit? Build a regressor to predict whether an album will be well received or not.
  • Second Album Syndrome: shed light on one of the business’s oldest enigmas, the “second album syndrome.” In a nutshell, the “second album syndrome” is a theory that states that the second album of a band is always bad. Does data confirm it?

Results

Since the perfomred analysis and experiments were very different, we suggest to read the two jupyter notebooks (you can simply open them on GitHub) Insurance Company and Piccardi Music (click on them to open them) where the assignments' requests are carefully explained as well as the code and our results' interpretations.

How to install and reproduce results

Download this repository as a zip file and extract it into a folder. The easiest way to run the code is to install Anaconda 3 distribution (available for Windows, macOS and Linux). To do so, follow the guidelines from the official website (select python of version 3): https://www.anaconda.com/download/

Additional packages required are:

  • pandas
  • seaborn
  • matplotlib
  • scipy
  • sklearn
  • tqdm
  • statsmodels

To install them write the following command on Anaconda Prompt (anaconda3):

cd *THE_FOLDER_PATH_WHERE_YOU_DOWNLOADED_AND_EXTRACTED_THIS_REPOSITORY*

Then write for each of the mentioned packages:

conda install *PACKAGE_NAME*

Some packages might require more complex installation procedures. If the above command doesn't work for a package, just google "How to install PACKAGE_NAME on YOUR_MACHINE'S_OS" and follow those guides.

Finally, run the jupyter notebooks Insurance_company.ipynb and *Piccardi_Music.ipynb.

Files description

  • Data analysis 1/data: folder containing the data to be analysed for the first mock case.

  • Data analysis 1/Insurance_company.ipynb: main file, a jupyter notebook containing all the code to rerun all the experiments and analysis for the first mock case.

  • Data analysis 2/data: folder containing the data to be analysed for the second mock case.

  • Data analysis 2/Piccardi Music.ipynb: main file, a jupyter notebook containing all the code to rerun all the experiments and analysis for the second mock case.

🛠 Skills

Python, Pandas, Seaborn, Matplotlib, Scipy, Scikit Learn, Statsmodels. Create visualizations to explain findings, regression analysis and Supervised learning, good knowledge of Pandas for handling tabular data.

🔗 Links

portfolio linkedin