This code in this repo uses the Python pandas and sklearn libraries to analyse the Stack Overflow 2020 Developer Survey Results.
The focus of this analysis is to look at differences in responses by country and attempt to build linear classifiers for specific countries.
The author is a noob data scientist completing the Udacity Data Science Nano degree, and the first assignment requires:
The classifiers suck and should not be used in any meaningful way, but the journey was bountiful and well documented.
A general exploration of the data is presented, and then several specific business questions are answered:
- Which countries had the most respondents ?
- Were there any country based biases in the technology preferences surevey responses ?
- Can we predict a respondents country of origin using their answers, excluding answers that obviously directly predict country ?
I chose not to include the 94MB survey results in the git repo. So you will need to get them yourself.
- Go here
- download and ensure this file is in place: notebook/assets/survey_results_public.csv
- download and ensure this file is in place: notebook/assets/survey_results_schema.csv
This repo assumes a local docker installation, and uses the jupyter/datascience-notebook to create a portable workspace that requires no other installation.
To get this running locally on OSX or Linux:
- install docker desktop
- in your terminal of choice with CWD set to the repo root execute:
./bin/go.sh
- in the lines of text output, find the localhost link and copy/paste it into your browser. Example link:
http://127.0.0.1:8888/lab?token=abc123beepbeep456boopboop
./bin/go.sh
creates a docker container hosting the jupyter notebook with a mount the the ./notebooks
directory of this repo.
This is the content of the bin/go file circa July 17, 2021:
docker run --rm -p 8888:8888 --name ds-so2020 -e JUPYTER_ENABLE_LAB=yes -v $(pwd)/notebook:/home/jovyan/work jupyter/datascience-notebook:latest
NOTE: only tested on mac
Simply run ./setup.sh
to create a python virtual env, and install all required dependencies
From above:
The author is a noob data scientist completing the Udacity Data Science Nano degree, and the first assignment requires a repo and a blog.
That aside the code seeks to analyse the SO 2020 dataset with several objectives in mind:
* practice data prep techniques including removal and imputation
* experiment with sklearn models
* attempt to build classifiers that identify membership to a specific country as a boolean
* create a well documented future reference for all of the above
Any analysis presented in the blog will have a specific python notebook to show the work and demonstrate reproducability.
There are some notebook files not used in the blog which are also included in the repo.
The notebooks in the jupyter workspace (i.e., ./notebooks
) are listed below, followed by the list of python helper files and descriptions.
- 0_overview.ipynb - notebook with markdown stating business context
- 1_basics.ipynb - notebook showing first steps to probe dataset
- 2_value_counts.ipynb - notebook showing some drill down steps to probe dataset
- 3_top_10_countries_measured_by_response_count.ipynb - showing survey responses by country
- 4_multiple_choice_responses.ipynb - show answers by deviation from mean and microsoft sentiment analysis
- 5_country_classifier.ipynb - orchestrate the data prep and modelling phases of CRISP-DM
- 6_visualise_binary_classifier_results.ipynb - analysis binary classifier results
- assets.* - the contents of the SO 2020 dataset zip file
- pickles - the classifier results used in the blog
- all libraries are in directories named for the notebook where they are used
The ipynb
files should not be read directly using an IDE - they are meant to be interacted with using a browser. The Installation section above outlines how to run ./bin/go.sh
and then copy/paste the provided URL into a browser.
All notebooks can be rerun to reproduce the results.
If you want to contribute, fork and submit a PR, that would be top notch.
Go nuts. Really, just get right in there.
This is the work of Kyle Zeeuwen. There is some inspiration and probably borrowed code from the course presenter Josh Bernhard, specifically this repo.
- Udacity is so far so good 👍. The review of my first submission was thorough and valuable. Some summary notes can be found here
- Cover photo : https://unsplash.com/photos/oMpAz-DN-9I : free via Unsplash : great photo by Greg Rakozy
- Stack Overflow for conducting the survey and sharing the results