/ds-stackoverflow2020-analysis

analysis of stackoverflow 2020 developer survey results. Includes code and writeup.

Primary LanguageJupyter NotebookMIT LicenseMIT

Overview

This code in this repo uses the Python pandas and sklearn libraries to analyse the Stack Overflow 2020 Developer Survey Results.

The focus of this analysis is to look at differences in responses by country and attempt to build linear classifiers for specific countries.

The author is a noob data scientist completing the Udacity Data Science Nano degree, and the first assignment requires:

The classifiers suck and should not be used in any meaningful way, but the journey was bountiful and well documented.

Business Overview

A general exploration of the data is presented, and then several specific business questions are answered:

  • Which countries had the most respondents ?
  • Were there any country based biases in the technology preferences surevey responses ?
  • Can we predict a respondents country of origin using their answers, excluding answers that obviously directly predict country ?

Installation Notes

Get the survey results

I chose not to include the 94MB survey results in the git repo. So you will need to get them yourself.

  • Go here
  • download and ensure this file is in place: notebook/assets/survey_results_public.csv
  • download and ensure this file is in place: notebook/assets/survey_results_schema.csv

(Option A) Run notebooks via docker

This repo assumes a local docker installation, and uses the jupyter/datascience-notebook to create a portable workspace that requires no other installation.

To get this running locally on OSX or Linux:

  • install docker desktop
  • in your terminal of choice with CWD set to the repo root execute: ./bin/go.sh
  • in the lines of text output, find the localhost link and copy/paste it into your browser. Example link:
    • http://127.0.0.1:8888/lab?token=abc123beepbeep456boopboop

./bin/go.sh creates a docker container hosting the jupyter notebook with a mount the the ./notebooks directory of this repo.

This is the content of the bin/go file circa July 17, 2021:

docker run --rm -p 8888:8888 --name ds-so2020 -e JUPYTER_ENABLE_LAB=yes -v $(pwd)/notebook:/home/jovyan/work jupyter/datascience-notebook:latest

(Option B) Run notebooks locally via ipython

NOTE: only tested on mac

Simply run ./setup.sh to create a python virtual env, and install all required dependencies

Motivation

From above:

The author is a noob data scientist completing the Udacity Data Science Nano degree, and the first assignment requires a repo and a blog.

That aside the code seeks to analyse the SO 2020 dataset with several objectives in mind:

* practice data prep techniques including removal and imputation
* experiment with sklearn models
* attempt to build classifiers that identify membership to a specific country as a boolean
* create a well documented future reference for all of the above

File Descriptors

Any analysis presented in the blog will have a specific python notebook to show the work and demonstrate reproducability.

There are some notebook files not used in the blog which are also included in the repo.

The notebooks in the jupyter workspace (i.e., ./notebooks) are listed below, followed by the list of python helper files and descriptions.

Notebooks

Misc

Libraries

  • all libraries are in directories named for the notebook where they are used

How to interact with project

The ipynb files should not be read directly using an IDE - they are meant to be interacted with using a browser. The Installation section above outlines how to run ./bin/go.sh and then copy/paste the provided URL into a browser.

All notebooks can be rerun to reproduce the results.

If you want to contribute, fork and submit a PR, that would be top notch.

Licencing

Go nuts. Really, just get right in there.

MIT License

Authors

This is the work of Kyle Zeeuwen. There is some inspiration and probably borrowed code from the course presenter Josh Bernhard, specifically this repo.

Acknowledgements