intro-clf-trees

Introduce and explain multi-class classification with decision trees 🌳 Note: this repo is primarily intended for explanatory purposes and prioritizes simplicity over optimality. So, please do reach out when something is unclear or unhelpful

Short exploration of sci-kit learn's make_classification()
- A function which generates "synthetic" data sets suitable to classification problems
Generating an example of classifiable web-session data
- Where we could analyze some typical features of a user's session on a website to predict what page they are ultimately looking for.

2.1-understanding-decision-tree-classification

Explaining the problem we are trying to solve and the data we'll be working with
Brief introduction into what a decision tree is and how it can be used to predict the classification of a record
Brief explanation of why you should almost never rely on a single decision tree

data

contains a csv generated by 1.1-generate-an-example-classification-data-set.ipynb which simulates a simplified version of a user's session data on a website.

Dependencies:

yml files Where the software package dependencies are specified (one for mac os, the other for windows 64).

You can automatically create a virtual environment with Anaconda with the following command (follow prompts to activate).

conda env create -f path/to/intro-clf-trees-venv-win.yml

You can also activate the venv in Anaconda Navigator by selecting it from the dropdown menu in the tool bar. Then launch a Jupyter Notebook server to open the ipynb notebooks within that venv. Here's a helpful article: Getting started with Python environments (using Conda), Oct 21, 2018, consulted Jun 5, 2020.

Explaining the problem we are trying to solve and the data we'll be working with

Business Requirement: Improve Usability

Let's suppose we manage a website for an organization which stores and publishes a large quantity of important content.
The organization would like to improve the usability of its site to "just work you know, the way Amazon or Google does."
However, unlike Amazon or Google which can prioritize paid or trending content and allow the rest to be buried at the end of a search result, our organization is committed to improving usability across all of its content even the content that is rarely relevant, because when it is relevant, its crucial.
They want a "quick win" an innovation that radically transforms the usability across all the site and attracts new users.
But it would also be great if you didn't move things around too much because their most loyal users already know their way around and get frustrated if they need to learn a whole new layout.

Proposed Solution: A Content Recomender.

A small unobtrusive panel with a short list of "quick links" that takes each user directly to the content they want without having to slog through a long path of clicks, scrolls, and dropdowns and without altering the current structure of the site.
We actually are not going to analyze their content and try to correlate it to their search queries
- their content is "messy" and hard to access, analyze, or understand
- their queries are inconsistent
  - they are either "naive" inputs by newcomers accustomed to "google like power"
  - or highly specialize inputs from expert powerusers
Instead, we are going to analyze the usage patterns in the the web-session data
- and label it according to what content they interacted with
- The advantages of this data set are
  - that it is much less likely to have been made "messy" in the past
  - and it is constantly refreshing and updating itself (every time a user uses the sight they add a relevant data record to our set)

paulo-metrostar/intro-clf-trees