/Intro-to-ML

Intro to Machine Learning - Pattern Recognition for Fun and Profit

Primary LanguagePython

Intro to Machine Learning - Pattern Recognition for Fun & Profit

Known Vulnerabilities

The table of contents is as follows:

Overview

This is a nice free introduction to Machine Learning with Python.

xkcd

Here is how the folks at nVidia see the relationship between Artifical Intelligence, Machine Learning and Deep Learning:

AI_versus_ML_versus_Deep_Learning

Towards the beginning of my career, I was interested in AI and joined a society founded by Donald Michie - who was then at the University of Edinburgh. I wonder how much things have progressed since then?

Machine Learning is hot right now, and of course the cloud providers have noticed.

Here is Google's Cloud offering:

    http://cloud.google.com/products/machine-learning/

For a more sombre view of things, the following article is worth reading:

    http://www.cio.com/article/3223191/artificial-intelligence/a-practical-guide-to-machine-learning-in-business.html

Prerequisites

Chris Manning, Stanford, 3 Apr 2017:

"Essentially, Python has just become the lingua franca of nearly all the deep learning toolkits, so that seems the thing to use."

    http://youtu.be/OQQ-W_63UgQ?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6&t=2102

For an explanation of why Python (as contrasted with other languages) is a good choice for Natural language processing the following link is worth a look:

    http://www.nltk.org/book_1ed/ch00-extras.html
  1. Python (Python 2 support has been dropped from a number of projects, so use Python 3)

  2. pip or possibly pip3 (if using Python 2 and Python 3)

pip (or pip3) is the Package manager for Python, much as npm is the package manager for the Node.js platform.

scikit-learn

The course uses this library, which it refers to as sklearn.

The latest version may be found here:

    http://scikit-learn.org/stable/

To install this library in multi-user mode (not recommended) with pip (replace with pip3 if using Python 3):

    pip install -U scikit-learn

To install this library in single-user mode (recommended) with pip (replace with pip3 if using Python 3):

    pip install --user scikit-learn

Libraries

It's not really possible to do much of anything in Python without additional libraries.

Essential libraries include:

Useful optional libraries include:

Verify library presence and version with pip as with scikit-learn:

pip list --format=freeze | grep numpy

[Replace numpy above as necessary.]

Or verify library presence and version with Python:

python -c "import numpy as im; print(im.__version__)"

[Likewise replace numpy above as necessary.]

Or use try_import.py for multiple libraries as shown:

$ python try_import.py numpy scipy sklearn keras pytorch
"numpy" was imported
"scipy" was imported
"sklearn" was imported
Using TensorFlow backend.
"keras" was imported
"pytorch" could not be imported - try "pip install --user pytorch"
$

Install the library with pip (either multi-user or single-user) as with scikit-learn above.

Numpy

NumPy allows for a nice performance optimization called single instruction, multiple data, or SIMD.

Basically, this allows for vector or matrix handling (compare 'vectors\ pt1.py' to 'vectors\ pt2.py').

Matplotlib / Seaborn

Matplotlib is great for plotting variables, but can be very low-level.

To make these graphs look a little better, check out my No More Blue repo.

Or - for a higher-level library - check out Seaborn.

[Seaborn will greatly simplify a number of difficult matplotlib graphing exercises.]

StatsModels

Although not used in this course, StatsModels is also worth a look.

It provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Some Seaborn functions will optionally use StatsModels if it is installed.

requirements.txt

Of course, it's also possible (as with npm or composer) to install all dependencies in one fell swoop (probably a best practice).

Simply list the dependencies in a file (for example requirements or requirements.txt) and install from it:

    pip install --user -r requirements.txt

[Note the --user option, which may be omitted for a Global install, also the -r option to specify an input file.]

TODO

  • Finish course
  • Update Quick Hit links to make them easier to navigate
  • Update everything for the most recent (and secure) version of TensorFlow

Credits

Based upon:

    http://www.udacity.com/course/intro-to-machine-learning--ud120

You can find an interview with co-author Katie Malone here:

    http://www.se-radio.net/2017/03/se-radio-episode-286-katie-malone-intro-to-machine-learning/

Alternatives

The following look like interesting options too:

    http://web.stanford.edu/class/cs224n/

    http://openclassroom.stanford.edu/MainFolder/CoursePage.php?course=MachineLearning

Data Cleaning

A lot (lets say three quarters) of a data scientist's time is spent massaging data. Which is a pretty important (lets say critically important) part of a data scientist's job and not often discussed.

Garbage in, garbage out.

[Not to mention the (very expensive) computer time wasted.]

For a quick introduction to data cleaning with numpy and pandas, have a look at this great tutorial:

http://realpython.com/python-data-cleaning-numpy-pandas/

You can see my stab at it here.

For a more complicated example, check out my ML with Missing Data repo.

Quick Hits

For an easy (and quick) introduction to the various Python tools and ML concepts:

    http://www.youtube.com/playlist?list=PLOU2XLYxmsIIuiBfYad6rFYQU_jL2ryal

This series is from mid-2016 so there is a small amount of 'code rot', plus it seems to use Python 2 rather than Python 3, but even so it's a quick and fun way to get a brief overview of ML and the tools & techniques involved.

End to End

For a deeper dive into the Iris dataset, check out my ML with SciPy repo.

This project shows a full end-to-end workflow.

Tools

There are a number of tools, such as Python, IPython, and Jupyter Notebooks.

One website that gets a lot of mentions is Anaconda:

http://www.anaconda.com/download/