/dive-into-machine-learning

Dive into Machine Learning with Python Jupyter notebook and scikit-learn

Creative Commons Attribution 4.0 InternationalCC-BY-4.0

Dive into Machine Learning Creative Commons License Awesome

Hi there! This guide is for you:

I learned Python by hacking first, and getting serious later. I wanted to do this with Machine Learning. If this is your style, join me in getting a bit ahead of yourself.

Note: There are several fields within "Data" and Machine Learning is just one. It's good to know the context: What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?

Get your feet wet!

I suggest you get your feet wet ASAP. You'll boost your confidence.

Tools you'll need

You can install Python 3 and all of these packages in a few clicks with the Anaconda Python distribution. Anaconda is popular in Data Science and Machine Learning communities.

If you're using Python 2.7, don't worry. You don't have to migrate to Python 3 just for this guide. Also, if you're using pip/virtualenv instead of Anaconda, that's alright too! Cf. conda vs. pip vs. virtualenv if you're curious.

Let's go!

Learn how to use IPython Notebook (5-10 minutes). (You can learn by screencast instead.)

Now, follow along with this brief exercise (10 minutes): An introduction to machine learning with scikit-learn. Do it in ipython or IPython Notebook. It'll really boost your confidence.

I'll wait.

What just happened?

You just classified some hand-written digits using scikit-learn. Neat huh?

scikit-learn is the go-to library for machine learning in Python. Some recognizable logos use it, including Spotify and Evernote. Machine learning is hard. You'll be glad your tools are easy to work with.

I encourage you to look at the scikit-learn homepage and spend about 5 minutes looking over the names of the strategies (Classification, Regression, etc.), and their applications. Don't click through yet! Just get a glimpse of the vocabulary.

Dive in

A Visual Introduction to Machine Learning

Let's learn a bit more about Machine Learning, and a couple of common ideas and concerns. Read "A Visual Introduction to Machine Learning, Part 1" by Stephanie Yee and Tony Chu.

A Visual Introduction to Machine Learning, Part 1

It won't take long. It's a beautiful introduction ... Try not to drool too much!

A Few Useful Things to Know about Machine Learning

OK. Let's dive deeper.

Read "A Few Useful Things to Know about Machine Learning" by Prof. Pedro Domingos. It's densely packed with valuable information, but not opaque. The author understands that there's a lot of "black art" and folk wisdom, and they invite you in.

Take your time with this one. Take notes. Don't worry if you don't understand it all yet.

The whole paper is packed with value, but I want to call out two points:

  • Data alone is not enough. This is where science meets art in machine-learning. Quoting Domingos: "... the need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. What it does is get more from less. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs."
  • More data beats a cleverer algorithm. Listen up, programmers. We like cool tools. Resist the temptation to reinvent the wheel, or to over-engineer solutions. Your starting point is to Do the Simplest Thing that Could Possibly Work. Quoting Domingos: "Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data. [...] As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.)"

So knowledge and data are critical. Focus your efforts on those, before fussing about algorithms. In practice, this means that unless you have to increase complexity, you should continue to Do Simple Things; don't rush to neural networks just because they're cool. To improve your model, get more data and use your knowledge of the problem to manipulate the data. You should spend most of your time on these steps. Only optimize your choice of algorithms after you've got enough data, and you've processed it well.

What has the most impact in Machine Learning

(Chart inspired by a slide from Alex Pinto's talk, "Secure Because Math: A Deep-Dive on ML-Based Monitoring".)

Talking Machines

Subscribe to Talking Machines, a podcast about machine learning. It's great. It's a low-effort, high-yield way to learn more.

I suggest this listening order:

  • Start with the "Starting Simple" episode. It supports what we read from Domingos. Ryan Adams talks about starting simple, as discussed above. Adams also stresses the importance of feature engineering. Feature engineering is an exercise of the "knowledge" Domingos writes about.
  • Then, start over from the first episode

Play to learn

Pick one or two of these IPython Notebooks and play along.

There are more places to find great IPython Notebooks:

Immerse yourself

Pick one of the courses below and do the whole thing.

Recommended course

Prof. Andrew Ng's Machine Learning is a free online course I've seen recommended often. And emphatically.

It's helpful if you decide on a pet project to play around with, as you go, so you have a way to apply your knowledge. You could use one of these Awesome Public Datasets. And remember, IPython Notebook is your friend.

Also, you should grab an in-depth textbook to use as a reference. The two best options are Understanding Machine Learning and Elements of Statistical Learning. You'll see these recommended as reference textbooks. I favor UML, but here's context and comparison. Download both books, they're free.

Busy schedule? Read Ray Li's review of this course for some helpful tips.

Other courses

Here are some other free online courses I've seen recommended. (Machine Learning, Data Science, and related topics.)

Learn Pandas well

If you're focusing on Python, you should get more familiar with Pandas.

Cheat sheets

Bookmark these cheat sheets:

More topics

More Data Science materials

Not repeating the materials mentioned above, here are some more Data Science resources:

Bayesian Statistics and Machine Learning

From the "Bayesian Machine Learning" overview on Metacademy:

... Bayesian ideas have had a big impact in machine learning in the past 20 years or so because of the flexibility they provide in building structured models of real world phenomena. Algorithmic advances and increasing computational resources have made it possible to fit rich, highly structured models which were previously considered intractable.

You can learn more by studying one of the following resources. Both resources use Python, PyMC, and Jupyter Notebooks.

Questions, answers, chats

For now, the best StackExchange site is stats.stackexchange.com – machine-learning. (There's also datascience.stackexchange.com, but it's still in Beta.) And there's /r/machinelearning. There are also many relevant discussions on Quora, for example: What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?

You should also join the Gitter channel for scikit-learn!

For help and community in meatspace, seek out meetups. Data Science Weekly's Big List of Data Science Resources may help you.

Assorted Opinions and Other Resources

The rest of the stuff that might not be structured enough for a course, but seems important to know.

Risks

"Machine learning systems automatically learn programs from data." Pedro Domingos, in "A Few Useful Things to Know about Machine Learning." The programs you generate will require maintenance. Like any way of creating programs faster, you can rack up technical debt.

Here is the abstract of Machine Learning: The High-Interest Credit Card of Technical Debt:

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.

If you're following this guide, you should read that paper. You can also listen to a podcast episode interviewing one of the authors of this paper.

A few more articles on the risks:

Welcome to the Danger Zone

So you are dabbling with Machine Learning. You've got Hacking Skills. Maybe you've got some "knowledge" in Domingos' sense (some "Substantive Expertise" or "Domain Knowledge"). This diagram is modified slightly from Drew Conway's "Data Science Venn Diagram." It isn't a perfect fit for us, but it may get the point across:

Drew Conway's Data Science Venn Diagram, modified slightly

Please don't sell yourself as a Machine Learning expert while you're still in the Danger Zone. Don't build bad products or publish junk science. (Also please don't be evil.) This guide can't tell you how you'll know you've "made it" into Machine Learning competence ... let alone expertise. It's hard to evaluate proficiency without schools or other institutions. This is a common problem for self-taught people.

Towards Expertise

You need practice. On Hacker News, user olympus commented to say you could use competitions to practice and evaluate yourself. Kaggle and ChaLearn are hubs for Machine Learning competitions. You can find some examples of code for popular Kaggle competitions here. For smaller exercises, try HackerRank.

You also need understanding. You should review what Kaggle competition winners say about their solutions, for example, the "No Free Hunch" blog. These might be over your head at first but once you're starting to understand and appreciate these, you know you're getting somewhere.

Competitions and challenges are one way to practice. You shouldn't limit yourself, though - and you should also understand that Machine Learning isn't all about Kaggle competitions.

Here's a complementary way to practice: do practice studies.

  1. Ask a question. Start your own study. The "most important thing in data science is the question" (Dr. Jeff T. Leek). So start with a question. Then, find real data. Analyze it. Then ...
  2. Communicate results. When you have a novel finding, reach out for peer review.
  3. Fix issues. Learn. Share what you learn.

And repeat. Re-phrasing this, it fits with the scientific method: formulate a question (or problem statement), create a hypothesis, gather data, analyze the data, and communicate results. (You should watch this video about the scientific method in data science, and/or read this article.)

How can you come up with interesting questions? Here's one way. Every Sunday, browse datasets and write down some questions. Also, sign up for Data is Plural, a newsletter of interesting datasets; look at these, datasets, and write down questions. Stay curious. When a question inspires you, start a study.

This advice, to do practice studies and learn from peer review, is based on a conversation with Dr. Randal S. Olson. Here's more advice from Olson, quoted with permission:

I think the best advice is to tell people to always present their methods clearly and to avoid over-interpreting their results. Part of being an expert is knowing that there's rarely a clear answer, especially when you're working with real data.

As you repeat this process, your practice studies will become more scientific, interesting, and focused. The most important part of this process is peer review.

Ask for Peer Review

Here are some communities where you can reach out for peer review:

Post to any of those, and ask for feedback. You'll get feedback. You'll learn a ton. As experts review your work you will learn a lot about the field. You'll also be practicing a crucial skill: accepting critical feedback.

When I read the feedback on my Pull Requests, first I repeat to myself, "I will not get defensive, I will not get defensive, I will not get defensive." You may want to do that before you read reviews of your Machine Learning work too.

An Anecdote About User Experience

If you create software for users, and you want to use machine learning to benefit your users, you must understand your users. I won't get into a whole user experience rant here, but in short, you must think about user experience.

I have a friend who worked at <Redacted> Music Streaming Service. This company used machine learning in their recommendation and radio services. He complained about the way the company scored the radio feature's performance. There was disagreement about what should be scored. They used a metric, "no song skips." But why? Sure that indicates the recommendation wasn't awful, what if you want to measure engagement? Other metrics could measure positive engagement: "favorites," shares, listening time, or whether the listener returns to the radio station later. Measuring "no skips" might work for the passive listener, but the engaged listener is different. Perhaps the engaged listener will skip 5 songs, but find 20 songs they love and come back to the service later.

My takeaway: user experience matters just as much as ever. You must understand which kind of user you're trying to benefit with your machine learning techniques. Write user stories. (Some readers may notice this anecdote is very unscientific. Well, user stories are sometimes unscientific. Yet still so important.)

Try to measure success for your users. You must be able to measure before you can optimize.

It helps to have knowledge of the domain (substantive expertise). You may also benefit from UX expertise. Work with UX experts if you can.

Machine Learning in Internet Security

There was a great BlackHat webcast on this topic, Secure Because Math: Understanding Machine Learning-Based Security Products. Slides are here, video recording is here. Equally relevant to InfoSec and AppSec.

Deep Learning

In early editions of this guide, there was no specific "Deep Learning" section. I omitted it intentionally. I think it is not effective for us to jump too far ahead. I also know that if you become an expert in traditional Machine Learning, you'll be capable of moving onto advanced subjects like Deep Learning, whether or not I've put that in this guide. We're just trying to get you started here!

Maybe this is a way to check your progress: ask yourself, does Deep Learning seem like magic? If so, take that as a sign that you aren't ready to work with it professionally. Let the fascination motivate you to learn more. I have read some argue you can learn Deep Learning in isolation; I have read others recommend it's best to master traditional Machine Learning first. Why not start with traditional Machine Learning, and develop your reasoning and intuition there? You'll only have an easier time learning Deep Learning after that. After all of it, you'll able to tackle all sorts of interesting problems.

In any case, when you decide you're ready to dive into Deep Learning, here are some helpful resources.

"Big" Data?

Scaling data analysis is a familiar problem now, and there's no shortage of ways to address it. Beware needless hype and companies selling you flashy, proprietary solutions. You can do it all with open-source tools. Even if "buy" instead of "build," you may want to buy from vendors who use known good stacks. No news here.

Here are some open-source tools to reach for:

And here are some things to read and listen to:

Finding Open-Source Libraries

Alternative ways to "Dive into Machine Learning"

Here are some other guides to Machine Learning. They can be alternatives or complements to this guide.