Hi there! You might find this guide helpful if:
- You know Python or you're learning it 🐍
- You're new to Machine Learning
- You care about the ethics of ML
- You learn by doing
For some great alternatives, jump to the end or check out Nam Vu's guide, Machine Learning for Software Engineers.
Of course, there is no easy path to expertise. Also, I'm not an expert! I just want to connect you with some great resources from experts. Applications of ML are all around us. I think it's in the public interest for more people to learn more about ML, especially hands-on, because there are many different ways to learn.
Whatever motivates you to dive into machine learning, if you know a bit of Python, these days you can get hands-on with a machine learning "Hello World!" in minutes.
- Python. Python 3 is the best option.
- Jupyter Notebook. (Formerly known as IPython Notebook.)
- Some scientific computing packages:
- numpy
- pandas
- scikit-learn
- matplotlib
You can install Python 3 and all of these packages in a few clicks with the Anaconda Python distribution. Anaconda is popular in Data Science and Machine Learning communities. (Use whichever tool works for you. If you're unsure or need more context about using conda/virtualenv/poetry/pipenv, here's a very helpful guide)
Some options you can use from your browser:
- Binder is Jupyter Notebook's official choice to try JupyterLab
- Deepnote allows for real-time collaboration
- Google Colab provides "free" GPUs
For other options, see:
- markusschanta/awesome-jupyter, "Hosted Notebook Solutions"
- ml-tooling/best-of-jupyter, "Notebook Environments"
Learn how to use Jupyter Notebook (5-10 minutes). (You can learn by screencast instead.)
Now, follow along with this brief exercise: An introduction to machine learning with scikit-learn. Do it in ipython
or a Jupyter Notebook, coding along and executing the code in a notebook.
You just classified some hand-written digits using scikit-learn. Neat huh?
Let's learn a bit more about Machine Learning, and a couple of common ideas and concerns. Read "A Visual Introduction to Machine Learning, Part 1" by Stephanie Yee and Tony Chu.
It won't take long. It's a beautiful introduction ... Try not to drool too much!
OK. Let's dive deeper.
Read "A Few Useful Things to Know about Machine Learning" by Prof. Pedro Domingos. It's densely packed with valuable information, but not opaque.
Here are two excerpts you may find interesting:
- Data alone is not enough. This is where science meets art in machine-learning. Quoting Domingos: "... the need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. What it does is get more from less. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs."
- More data can beat a cleverer algorithm. Quoting Domingos: "Suppose you’ve constructed the best set of features you can, but the classifiers you’re getting are still not accurate enough. What can you do now? There are two main choices: design a better learning algorithm, or gather more data. [...] As a rule of thumb, a dumb algorithm with lots and lots of data beats a clever one with modest amounts of it. (After all, machine learning is all about letting data do the heavy lifting.)"
- What is the difference between Data Analytics, Data Analysis, Data Mining, Data Science, Machine Learning, and Big Data?
- Another handy term: "Data Engineering."
- "MLOps" overlaps with Data Eng, and there's an introductory MLOps section later in this guide.
Next, play along from one or more of notebooks.
- Dr. Randal Olson's Example Machine Learning notebook: "let's pretend we're working for a startup that just got funded to create a smartphone app that automatically identifies species of flowers from pictures taken on the smartphone. We've been tasked by our head of data science to create a demo machine learning model that takes four measurements from the flowers (sepal length, sepal width, petal length, and petal width) and identifies the species based on those measurements alone."
- Various notebooks by topic:
- Notebooks in a series:
- ageron/handson-ml2 - "Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python." Scikit-Learn, Keras, TensorFlow 2.
Find more great Jupyter Notebooks when you're ready:
- Jupyter's official Gallery of Interesting Jupyter Notebooks: Statistics, Machine Learning and Data Science (permalink)
Pick one of the courses below and start on your way.
Prof. Andrew Ng's Machine Learning is a popular and esteemed free online course. I've seen it recommended often. And emphatically.
It's recommended to grab a textbook to use as an in-depth reference. The two I saw recommended most often were Understanding Machine Learning and Elements of Statistical Learning. You only need to use one of the two options as your main reference; here's some context/comparison to help you pick which one is right for you.
You might like to have a pet project to play with, on the side. When you are ready for that, you could explore one of these: Awesome Public Datasets, paperswithcode.com/datasets, datasetlist.com
- Study tips for Prof. Andrew Ng's course, by Ray Li
- If you're wondering, Is it still a relevant course? or trying to figure out if it fits for you personally, check out these reviews:
It's hard to make time available every week. So, you can try to study more effectively within the time you have available. Here are some ways to do that:
- "Learning How to Learn" by Barbara Oakley by Barbara Oakley, a free video course on Coursera.
- Prefer book/audiobook? These are great options:
- Barbara Oakley's book A Mind for Numbers: How to Excel at Math and Science (reviews) — "We all have what it takes to excel in areas that don't seem to come naturally to us at first"
- Make It Stick: the Science of Successful Learning (reviews)
I am not a machine learning expert. I'm just a software developer and these resources/tips were useful to me as I learned some ML on the side.
- Data science courses as Jupyter Notebooks:
microsoft/Data-Science-For-Beginners
— added in 2021 — "10-week, 20-lesson curriculum all about Data Science. Each lesson includes pre-lesson and post-lesson quizzes, written instructions to complete the lesson, a solution, and an assignment. Our project-based pedagogy allows you to learn while building, a proven way for new skills to 'stick'."- See also
microsoft/ML-For-Beginners
More free online courses I've seen recommended. (Machine Learning, Data Science, and related topics.)
- Coursera's Data Science Specialization
- Prof. Pedro Domingos's introductory video series. Prof. Pedro Domingos wrote the paper "A Few Useful Things to Know About Machine Learning", which you may remember from earlier in the guide.
ossu/data-science
(see alsoossu/computer-science
)- Stanford CS229: Machine Learning
- Harvard CS109: Data Science
- Advanced Statistical Computing (Vanderbilt BIOS8366). Interactive.
- Kevin Markham's video series, Intro to Machine Learning with scikit-learn, starts with what we've already covered, then continues on at a comfortable place.
- UC Berkeley's Data 8: The Foundations of Data Science course and the textbook Computational and Inferential Thinking teaches critical concepts in Data Science.
- Prof. Mark A. Girolami's Machine Learning Module (GitHub Mirror). "Good for people with a strong mathematics background."
- An epic Quora thread: How can I become a data scientist?
- There are more alternatives linked at the bottom of this guide
Start with the support forums and chats related to the course(s) you're taking.
Check out datascience.stackexchange.com and stats.stackexchange.com – such as the tag, machine-learning. There are some subreddits, like /r/LearningMachineLearning and /r/MachineLearning.
Don't forget about meetups. Also, nowadays there are many active and helpful online communities around the ML ecosystem. Look for chat invitations on project pages and so on.
You'll want to get more familiar with Pandas.
- Essential: Things in Pandas I Wish I'd Had Known Earlier (as a Jupyter Notebook)
- Essential: 10 Minutes to Pandas
- Another helpful tutorial: Real World Data Cleanup with Python and Pandas
- Video series from Data School, about Pandas. "Reference guide to 30 common pandas tasks (plus 6 hours of supporting video)."
- Here are some docs I found especially helpful as I continued learning:
- Bookmarks for scaling
pandas
and alternatives
These debugging tools can be used inside (or outside) a Jupyter notebook:
There are many more tools than that, but those might get you started, or might be especially useful while you're learning. Beyond learning, troubleshooting is more than just logs or debuggers, of course... there's also some MLOps links, later in this guide.
Some good cheat sheets I've come across. (Please submit a Pull Request to add other useful cheat sheets.)
- scikit-learn algorithm cheat sheet
FavioVazquez/ds-cheatsheets
- Statistics
wzchen/probability-cheatsheet
- "This cheatsheet is a 10-page reference in probability that covers a semester's worth of introductory probability. The cheatsheet is based off of Harvard's introductory probability course, Stat 110. It is co-authored by former Stat 110 Teaching Fellow William Chen and Stat 110 Professor Joe Blitzstein."- Probabilities and statistics refresher cheat sheet from Stanford CS 229
- Stanford CS 229 cheat sheets, available on the web and as PDFs
"Machine learning systems automatically learn programs from data." Pedro Domingos, in "A Few Useful Things to Know about Machine Learning." The programs you generate will require maintenance. Like any way of creating programs faster, you can rack up technical debt.
Here is the abstract of Machine Learning: The High-Interest Credit Card of Technical Debt:
Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
If you're reading this guide, you should read that paper. You can also listen to a podcast episode interviewing one of the authors of this paper.
- Awesome Production Machine Learning, "a curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning." It includes a section about privacy-preserving ML, by the way!
- "Rules of Machine Learning: Best Practices for [Reliable] ML Engineering," by Martin Zinkevich, regarding ML engineering practices.
- The High Cost of Maintaining Machine Learning Systems
- Overfitting vs. Underfitting: A Conceptual Explanation
- 11 Clever Methods of Overfitting and How to Avoid Them
- "So, you want to build an ethical algorithm?" An interactive tool to prompt discussions (source)
That's not a comprehensive list, of course! They are just some gateways and starting-points. Know some other resources? Please share them, pull requests are welcome!
What are some ways to practice?
One way: competitions and challenges
You need practice. On Hacker News, user olympus commented to say you could use competitions to practice and evaluate yourself. Kaggle and ChaLearn are hubs for Machine Learning competitions. (You can find more competitions here or here.)
You also need understanding. You should review what Kaggle competition winners say about their solutions, for example, the "No Free Hunch" blog. These might be over your head at first but once you're starting to understand and appreciate these, you know you're getting somewhere.
Competitions and challenges are just one way to practice! Machine Learning isn't just about Kaggle competitions.
Another way: try doing some practice studies
Here's a complementary way to practice: do practice studies.
- Ask a question. Start exploring some data. The "most important thing in data science is the question" (Dr. Jeff T. Leek). So start with a question. Then, find real data. Analyze it. Then ...
- Communicate results. When you think you have a novel finding, ask for review. When you're still learning, ask in informal communities (some are linked below).
- Learn from feedback. Consider learning in public, it works great for some folks. (Don't pressure yourself yet though! Everybody is different, and it's good to know your learning style.)
How can you come up with interesting questions? Here's one way. Pick a day each week to look for public datasets and write down some questions that come to mind. Also, sign up for Data is Plural, a newsletter of interesting datasets. When a question inspires you, try exploring it with the skills you're learning.
This advice, to do practice studies and learn from review, is based on a conversation with Dr. Randal S. Olson. Here's more advice from Olson, quoted with permission:
I think the best advice is to tell people to always present their methods clearly and to avoid over-interpreting their results. Part of being an expert is knowing that there's rarely a clear answer, especially when you're working with real data.
As you repeat this process, your practice studies will become more scientific, interesting, and focused. Also, here's a video about the scientific method in data science.)
More machine learning career-related links
- "Advice on building a machine learning career and reading research papers by Prof. Andrew Ng"
- Some links for finding/following interesting papers/code:
- Papers With Code is a popular site to follow, and it can lead you to other resources. github.com/paperswithcode
- MIT: Papers + Code — "Peer-review is the lifeblood of scientific validation and a guardrail against runaway hype in AI. Our commitment to publishing in the top venues reflects our grounding in what is real, reproducible, and truly innovative."
- papers.labml.ai/papers/weekly, monthly
- Pull requests welcome!
- /r/LearnMachineLearning
- /r/MachineLearning
- /r/DataIsBeautiful
- /r/DataScience
- Cross-Validated: stats.stackexchange.com
ossu/data-science
has a Discord server and newsletter
OpenReview.net "aims to promote openness in scientific communication, particularly the peer review process."
- Open Peer Review: We provide a configurable platform for peer review that generalizes over many subtle gradations of openness, allowing conference organizers, journals, and other "reviewing entities" to configure the specific policy of their choice. We intend to act as a testbed for different policies, to help scientific communities experiment with open scholarship while addressing legitimate concerns regarding confidentiality, attribution, and bias.
- Open Publishing: Track submissions, coordinate the efforts of editors, reviewers and authors, and host… Sharded and distributed for speed and reliability.
- Open Access: Free access to papers for all, free paper submissions. No fees.
More about OpenReview.net
- Open Discussion: Hosting of accepted papers, with their reviews, comments. Continued discussion forum associated with the paper post acceptance. Publication venue chairs/editors can control structure of review/comment forms, read/write access, and its timing.
- Open Directory: Collection of people, with conflict-of-interest information, including institutions and relations, such as co-authors, co-PIs, co-workers, advisors/advisees, and family connections.
- Open Recommendations: Models of scientific topics and expertise. Directory of people includes scientific expertise. Reviewer-paper matching for conferences with thousands of submissions, incorporating expertise, bidding, constraints, and reviewer balancing of various sorts. Paper recommendation to users.
- Open API: We provide a simple REST API [...]
- Open Source: We are committed to open source. Many parts of OpenReview are already in the OpenReview organization on GitHub. Some further releases are pending a professional security review of the codebase.
OpenReview.net is created by Andrew McCallum’s Information Extraction and Synthesis Laboratory in the College of Information and Computer Sciences at University of Massachusetts Amherst
OpenReview.net is built over an earlier version described in the paper Open Scholarship and Peer Review: a Time for Experimentation published in the ICML 2013 Peer Review Workshop.
OpenReview is a long-term project to advance science through improved peer review, with legal nonprofit status through Code for Science & Society. We gratefully acknowledge the support of the great diversity of OpenReview Sponsors––scientific peer review is sacrosanct, and should not be owned by any one sponsor.
Production, Deployment, MLOps
If you are learning about MLOps but find it overwhelming, these resources might help you get your bearings:
- MLOps Stack Template by Henrik Skogström
- Lessons on ML Platforms from Netflix, DoorDash, Spotify, and more by Ernest Chan in Towards Data Science
- MLOps Stack Canvas at ml-ops.org
Recommended awesomelists to save/star/watch:
- EthicalML/awesome-artificial-intelligence-guidelines
- EthicalML/awesome-production-machine-learning
- visenger/awesome-ml-model-governance
- visenger/awesome-MLOps
Take note: some experts warn us not to get too far ahead of ourselves, and encourage learning ML fundamentals before moving onto deep learning. That's paraphrasing from some of the linked coursework in this guide — for example, Prof. Andrew Ng encourages building foundations in ML before studying DL. Perhaps you're ready for that now, or perhaps you'd like to get started soon and learn some DL in parallel to your other ML learnings.
When you're ready to dive into Deep Learning, here are some helpful resources.
- Dive into Deep Learning - An interactive book about deep learning (view on GitHub)
- Quickstart:
- "Implemented with NumPy/MXNet, PyTorch, and TensorFlow"
- "Adopted at 200 universities from 50 countries"
- "The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code."
- "You can modify the code and tune hyperparameters to get instant feedback to accumulate practical experiences in deep learning."
explosion/thinc
is an interesting library that wraps PyTorch, TensorFlow and MXNet models.- "Concise functional-programming approach to model definition, using composition rather than inheritance."
- "Integrated config system to describe trees of objects and hyperparameters."
fastai/fastbook
by Jeremy Howard and Sylvain Gugger — "an introduction to deep learning, fastai and PyTorch."- Prof. Andrew Ng's courses on Deep Learning! There five courses, as part of the Deep Learning Specialization on Coursera. These courses are part of his new venture, deeplearning.ai
- Some course notes about it: ashishpatel26/Andrew-NG-Notes
- Deep Learning, a free book published MIT Press. By Ian Goodfellow, Yoshua Bengio and Aaron Courville.
- A notable testimonial for it is here: "What are the best ways to pick up Deep Learning skills as an engineer?"
- paperswithcode.com — "The mission of Papers with Code is to create a free and open resource with Machine Learning papers, code, datasets, methods and evaluation tables."
labmlai/annotated_deep_learning_paper_implementations
— "Implementations/tutorials of deep learning papers with side-by-side notes." 50+ of them! Really nicely annotated and explained.- Distill.pub publishes explorable explanations, definitely worth exploring and following!
- Replicate "makes it easy to share a running machine learning model"
- Easily try out deep learning models from your browser
- The demos link to papers/code on GitHub, if you want to dig in and see how something works
- The models run in containers built by
cog
, "containers for machine learning." It's an open-source tool for putting models into reproducible Docker containers.
Machine Learning can be powerful, but it is not magic.
Whenever you apply Machine Learning to solve a problem, you are going to be working in some specific problem domain. To get good results, you or your team will need "substantive expertise" (to re-use a phrase from earlier), which is related to "domain knowledge." Learn what you can, for yourself... But you should also collaborate with experts. You'll have better results if you collaborate with subject-matter experts and domain experts.
I couldn't say it better:
Machine learning won’t figure out what problems to solve. If you aren’t aligned with a human need, you’re just going to build a very powerful system to address a very small—or perhaps nonexistent—problem.
That quote is from "The UX of AI" by Josh Lovejoy. In other words, You Are Not The User. Suggested reading: Martin Zinkevich's "Rules of ML Engineering", Rule #23: "You are not a typical end user"
Here are some additional Data Science resources:
- Python Data Science Handbook, as Jupyter Notebooks
- Accessible data science book, no coding experience required: Data Smart by John Foreman
- Data Science Workflow: Overview and Challenges (read the article and also the comment by Joseph McCarthy)
r0f1/datascience
— "A curated list of awesome resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks."
From the "Bayesian Machine Learning" overview on Metacademy:
... Bayesian ideas have had a big impact in machine learning in the past 20 years or so because of the flexibility they provide in building structured models of real world phenomena. Algorithmic advances and increasing computational resources have made it possible to fit rich, highly structured models which were previously considered intractable.
Here are some awesome resources for learning Bayesian methods.
- The free book Probabilistic Programming and Bayesian Methods for Hackers. Made with a "computation/understanding-first, mathematics-second point of view." Uses PyMC. It's available in print too!
- Like learning by playing? Me too. Try 19 Questions, "a machine learning game which asks you questions and guesses an object you are thinking about," and explains which Bayesian statistics techniques it's using!
- Time Series Forecasting with Bayesian Modeling by Michael Grogan, a 5-project series - paid but the first project is free.
- Bayesian Modelling in Python. Uses PyMC as well.
These next two links are non-sequiturs, not specifically related to ML. But since you're here, I have a hunch you might find them interesting too:
- Maggie Appleton's "A Brief History & Ethos of the Digital Garden"
- Shawn Wang's "Digital Garden Terms of Service"
Here are some other guides to learning Machine Learning.
- Example Machine Learning notebook, exercise, and guide by Dr. Randal S. Olson. Mentioned in Notebooks section as well, but it has a similar goal to this guide (introduce you, and show you where to go next). Rich "Further Reading" section.
- Courses by cloud vendors. These are usually high quality content but steer you heavily to use vendor-specific tools/services. I encourage you to "make the most" of the resources that they make freely available. To avoid getting locked into vendor specifics, just make sure you're learning from other resources as well.
microsoft/ML-For-Beginners
microsoft/Data-Science-For-Beginners
- Machine Learning Crash Course from Google with TensorFlow APIs.
- Amazon AWS Amazon have open up their internal training to the public and also offer certification.
- Machine Learning for Developers is good for people who are more familiar with Java or Scala than Python.
- ageron/handson-ml2 aka Hands-On Machine Learning 2nd Edition by Aurélien Geron
- rasbt/python-machine-learning-book-3rd-edition aka Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2 by Sebastian Raschka and Vahid Mirjalili
josephmisiti/awesome-machine-learning
,svaksha/Pythonidae
- Machine Learning for Software Engineers, by Nam Vu. In their words, it's a "top-down and results-first approach designed for software engineers." Definitely bookmark and use it, as well - it can answer lots of questions and connect you with great resources.