/ml-class

Code for the Inquiryum Machine Learning Fundamentals Course

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

CAP4770 Introduction to Data Mining

Spring 2023

essentials

# Resource
1 The current version of the syllabus
2 Welcome video
3 What should you do the first week of the course
4 Instructor: Ron Zacharski, ron.zacharski@gmail.com, 575.680.4041
5 Experience Point Sheet
6 the FIU Deep Learning Slack workspace
7 The Lab Submission Form

Important: How to pass the course

As you will read in the details below, this class is a programming intensive course where you work at your own pace. Historically, about ⅓ of the students get an A, ⅓ an F, and ⅓ between an A and an F. What separates the 'A' students from the 'F' ones is that the 'A' students keep a regular schedule and consistantly submit their work. If they have a question or need help debugging they message me on Slack. They are not necessarily the most proficient programmers, or the best at math. The attribute that best defines them is self-discipline.

Course Catalog Description

Data mining applications, data preparation, data reduction and various data mining techniques such as association, clustering, classification, anomaly detection.

Course Description

This course provides an introduction to practical machine learning tools for data mining with an emphasis on XGBoost and Deep Learning.

Prerequisites

Prerequisite Course: COP 3530 Data Structures

Corequisite Course: COP 4710 Database Management

Note: While very little material from either of these courses will be used in this course, these prerequisites give you a level of programming maturity that is required.

An asynchronous online class

This class is asynchronous meaning there is no mandatory real-time interaction. You will be working through the Inquiryum Machine Learning Fundamentals Course. You can watch the videos anytime you want. You can play them at a faster speed, you can rewatch them or pause them. You can work on the course material in 20 minute blocks throughout a day, or devote a large contiguous block of time once per week. When you need help you can use the FIU Deep Learning Slack workspace to get assistance from me or your classmates.

The advantages of this approach is that it allows you great flexibility in when you want to work on the material and for how long. And, as described below under mastery learning, it allows you to work at your own pace.

Instructor availability

Slack Office Hours: Tuesdays and Wednesdays 11-2pm ET

I will be sitting at my laptop on the Slack channel Tuesdays and Wednesdays from 11am until 2pm ET. This means that if you message me, I will respond within 5 minutes unless I am helping another student. My next level of availability is Tuesdays and Wednesdays from 2pm to 4pm and Thursdays from 11am until 2pm ET. My average response time during that period is 30 minutes. Feel free to message me outside of those times but my response delay might be significant. Often I turn off Slack notifications at midnight. There may be times during Friday through Sunday when I don't have cell or wifi coverage and I will not be able to receive your message. Also, there may other times when I don't have cell coverage. In those cases I will post a message on Slack beforehand. The reason for this is that while I am based in Santa Fe I often go off exploring the Southwest in my van and sometimes lose cell phone coverage. If your questions require something that can be better addressed over Zoom, we can arrange a meeting time through Slack. I also encourage those in class to help others (see my honor code policy below)

The above hours may be subject to change if other times benefit more students. These changes will be announced in the Slack channel.

Course Objectives

Students will gain hands-on experience with the following algorithms and libraries, learning when and how to apply them to problems in data mining:

  • Numpy, Pandas, skLearn

  • entropy and decision trees

  • bagging and pasting

  • random forest

  • XGBoost

  • deep learning basics

  • Convolutional Neural Networks (CNN)

  • Clustering

  • Working with text

Expected Outcomes

Basic Machine Learning (ML) Techniques

Students should be able to

  • architect a scalable ML pipeline
  • run ML jobs on a GPU using Jupyter Notebooks in Colab
  • evaluate different ML models
  • determine the best ML algorithm to use for an application
  • reduce the dimensionality of a dataset
  • develop different linear models to solve classification problems
  • communicate effectively about ML applications (terminology)

XGBoost

Students should be able to

  • apply decision tree algorithms to create a classifier
  • use random forest techniques
  • combine a number of weak classifiers into a strong one by using boosting.
  • effectively use the XGBoost algorithm

Deep Learning

Students should be able to

  • build a simple deep learning system for image classification
  • build CNNs for computer vision
  • pre-process text datasets into a form usable for classification
  • build CNN for text classification
  • adjust hyperparameters to improve performance

Labs and Projects

The majority of effort in the course is in working on labs and project, which have different levels of expected knowledge and independence.

Labs

  • In the form of Jupyter Notebook tutorials which provide detailed explanations and sample executable code.
  • You are to:
    • write a small amount of code to complete the task
    • answer any non-coding questions the Notebook may ask.

Projects

  • Follows examples shown in the course videos and in the labs.
  • Builds off of concepts and skills you learned completing the labs.
  • Project definition provides
    • a dataset
    • a short problem description
  • You are to
    • design and create the machine learning algorithm used to solve the problem.
    • write the code in a Jupyter Notebook
    • test and evaluate your solution.
    • save your notebook to Github..

Mastery Learning

Traditional classes are time-based learning. You spend a specific amount of time on a topic and then you move on to the next topic. For example, in a traditional intro course on Python programming you might cover for loops in week 5, take a quiz on them, and then move on to Python dictionaries in week 6. Suppose you got a 75% on that quiz in week 5. That means that you did not learn 25% of the material. Then perhaps in week 10 you take a test on list comprehensions and get an 80% (you did not master 20% of the material). These gaps in your mastery start adding up, and eventually, in either in some future class or on the job, you hit a wall because your current task requires that you are skilled in areas that you failed to master.

This class doesn't work like that.

In contrast to time-based learning, in mastery learning you stay on the topic until you master it. You work at your own pace. This online class is based on this approach. You stay on a topic until you master it. As I mentioned, the lectures are a set of videos (mostly screencasts) that you can watch at anytime. If the material is easy for you, you can speed up the videos and watch them at 1.5 speed. If you find the material challenging, you can rewatch the videos, google for more information, interact with other learners on the Slack channel.

Obviously, the work-at-your-own pace approach will collide with the end of the semester and there will be some material that you will not cover. The course is designed so that the essential core information is presented first, to enable you to develop solid foundational skills with no gaps.

Mastery Learning Difficulties

This course is work at your own pace. Other courses you might be taking have fixed deadlines, So, for example, you might have a gnarly project for a programming class due this week and a big operating systems project due next week. It is likely that you will work on those projects since they have immediate deadlines and ignore working on this course. It is human nature. Just block out a regular time each week to work on the course and you will do fine.

Starting on week 8, there is a limit of 3 submissions per week.

The course material

Order Lesson
1 JumpStart
2 Labs
3 Projects

Again, the class is work-at-your-own pace, but I provide a suggested schedule below.

Week-by-Week

Week Date Unit Topics labs and projects
1 9 Jan Intro Intro to class & Quickstart to ML Quickstart lab
2 16 Jan basics Numpy, Pandas Numpy & Pandas labs
3 23 Jan basics kNN sklearn sklearn lab
4 30 Jan basics entropy and decision trees decision tree lab
5 6 Feb basics one-hot encoding, cross-validation, hyperparameters working with data lab
6 13 Feb basics Regression & Clustering regression and clustering labs
7 20 Feb XGBoost Intro to boosting, bagging & pasting bagging and pasting lab
8 27 Feb XGBoost random forest, patches, xgboost XGBoost lab First Project
9 6 Mar DNN our first neural network - classifying images a first look at deep learning lab
10 13 Mar DNN Neural Network anatomy & classification --
11 20 Mar DNN Introduction to Convolutional Neural Networks (CNN) CNN lab
12 27 Mar DNN project work Projects 2 & 3
13 3 Apr DNN CNNs and text classification NLP & Embeddings lab
14 10 Apr DNN CNN and text classification cont'd Amazon Reviews Project
15 17 Apr RL Generative AI GAN lab
16 24 Apr FINALS WEEK FINISH PROJECTS

Deadlines will be announced in the Slack channel.

Required materials

Google Colab Cloud Account

While the free Colab account is the minimum requirement, for the last 6 weeks of the class it may be beneficial to subscribe to [Google Colab Pro](Google Colab) for $9.99/mo

Laptop

Inquiryum’s Machine Learning Fundamentals Course

No purchases of books or equipment are required.

Slack

Slack is a work chat application that many tech companies use. We are going to be using Slack in a number of ways. First, all my announcements for the class will be in Slack. If you have a particular programming question you can ask it in a general channel and hopefully you will get an answer or suggestion quickly from either myself or fellow learners.

Slack check-in

Twice per week one of our Slackbots will ask you three questions:

  1. What have you accomplished since the last class?
  2. What are you working on now?
  3. What is holding your back?

Failure to do the Slack check-in will result in the following deduction of points:

number of missed check-ins points deducted
1 0
2 10
3 25
4 100
5 250

You will be responsible for logging into Slack on Tuesdays and Fridays to answer these questions. When you initially sign in to Slack make sure to join the scrum channel.

Sign up for Slack here.

Okay but how do I pass?

Grading is based on a method developed by Professor Lee Sheldon at Indiana University. It is based on obtaining experience points (XP). The number of XP determines what level you are at. You start the class at Level Zero and with 0 XP. The level you obtain at the end of the semester determines your final grade. Here is the chart:

Level XP Grade
Zero 0 F
One 550 D
Two 740 C
Three 800 C+
Four 840 B-
Five 871 B
Six 914 B+
Seven 950 A-
Eight 990 A

Here are the ways of earning XP:

  • there will be around 15 labs. On average each will be worth 30xp

  • there are 4-5 machine learning projects. On average they are each worth 150xp

Accessibility Statement

The Office of Disability Resources has been designated by the college as the primary office to guide, counsel, and assist students with disabilities. If you receive services through the Office of Disability Resources and require accommodations for this class, make an appointment with me as soon as possible to discuss your approved accommodation needs. Bring your accommodation letter, along with a copy of our class syllabus with you to the appointment. I will hold any information you share with me in strictest confidence unless you give me permission to do otherwise.

If you have not made contact with the Office of Disability Resources and have reasonable accommodation needs, (note taking assistance, extended time for tests, etc.), I will be happy to refer you. The office will require appropriate documentation of disability

Title IX Statement

Floridal International University's faculty are committed to supporting students and upholding the University’s Policy on Sexual Harassment and Sexual Misconduct. Under Title IX and this Policy, discrimination based upon sex or gender is prohibited. If you experience an incident of sex or gender based discrimination, we encourage you to report it. While you may talk to me, understand that as a “Responsible Employee” of the University, I MUST report to FIU's Title IX Coordinator what you share. If you wish to speak to someone confidentially, please contact the confidential resources described on the []FIU Title IX webpage](https://dei.fiu.edu/crca/title-ix) They can connect you with support services and help you explore your options. You may also seek assistance from FIU’s Title IX Coordinator.

Honor Code Policy

The general policy for any computer science class is

  1. You must write all programs yourself (without help from others or from websites), unless specified. You are not to communicate to others in any way about your assignments. You are also not to get code for your projects from Google, StackOverflow, Chegg, YouTube, or any other website unless permitted in writing.

  2. Do not share your code with other students, either this semester, or in any future semester. Remember that giving unauthorized help violates the Honor Code just as much as receiving unauthorized help does.

  3. Do not post your code or class materials anywhere. You may not upload your solutions to any publicly-available website, post part of your solution on StackOverflow or any similar site, or post assignments/notes/etc from the course, even if they were instructor-authored materials.

  4. Explicitly cite any sources you use

  5. Do not look at solutions from previous semesters. Professors evolve and reuse assignments over many years in order to perfect them. If someone does leave their code (or other materials) lying around from a previous offering of the course, you may not look at them when completing your own.

  6. Be prepared to explain anything you submit. Your instructor may, at any time, call you in to his/her office to explain any part of your program. You will be expected to convincingly walk him/her through your code, demonstrating your thought process behind it. If you cannot, this may be considered an Honor Code violation.

  7. When in doubt, ask your instructor what constitutes plagiarism. If you’re not sure whether you need to cite a source for a quotation in a paper, or list the URL of a website from which you got some code, ask. If you do not ask, and the instructor deems it to be unauthorized help, this may be considered an Honor Code violation.

From The University of Mary Washington Computer Science Department Honor Code Policy

The amendments to this general policy are as follows (the numbers related to the numbers in the policy):

  1. I am more flexible than the policy "you are not to communicate to others in any way about your assignment." My rule of thumb is What would a responsible adult do on the job? If you have a deadline on the job at a startup and didn't know how to do something, the responsible thing wouldn't be to sit at your workstation just getting more and more frustrated and depressed and missing the deadline. The responsible person would get whatever help was necessary to complete the task. On the other hand, a responsible person wouldn't let someone else do all the work and present it as his own. That would be a violation of this policy.
  2. Regarding " Remember that giving unauthorized help violates the Honor Code just as much as receiving unauthorized help does." Again, I refer to the 'responsible adult' mentioned above. I would like people to help each other but yet do the work to learn the material. Sharing a complete assignment violates this point, but helping a person debug one cell of a notebook is fine.
  3. Sadly, this contradicts what you want to do in your professional life. In your professional life, you want to post solutions to things you figured out as a way of helping people in the community. In fact, we are going to be using some material people posted in this class. However, to prevent plagiarism, you will only post your material to a private github repository. Sorry.
  4. You should acknowledge the people that helped you in writing in your submission. For example, "Ann Mulkern helped me with the code to divide the dataset into training and testing sets"
  5. All the rest of the conditions of the computer science policy hold as is.

Avatar names, pseudonyms, noms de plume

During the first week of class you will need to fill out the Avatar Form for your avatar name, pseudonym, whatever. This is the name that will appear on the Experience Point Google Spreadsheet that will be viewable by everyone in the class. If you wish to remain anonymous, don’t share your avatar name with anyone. To further protect the anonymity of those who wish to remain anonymous, the spreadsheet will also be populated by fictitious avatar names.