/TuftsCOMP135_Spring2016

Repository of course materials for COMP135 at Tufts University Taught by Kyle Harrington

Primary LanguageJavaScriptApache License 2.0Apache-2.0

Introduction to Machine Learning and Data Mining

Comp 135: Introduction to Machine Learning and Data Mining
Department of Computer Science
Tufts University
Spring 2016

Course Web Page (redirects to current page): https://github.com/kephale/TuftsCOMP135_Spring2016

Announcement(s):

What is this course about?

Machine learning is the study of algorithmic methods for learning and prediction based upon data. Approaches range from extracting patterns from large collections of data, such as social media and scientific datasets, to online learning in real-time, for applications like active robots. ML is becoming increasingly widespread because of the increase and accessibility of computational power and datasets, as well as recent advances in ML algorithms. It is now commonplace for ML to produce results that have not been achieved by humans.

As this is an introductory course, we will focus on breadth over the field of ML, but this will still require significant cognitive effort on your part. The ideal candidate for this course is an upper-level undergraduate or beginning graduate student comfortable with some mathematical techniques and a solid grounding in programming. Maths that will prove useful are: statistics, probability, calculus, and linear algebra. We will review some of the essential topics, and the only explicit requirements are previous coursework of (COMP 15) and (COMP/MATH 22 or 61) or (consent of instructor). Comp 160 is highly recommended.

Class Times:

Tu, Th 10:30AM - 11:45AM
Tisch Library, 304-Auditorium

Instructor:

Kyle Harrington kyle@eecs.tufts.edu
Office Hours: By appointment

Teaching Assistants:

Sepideh Sadeghi, sepideh.sadeghi@tufts.edu
Office Hours: Mon noon-1pm, Fri 10am-noon,
Location for Office Hours: Halligan 121

Hao Cui, Hao.Cui@tufts.edu
Office Hours: Tue 4:30-5:30 pm, Thu 4:30-5:30 pm,
Location for Office Hours: Halligan 121

Grading

  • Written homework assignments (20%)
  • Quizzes (20%)
  • In-class midterm exam (20%): March 17
  • Final project (40%)

Rules for late submissions:

All work must be turned in on the date specified. Unless there is a last minute emergency, please notify Kyle Harrington of special circumstances at least two days in advance.

If you aren't done by the due date, then turn in what you have finished for partial credit.

Collaboration

On homework assignments and projects: Discussion about problems and concepts is great. Each assignment must be completed by you and only you, and is expected to be unique. Code should be written by you; writeups should be written by you. If you have collaborated (helping or being helped), just say so. There is no harm in saying so.

On quizzes and exams: no collaboration is allowed.

Failure to follow these guidelines may result in disciplinary action for all parties involved. For this and other issues concerning academic integrity please consult the booklet available from the office of the Dean of Student Affairs.

Tentative List of Topics

  • Supervised Learning basics: nearest neighbors, decision trees, linear classifiers, and simple Bayesian classifiers; feature processing and selection; avoiding over-fitting; experimental evaluation.
  • Unsupervised learning: clustering algorithms; generative probabilistic models; the EM algorithm; association rules.
  • Theory: basic PAC analysis for classification.
  • More supervised learning: neural networks; backpropagation; dual perceptron; kernel methods; support vector machines.
  • Additional topics selected from: active learning; aggregation methods (boosting and bagging); time series models (HMM); reinforcement learning

Reference Material

We will use a mixture of primary research materials, portions of texts, and online sources. Required reading material will be listed as such. The following is a list of recommended reference material.

  • (We will often use this one) Machine Learning. Tom M. Mitchell, McGraw-Hill, 1997
  • Introduction to Machine Learning, Ethem Alpaydin, 2010.
  • An introduction to support vector machines : and other kernel-based learning methods. N. Cristianini and J. Shawe-Taylor, 2000.
  • Data Mining: Practical Machine Learning Tools and Techniques. Ian H. Witten, Eibe Frank, 2005.
  • Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Peter Flach, 2012.
  • Pattern Classification. R. Duda, P. Hart, and D. Stork, 2001.
  • Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig, 2010
  • Principles of Data Mining. D. Hand, H. Mannila, and P. Smyth, 2001.
  • Reinforcement Learning: an Introduction. R. Sutton and A. Barto, 1998.x

Roni Khardon's Version of COMP-135.

Programming and Software

Weka is a great machine learning package that has been around for a while. It is quite extensible, and we will be using it for some assignments. You can use weka.jar on the CS department servers through the command line. If you have trouble, there is excellent documentation on the Weka wiki.

There are some languages that are particularly useful in the context of machine learning, either because of their innate capabilities or because of libraries implemented in the language. When code examples are provided in class they will likely be in one of these language:

  • Python
  • Java
  • Julia
  • Matlab
  • Clojure
  • R

Jupyter is a notebook-based programming environment that supports many programming languages. We will use it for numerous in-class demos, and you may want to use it for your homework and final projects as well.

Slides

Slides are made with Reveal.JS. This has some perks that do not exist in Powerpoint/Keynote. They embed into the web more elegantly than PDFs, and because they use HTML5/CSS support essentially all functionality that one can get in a web browser.

When browsing the slides, notice that there is also an "overview" mode (press 'o' after loading a particular set of slides). This will tile the slides in an arrangement that is encoded within the presentation file, and should facilitate rapid browsing.

Schedule

Date Lecture Assignments and Notes Due Date
01/21 Introduction to Machine Learning 01/27
01/26 Instance learning 02/03
01/28 Decision trees pt 1
02/02 Decision trees pt 2
02/04 Naive bayes
02/09 Measuring ML success pt 1
  • Chapter 5 - Mitchell
  • Final project proposal (See due date)
  • Slides
03/07
02/11 Measuring ML success pt 2 02/16
02/16 Features
02/18 No class, Monday Schedule
02/23 Features
02/25 Linear threshold units pt 1
03/01 Linear threshold units pt 2
03/08 Clustering 03/15
03/10 Reinforcement Learning and Games
03/15 Unsupervised learning
03/17 Midterm Due before class on 03/31
03/22 No class, Spring recess
03/24 No class, Spring recess
03/29 Distribution Approximation and EM 05/05
03/31 Boosting (Alex Lenail)
04/05 Neural Networks 04/12
04/07 Support Vector Machines
04/12 Reinforcement Learning
04/14 Game Theory and Retrospective 04/26
04/19 Project presentations
  • Quiz 2
04/21 Project presentations & Random Forests
04/26 Project presentations & Long Short-term Memory
04/28 Project presentations & Automatic Optimization of ML Pipelines
05/05 Program Discovery
  • A Field Guide to Genetic Programming

Assignments, Quizzes, and Exams

Assignment1

Note that it is possible to call Weka from the command line (i.e. on the homework server)

Submission of assignment 1

Write a one paragraph description of what you can find.

  • Open "Visualize" and investigate how pairs of attributes relate to each other?
  • What types of clusters can you find (try "Cluster"/"Choose"/"SimpleKMeans" test with different "numClusters")
  • If you're feeling adventurous, then try to build a classifier ("Classify"/"Choose"/"weka.classifiers.trees.J48" and choose a nominal attribute to classify over, like "location_name". In the case of "location_name", before building the classifier use "Preprocess" and remove all "location" attributes except "location_name". You will want to use the abbreviated dataset for this.)

Assignment2

This is a bonus for 10% on a quiz, not required.

Git is the current standard for code sharing and collaborative coding. This course is run off of Github using git to control and track the history of changes. For this assignment, clone this repository, open up Lecture02/notebooks/instance_based_learning.ipynb, complete the assignment by adding new cells to the notebook, and submit a pull request on GitHub. The new cells should implement an exhaustive search implementation of kNN. The current version uses a KD-tree to obtain the nearest-neighbors. The current line of code that you should replace with your exhaustive search implementation is: query_result = kdtree.query( [0.5, 0.5], k=10 )

  • For help getting going with git and GitHub checkout GitHub guides
  • Setup Jupyter on your computer (use Python for this assignment. This is the default language Jupyter installs)
  • See Slides from Lecture 2 for information on the k-Nearest Neighbors algorithm
  • We already have an existing Jupyter Notebook, but it is missing a classic implementation of kNN with exhaustive search!
  • Some Python and Jupyter tutorials are linked in the programming and software section

Submission:

Additional instructions on submitting a pull request:

  1. In order to make a pull request, you will need to "fork" the class repository (https://github.com/kephale/TuftsCOMP135_Spring2016/). On the github page, at the top right, you will see a "Fork" button. If you click this, then follow the instructions, it will create a copy of the repository under your username.
  2. You will need to clone your fork (this will download your version of the class repository).
  3. Make your changes to the file (this would involve opening Jupyter, editing the file, and resaving it). If you have already changed the file without using git, all you have to do is copy your updated version over the existing file the fork that you just downloaded.
  4. Add your changed files, commit the changes, and push to the repository.
  5. Once you have done this, you can open up the webpage for your fork and click on the "New pull request" button. Follow the instructions to send a pull request to the course's repository.

If you have any issues with Github, then see the Github guides (https://guides.github.com/activities/hello-world/)

Nearly every major corporation (Google, Facebook, Microsoft, Twitter, etc.) and university uses git to manage code for almost all of their open-source projects, if not specifically Github. This is especially true for the open-source machine learning code being released by these corporations and universities. When it comes time to work on final projects, especially with multiple people involved, git will turn out to be one of your most powerful tools.

FinalProjects

The final project for this course is an opportunity to apply what you've learned about machine learning to a real-world problem, to extend an existing machine learning algorithm, or to implement and explore a cutting-edge ML technique.

Collaboration is an option, but this should be discussed with Kyle a priori. A clear delineation of work should be proposed ahead of time. Github is strongly encouraged for collaborative projects to make it easier to measure each person's contribution.

If you would like a suggestion for a project contact Kyle 2 weeks before the proposal deadline to schedule a meeting. We will talk through problems/data that may be of interest to you.

Proposals

Due: March 7

Submission: Email a PDF of your proposal to Kyle

Write a 300-500 word abstract describing your proposed project. This should include 2-3 references of papers you expect to include in your final paper.

See an example project proposal here

Project

Due: May 5

Turn in a 8-12 page paper. A rough outline is:

  • Background on problem
  • Related work
  • Your method
  • Results
  • Conclusion and future work
  • References

There should be at least 10 references.

Resources

  • Ask faculty around the department if they have datasets that might be interesting for Machine Learning
  • Google Scholar - Search for articles published in "ICML", "NIPS", or "Machine Learning"; or search for keywords relevant to problems/algorithms that interest you
  • Huge list of datasets

Quiz1

Quiz 1 will cover:

  • kNN
  • Decision trees
  • Naive bayes
  • Measuring success of ML algorithms

License