/TuftsCOMP135_Spring2016

Repository of course materials for COMP135 at Tufts University Taught by Kyle Harrington

Primary LanguageJavaScriptApache License 2.0Apache-2.0

Introduction to Machine Learning and Data Mining

Comp 135: Introduction to Machine Learning and Data Mining
Department of Computer Science
Tufts University
Spring 2016

Course Web Page (redirects to current page): https://github.com/kephale/TuftsCOMP135_Spring2016

Announcement(s):

If you have an issue with running assignment 2 and are using python 3 (check the top right corner of your Jupyter notebook), then replace:

dataset = np.matrix( zip(points_x,points_y) )

with

dataset = np.matrix( list( zip(points_x,points_y) ) )

This arises because in Python3 the zip() function returns an iterator, that is not fully-evaluated and numpy has issues with that. By calling list() on zip() we can force the iterator to be converted into a list.

What is this course about?

Machine learning is the study of algorithmic methods for learning and prediction based upon data. Approaches range from extracting patterns from large collections of data, such as social media and scientific datasets, to online learning in real-time, for applications like active robots. ML is becoming increasingly widespread because of the increase and accessibility of computational power and datasets, as well as recent advances in ML algorithms. It is now commonplace for ML to produce results that have not been achieved by humans.

As this is an introductory course, we will focus on breadth over the field of ML, but this will still require significant cognitive effort on your part. The ideal candidate for this course is an upper-level undergraduate or beginning graduate student comfortable with some mathematical techniques and a solid grounding in programming. Maths that will prove useful are: statistics, probability, calculus, and linear algebra. We will review some of the essential topics, and the only explicit requirements are previous coursework of (COMP 15) and (COMP/MATH 22 or 61) or (consent of instructor). Comp 160 is highly recommended.

Class Times:

Tu, Th 10:30AM - 11:45AM
Tisch Library, 304-Auditorium

Instructor:

Kyle Harrington kyle@eecs.tufts.edu
Office Hours: By appointment

Teaching Assistants:

Sepideh Sadeghi, sepideh.sadeghi@tufts.edu
Office Hours: Mon noon-1pm, Fri 10am-noon,
Location for Office Hours: Halligan 121

Hao Cui, Hao.Cui@tufts.edu
Office Hours: Tue 4:30-5:30 pm, Thu 4:30-5:30 pm,
Location for Office Hours: Halligan 121

Grading

  • Written homework assignments (20%)
  • Quizzes (20%)
  • In-class midterm exam (20%): March 17
  • Final project (40%)

Rules for late submissions:

All work must be turned in on the date specified. Unless there is a last minute emergency, please notify Kyle Harrington of special circumstances at least two days in advance.

If you aren't done by the due date, then turn in what you have finished for partial credit.

Collaboration

On homework assignments and projects: Discussion about problems and concepts is great. Each assignment must be completed by you and only you, and is expected to be unique. Code should be written by you; writeups should be written by you. If you have collaborated (helping or being helped), just say so. There is no harm in saying so.

On quizzes and exams: no collaboration is allowed.

Failure to follow these guidelines may result in disciplinary action for all parties involved. For this and other issues concerning academic integrity please consult the booklet available from the office of the Dean of Student Affairs.

Tentative List of Topics

  • Supervised Learning basics: nearest neighbors, decision trees, linear classifiers, and simple Bayesian classifiers; feature processing and selection; avoiding over-fitting; experimental evaluation.
  • Unsupervised learning: clustering algorithms; generative probabilistic models; the EM algorithm; association rules.
  • Theory: basic PAC analysis for classification.
  • More supervised learning: neural networks; backpropagation; dual perceptron; kernel methods; support vector machines.
  • Additional topics selected from: active learning; aggregation methods (boosting and bagging); time series models (HMM); reinforcement learning

Reference Material

We will use a mixture of primary research materials, portions of texts, and online sources. Required reading material will be listed as such. The following is a list of recommended reference material.

  • (We will often use this one) Machine Learning. Tom M. Mitchell, McGraw-Hill, 1997
  • Introduction to Machine Learning, Ethem Alpaydin, 2010.
  • An introduction to support vector machines : and other kernel-based learning methods. N. Cristianini and J. Shawe-Taylor, 2000.
  • Data Mining: Practical Machine Learning Tools and Techniques. Ian H. Witten, Eibe Frank, 2005.
  • Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Peter Flach, 2012.
  • Pattern Classification. R. Duda, P. Hart, and D. Stork, 2001.
  • Artificial Intelligence: A Modern Approach. Stuart Russell and Peter Norvig, 2010
  • Principles of Data Mining. D. Hand, H. Mannila, and P. Smyth, 2001.
  • Reinforcement Learning: an Introduction. R. Sutton and A. Barto, 1998.x

Roni Khardon's Version of COMP-135.

Programming and Software

Weka is a great machine learning package that has been around for a while. It is quite extensible, and we will be using it for some assignments. You can use weka.jar on the CS department servers through the command line. If you have trouble, there is excellent documentation on the Weka wiki.

There are some languages that are particularly useful in the context of machine learning, either because of their innate capabilities or because of libraries implemented in the language. When code examples are provided in class they will likely be in one of these language:

  • Python
  • Java
  • Julia
  • Matlab
  • Clojure
  • R

Jupyter is a notebook-based programming environment that supports many programming languages. We will use it for numerous in-class demos, and you may want to use it for your homework and final projects as well.

Slides

Slides are made with Reveal.JS. This has some perks that do not exist in Powerpoint/Keynote. They embed into the web more elegantly than PDFs, and because they use HTML5/CSS support essentially all functionality that one can get in a web browser.

When browsing the slides, notice that there is also an "overview" mode (press 'o' after loading a particular set of slides). This will tile the slides in an arrangement that is encoded within the presentation file, and should facilitate rapid browsing.

Schedule

Date Lecture Assignments and Notes Due Date
01/21 Introduction to Machine Learning 01/27
01/26 Instance learning 02/03
01/28 Decision trees pt 1
02/02 Decision trees pt 2
02/04 Naive bayes
02/09 Measuring ML success pt 1
  • Chapter 5 - Mitchell
  • Final project proposal (See due date)
03/02
02/11 Measuring ML success pt 2
  • Assignment 3 (See due date)
02/17
02/16 Features
02/18 No class, Monday Schedule
02/23 Features
  • Quiz 1
    02/25 Linear threshold units pt 1
    03/01 Linear threshold units pt 2
    • Assignment 4 (See due date)
    03/07
    03/03 Clustering pt 1
    03/08 Clustering pt 2
    03/10 Unsupervised learning
    03/15 Association rules
    • Chapter 10 - Mitchell
    • Final project (See due date)
    04/25
    03/17 Midterm
    03/22 No class, Spring recess
    03/24 No class, Spring recess
    03/29 Computational learning theory
    • Chapter 7 - Mitchell
    03/31 Perceptron
    04/05 Kernel-based methods
    04/07 SVM
    • Assignment 5 (See due date)
    04/13
    04/12 Active learning
    04/14 MDPs and Reinforcement Learning
    04/19 Reinforcement learning pt 2
    • Quiz 2
    04/21 Aggregation methods
    04/26 Project presentations
    04/28 Project presentations

    Assignments, Quizzes, and Exams

    Assignment1

    Note that it is possible to call Weka from the command line (i.e. on the homework server)

    Submission of assignment 1

    Write a one paragraph description of what you can find.

    • Open "Visualize" and investigate how pairs of attributes relate to each other?
    • What types of clusters can you find (try "Cluster"/"Choose"/"SimpleKMeans" test with different "numClusters")
    • If you're feeling adventurous, then try to build a classifier ("Classify"/"Choose"/"weka.classifiers.trees.J48" and choose a nominal attribute to classify over, like "location_name". In the case of "location_name", before building the classifier use "Preprocess" and remove all "location" attributes except "location_name". You will want to use the abbreviated dataset for this.)

    Assignment2

    This is a bonus for 10% on a quiz, not required.

    Git is the current standard for code sharing and collaborative coding. This course is run off of Github using git to control and track the history of changes. For this assignment, clone this repository, open up Lecture02/notebooks/instance_based_learning.ipynb, complete the assignment by adding new cells to the notebook, and submit a pull request on GitHub. The new cells should implement an exhaustive search implementation of kNN. The current version uses a KD-tree to obtain the nearest-neighbors. The current line of code that you should replace with your exhaustive search implementation is: query_result = kdtree.query( [0.5, 0.5], k=10 )

    • For help getting going with git and GitHub checkout GitHub guides
    • Setup Jupyter on your computer (use Python for this assignment. This is the default language Jupyter installs)
    • See Slides from Lecture 2 for information on the k-Nearest Neighbors algorithm
    • We already have an existing Jupyter Notebook, but it is missing a classic implementation of kNN with exhaustive search!
    • Some Python and Jupyter tutorials are linked in the programming and software section

    Submission:

    Additional instructions on submitting a pull request:

    1. In order to make a pull request, you will need to "fork" the class repository (https://github.com/kephale/TuftsCOMP135_Spring2016/). On the github page, at the top right, you will see a "Fork" button. If you click this, then follow the instructions, it will create a copy of the repository under your username.
    2. You will need to clone your fork (this will download your version of the class repository).
    3. Make your changes to the file (this would involve opening Jupyter, editing the file, and resaving it). If you have already changed the file without using git, all you have to do is copy your updated version over the existing file the fork that you just downloaded.
    4. Add your changed files, commit the changes, and push to the repository.
    5. Once you have done this, you can open up the webpage for your fork and click on the "New pull request" button. Follow the instructions to send a pull request to the course's repository.

    If you have any issues with Github, then see the Github guides (https://guides.github.com/activities/hello-world/)

    Nearly every major corporation (Google, Facebook, Microsoft, Twitter, etc.) and university uses git to manage code for almost all of their open-source projects, if not specifically Github. This is especially true for the open-source machine learning code being released by these corporations and universities. When it comes time to work on final projects, especially with multiple people involved, git will turn out to be one of your most powerful tools.

    Quiz1

    Quiz 1 will cover:

    • kNN
    • Decision trees
    • Naive bayes
    • Measuring success of ML algorithms