/GWU_data_mining

Materials for GWU DNSC 6279

Primary LanguageJupyter Notebook

Materials for GWU DNSC 6279

This course provides exposure to various data preprocessing, statistics, and machine learning techniques that can be used both to discover relationships in large data sets and to build predictive models.

Techniques covered will include basic and analytical data preprocessing, regression models, decision trees, neural networks, clustering, association analysis, and basic text mining.

Techniques will be presented in the context of data driven organizational decision making using statistical and machine learning approaches.

Course Schedule

Weeks Topics
1 Section 00: Intro and History
2-3 Section 01: Basic Data Prep
3-4 Section 02: Analytical Data Prep
5-6 Section 03: Regression
7 Section 04: Decision Trees
8 Section 05: Neural Networks
9 Project proposal presentations

Additional reference material

Course Syllabus

Pre-requisite Courses

Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), MSBA Program Candidacy or instructor approval.

Instructor

Mr. Patrick Hall

E-mail: jphall@gwu.edu

Twitter: @jpatrickhall

Office Hours: In departmental office space before lectures when business travel allows.

Recommended Textbooks

Introduction to Data Mining, by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

An Introduction to Statistical Learning with Applications in R, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani

Reading Assignments

The student is responsible for studying and understanding all assigned materials. If reading generates questions that are not discussed in class, the student has the responsibility of addressing the instructor privately or raising the issue in an appropriate digital medium.

Blackboard

Some materials for this class have personal or corporate copyrights or licenses that prevent them from being shared on GitHub. Those materials or other internal information will be shared with students via Blackboard.

Grading

The course grade will be based on homework assignments, quizzes, a final exam, and a team project. Each grading component is described in detail below.

Quizzes

There will be several in class quizzes, typically every week. They will be based on current and prior assigned readings and material covered in the class sessions. No make up quizzes will be given. The lowest quiz grade will be dropped.

Quizzes are individual assignments.

Homework Assignments

You will be given several homework assignments during the semester. Homework assignments will typically require the use of software. A typical homework assignment will consist of a few problems with several parts. You may be given up to several weeks to complete the assignment. Late homework assignments may be rejected.

In preparing your homework assignments, please follow these guidelines:

  • Ensure any submitted computer program solutions are commented and runnable in a standard Python, R, or SAS environment.
  • Ensure any written solutions are typed or easily readable by anyone.
  • Ensure a clear logical flow and mark your answers.
  • Print/type your name(s) on the top right hand corner of every page or in a header of any papers submitted.

Homework assignments may be completed in groups.

Final Exam

The final exam will be scheduled during finals' week. Graduate final exams are scheduled by the university late in the semester. The exam date will be made known at that time. No make up final exams will be given.

The final exam is an individual assignment.

Project

The project is designed to serve as an exercise in applying one or more of the data mining techniques covered in the course to analyze real life data sets. A primary objective is to understand the complexities that arise in mining large, real life datasets that are often inconsistent, incomplete, and unclean. Students can use a variety of software tools to perform the analysis, including standard Python, R, or SAS packages. This is a semester long project, and students have the option to work in 2 or 3 person teams. The deliverables include a formal project proposal (due mid-semester), and a final report (due at the end of the semester at the time of your final project presentation).

Students may select a current Kaggle contest or their MSBA practicum project as the project for this class.

Projects can be a group or individual assignment.

Grading Weights
  • Quizzes: 25%
  • Homework assignments: 25%
  • Final exam: 25%
  • Project: 25%

Academic Integrity

If you are struggling with an assignment or class materials, require extra time for an assignment, or simply require additional assistance, see the instructor immediately.

Cheating and plagiarism will not be tolerated. Any case will automatically result in loss of all the points for the assignment, and may be a reason for a failing grade and/or grounds for dismissal. In case of a group assignment, all group members will receive a zero grade.

Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will be reported to the Office of Academic Integrity. Students are expected to know and understand all college policies, especially the code of academic integrity.

Disability Services

Please contact the Disability Support Services to establish eligibility and to coordinate reasonable accommodation.

Attendance

Regular attendance is expected. Students are held responsible for all of the work of the courses in which they are registered, and all absences must be excused by the instructor before provision is made to make up the work missed.

Class Policy Changes

The instructor reserves the right to revise any item on this syllabus, including, but not limited to any class policy, course outline or schedule, grading policy, tests, etc. Note that the requirements for deliverables may be clarified and expanded in class, via email, on GitHub, or on Blackboard. Students are expected to complete the deliverables incorporating such additions.

Software

Official software packages

These packages will be used for in class demonstrations and homework solutions.

  • H2o.ai is a package of high performance functions and algorithms for preprocessing data and training statistical and machine learning models. It can be accessed without the need for coding through a standalone, web browser client or by installing additional coding interfaces for R and/or Python.

  • Anaconda Python Python is an approachable, general purpose programming language with excellent add on libraries for math and data analysis. Anaconda Python is a commercial version of Python that bundles these add on packages (and many other packages) together with convenient development utilities like the Spyder IDE.

  • SAS 9.4 and Enterprise Miner is a commercial package for preprocessing data and training statistical and machine learning models. Enterprise Miner allows for the construction of complex data mining workflows without writing code. Enterprise Miner is a proprietary commercial product and not freely available. You may access Enterprise Miner through the SAS on Demand for Academics portal or by contacting the GWU Instructional Technology Lab.

Other useful free software:

  • R is a tremendously popular language for data analysis, with thousands of user contributed packages for different types of data analysis tasks.

  • R Studio is the standard IDE for the R language.

  • SAS 9.4 University Edition is a free edition of SAS' proprietary commercial data analysis software. SAS University Edition contains the newest version of several SAS software packages along with learning tools and utilities for new users. It also requires a virtual machine player which you may need to install separately.

Using Git

You are welcome to use git and/or GitHub to save and manage your own copies of class materials.

The easiest way to do so is to download this entire repository as a zip file. However you will need to download a new copy of the repository whenever changes are made to this repository. To download the course repository, navigate to the course GitHub repository (i.e. this page) and click the 'Clone or Download' button and then select 'Download Zip'.

alt text

If you would like to take advantage of the version control capabilities of git then you need to follow these steps.

Install required software
Fork and pull materials

Navigate to the course GitHub repository (i.e. this page) and click the 'Fork' button.

alt text

Enter the following statements on the git bash command line:

$ cd <parent directory>

$ mkdir GWU_data_mining

$ cd GWU_data_mining

$ git init

$ git remote add origin https://github.com/<your username>/GWU_data_mining.git

$ git remote add upstream https://github.com/jphall663/GWU_data_mining.git

$ git pull origin master

$ git lfs install

$ git lfs track '*.jpg' '*.png' '*.csv' '*.sas7bdat'