/cmc-csci143

big data course materials

Primary LanguageTeX

CSCI143: Big Data

Salary information:

  1. https://www.levels.fyi/
    1. devops
    2. distributed systems engineer
    3. data scientist
  2. employers illegally collude to reduce salaries

About the Instructor

Name Mike Izbicki (call me Mike)
Email mizbicki@cmc.edu
Office Adams 216
Office Hours MW 2pm-3pm, or by appointment
Webpage izbicki.me
Research Machine Learning (see izbicki.me/research.html for some past projects)

Fun facts:

  1. grew up in San Clemente
  2. 7 years in the navy
  3. phd/postdoc at UC Riverside
  4. taught in DPRK
  5. My wife is pregnant and due to have a baby early April. This may result in a class session being rescheduled, depending on when the baby decides to come.

About the Course

What is big data?

Depends entirely on the person who is talking

  1. Most non-computer scientists (muggles) think anything bigger than 1G is big data
  2. Facebook considers "tens of petabytes" to be a "SMALL data problem"
  3. One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
  4. For us, "big data" means:
    1. managing a cluster of computers to solve a computational problem; if it can be solved on a single computer, it's SMALL data
    2. all the interesting/applied parts of upper division computer science compressed into a single course

We will work with the following three datasets:

  1. All geolocated tweets sent from 2017-today, 4 terabytes
  2. The common crawl of the web since 2008, >1 petabyte
  3. The internet archive, >50 petabytes as of 2014

By the end of this course, you will build your own "google" search engine. You will manage a cluster of machines that work together to:

  1. download all the data from the internet
  2. extract key information from the HTML
  3. store it in a format suitable for sub 200ms queries
  4. and serve the data in a webpage

In order to make your search engine scalable, we will use the following technologies:

  1. Docker containers

    1. used to easily deploy code to thousands of computers
    2. requires concepts from operating systems, networks, architecture; closely related to "virtual machines"
    3. widely used in industry, see https://stackshare.io/docker
  2. Databases

    1. stores and accesses the data efficiently
      1. application and database on same computer (SQLite)
      2. application and database on different computers (Postgres), our focus
      3. database on a cluster of computers in the same datacenter (Postgres + extensions Postgres-xl and Citus)
      4. database on a cluster of computers spread throughout the world (YugabyteDB, CocroachDB)
    2. SQL to manipulate data, python to build applications
    3. NoSQL (e.g. MongoDB, CouchDB) sucks and you should probably never use it (strongly held personal opinion)
    4. Postgres implements full text search in 70+ languages using custom libraries I've written
    5. Postgres widely used in industry, see https://stackshare.io/postgresql
  3. With these technologies, you can create a fully functioning, highly scalable web business

    1. former CMC student Biniyam Asnake created the business NextDorm as his senior thesis (slightly different tech stack, but same ideas)

Who should take this course?

This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:

  1. you probably have not fully met the prereqs for this course
  2. some material in this course will duplicate material in your other CS courses
    1. you should not take both this course and CSCI133 Databases
    2. the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases

Prerequisites:

  1. Discrete math: CSCI055 or MATH055

    1. Basic probability / counting
    2. Basic graph theory
  2. Foundations of data science: CSCI 036, ECON 122, or ECON 160

    1. Basic machine learning
    2. Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
    3. Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
  3. Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)

    1. All courses cover:
      1. Big-oh notation
      2. Balanced binary search trees
    2. CSCI046 covers:
      1. Basic Unix shell commands
      2. Advanced git
      3. Vim text editor
      4. Analyzing multi-gigabyte Twitter datasets
    3. Data structures pre-req CSCI040:
      1. Markdown
      2. HTML / CSS
      3. Basic SQL
      4. Programming web servers with the flask library
      5. Web scraping with the requests and bs4 libraries

Relation to other CS courses:

One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.

  1. Overlapping concepts

    1. CSCI105 Computer Systems (10% overlap)
      1. types of storage: tape vs HDD vs SDD vs NVME vs RAM
      2. RAID
      3. parallel vs distributed architectures
    2. CSCI135 Operating Systems (10% overlap)
      1. permissions systems
      2. processes vs threads
      3. virtual machines vs containers
    3. CSCI125 Networking (10% overlap)
      1. private vs public networks
      2. IP addresses
      3. TCP ports
      4. virtual networks
    4. CSCI121 Software Development (10% overlap)
      1. version control systems (i.e. git)
      2. test driven development / continuous integration
      3. microservices vs monolithic architectures
      4. 12 factor applications
    5. CSCI133 Databases (50% overlap)
      1. SQL
      2. ACID/MVCC/transactions
      3. indexing techniques
    6. A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
  2. Concepts we don't cover from CSCI133 Databases

    1. relational algebra
    2. technical implementation details / C programming
    3. relationship between the database and operating system
  3. BigData concepts from a CS perspective that we will not talk about:

    1. Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
    2. Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
    3. Geo-distributed databases

Textbook:

Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.

  1. Docker documentation

  2. Postgresql documentation

  3. SQLite documentation

  4. SQLAlchemy documentation

  5. 12 Factor Web Apps

Grades:

You will have:

  1. Occasional labs (worth 2pts each)
  2. Weekly homeworks (worth 10-25 points each)
  3. Twitter MapReduce project (worth 20 points -- only students who did not take CS46 with me)
  4. One open notes midterm (20 points, week after spring break)
  5. One open notes final (30 points, during finals week)
  6. In total, there will be about 250 points in the class.

Your final grade will be computed according to the following table, with one caveat.

If your grade satisfies then you earn
95 ≤ grade A
90 ≤ grade < 95 A-
87 ≤ grade < 90 B+
83 ≤ grade < 87 B
80 ≤ grade < 83 B-
77 ≤ grade < 80 C+
73 ≤ grade < 77 C
70 ≤ grade < 73 C-
67 ≤ grade < 70 D+
63 ≤ grade < 67 D
60 ≤ grade < 63 D-
60 > grade F

CAVEAT: In order to get an A/A- in this course, you must also complete one of the following two tasks to learn about the history of unix programming:

  1. watch the following documentaries:

    1. RevolutionOS (from 2001)

    2. The Internet's Own Boy: The Story of Aaron Swartz (from 2014)

  2. read chapters 1-3 of The Art of Unix Programming by ESR

Late Work Policy:

You lose 10% on labs/projects for each day late. If you have extenuating circumstances, contact me in advance of the due date and I may extend the due date for you.

Schedule

Week Date Topic
0 M, 25 Jan DevOps: Unix Shell
0 W, 27 Jan DevOps: Unix Shell
1 M, 01 Feb DevOps: Docker
1 W, 03 Feb DevOps: Docker
2 M, 08 Feb DevOps: CRUD Apps
2 W, 10 Feb DevOps: CRUD Apps
3 M, 15 Feb SQL: Basics
3 W, 17 Feb SQL: Basics
4 M, 22 Feb SQL: Intermediate Data Types
4 W, 24 Feb SQL: Intermediate Data Types
5 M, 01 Mar SQL: ACID/MVCC/Transactions
5 W, 03 Mar SQL: ACID/MVCC/Transactions
6 M, 08 Mar Spring Break
6 W, 10 Mar Spring Break
7 M, 15 Mar SQL: ACID/MVCC/Transactions
7 W, 17 Mar SQL: ACID/MVCC/Transactions
8 M, 22 Mar Indexing: b-tree
8 W, 24 Mar Indexing: b-tree
9 M, 29 Mar Indexing: Multilingual Full Text Search
9 W, 31 Mar Indexing: Multilingual Full Text Search
10 M, 05 Apr Indexing: Multilingual Full Text Search
10 W, 07 Apr Indexing: Multilingual Full Text Search
11 M, 12 Apr Counting: Triggers
11 W, 14 Apr Counting: Triggers
12 M, 19 Apr Counting: Probabilistic Data Structures
12 W, 21 Apr Counting: Probabilistic Data Structures
13 M, 26 Apr Counting: Probabilistic Data Structures
13 W, 28 Apr Counting: Probabilistic Data Structures
14 M, 03 May DBA (DataBase Admin)
14 W, 05 May DBA (DataBase Admin)

Technology Policy

  1. You must complete all programming assignments on the lambda server.

  2. You must use either vim or emacs to complete all programming assignments. In particular, you may not use VSCode, IDLE, or PyCharm for any reason.

  3. You must not share your lambda-server password with anyone else.

Violations of any of these policies will be treated as academic integrity violations.

Collaboration Policy

You are encouraged to discuss all labs and projects with other students, subject to the following constraints:

  1. you must be the person typing in all code for your assignments, and
  2. you must not copy another student's code.

You may use any online resources you like as references.

WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.

You are ultimately responsible for ensuring you learn the material!

Accommodations for Disabilities

I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.

I want you to succeed and I'll make every effort to ensure that you can.