cmc-csci143: A repository from anan-ara

CSCI143: Big Data

Salary information:

About the Instructor


Name	Mike Izbicki (call me Mike)
Email	mizbicki@cmc.edu
Office	Adams 216
Office Hours	MW 2pm-3pm, or by appointment
Webpage	izbicki.me
Research	Machine Learning (see izbicki.me/research.html for some past projects)

Fun facts:

grew up in San Clemente
7 years in the navy
phd/postdoc at UC Riverside
taught in DPRK
My wife is pregnant and due to have a baby early April. This may result in a class session being rescheduled, depending on when the baby decides to come.

About the Course

What is big data?

Depends entirely on the person who is talking

Most non-computer scientists (muggles) think anything bigger than 1G is big data
Facebook considers "tens of petabytes" to be a "SMALL data problem"
One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
For us, "big data" means:
1. managing a cluster of computers to solve a computational problem; if it can be solved on a single computer, it's SMALL data
2. all the interesting/applied parts of upper division computer science compressed into a single course

We will work with the following three datasets:

All geolocated tweets sent from 2017-today, 4 terabytes
The common crawl of the web since 2008, >1 petabyte
The internet archive, >50 petabytes as of 2014

By the end of this course, you will build your own "google" search engine. You will manage a cluster of machines that work together to:

download all the data from the internet
extract key information from the HTML
store it in a format suitable for sub 200ms queries
and serve the data in a webpage

In order to make your search engine scalable, we will use the following technologies:

Docker containers
1. used to easily deploy code to thousands of computers
2. requires concepts from operating systems, networks, architecture; closely related to "virtual machines"
3. widely used in industry, see https://stackshare.io/docker
Databases
1. stores and accesses the data efficiently
  1. application and database on same computer (SQLite)
  2. application and database on different computers (Postgres), our focus
  3. database on a cluster of computers in the same datacenter (Postgres + extensions Postgres-xl and Citus)
  4. database on a cluster of computers spread throughout the world (YugabyteDB, CocroachDB)
2. SQL to manipulate data, python to build applications
3. NoSQL (e.g. MongoDB, CouchDB) sucks and you should probably never use it (strongly held personal opinion)
4. Postgres implements full text search in 70+ languages using custom libraries I've written
5. Postgres widely used in industry, see https://stackshare.io/postgresql
With these technologies, you can create a fully functioning, highly scalable web business
1. former CMC student Biniyam Asnake created the business NextDorm as his senior thesis (slightly different tech stack, but same ideas)

Who should take this course?

This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:

you probably have not fully met the prereqs for this course
some material in this course will duplicate material in your other CS courses
1. you should not take both this course and CSCI133 Databases
2. the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases

Prerequisites:

Discrete math: CSCI055 or MATH055
1. Basic probability / counting
2. Basic graph theory
Foundations of data science: CSCI 036, ECON 122, or ECON 160
1. Basic machine learning
2. Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
3. Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)
1. All courses cover:
  1. Big-oh notation
  2. Balanced binary search trees
2. CSCI046 covers:
  1. Basic Unix shell commands
  2. Advanced git
  3. Vim text editor
  4. Analyzing multi-gigabyte Twitter datasets
3. Data structures pre-req CSCI040:
  1. Markdown
  2. HTML / CSS
  3. Basic SQL
  4. Programming web servers with the flask library
  5. Web scraping with the requests and bs4 libraries

Relation to other CS courses:

One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.

Overlapping concepts
1. CSCI105 Computer Systems (10% overlap)
  1. types of storage: tape vs HDD vs SDD vs NVME vs RAM
  2. RAID
  3. parallel vs distributed architectures
2. CSCI135 Operating Systems (10% overlap)
  1. permissions systems
  2. processes vs threads
  3. virtual machines vs containers
3. CSCI125 Networking (10% overlap)
  1. private vs public networks
  2. IP addresses
  3. TCP ports
  4. virtual networks
4. CSCI121 Software Development (10% overlap)
  1. version control systems (i.e. git)
  2. test driven development / continuous integration
  3. microservices vs monolithic architectures
  4. 12 factor applications
5. CSCI133 Databases (50% overlap)
  1. SQL
  2. ACID/MVCC/transactions
  3. indexing techniques
6. A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
Concepts we don't cover from CSCI133 Databases
1. relational algebra
2. technical implementation details / C programming
3. relationship between the database and operating system
BigData concepts from a CS perspective that we will not talk about:
1. Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
2. Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
3. Geo-distributed databases

Textbook:

Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.

Grades:

You will have:

Occasional labs (worth 2pts each)
Weekly homeworks (worth 10-25 points each)
Twitter MapReduce project (worth 20 points -- only students who did not take CS46 with me)
One open notes midterm (20 points, week after spring break)
One open notes final (30 points, during finals week)
In total, there will be about 250 points in the class.

Your final grade will be computed according to the following table, with one caveat.

If your grade satisfies	then you earn
95 ≤ grade	A
90 ≤ grade < 95	A-
87 ≤ grade < 90	B+
83 ≤ grade < 87	B
80 ≤ grade < 83	B-
77 ≤ grade < 80	C+
73 ≤ grade < 77	C
70 ≤ grade < 73	C-
67 ≤ grade < 70	D+
63 ≤ grade < 67	D
60 ≤ grade < 63	D-
60 > grade	F

CAVEAT: In order to get an A/A- in this course, you must also complete one of the following two tasks to learn about the history of unix programming:

watch the following documentaries:
1. RevolutionOS (from 2001)
2. The Internet's Own Boy: The Story of Aaron Swartz (from 2014)
read chapters 1-3 of The Art of Unix Programming by ESR

Late Work Policy:

You lose 10% on labs/projects for each day late. If you have extenuating circumstances, contact me in advance of the due date and I may extend the due date for you.

Schedule

Week	Date	Topic
0	M, 25 Jan	DevOps: Unix Shell
0	W, 27 Jan	DevOps: Unix Shell
1	M, 01 Feb	DevOps: Docker
1	W, 03 Feb	DevOps: Docker
2	M, 08 Feb	DevOps: CRUD Apps
2	W, 10 Feb	DevOps: CRUD Apps
3	M, 15 Feb	SQL: Basics
3	W, 17 Feb	SQL: Basics
4	M, 22 Feb	SQL: Intermediate Data Types
4	W, 24 Feb	SQL: Intermediate Data Types
5	M, 01 Mar	SQL: ACID/MVCC/Transactions
5	W, 03 Mar	SQL: ACID/MVCC/Transactions
6	M, 08 Mar	Spring Break
6	W, 10 Mar	Spring Break
7	M, 15 Mar	SQL: ACID/MVCC/Transactions
7	W, 17 Mar	SQL: ACID/MVCC/Transactions
8	M, 22 Mar	Indexing: b-tree
8	W, 24 Mar	Indexing: b-tree
9	M, 29 Mar	Indexing: Multilingual Full Text Search
9	W, 31 Mar	Indexing: Multilingual Full Text Search
10	M, 05 Apr	Indexing: Multilingual Full Text Search
10	W, 07 Apr	Indexing: Multilingual Full Text Search
11	M, 12 Apr	Counting: Triggers
11	W, 14 Apr	Counting: Triggers
12	M, 19 Apr	Counting: Probabilistic Data Structures
12	W, 21 Apr	Counting: Probabilistic Data Structures
13	M, 26 Apr	Counting: Probabilistic Data Structures
13	W, 28 Apr	Counting: Probabilistic Data Structures
14	M, 03 May	DBA (DataBase Admin)
14	W, 05 May	DBA (DataBase Admin)

Technology Policy

You must complete all programming assignments on the lambda server.
You must use either vim or emacs to complete all programming assignments. In particular, you may not use VSCode, IDLE, or PyCharm for any reason.
You must not share your lambda-server password with anyone else.

Violations of any of these policies will be treated as academic integrity violations.

Collaboration Policy

You are encouraged to discuss all labs and projects with other students, subject to the following constraints:

you must be the person typing in all code for your assignments, and
you must not copy another student's code.

You may use any online resources you like as references.

WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.

You are ultimately responsible for ensuring you learn the material!

Accommodations for Disabilities

I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.

I want you to succeed and I'll make every effort to ensure that you can.