Salary information:
Name | Mike Izbicki (call me Mike) |
mizbicki@cmc.edu | |
Office | Adams 216 |
Office Hours | MW 2pm-3pm, or by appointment |
Webpage | izbicki.me |
Research | Machine Learning (see izbicki.me/research.html for some past projects) |
Fun facts:
- grew up in San Clemente
- 7 years in the navy
- phd/postdoc at UC Riverside
- taught in DPRK
- My wife is pregnant and due to have a baby early April. This may result in a class session being rescheduled, depending on when the baby decides to come.
What is big data?
Depends entirely on the person who is talking
- Most non-computer scientists (muggles) think anything bigger than 1G is big data
- Facebook considers "tens of petabytes" to be a "SMALL data problem"
- One of the biggest problems in industry is people apply tools for "Facebook big data" to "muggle big data", and a major goal of this course is to teach you why this is bad and how to avoid it
- For us, "big data" means:
- managing a cluster of computers to solve a computational problem; if it can be solved on a single computer, it's SMALL data
- all the interesting/applied parts of upper division computer science compressed into a single course
We will work with the following three datasets:
- All geolocated tweets sent from 2017-today, 4 terabytes
- The common crawl of the web since 2008, >1 petabyte
- The internet archive, >50 petabytes as of 2014
By the end of this course, you will build your own "google" search engine. You will manage a cluster of machines that work together to:
- download all the data from the internet
- extract key information from the HTML
- store it in a format suitable for sub 200ms queries
- and serve the data in a webpage
In order to make your search engine scalable, we will use the following technologies:
-
Docker containers
- used to easily deploy code to thousands of computers
- requires concepts from operating systems, networks, architecture; closely related to "virtual machines"
- widely used in industry, see https://stackshare.io/docker
-
Databases
- stores and accesses the data efficiently
- application and database on same computer (SQLite)
- application and database on different computers (Postgres), our focus
- database on a cluster of computers in the same datacenter (Postgres + extensions Postgres-xl and Citus)
- database on a cluster of computers spread throughout the world (YugabyteDB, CocroachDB)
- SQL to manipulate data, python to build applications
- NoSQL (e.g. MongoDB, CouchDB) sucks and you should probably never use it (strongly held personal opinion)
- Postgres implements full text search in 70+ languages using custom libraries I've written
- Postgres widely used in industry, see https://stackshare.io/postgresql
- stores and accesses the data efficiently
-
With these technologies, you can create a fully functioning, highly scalable web business
- former CMC student Biniyam Asnake created the business NextDorm as his senior thesis (slightly different tech stack, but same ideas)
Who should take this course?
This course is designed for data science majors, not computer science majors. I'm happy to have CS majors in this course (and I think you'll find this course fun), but know that:
- you probably have not fully met the prereqs for this course
- some material in this course will duplicate material in your other CS courses
- you should not take both this course and CSCI133 Databases
- the course number CSCI143 comes from the fact that all CMC upper division CS courses start with CSCI14, and the 3 is for databases
Prerequisites:
-
Discrete math: CSCI055 or MATH055
- Basic probability / counting
- Basic graph theory
-
Foundations of data science: CSCI 036, ECON 122, or ECON 160
- Basic machine learning
- Basic SQL (also covered in CSCI040 Computing for the Web; not covered in any computer science class except CSCI133 Databases, which you should not take if you take this course)
- Regular expressions (for CS majors, typically covered in a theory of computing or compilers class)
-
Data structures: CSCI046 or CSCI70 (Mudd) or CSCI62 (Pomona)
- All courses cover:
- Big-oh notation
- Balanced binary search trees
- CSCI046 covers:
- Basic Unix shell commands
- Advanced git
- Vim text editor
- Analyzing multi-gigabyte Twitter datasets
- Data structures pre-req CSCI040:
- Markdown
- HTML / CSS
- Basic SQL
- Programming web servers with the
flask
library - Web scraping with the
requests
andbs4
libraries
- All courses cover:
Relation to other CS courses:
One purpose of this course is to provide DS majors with an overview of CS concepts. Therefore, there is a lot of material in this course that is covered in other upper division CS courses required for CS majors.
-
Overlapping concepts
- CSCI105 Computer Systems (10% overlap)
- types of storage: tape vs HDD vs SDD vs NVME vs RAM
- RAID
- parallel vs distributed architectures
- CSCI135 Operating Systems (10% overlap)
- permissions systems
- processes vs threads
- virtual machines vs containers
- CSCI125 Networking (10% overlap)
- private vs public networks
- IP addresses
- TCP ports
- virtual networks
- CSCI121 Software Development (10% overlap)
- version control systems (i.e. git)
- test driven development / continuous integration
- microservices vs monolithic architectures
- 12 factor applications
- CSCI133 Databases (50% overlap)
- SQL
- ACID/MVCC/transactions
- indexing techniques
- A lot of the concepts we'll be covering "should" be covered in other CS courses, but because CS professors are often more theory minded than practice minded, they don't get covered. In that sense, this course is similar to the Missing Semester of Your CS Education course taught at MIT.
- CSCI105 Computer Systems (10% overlap)
-
Concepts we don't cover from CSCI133 Databases
- relational algebra
- technical implementation details / C programming
- relationship between the database and operating system
-
BigData concepts from a CS perspective that we will not talk about:
- Frameworks for distributed computation (e.g. Apache Hadoop, Apache Spark)
- Distributed Filesystems (e.g. HDFS, IPFS); we will talk about S3
- Geo-distributed databases
Textbook:
Big data is a rapidly changing field, and all currently printed textbooks are both incomplete and already out of date. Therefore, we won't be using a textbook. Instead, we will be using online documentation. The main references we will use are given below, but I will provide more specific links each week.
Grades:
You will have:
- Occasional labs (worth 2pts each)
- Weekly homeworks (worth 10-25 points each)
- Twitter MapReduce project (worth 20 points -- only students who did not take CS46 with me)
- One open notes midterm (20 points, week after spring break)
- One open notes final (30 points, during finals week)
- In total, there will be about 250 points in the class.
Your final grade will be computed according to the following table, with one caveat.
If your grade satisfies | then you earn |
---|---|
95 ≤ grade | A |
90 ≤ grade < 95 | A- |
87 ≤ grade < 90 | B+ |
83 ≤ grade < 87 | B |
80 ≤ grade < 83 | B- |
77 ≤ grade < 80 | C+ |
73 ≤ grade < 77 | C |
70 ≤ grade < 73 | C- |
67 ≤ grade < 70 | D+ |
63 ≤ grade < 67 | D |
60 ≤ grade < 63 | D- |
60 > grade | F |
CAVEAT: In order to get an A/A- in this course, you must also complete one of the following two tasks to learn about the history of unix programming:
-
watch the following documentaries:
-
RevolutionOS (from 2001)
-
The Internet's Own Boy: The Story of Aaron Swartz (from 2014)
-
-
read chapters 1-3 of The Art of Unix Programming by ESR
Late Work Policy:
You lose 10% on labs/projects for each day late. If you have extenuating circumstances, contact me in advance of the due date and I may extend the due date for you.
Week | Date | Topic |
---|---|---|
0 | M, 25 Jan | DevOps: Unix Shell |
0 | W, 27 Jan | DevOps: Unix Shell |
1 | M, 01 Feb | DevOps: Docker |
1 | W, 03 Feb | DevOps: Docker |
2 | M, 08 Feb | DevOps: CRUD Apps |
2 | W, 10 Feb | DevOps: CRUD Apps |
3 | M, 15 Feb | SQL: Basics |
3 | W, 17 Feb | SQL: Basics |
4 | M, 22 Feb | SQL: Intermediate Data Types |
4 | W, 24 Feb | SQL: Intermediate Data Types |
5 | M, 01 Mar | SQL: ACID/MVCC/Transactions |
5 | W, 03 Mar | SQL: ACID/MVCC/Transactions |
6 | M, 08 Mar | Spring Break |
6 | W, 10 Mar | Spring Break |
7 | M, 15 Mar | SQL: ACID/MVCC/Transactions |
7 | W, 17 Mar | SQL: ACID/MVCC/Transactions |
8 | M, 22 Mar | Indexing: b-tree |
8 | W, 24 Mar | Indexing: b-tree |
9 | M, 29 Mar | Indexing: Multilingual Full Text Search |
9 | W, 31 Mar | Indexing: Multilingual Full Text Search |
10 | M, 05 Apr | Indexing: Multilingual Full Text Search |
10 | W, 07 Apr | Indexing: Multilingual Full Text Search |
11 | M, 12 Apr | Counting: Triggers |
11 | W, 14 Apr | Counting: Triggers |
12 | M, 19 Apr | Counting: Probabilistic Data Structures |
12 | W, 21 Apr | Counting: Probabilistic Data Structures |
13 | M, 26 Apr | Counting: Probabilistic Data Structures |
13 | W, 28 Apr | Counting: Probabilistic Data Structures |
14 | M, 03 May | DBA (DataBase Admin) |
14 | W, 05 May | DBA (DataBase Admin) |
-
You must complete all programming assignments on the lambda server.
-
You must use either vim or emacs to complete all programming assignments. In particular, you may not use VSCode, IDLE, or PyCharm for any reason.
-
You must not share your lambda-server password with anyone else.
Violations of any of these policies will be treated as academic integrity violations.
You are encouraged to discuss all labs and projects with other students, subject to the following constraints:
- you must be the person typing in all code for your assignments, and
- you must not copy another student's code.
You may use any online resources you like as references.
WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.
You are ultimately responsible for ensuring you learn the material!
I've tried to design the course to be as accessible as possible for people with disabilities. (We'll talk a bit about how to design accessible software in class too!) If you need any further accommodations, please ask.
I want you to succeed and I'll make every effort to ensure that you can.