- Class Q & A Piazza
- Canvas
- Class Data
Office Hours:
- Clark Fitzgerald (rcfitzgerald@ucdavis.edu) Monday 1-2pm, Thursday 2-3pm both in MSB 4208 (conference room in the corner of the 4th floor of math building)
- Si Teng Hao (ssthao@ucdavis.edu): Friday 2-4pm, statistics TA room in first floor of MSB
Lecture content is in the lecture directory.
Date | Topic | Video |
---|---|---|
1-8 | introduction, syllabus, first steps in R | |
1-10 | group by computation, zip files | |
1-15 | debugging | |
1-17 | parallelism with independent local processors | |
1-22 | vectorization, apply family of functions | |
1-24 | size and efficiency of objects, intro to S4 / Matrix | |
1-29 | unsupervised learning / cluster analysis, agglomerative nested clustering | |
1-31 | introduction to bash, file navigation, help, permissions, executables | |
2-5 | SLURM cluster model, example job submissions | |
2-7 | mid quarter evaluation, bash pipes and filters, students practice SLURM, SWC lesson example hw | |
2-12 | review course suggestions, bash coding style guidelines | https://youtu.be/vqA18iYk7BM |
2-14 | shared memory parallelism | https://youtu.be/l34IBkk8xcc |
2-19 | Python Introduction | https://youtu.be/t7UZR_hVMpY |
2-21 | Python Iterators, generators, integration with shell pipeleines | https://youtu.be/e-RAah4Eey4 |
2-26 | bootstrap, data flow, intermediate variables | https://youtu.be/ORvLvBj8dzo |
2-28 | performance monitoring, chunked streaming computation | https://youtu.be/RjrijQXd1dY |
3-5 | profiling in R, test driven development | https://youtu.be/lRKVzirgumw |
3-7 | database interfaces, SQL | https://youtu.be/O99Vx0L6hZM |
3-12 | Map Reduce, Hive | https://youtu.be/vdusmDcgGPg |
3-14 | Compiled languages | https://youtu.be/uwBm0ESc9-s |
Catalog Description:
Highperformance computing in highlevel data analysis languages; different computational approaches and paradigms for efficient analysis of big data; interfaces to compiled languages; R and Python programming languages; highlevel parallel computing; MapReduce; parallel algorithms and reasoning.
The fastest machine in the world as of January, 2019 is the Oak Ridge Summit Supercomputer.
This is an experiential course. Students will learn how to work with big data by actually working with big data. We'll cover the foundational concepts that are useful for data scientists and data engineers.
These are the goals of the course:
- Develop skills and confidence to analyze data larger than memory
- Identify when and where programs are slow, and what options are available to speed them up
- Critically evaluate new data technologies, and understand them in the context of existing technologies and concepts
The class will cover the following topics. In class we'll mostly use the R programming language, but these concepts apply more or less to any language.
- 'group by' computation
- debugging
- profiling
- memory efficiency
- high level parallel programming
- interfacing to faster languages
- object oriented programming
- shell (bash)
- cluster computing (SLURM)
Optional topics:
- creating reusable software
- Databases, Hive, Postgres
- Other languages: Python, Julia, C
- test driven development
- metaprogramming
- GPUs (graphical processing units)
- Hadoop MapReduce
We won't do the following in class:
- Go in depth into the latest and greatest packages for manipulating data. Feel free to use them on assignments, unless otherwise directed.
- Comprehensive overview of machine learning, predictive analytics, deep neural networks, algorithm design, or any particular sub field of statistics.
- Learn low level concepts that distributed applications build on, such as network sockets, MPI, etc.
These are all worth learning, but out of scope for this class.
We'll use the raw data behind usaspending.gov as the primary example dataset for this class. These are comprehensive records of how the US government spends taxpayer money. From their website:
USA Spending tracks federal spending to ensure taxpayers can see how their money is being used in communities across America.
How did I get this data? I downloaded the raw Postgres database. Nehad Ismail, our excellent department systems administrator, helped me set it up. It's about 1 Terabyte when built. The largest tables are around 200 GB and have 100's of millions of rows.
You may find these books useful, but they aren't necessary for the course. I'll post other references along with the lecture notes.
- The Art of R Programming, by Norm Matloff
- Advanced R, by Hadley Wickham
- Linux Pocket Guide, by Daniel Barrett
Category | Grade Percentage |
---|---|
Assignments | 75 |
Group Project | 20 |
Participation | 5 |
- If there is any cheating, then we will have an in class exam.
- There will be around 6-8 assignments.
- Homework must be turned in by the due date. No late homework accepted.
- The lowest assignment score will be dropped.
The safest way to get full points for participation is to use Piazza on a weekly basis.EDIT (clarification) The safest way to get full points for participation is to have some interaction on Piazza on a weekly basis. Other ways are to interact in class, discussion, or office hours.
I expect you to ask lots of questions as you learn this material. Here is where you can do this:
- In class or at office hours
- Online with Piazza. Make sure your posts don't give away solutions to the assignment. Including a handful of lines of code is usually fine.
For private or sensitive questions you can do private posts on Piazza or email the instructor or TA.
Asking good technical questions is an important skill. Stack Overflow offers some sound advice on how to ask questions. Summarizing,
- Check that your question hasn't been asked.
- Make the question specific, self contained, and reproducible.
You're welcome to opt in or out of Piazza's Network service, which lets employers find you.
For the group project you will form groups of 2-3 and pursue a more open ended question using the usaspending data set. This is your opportunity to pursue a question that you are personally interested in as you create a public 'portfolio project' that shows off your big data processing skills to potential employers or admissions committees.
Start early! Programming takes a long time, and you may also have to wait a long time for your job submission to complete on the cluster.
I encourage you to talk about assignments, but you need to do your own work, and keep your work private.
OK
- Talking about general approaches.
- Using short snippets of code (5 lines or so) from lecture, Piazza, or other sources. Acknowledge where it came from in a comment or in the assignment.
NOT OK
- Any violations of the UC Davis code of student conduct.
- Using other people's code without acknowledging it.
- Copying large blocks of code.
- Storing your code in a publicly available repository.
Adapted from Nick Ulle's Fall 2018 STA141A class.
Point values and weights may differ among assignments. This is to indicate what the most important aspects are, so that you spend your time on those that matter most. Check the homework submission page on Canvas to see what the point values are for each assignment.
The grading criteria are correctness, code quality, and communication. The following describes what an excellent homework solution should look like:
The report does the following:
- solves all the questions contained in the prompt
- makes conclusions that are supported by evidence in the data
- discusses efficiency and limitations of the computation
- cites any sources used
The attached code runs without modification.
The code is idiomatic and efficient. Different steps of the data processing are logically organized into scripts and small, reusable functions. Variable names are descriptive. The style is consistent and easy to read.
Plots include titles, axis labels, and legends or special annotations where appropriate. Tables include only columns of interest, are clearly explained in the body of the report, and not too large. Numbers are reported in human readable terms, i.e. 31 billion rather than 31415926535. Writing is clear, correct English.
The report points out anomalies or notable aspects of the data discovered over the course of the analysis. It discusses assumptions in the overall approach and examines how credible they are. It mentions ideas for extending or improving the analysis or the computation.