Reading and Discussion List for Stanford CS348K

Web site for most recent offering of the course: http://cs348k.stanford.edu/spring23

Lecture 1: Throughput Computing Review

Lecture slides

Post-Lecture Required Readings/Videos:

Students that have not taken CS149 or feel they need a refresher on basic parallel computer architecture should first watch this pre-recorded lecture that is similar to lecture 2 in CS149. It's a full 90 minutes so feel welcome to skip/fast-forward through the parts that you know. The technical content begins eight minutes in.
The Compute Architecture of Intel Processor Graphics Gen9. Intel Corporation
- This is not an academic paper, but a whitepaper from Intel describing the architectural geometry of a recent GPU. I'd like you to read the whitepaper, focusing on the description of the processor in Sections 5.3-5.5. Then, given your knowledge of the concepts discussed in the prerecorded video and in lecture 1 (multi-core, SIMD, multi-threading, etc.), I'd like you to describe the organization of the processor (using terms from the lecture, not Intel terms). For example:
  - What is the basic processor building block?
  - How many times is this block replicated for additional parallelism?
  - How many hardware threads does it support?
  - What width of SIMD instructions are executed by those threads? Are there different widths supported? Why is this the case?
  - Does the core have superscalar execution capabilities?
- Consider your favorite data-parallel programming language, such as GLSL/HLSL shading languages from graphics, CUDA, OpenCL, ISPC, NumPy, TensorFlow, or just an OpenMP #pragma parallel for. Can you think through how an "embarrassingly parallel" for loop can be mapped to this architecture? (You don't need to write this down in your writeup, but you could if you wish.)
- Note that an update to the Gen9 architecuture is Gen11, which you can read about here. (We chose to have to read the Gen9 whitepaper since it's a bit more detailed on the compute sections.)
- For those that want to go further, I also encourage you to read NVIDIA's V100 (Volta) Architecture whitepaper, linked in the "further reading" below. Can you put the organization of this GPU in correspondence with the organization of the Intel GPU? You could make a table contrasting the features of a modern AVX-capable Intel CPU, Intel Integrated Graphics (Gen9), NVIDIA GPUs (Volta, Ampere) etc. Hint: here are some diagrams from CS149: here and here.

Other Recommended Readings:

Volta: Programmability and Performance. Hot Chips 29 (2017)
- This Hot Chips presentation documents features in NVIDIA Volta GPU. Take a good look at how a chip is broken down into 80 streaming multi-processors (SMs), and that each SM can issue up to 4 warp instructions per clock, and supports up to concurrent 64 warps. You may also want to look at the NVIDIA Volta Whitepaper.
The Story of ISPC. Pharr (2018)
- Matt Pharr's multi-part blog post is an riveting description of the history of ISPC, a simple, and quite useful, language and compiler for generating SIMD code for modern CPUs from a SPMD programming model. ISPC was motivated by the frustration that the SPMD programming benefits of CUDA and GLSL/HLSL on GPUs could easily be realized on CPUs, provided applications were written in a simpler, constrained programming system that did not have all the analysis challenges of a language like C/C++.
Scalability! But at What COST? McSherry, Isard, and Murray. HotOS 2015
- The arguments in this paper are very consistent with the way we think about performance in the visual computing domain. In other words, efficiency and raw performance are different than "scalable".

Lecture 2: Digital Camera Processing Pipeline (Part I)

Lecture slides

Post-Lecture Required Readings:

What Makes a Graphics Systems Paper Beautiful. Fatahalian (2019)
- A major theme of this course is "thinking like a systems architect". This short blog post discusses how systems artitects think about the intellectual merit and evaluation of systems. Read the blog post, and if you have time, click through to some of the paper links. These are the types of issues, and the types of systems, we will be discussing in this class. The "real" reading of the night is Google HDR+ paper listed below, and I want you to analyze that paper's arguments in the terms of the goals and constraints underlying the design of the camera application in Google Pixel smartphones.
Burst Photography for High Dynamic Range and Low-light Imaging on Mobile Cameras. Hasinoff et al. SIGGRAPH Asia 2016
- A bit of background: This paper addresses a common concern in photography that all of us have probably encountered before: the tension between generating images that are sharp (but noisy due too low of light, or distorted by using a flash) and images that are blurry but accurately depict the lighting of a scene. If the scene is dark and the photographer uses a short exposure (to allow light to flow into the camera for only a short period of time), then the image will be sharp, but too noisy to look good. (Remember in class we talked about a few sources of noise: photon noise, sensor noise, etc.). If the photographer uses a long exposure to allow more light to enter the camera, then as objects in the scene move (or the camera moves... e.g., camera shake when hand-holding a camera) over the duration of the exposure, the result is a blurry image. You can see examples of blurry images from long exposures here and here. One way to solve the problem is to use a flash, but if you turn on your camera's flash, you'll lose the original look and feel of the scene. No one likes a photo of them taken with flash! When researchers first started studying ways to use computation to improve images, the thinking was that if you could take two pictures... a blurry long exposure image, and a sharp low exposure image, then there might be a way to combine the two images to produce a single good photograph. Here's one example of this technique, where there is a blurred long exposure to capture true colors and a second short exposure taken with flash to get a sharp version of the scene. Another variant of this technique, called "bracketing", applies the idea to the problem of avoiding oversaturation of pixels. You can learn more about bracketing at these links.
  - https://www.photoup.net/best-practices-for-shooting-bracketed-images-for-hdr/
  - https://www.youtube.com/watch?v=54JXUJhvFSs
- This is a very technical paper. But don't worry, your job is not to understand all the technical details of the algorithms, it is to approach the paper with a systems mindset, and think about the end-to-end considerations that went into the particular choice of algorithms. In general, I want you to pay the most attention to Section 1, Section 4.0 (you can ignore the detailed subpixel alignment in 4.1), Section 5 (I will talk about why merging is done in the Fourier domain in class), and Section 6. Specifically, as you read this paper, I'd like you think about the following issues:
- Any good system typically has a philosophy underlying its design. This philosophy serves as a framework for which the system architect determines whether design decisions are good or bad, consistent with principles or not, etc. Page 2 of the paper clearly lists some of the principles that underlie the philosophy taken by the creators of the camera processing pipeline at Google. For each of the four principles, given an assessment of why the principle is important.
- Designing a good system is about meeting design goals, subject to certain constraints. (If there were no constraints, it would be easy to use unlimited resources to meet the goals.) What are the major constraints of the system? For example are there performance constraints? Usability constraints? Etc.
- The main technical idea of this paper is to combine a sequence of similarly underexposed photos, rather than attempt to combine a sequence of photos with different exposures (the latter is called “bracketing”). What are the arguments in favor of the chosen approach? Appeal to the main system design principles.
- Why is the motivation for the weighted merging process described in Section 5? Why did the authors not use the simpler approach of just adding up all the aligned images? (Can you justify the designer's decision by appealing to one of the stated design principles?)
- Finally, take a look at the “finishing” steps in Section 6. Many of those steps should sound familiar to you after today’s lecture (or for sure after lecture 3).

Other Recommended Readings:

The old Stanford CS448A course notes remain a very good reference for camera image processing pipeline algorithms and issues.
Clarkvision.com has some very interesting material on cameras.
Demosaicking: Color Filter Array Interpolation. Gunturk et al. IEEE Signal Processing Magazine, 2005
Unprocessing Images for Learned Raw Denoising. Brooks et al. CVPR 2019
A Non-Local Algorithm for Image Denoising. Buades et al. CVPR 2005
A Gentle Introduction to Bilateral Filtering and its Applications. Paris et al. SIGGRAPH 2008 Course Notes
A Fast Approximation of the Bilateral Filter using a Signal Processing Approach. Paris and Durand. MIT Tech Report 2006 (extends their ECCV 2006 paper)

Lecture 3: Digital Camera Processing Pipeline (Part II)

Lecture slides

Post-Lecture Required Readings:

The Frankencamera: An Experimental Platform for Computational Photography. A. Adams et al. SIGGRAPH 2010
- Frankencamera was a paper written right about the time mobile phone cameras were becoming “acceptable” in quality, phones were beginning to contain a non-trivial amount of compute power, and computational photography papers were an increasingly hot topic in the SIGGRAPH community. (Note that the Frankencamera paper predates the HDR+ paper by six years.) At this time many compelling image processing and editing techniques were being created, and many of them revolved around generating high-quality photographs from a sequence of multiple shots or exposures. However, current cameras at the time provided a very poor API for software applications to control the camera hardware and its components. Many of the pieces were there for a programmable camera platform to be built, but someone had to attempt to architect a coherent system to make them accessible. Frankencamera was an attempt to do that: It involved two things:
  - The design of an API for programming cameras (a mental model of an abstract programmable camera architecture).
  - And two implementations of that architecture: an open camera reference design, and an implementation on a Nokia smartphone.
- When you read the paper, we’re going to focus on the abstract "architecture" presented by a Frankencamera. Specifically I’d like you to think about the following:
  1. I’d like you to describe the major pieces of the Frankcamera abstract machine (the system’s nouns): e.g., devices, sensors, processors, etc. Give some thought to why a sensor is not just any other device? Is there anything special about the sensor?
  2. Then describe the major operations the machine could perform (the system’s verbs). For example, in your own words, what is a "shot"? Would you say a shot is a command to the abstract machine? Or is a shot a set of commands? What do you think about the word “timeline” as a good word to describe what a shot is “shot”?
  3. What output does executing a shot generate? How is a "frame" different from a shot? Why is this distinction made by the system?
  4. Would you say that F-cam is a “programmable” camera architecture or a “configurable architecture”. What kinds of “programs” does the abstract machine run? (Note/hint: see question 2)
  5. It's always good to establish the scope of what a system is trying to do. In this case, how would you characterize the particular type of computational photography algorithms that F-cam seeks to support/facilitate/enable?
- Students may be interested that vestiges of ideas from the Frankencamera can now be seen in the Android Camera2 API: https://developer.android.com/reference/android/hardware/camera2/package-summary

Other Recommended Readings:

Synthetic Depth-of-Field with a Single-Camera Mobile Phone. Wadha et al. SIGGRAPH 2018.
- This is a paper about the implementation of "Portrait Mode" in Google Pixel smartphones. It's a dense paper, similar to the HDR+ paper from 2016, but it is a detailed description of how the system works under the hood.
Handheld Mobile Photography in Very Low Light. Liba et al. SIGGRAPH Asia 2019
- This is a paper about the implementation of "Night Sight" in Google Pixel smartphones.
Exposure Fusion. Mertens et al. Computer Graphics and Applications, 2007
Local Laplacian Filters: Edge-aware Image Processing with a Laplacian Pyramid. Paris et al. SIGGRAPH 2011
The Laplacian Pyramid as a Compact Image Code. Burt and Adelson, IEEE Transactions on Communications 1983.

Lecture 4: Digital Camera Processing Pipeline (Part III)

Lecture slides

For the required reading for the next class, please see required readings under lecture 5. During we continued discussion of digital camera processing pipeline topics. For suggested going further readings, please see the list of readings given under Lecture 4 .

Lecture 5: Efficiently Scheduling Image Processing Algorithms

Lecture slides

Pre-Lecture Required Reading: (to read BEFORE lecture 5)

Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. Ragan-Kelley, Adams, et al. PLDI 2013
- Note: Alternatively you may read the selected chapters in the Ragan-Kelley thesis linked below in recommended readings. (Or the CACM article.) The thesis chapters involve a little more reading than the paper, but in my opinion is a more accessible explanation of the topic, so I recommend it for students.
- In reading this paper, I want you to specifically focus on describing the philosophy of Halide. Specifically, if we ignore the "autotuner" described in Section 5 of the paper (you can skip this section if you wish), what is the role of the Halide programmer, and what is the role of the Halide system/compiler?
  - Hint 1: Which component is responsible for major optimization decisions?
  - Hint 2: Can a change to a schedule change the output of a Halide program?
- Let's consider what type of programmer Halide provides the most value for. Ignoring the autoscheduler (and just considering the language algorithm expression language and scheduling language), what class of programmer do you think is the target of Halide? Novices? Experts? CS149 students? Why?
- In your own words, in two sentences or less, attempt to summarize what you think is the most important idea in the design of Halide?
- Advanced question: In my opinion, there is one major place where the core design philosophy of Halide is violated. It is described in Section 4.3 in the paper, but is more clearly described in Section 8.3 of the Ph.D. thesis. (see sliding window optimizations and storage folding). Why do you think am I claiming this compiler optimization is a significant departure from the core principles of Halide? (there are also valid arguments against my opinion.)
  - Hint: what aspects of the program’s execution is not explicitly described in the schedule in these situations?

Post-Lecture Required Reading: (to read AFTER lecture 5)

Learning to Optimize Halide with Tree Search and Random Programs. Adams et al. SIGGRAPH 2019
- This paper documents the design of the modern autoscheduling algorithm that is now implemented in the Halide compiler. This is quite a technical paper, so I recommend that you adopt the "read for high-level understanding first, then dive into some details" reading strategy I suggested in class. Your goal should be to get the big points of the paper, not all the details.
- The back-tracking tree search used in this paper is certainly not a new idea (you might have implemented algorithms like this in an introductory AI class), but what was interesting was the way the authors formulated the scheduling problem as a sequence of choices that could be optimized using tree search. Please summarize how scheduling a Halide program is modeled as a sequence of choices.
  - Note: one detail you might be interested to take a closer look at is the "coarse-to-fine refinement" part of Section 3.2. This is a slight modification to a standard back tracking tree search.
- An optimizer's goal is to minimize a cost. In the case of this paper, the cost is the runtime of the scheduled program. Why is a machine-learned model used to predict the scheduled program's runtime? Why not just compile the program and run it on a machine?
- The other interesting part of this paper is the engineering of the learned cost model. This part of the work was surprisingly difficult. Observe that the authors do not present an approach based on end-to-end learning where the input is a Halide program DAG and the output is an estimated cost, instead they use traditional compiler analysis of the program AST to compute a collection of program features, then what is learned is how to weight these features when estimating cost (See Section 4.2). For those of you with a bit of deep learning background, I'm interested in your thoughts here. Do you like the hand-engineered features approach?

Other Recommended Readings:

Decoupling Algorithms from the Organization of Computation for High Performance Image Processing. Ragan-Kelley (MIT Ph.D. thesis, 2014)
- Please read Chapters 1, 4, 5, and 6.1 of the thesis
- You might also find this CACM article on Halide from 2018 an accessible read.
Halide Language Website (contains documentation and many tutorials)
Check out this useful Youtube Video on Halide scheduling
Differentiable Programming for Image Processing and Deep Learning in Halide. Li et al. SIGGRAPH 2018
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Chen et al. OSDI 2018
- TVM is another system that provides Halide-like scheduling functionality, but targets ML applications. (See Section 4.1 in the paper for a description of the schedule space)
Learning to Optimize Tensor Programs. Chen et al. NIPS 2018

Lecture 6: Efficient DNN Inference (Part I)

Lecture slides

Other Recommended Readings:

Stanford CS231: Convolutional Neural Networks for Visual Recognition.
- If you haven't taken CS231N, I recommend that you read through the lecture notes of modules 1 and 2 for very nice explanation of key topics.
An Introduction to different Types of Convolutions in Deep Learning, by Paul-Louis Pröve (a nice little tutorial)
Going Deeper with Convolutions, Szegedy et al. CVPR 2015 (this is the Inception paper).
- You may also enjoy reading this useful blog post about versions of the Inception network.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Howard et al. 2017
What is the State of Neural Network Pruning?, Blalock et al. MLSys 2020
- This paper is a good read even if you are not interested in DNN pruning, because the paper addresses issues and common mistakes in how to compare performance-oriented academic work.

Lecture 7: Efficient DNN Inference (Part II)

Lecture slides

Post-Lecture Required Reading:

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Dao et al. NeurIPS 2022.
- This recent paper applied a clever trick that allows the sequence of operations (a matrix multiply, a softmax, and then another matrix multiply) in an "attention" block to be able to be fused into a single memory-bandwidth efficient operation. The result of this trick is not only better performance, but that it became practical to run sequence models on larger sequences. As shown in the paper, the ability to provide longer sequences (wider context for reasoning) lead to higher quality model output.
- Your job in your write is to address one thing. In your own words, describe how the algorithm works to correctly compute softmax(QK^T)V block by block. Specifically:
  - Start by making sure you understand the expression for computing softmax(x) for a vector x. Convince yourself that the basic formula suggestions you need to read all elements of the vector before you can compute the first element of softmax(x).
  - Now make sure you understand the factored form of the softmax, which is rwritten in terms of x1 and x2, where x1 and x2, are slices of the original vector x.
  - Now apply this understanding to the blocked algorithm. Line 9 of Algorithm 1 generates a partial result for a subblock of S_{ij}=(Q_i K_j^T). Then line 10 computes statistics of the rows of S_{ij}. Offer a high level explanation of why the subsequent lines correctly compute the desired result. The proof on page 23 in Appendix C works out all the math. Pretty cool, right?

Other recommended Readings:

NVIDIA CUTLASS Github repo
NVIDIA CuDNN Documentation
Facebook Tensor Comprehensions
- The associated Arxiv paper is Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions, Vasilache et al. 2018.

Lecture 8: DNN Hardware Accelerators

Lecture slides

Pre-Lecture Required Reading: (to read BEFORE lecture 8)

In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi et al. ISCA 2017
- Like many computer architecture papers, the TPU paper includes a lot of facts about details of the system. I encourage you to understand these details, but try to look past all the complexity and try and look for the main lessons learned: things like motivation, key constraints, key principles in the design. Here are the questions I'd like to see you address.
- What was the motivation for Google to seriously consider the use of a custom processor for accelerating DNN computations in their datacenters, as opposed to using CPUs or GPUs? (Section 2)
- I'd like you to resummarize how the matrix_multiply operation works. More precisely, can you flesh out the details of how the TPU carries out the work described in this sentence at the bottom of page 3: "A matrix operation takes a variable-sized B256 input, multiplies it by a 256x256 constant weight input, and produces a B256 output, taking B pipelined cycles to complete".
- We are going to talk about the "roofline" charts in Section 4 during class. These graphs plot the max performance of the chip (Y axis) given a program with an arithmetic intensity (X -- ratio of math operations to data access). How are these graphs used to assess the performance of the TPU and to characterize the workload run on the TPU?
- Section 8 (Discussion) of this paper is an outstanding example of good architectural thinking. Make sure you understand the points in this section as we'll discuss a number of them in class. Particularly for us, what is the point of the bullet "Pitfall: Architects have neglected important NN tasks."?

Other Recommended Readings:

NVIDIA Tensor Core
Google TPU v3
Cerebras WSE
- HotChips: https://hotchips.org/assets/program/tutorials/HC2020.Cerebras.NataliaVassilieva.v02.pdf
- Corporate Whitepaper... https://cerebras.net/resources/achieving-industry-best-ai-performance-through-a-systems-approach/
NVIDIA DLA (open source)
- http://nvdla.org
GraphCore IPU
- https://www.graphcore.ai/products/ipu
Microsoft Brainwave
- https://www.microsoft.com/en-us/research/uploads/prod/2018/03/mi0218_Chung-2018Mar25.pdf
SambaNova's Cardinal (very little public documentation)
- https://sambanova.ai/
Apple's ML accelerator in their M12 chip
Anything from companies on this list ;-)
- https://www.crn.com/slide-shows/components-peripherals/the-10-coolest-ai-chip-startups-of-2020>
Academic efforts, like:
- SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, Parashar et al. ISCA 2017
- EIE: Efficient Inference Engine on Compressed Deep Neural Network, Han et al. ISCA 2016
- vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design, Rhu et al. MICRO 2016
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Network, Chen et al. ISCA 2016
An excellent survey organizing different types of designs:
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Zhe et al. IEEE 2017

Lecture 9: System Support for Curating Training Data

Lecture slides

Pre-Lecture Required Reading: (to read BEFORE lecture 9)

There are two required readings for this lecture. Use the second reading to supplement the first. The prompt questions are shared across the readings.

Snorkel: Rapid Training Data Creation with Weak Supervision. Ratner et al. VLDB 2017.
Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. Bach et al. SIGMOD 2019
- This is a paper about the deployment of Snorkel at Google. Pay particular attention to the enumeration of "core principles" in Section 1 and the final "Discussion" in section 7. Skimming this paper in conjunction with the first required reading is recommended.
- You also might want to check out the Snorkel company blog for more digestible examples.
Prompt questions:
- First let's get our terminology straight. What is meant by "weak supervision", and how does it differ from the traditional supervised learning scenario where a training procedure is given a training set consisting of (1) a set of data examples and (2) a corresponding set of "ground truth" labels for each of these examples?
- Like in all systems, I'd like everyone to pay particular attention to the design principles described in Section 1 of the Ratner et al. paper, as well as the principles defined in the Drybell paper. If you had to boil the entire philosophy of Snorkel down to one thing, what would you say it is? Hint: look at principle 1 in the Ratner et al. paper. "If accurate labels are expensive, but data is cheap, find a way to use all the sources of __________ you can get your hands on."
- Previously humans injected their knowedge by labeling data examples. Under the Snorkel paradigm, now humans inject their knowledge now?
- The main abstraction in Snorkel is the labeling function. Please describe what the output interface of a labeling function is. Then, and most importantly, what is the value of this abstraction. Hint: you probably want to refer to the key principle of Snorkel in your answer.
- What is the role of the final model training part of Snorkel? (That is, training an off-the-shelf DNN architecture on the probabalistic labels produced by Snorkel.) Why not just use the probablistic labels as the model itself?
- One interesting aspect of Snorkel is the notion of learning from "non-servable assets". (This is more clearly articulated in the Snorkel DryBell paper, so take a look there). This is definitely not an issue that would be high on the list of academic concerns, but it is quite important in indusrty. What is a non-servable asset and how does the Snorkel approach make it possible to "learn from" non-servable assets even if they are not available at later runtimes.
- From the ML perspective, the key technical insight of Snorkel (and of most follow on Snorkel papers, see here, here, and here for some examples) is the mathematical modeling of correlations between labeling functions in order to more accurately estimate probabilistic labels. We will not talk in detail about these generative algorithms in class, but many of you will enjoy learning about them.
- In general, I'd like you to reflect on Snorkel and (if time) some of the recommended papers below. (see the Rekall blog post, or the Model Assertions paper for different takes.) I'm curious about your comments.

Other Recommended Readings:

Accelerating Machine Learning with Training Data Management. Alex Ratner's Stanford Ph.D. Dissertation (2019)
- This is the thesis that covers Snorkel and related systems. As was the case when we studied Halide, it can often be helpful to read Ph.D. theses, since they are written up after the original publication on a topic, and often include more discussion of the bigger picture, and also the widsom of hindsight. I highly recommend Alex's thesis.
Two related papers:
- Overton: A Data System for Monitoring and Improving Machine-Learned Products, Ré et al. 2019
- Ludwig: a type-based declarative deep learning toolbox, Molino et al. 2019
- Often when you hear about machine learning abstractions, we think about ML frameworks like PyTorch, TensorFlow, or MX.Net. Instead of having to write key model ML layers yourself, these frameworks present the abstraction of a ML graph "operator", and allow model creation by composing operators into DAGs. However, the abstraction of designing models by wiring up data flow graphs of operators is still quite low. One might characterize these abstractions as being targeted for an ML engineer---someone who has taken a lot of ML classes, and has experience implementing model architectures in TensorFlow, experience selecting the right model for the job, or with the know-how to adjust hyperparameters to make training successful.
  The two papers for our discussion are efforts to begin to raise the level of abstraction even higher. These are systems that emerged out of two major companies (Overton out of Apple, and Ludwig out of Uber), and they share the underlying philosophy that some of the operational details of getting modern ML to work can be abstracted away from users that simply want to use technologies to quickly train, validate, and continue to maintain accurate models. In short these two systems can be thought of as different takes on Karpathy’s software 2.0 argument, which you can read in this Medium blog post. I’m curious about your thoughts on this post as well!
Model Assertions for Monitoring and Improving ML Models. Kang et al. MLSys 2020
- A similar idea here, but with different human-provided priors about how to turn existing knowledge in a model into additional supervision.
Rekall: Specifying Video Events using Compositions of Spatiotemporal Labels, Fu et al. 2019
- In the context of Snorkel, Rekall could be viewed as a system for writing labeling functions for learning models for detecting events in video. Alternatively, from a databases perspective, Rekall can be viewed as a system for defining models by not learning anything at all -- and just having the query itself be the model.
- Blog post: https://dawn.cs.stanford.edu/2019/10/09/rekall/, code: https://github.com/scanner-research/rekall
Waymo's recent blog post on image retrieval systems as data-curation systems, Guo et al 2020.
Selection via Proxy: Efficient Data Selection for Deep Learning, Coleman et al. ICLR 2020.
The unsupervised or semi-supervised learning angle. We'd be remiss not to talk about the huge body of work that attempts to reduce the amount of labeled training data required using unsupervised learning techniques.
- Language Models are Few-Shot Learners. Brown et al. NeurIPS 2020. (The GPT-3 paper)
- A Simple Framework for Contrastive Learning of Visual Representations. Chen et al. ICLM 2020 (The SimCLR paper)
- Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Caron et al. NeurIPS 2020. (The SwAV paper)
- Data Distillation: Towards Omni-Supervised Learning, Radosavovic et al. CVPR 2018

Lecture 10: Video Compression (Traditional and Learned)

Lecture slides

Post-Lecture Required Reading:

Warehouse-scale video acceleration: co-design and deployment in the wild. Ranganathan et al. ASPLOS 2021

This is a recent paper about the deployment of fixed-function hardware for video encoding and decoding in Google datacenters (Youtube, Google Photos, etc). I thought it was a great example of a systems paper describing goals, constraints, and cross-cutting issues. Please address the following in your summary:

An interesting stat from the paper was that it takes over one CPU-hour of compute to encode 150 frames of 2160p video using VP9 (top of Section 4.5). State why companies like Google care so much about spending large amounts of computation to achieve very high quality (or similarly, very low bitrate) video encoding. Why does more computation help? (the last part is a review from lecture).
Please describe the reasons why Youtube must encode a video multiple times and at many resolutions/quality settings? And why is multiple output target (MOT) encoding a useful approach in this environment? Hint, consider low resolution output versions of a high-resolution input video.)
An interesting paragraph towards the end of Section 2 (see header "Video Usage Patterns at Scale") broke the workload down into videos that are viral and highly watched (constituting most of the watch time), videos that are modestly watched, and videos that are rarely watched at all. For each of these categories, describe whether you think the VCU is a helpful platform that that type of video.
We all have laptops and mobile devices with hardware accelerators for video encoding, but the requirements for accelerator hardware in a datacenter are quite different. What were some of the most interesting differences to you? What are opportunities that the datacenter case affords that might not be there in the case of hardware for consumer devices?
In Section 3.3 there is an interesting analysis of how much bandwidth is needed for the VCU hardware to avoid being BW-bound. What is the role of custom data compression hardware on the VCU?
There are key sections talking about the stateless design of the encoder. (End of 3.2). Give some reasons why, in a datacenter environment, the stateless approach is beneficial.

Other Recommended Readings:

Overview of the H.264/AVC Video Coding Standard. Wiegand et al. IEEE TCSVT '03
vbench: Benchmarking Video Transcoding in the Cloud. Lottarini et al. ASPLOS 18
Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol. Fouladi et al. NSDI 2018
Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. Fouladi et al. NSDI 17
Gradient-Based Pre-Processing for Intra Prediction in High Efficiency Video Coding. BenHajyoussef et al. 2017
The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files for a File-Storage Service. Horn et al. 2017.
Neural Adaptive Content-Aware Internet Video Delivery. Yeo et al OSDI 18.
Learning Binary Residual Representations for Domain-specific Video Streaming. Tsai et al. AAAI 18

Lecture 11: The Future of Video Conferencing Systems

Lecture slides

Post-Lecture Required Reading:

Beyond Being There. Hollan and Stornetta. CHI 1992

This is a classic paper from the early human-computer interaction community that challenges the notion that recreating reality in digital form should be the "north star" goal of the design of virtual environments. We're now 30 years past the paper, and technology and our ways of communicating using technology has progressed significantly, but one might argue given all the recent talk about "The Metaverse" that technologists may still be making the same mistakes. Please address the following in your summary:

In the section titled "Being There", the authors provide a "crutch vs. shoes" analogy. Consider modern video-conferencing systems you might use on a daily basis, and does the analogy apply?
I'd like you to reflect on your usage of Zoom/Teams/Meet/Slack/Discord/etc. and consider what are the features of these systems that cause you to choose to use them. Consider the modalities of text, audio, and video. When and why do you choose to use each? Are there situations where too much resolution or too much capture fidelity hurts the experience of communicating?
Now think about a topic that you are all experts in: online/virtual classes at Stanford. Can you think of examples where professors used the pandemic as an opportunity to communicate in a way that was "better than being there"? Can you think of situations where classes tried to replicate the experience of "being there" and it fell flat? Comment on how you think the themes of "beyond being there" might apply to virtual education and teaching?
The authors hypothesize: "What if we were able to create communication tools that were richer than face-to-face?" But of course, this was back in 1992. Today we have technologies and systems that arguably answer this question in the positive. What tools, platforms, systems come to mind to you?
The idea of intersubjectivity discussed in the document is interesting. What are ways that current tools fail in this regard and how might they be improved? Note: one good example of a success in the intersubjectivity department is the simple animated "dot dot dot" indicating someone on the other side of a DM conversation is typing. Perhaps one failure example is how little information we get out of the online/offline indicator on Slack.
So is Facebook, er. I mean Meta, right? Is a high-fidelity VR social experience going to change the way we work with others? Or will the next big innovation in how with communicate be something more like Instastram, TikTok, or WhatsApp?

Recommended Readings:

SVE: Distributed Video Processing at Facebook Scale. Huang et al. SOSP 2017
Nonverbal Overload: A Theoretical Argument for the Causes of Zoom Fatigue. Bailenson 2021.
Social Translucence: An Approach to Designing Systems that Support Social Processes. Erickson and Kellogg. TOCHI 2000.

Lecture 12: The Light Field and NeRF Preliminaries

Lecture slides

Recommended Readings:

Light Field Rendering. Levoy and Hanrahan 1996.
Jump: Virtual Reality Video. Anderson et al. SIGGRAPH Asia 2016
Instant 3D Photography. Hedman and Kopf. SIGGRAPH 2018

Lecture 13: The NeRF Explosion

Lecture slides

Recommended Readings:

Neural Volumes: Learning Dynamic Renderable Volumes from Images. Lombardi et al. SIGGRAPH 2019
Representing Scenes as Neural Radiance Fields for View Synthesis. Mildenhall et al. ECCV 2020
- This is the original NeRF paper
PlenOctrees for Real-time Rendering of Neural Radiance Fields. Yu et al ICCV 2021
Plenoxels: Radiance Fields without Neural Networks. Fridovich-Keil et al. CVPR 2022
Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Müller et al. SIGGRAPH 2022
Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes. Takikawa et al. CVPR 2021
Block-NeRF: Scalable Large Scene Neural View Synthesis. Tancik et al. CVPR 2022

Lecture 14: Simulating Virtual Worlds to Train Agents

Lecture slides

Pre-Lecture Required Reading:

Generative Agents: Interative Simulacra of Human Behavior. Park et al. CHI 2023

Generating plausible agents that behave "like humans" has long been an interest of video game designers seeking to create non-playable characters. But agents that behave realistically have many other applications as well: they can serve as proxies for software testers to find bugs in games or help designers assess the playability or difficulty of game levels. If we think more broadly, behavior that emerges from many agents performing plausible tasks over time in a simulated world can potentially give rise to global phenomenon such as organization of teams or the creation of empires (as anyone that's played games like The Sims might have experienced! :-)) This paper is about designing simulated agents that leverage queries to large-language models (e.g. ChatGPT) to produce interesting behavior without significant hand-coded logic or programmed rules. This paper touches on a number of themes from the course, and I'd like you to think about the following questions:

First let's start with some technical details. The paper's experiments are performed in a small "Sims"-like work called Smallville. The key subroutine used by agents in this paper is a query to a stateless large language model (LLM). For those of you that have used ChatGPT or similar systems like Google's Bard, just picture this module working like those systems. The input query is a text string of finite length (e.g., a few thousand characters), and the output of the LLM is text string response. It's easy to picture how to code-up a "bot" to operate within Smallville (use game APIs to move to place X, turn stove to "on", etc.), and it's easy to understand how one could generate prompts for an LLM and receive responses, the agents described in this paper need to translate the text string responses from the LLM to agent actions in the game. What is the mechanism for turning LLM responses into agent actions in the game? (For example, if the agent is in a bedroom and the LLM says the character should clean up the kitchen, how does the agent turn this direction into actions in the game?) This is discussed in Section 5.
The paper hypothesizes that this stateless module (with small, finite inputs) will be insufficient for creating characters that behave over long time scales in a consistent and rational way. Summarize the reasons for this challenge? (hint: consider continuity)
To address the challenge described above, the paper's solution is to "summarize" a long history of the agent into a finite-length input for the LLM. There are two parts to this approach. The first is the "memory stream". Describe what the memory stream's purpose is in the agent architecture. Then describe how retrieval is used to select what data fromt the memory stream should be used in each query.
Of course, over a long simulation, enough activity happens to an agent that a memory stream grows quite long. One way to address this might be to ask ChatGPT to summary a long text string into a shorter one. But the authors go with a different approach that they call reflection. How is reflection implemented and give your thoughts on this approach, which indeed is a form of summarization of the memory stream.
Ideas in a paper can sometimes sound really interesting, but then you get to the evaluation section and realize that the cool ideas aren't really that helpful. This is a particular hard piece of work to evaluate, and I'd like you to take a detailed look at the evaluation sections (Section 6 and 7). What do you think? Do you believe that important aspects of the agent architecture have merit?

Post-Lecture Required Reading:

An Extensible, Data-Oriented Architecture for High-Performance, Many-World Simulation. Shacklett et al. SIGGRAPH 2023

If you were to create a computer game today, you'd probably not write the entire game from scratch. Instead, you'd choose to write your game using the APIs provided by a game engine framework, such as e.g., Unity, Unreal, or Phaser) because it would be not only far more productive to do so, but also because you'd probably not be able to implement key parts of your game (like advanced rendering, physics, input collection from controllers, etc) as well as domain experts in these areas. In other words, existing engines provide valuable, well-implemented building blocks for creating a game, which allows game developers to focus on implementing the logic and content specific to their game (specific game rules, creating worlds, etc.).

In this paper, the authors observed that there was an emerging need to create simulators that execute at very high throughput when running a "batch" of thousands of independent instances of a world. Early examples of these "batch simulators" can be found here for Atari games (github here), robotics physics simulations, and navigation of indoor environments. These batch simulators were all written from scratch. This paper is based on a simple claim: it is likely that in the near future there will be a need to make many more unique simulators for training agents like computer game bots, robots, etc., and that not everyone that wants to create a simulator will an expert at high-performance programming on GPUs. Therefore, there should be a "game framework" for high performance batch simulators.

As you read the paper, please respond to the following questions:

As always, make sure you read and understand the requirements and goals of the system as presented in sections 1 and 2. The paper lists these goals clearly, and please make sure you understand all of them, but I'd like you to focus your response on the "PERFORMANCE" goal. Specifically, performance in this paper does not mean "run in parallel on the GPU", it means "efficiently run in parallel on the GPU". Think back to your CS149 principles of SIMD execution and efficient memory access (we discussed this in lecture 1 as well as the "homework" reading in the first week), what are the key ideas in this paper that pertain specifically to the "high performance" goal?
- Hint: GPUs perform best when they can execute the same instruction across many threads AND, when those threads access data, high-performance GPU memory performs best when adjacent threads are accessing adjacent memory addresses.
There are two major abstractions in the presented system: components and the computation graph (there's a good description in Section 4). Let's focus on components first. How are components for all worlds stored in a single table. Give at least one reason why this leads to high GPU performance. (See Section 5.1). Note, it may be helpful to dive into the description of how Madrona implements component deletion as a way to check your understanding of this part of the paper (see the text related to Figure 2.)
The authors choose to implement all game logic components as a computation graph (all instances use the same graph), and then execute the computation graph for ALL game instances by running a single graph node N for all instances, then moving on to the next graph node. Give at lease one reason why this leads to high GPU performance (possible answers lie in rest of Section 5).
This is a good paper to dig into the evaluation, so we can discuss in class what questions the evaluation is trying to answer. There are four configurations evaluated in Section 7, (BATCH-ECS-GPU, ECS-GPU, ECS-CPU, and REF-CPU). For each of the configurations, please provide an explanation of how the configuration is used in a comparison to make a claim about "X is better than Y". For example, the paper provides ECS-CPU as a good high-performance C++ implementation because comparing just BATCH-ECS-GPU to REF-CPU alone doesn't prove the ideas in the paper are the reason for the observed speedups. For example, with only that comparison, the speedup could be due to the faster speed of a GPU vs a CPU, or low performance of Python code vs. high oerformance CUDA code.
- Hint: the goal of a scientific paper evaluation is to show that the ideas in the paper have merit. It is not to show that the authors are better programmers than the programmers of prior work.
Ignoring robotics applications, are there good reasons to train AI agents for the purpose of making video games better? What might you do with a very high performance batch simulator?

Lecture 14: Diffusion-based Image Generation (Part I)

Pre-Lecture Required Preparation:

PabloRdrRbl/readings

Reading and Discussion List for Stanford CS348K

Lecture 1: Throughput Computing Review

Lecture 2: Digital Camera Processing Pipeline (Part I)

Lecture 3: Digital Camera Processing Pipeline (Part II)

Lecture 4: Digital Camera Processing Pipeline (Part III)

Lecture 5: Efficiently Scheduling Image Processing Algorithms

Lecture 6: Efficient DNN Inference (Part I)

Lecture 7: Efficient DNN Inference (Part II)

Lecture 8: DNN Hardware Accelerators

Lecture 9: System Support for Curating Training Data

Lecture 10: Video Compression (Traditional and Learned)

Lecture 11: The Future of Video Conferencing Systems

Lecture 12: The Light Field and NeRF Preliminaries

Lecture 13: The NeRF Explosion

Lecture 14: Simulating Virtual Worlds to Train Agents

Lecture 14: Diffusion-based Image Generation (Part I)