/cs420

KAIST CS420: Compiler Design (2020 Spring)

KAIST CS420: Compiler Design (2022 Spring)

Logistics

Course description

Context

Compilers bridge the gap between human and machine. Human wants to easily express complex idea. On the other hand, machine understands only a few words (instructions) to be efficiently implemented in silicon. Compilers transform programs from a form suitable for human to easily express complex idea, to a form suitable for machine to efficiently execute. Since the gap between human and machine is fundamentally wide, compilers have been constructed and widely used since the beginning of the history of computing. Even, the first practical compiler predates the first practical operating systems (according to Wikipedia)!

In response to industry shifts, new compilers should be written and written again. First, human wants to express more and more complex idea, especially in the era of artificial intelligence and big data. Second, machine changes in response to physics (e.g. the ending of Dennard scaling and Moore's law) and industrial needs (e.g. Internet of Things and distributed systems). New compilers should be constructed to close the new gap between changing human and changing machine. For this reason, industrial needs for (and salary of) compiler engineers have been constantly high.

Goal

In this class, we will learn how to construct a compiler by actually building one. You are going to benefit from the provided skeleton code of a clean slate educational compiler--dubbed KECC: KAIST Educational C Compiler (think: KENS for networking or Pintos, xv6 for operating systems). We are going to discuss parsing only briefly, because the topic is assumed to be dealt with in CS322: Formal Languages and Automata. (You don't need to know parsing to take this course, though.) We will focus on translation from human-friendly form to machine-friendly form, and compiler optimizations. Specifically, we will discuss (1) how to transform a C program to an SSA-based intermediate representation (IR); (2) how to perform register promotion, static single assignment, global value numbering, and register allocation optimizations on the IR; and (3) how to transform an IR program to a RISC-V assembly program. KECC will provide a significant amount of skeleton code so that you can focus on the topic of this course.

We will also briefly discuss the recent trends of compiler construction. I see two crucial recent trends: scripting languages and parallelism. (1) Scripting languages like JavaScript and Python, unlike C, should be compiled (or interpreted) at run-time, and therefore, there is no clear distinction of compile- and run-time. It is a challenge in that compile time should also be optimized, but it is also an opportunity in that compile may gather and benefit from run-time information. (2) It is crucial to exploit the massive parallelism of modern applications like deep learning and high-performance computing (HPC), because they require so huge computation. Due to the complexity of workloads, their parallelism should be automatically discovered and exploited by compilers, which is a big challenge.

We will also briefly study the theory of compiler. We will focus on the correctness of compiler. In general, in what sense a compiler is correct, and how to prove it? Specifically, how to prove the correctness of KECC's transformations and optimizations? As it will turn out, this compiler correctness theory will greatly help you efficiently build your own compiler.

Resources

Tools

Make sure you're capable of using the following development tools:

  • Git: for downloading KECC and version-controlling your development. If you're not familiar with Git, walk through this tutorial.

    • IMPORTANT: you should not expose your work to others. In particular, you should not fork the upstream and push there. Please the following steps:

      • Directly clone the upstream without forking it.

        $ git clone --origin upstream git@github.com:kaist-cp/kecc-public.git
        $ cd kecc-public
        $ git remote -v
        upstream	git@github.com:kaist-cp/kecc-public.git (fetch)
        upstream	git@github.com:kaist-cp/kecc-public.git (push)
      • To get updates from the upstream, fetch and merge upstream/main.

        $ git fetch upstream
        $ git merge upstream/main
    • If you want to manage your development in a Git server, please create your own private repository.

      • You may upgrade your GitHub account to "PRO", which is free of charge.
        Refer to the documentation

      • Set up your repository as a remote.

        $ git remote add origin git@github.com:<github-id>/kecc-public.git
        $ git remote -v
        origin	 git@github.com:<github-id>/kecc-public.git (fetch)
        origin	 git@github.com:<github-id>/kecc-public.git (push)
        upstream git@github.com:kaist-cp/kecc-public.git (fetch)
        upstream git@github.com:kaist-cp/kecc-public.git (push)
      • Push to your repository.

        $ git push -u origin main
  • Rust: as the language of homework implementation. We chose Rust because its ownership type system greatly simplifies the development of large-scale system software.

    We recommend you to read this page that describes how to study Rust.

  • Visual Studio Code (optional): for developing your homework. If you prefer other editors, you're good to go.

  • You can connect to server by ssh s<student-id>@cp-service.kaist.ac.kr -p13002, e.g., ssh s20071163@cp-service.kaist.ac.kr -p13002.

  • Compiler Explorer (optional): for comparing the result of KECC to LLVM IR or RISC-V assembly code. See this video for instructions.

    • NOTE: If you want to see LLVM IR code after using a specific optimization pass (e.g., mem2reg, gvn, etc.), please follow the steps:
      1. Uncheck Directives and Comments options in the filter menu at the top of the compiler window.
      2. Type your C code, and obtain LLVM IR code by compiling it using -O0 -Xclang -disable-O0-optnone -emit-llvm flags.
        • See this link for the description of each compilation flag.
      3. Copy LLVM IR code.
      4. Change the target language from C to LLVM IR and compiler from clang to opt.
      5. Paste LLVM IR code, and optimize it using -mem2reg(or -gvn) flag.

Prerequisites

  • It is strongly recommended that students already took courses on:

    • Mathematics (calculus MAS101, MAS102, or discrete mathematics CS204): proposition statement and proof
    • Data structures (CS206): linked list, stack, queue
    • Systems programming (CS230): memory layout and stack frame
    • Programming languages (CS320): lambda calculus, interpreter

    Without a proper understanding of these topics, you will likely struggle in this course.

  • Other recommendations which would help you in this course:

    • Basic understanding of computer architecture (CS311)
    • Programming experience in Rust

Grading & honor code

Homework (80%)

You will implement translations and optimizations on KECC. All homework submissions will be automatically graded online so that you can immediately see your score. If your compiler is correct and the generated assemblies perform comparably to those generated by gcc -O1, you're going to get A+ (if not A#) even if you miss the final exam.

Since compiler construction requires nontrivial undertaking, you're encouraged to ask questions on the homework in the issue tracker at the early stage of the semester.

Final exam (20%, 13th of June, 13:00-15:45)

The exam will evaluate your understanding of compiler theory. There will not be a midterm exam.

Attendance (?%)

You should submit a token to the Course Management website for each session. You should answer to the quiz by the end of the day.

Communication

Registration

Rules

  • Course-related announcements and information will be posted on the website as well as on the GitHub issue tracker. You are expected to read all announcements within 24 hours of their being posted. It is highly recommended to watch the repository so that new announcements will automatically be delivered to you email address.

  • Ask questions on course materials and assignments in this repository's issue tracker.

    • Don't send emails to the instructor or TAs for course materials and assignments.

    • Before asking a question, search it in Google and Stack Overflow.

    • Describe your question as detailed as possible. It should include following things:

      • Environment (OS, gcc, g++ version, and any other related program information).
      • Command(s) that you used and the result. Any logs should be formatted in code. Refer to this.
      • Any directory or file changes you've made. If it is solution file, just describe which part of the code is modified.
      • Googling result. Search before asking, and share the keyword used for searching and what you've learned from it.
    • Give a proper title to your issue.

    • Read this for more instructions.

    • I'm requiring you to ask questions online first for two reasons. First, clearly writing a question is the first step to reach an answer. Second, you can benefit from questions and answers of other students.

  • Ask your questions via email only if they are either confidential or personal. Any questions failing to do so (e.g. email questions on course materials) will not be answered.

  • We are NOT going to discuss new questions during the office hour. Before coming to the office hour, please check if there is a similar question on the issue tracker. If there isn't, file a new issue and start discussion there. The agenda of the office hour will be the issues that are not resolved yet.

  • Emails to the instructor or the head TA should begin with "CS420:" in the subject line, followed by a brief description of the purpose of your email. The content should at least contain your name and student number. Any emails failing to do so (e.g. emails without student number) will not be answered.

  • If we should go online, we'll meet at the Zoom room. your Zoom name should be <your student number> <your name> (e.g., 20071163 강지훈). Change your name by referring to this.

  • This course is conducted in English. But you may ask questions in Korean. Then I will translate it to English.