ICtoAIacclerators: A C++ repository from soumyadip1995

from ICs to AI Accelerators

Deep learning frameworks are still evolving, making it hard to design custom hardware. Reconfigurable devices such as field-programmable gate arrays (FPGA) make it easier to evolve hardware, frameworks and software alongside each other. In order to design robust hardware, a certain level of knowledge is necessary-- ranging from undergrad elements of CS to embedded systems and definitely Deep Learning . It is becoming harder to find people who understand the full stack from a first principles theory. This is a micro curriculum which will help you understand the system stack starting from the Building Block of ICs to AI Accelerators from a first Principles Perspective.

Note:- This is in NO way complete. I will keep updating this repo

Transistors and Digital Logic

Building block of ICs. Learn about Transistors- Follow the Book by Sedra-Smith on Microelectronic Circuits. BJT, FETs, Power Transistors . Some basic Circuit Theory:- Divide it into sub-chapters. IC design wiki.
Sequential and Combinational Circuits. Synchronous, Asynchronous, Register Transfer level, Introduction to VLSI design, Verilog and VHDL.

Microprocessors and Microcontrollers

8051
8085, 8086 - Basics, Instruction cycle
AVR Family :- Arduino, some basic projects like LCD controller, Servo Motors, Sensors etc.
PIC Family:- Some background knowledge.

FPGAs

Talk about how FPGAs are Built - An FPGA from 7400s. The most basic building block of an FPGA is the Cell, or Slice. Talk about Programmable LUTs.

Build Your Own FPGA
Internal Functionality of FPGA look Up tables Getting Started with FPGAs- https://www.allaboutcircuits.com/technical-articles/getting-started-with-fpgas-look-up-tables-and-flip-flops/

5 easy steps to Building an Embedded Processor System inside an FPGA Designing an FPGA from Scratch 38 part Tutorial:- Writing a Software device driver and an application program to run on the system. Pick out a suitable development Board.- Designing an FPGA from SCratch
All about FPGAs
How to Get started with FPGA programming ? What is FPGA programming ?

Digital Logic Design- Combinational and sequencial Circuits.
Verilog/VHDL language
Simulation - Modelsim
Synthesis and Implementation Xilinx ISE desisn Suite:- Xilinx ISE

ARM Architecture

Read about RISC Architectures - RISC Wiki
Learn about ARM organisation. ARM core dataflow model. 3 stage and 5 stage pipeline. ARM 7 and ARM 9. Explaining Pipelining in ARM Processors.

RISC-V Architecture

Include material for the risc V architecture as well.

ARM CPU

ARM Assembly basics Tutorial Series:- Writing ARM Assembly Learn about the Assembly language, data types and addressing modes. A good reading source would be from Computer Organization and Architecture by William Stallings. 32- Bit ARM and 16 -Bit Thumb instruction set.
ARM Assembly Language.
ARM Datatypes
ARM Addressing Modes
ARM Instruction Formats.
ARM Processor/ also cores.
- ARM Cortex M
- ARM Cortex A

ARM Operating System.

Operating System Overview
- Scheduling.
Memory Management
- Translation Lookaside Buffers
ARM Memory Management:- Developer ARM :- Learn the Architecture. Download the full tutorial pdf.
ARM Linux distributions:-Linux ARM distros, (ARM Linux Distributions wiki)[https://en.wikipedia.org/wiki/Category:ARM_Linux_distributions]
Building an MMU(Verilog, 1000):- ARM9, explain TLBs and other fun things. Maybe also a memory controller, depending on how the FPGA is, then add the init code to your bootloader.

Building an ARM7 CPU, Coding a BootROM, Coding an Assembler.

Coding an assembler:- write in python. Happens in parallel with the CPU building. Initially outputs just binary files, but changed when you write a linker.
Building a ARM7 CPU(Verilog, 1500):- Break this into subchapters. A simple pipeline to start, decode, fetch, execute.
Coding a bootrom(Assembler, 40) - from geohotz Memory Management Unit - wiki. https://developer.arm.com/architectures/learn-the-architecture/memory-management/the-memory-management-unit-mmu
Write your own OS
https://medium.com/@g33konaut/writing-an-x86-hello-world-boot-loader-with-assembly-3e4c5bdd96cf
Bootloader in C:- https://www.codeproject.com/Articles/664165/Writing-a-boot-loader-in-Assembly-and-C-Part

Compiler Design

Read the Compiler Design Tutorial by Tutorials Point. (Tutorials Point Compiler Design)[https://www.tutorialspoint.com/compiler_design/compiler_design_overview.htm]
Write a Compiler in Haskell. learn Haskell- Covers the Basics of Compilers.
Tutorial for implementation of functional languages - https://www.microsoft.com/en-us/research/uploads/prod/1992/01/tutor.pdf
Write a C compiler - https://github.com/nlsandler/write_a_c_compiler, https://norasandler.com/2017/11/29/Write-a-Compiler.html
Haskell C Compiler - https://github.com/NunoDasNeves/haskell-c-compiler
Implementing a JIT Compiled Language with Haskell and LLVM. LLVM Tutorial A JIT compiler runs after the program has started and compiles the code (usually bytecode or some kind of VM instructions) on the fly (or just-in-time, as it's called) into a form that's usually faster, typically the host CPU's native instruction set. A JIT has access to dynamic runtime information whereas a standard compiler doesn't and can make better optimizations like inlining functions that are used frequently.This is in contrast to a traditional compiler that compiles all the code to machine language before the program is first run.
Optimizing a Compiler - Ycombinator:- https://news.ycombinator.com/item?id=15821899

System on Chip (SoC)

Needed for System On Chip design for ASICs.

SoC wiki
SoC Design Methodology , Overview of the SOC Design Process.
Canonical SoC Design, System Design Flow, System Architecture, Components of the system, Hardware & Software, Processor Architectures, System Architecture and Complexity. Parameterized Systems-on-a-Chip , System-on-a-chip Peripheral Cores.
Overview of SOC external memory, Internal Memory, Size, Cache memory, Cache Organization, Cache data. Types of Cache:- Split Level Caches, Multi Level Cache. SOC Memory System .
SoC Notes:- SOC Notes

Peripheral Devices

Buffers and latches, Crystal, Reset circuit, Chip select logic circuit, timers and counters.Universal asynchronous receiver, transmitter (UART), Pulse width modulators.
Building a UART(Verilog, 100):- An intro chapter to Verilog, copy a real UART, introducing the concept of MMIO. Serial test echo program and led control. Software Serial arduino.cc
Implementing a UART in Verilog and Migen
UART, Serial Port, RS-232 Interface

Understanding Memory

Semiconductor Main Memory :- SRAM , DRAM, Chip Logic. Flash Memory:- NOR, NAND flash Memory, External Memory
DDR DRAM :- DDR SDRAM

Understanding AI Accelerators

Jetson Nano Developer's Kit

Basically a Raspberry Pi on Steroids.

Jetson Nano embedded Technical Specifications- https://developer.nvidia.com/embedded/develop/hardware
Jetson nano DL Benchmarks:- https://developer.nvidia.com/embedded/jetson-nano-dl-inference-benchmarks
Jetson nano Developer's Kit:- https://developer.nvidia.com/embedded/jetson-nano-developer-kit

Google TPU

What is a Tensor Processing Unit ? RISC, CISC , TPU instruction set , the TPU. GPU vs TPU. Matrix Multiplying Unit (MMU). Parallel Processing on Matrix Multiplying Unit. Why Matrix Multiplication ? . Matrix Machine. Systolic Array - 1) Cycle 1 and Cycle 2. Use cases of TPU.
Edge TPU performance benchmarks
Tensorflow Models on the Edge TPU
Implement a Multilayer perceptron for image classification using the CIFAR Dataset (GPU vs TPU).
A Survey of Accelerator Architectures for Deep Neural Networks

CUDA programming.

CUDA provides two APIs (Application Programming Interfaces) for developers: the CUDA driver API and the CUDA runtime API. The CUDA driver API is more fundamental (low-level) and more flexible. The CUDA runtime API is constructed based on the CUDA driver API and is easier to use. We only consider the CUDA runtime API CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions.

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>. A few examples has been provided in cuda programming.

soumyadip1995/ICtoAIacclerators