Intro to C for High-Level ‘Grammers

Contents

DISCLAIMER

DISCLAIMER:

The following is a set of notes on low-level system programming targetting programmers more accustomed to the higher-level world of interpreted languages.

I am NOT an expert in C programming and I can barely follow any assembler language that is not nasm. This document may contain many incorrect statements but how about you and I go toe-to-toe on Bird Law and see who comes out the victor?

lawyerings

Each day we’ve worked on this repo will be separated into branches, if you want to follow sequentially as this repo has evolved, select the branch by day, day-one being the branch corresponding to the state of the repo at the time of writing this message.

Have fun and please feel free to absolutely roast me for any and all statements.

Preliminaries

We are not going to talk C at all until we cover some basic UNIX OS concepts and discuss build tools. Even if you don’t care about C, this stuff might be useful.

WTF do I need to install?

You need a standard compiler and make.

You have either gcc or clang pre-installed on UNIX-like OSs, so the only thing you will need is make.

Linux

If it isn’t installed by default then consult your distro’s package docs. This should return something if it’s installed:

which make

macOS

brew install make

BSD

As with Linux, this is going to depend on your BSD flavor but obviously if you are using BSD you probably shouldn’t be reading this. If you use FreeBSD and you are somehow unaware, core make in FreeBSD (i.e. what you use for ports) is not the same thing as the commie GNU make that most people are familiar with. You would need to install gmake to be perfectly consistent with these notes but who cares?

Windows

The best starting place is to install gentoo first.

Okay fine, then you should probably use WSL because literally nobody but game devs have the sanity or motivation necessary to learn C using Windows APIs.

If you ARE using WSL, you are most likely using Ubuntu:

sudo apt-install make

There is also cygwin but I haven’t the slightest idea how that works, so good luck.

Makefiles

IMPORTANT: You can’t copy paste this code because Makefiles are whitespace sensitive and this org file doesn’t preserve tabs. :(

Makefiles are the most convenient build tool ever created. They have been around for almost 50 years. You can use them for almost anything. Start by creating a file titled Makefile and give it a target “hello”

hello:
  echo hello world

If you run this via make:

$ make hello
echo hello world
hello world

That is, for a specified target the set of tabbed lines directly beneath are the commands which will be ran (called a recipe)). Makefiles are tab sensitive! If you don’t want to see the command you invoked, provide a @ symbol.

hello:
  @echo hello world
$ make hello
hello world

You can provide any number of targets.

hello:
  @echo hello world

goodbye:
  @echo goodbye moon
$ make hello
hello world
$ make goodbye
goodbye moon

You can also provide any number of recipes to each target.

hello:
  @echo hello world
  @echo hello earth

goodbye:
  @echo goodbye moon
  @echo goodbye sun
$ make hello
hello world
hello earth
$ make goodbye
goodbye moon
goodbye sun

Targets can be composed with other targets as dependencies. What this means is that the other targets specified to the direct right of the `:` symbol will be evaluated before the indented target recipes fire.

hello_goodbye: hello goodbye
  @echo all done

hello:
  @echo hello world

goodbye:
  @echo goodbye moon
$ make hello_goodbye
hello world
goodbye moon
all done

Incidentally, the top-most target is taken as a default value if no target is given as an argument to make. NOTE THAT THE TARGET NAMES ARE COMPLETELY ARBITRARY AND THE TOP-MOST WILL ALWAYS SERVE AS THE DEFAULT:

$ make
hello world
goodbye moon
all done

Like shell scripts, we can bind identifiers to expressions. make will literally inject these values wherever it encounters them within $(). i.e.,

HELLO=hello world
GOODBYE=goodbye moon
CAN_BE_TARGET_TOO_LOL=i literally dont matter

$(CAN_BE_TARGET_TOO_LOL): hello goodbye
  @echo $(CAN_BE_TARGET_TOO_LOL)

hello:
  @echo $(HELLO)

goodbye:
  @echo $(GOODBYE)
$ make
hello world
goodbye moon
i literally dont matter

Sometimes in shell scripting we want the output of an evaluated shell expression, for instance:

$ echo today is $(date | awk -F: '{ print $1}')
today is Thu Apr 4 01

Of course, this couldn’t quite work in Makefile as is, how would the parser distinguish between subtitution and evaluation? Solution: just add another $:

HELLO=hello world
GOODBYE=goodbye moon
CAN_BE_TARGET_TOO_LOL=i literally dont matter

$(CAN_BE_TARGET_TOO_LOL): hello goodbye
  @echo $(CAN_BE_TARGET_TOO_LOL)
  @echo but at least its $$(date | awk -F: '{ print $1 }')

hello:
  @echo $(HELLO)

goodbye:
  @echo $(GOODBYE)
$ make
hello world
goodbye moon
i literally dont matter
but at least its Thu Apr 4 01

That’s enough for now, we’re actually ready to start a C project.

Building a C project

Here comes some boilerplate.

filename: Makefile

CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
			 -Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run

all: main.c
	$(CC) main.c -o $(EXE) $(CFLAGS)

clean:
	rm -rf $(EXE)

And at last, perhaps the simplest C program imaginable:

filename: main.c

int main(void) {
  return 0;
}

Note that main.c should exist at the project’s root, together with the Makefile. When after we run make, we can run our program by giving it’s executable name relative to our current directory. i.e.,

$ make && ./run

Aaaaaaand…. Nothing happens. :D
What this program does it simply return the number 0 to standard out (stdout). It is a convention in UNIX that an “exit value of zero” is an indication of success. \ It is extremely important that this convention is followed. This is how we have the capability of running conditional shell commands and applications in succession. Observe the following behavior with our newly compiled binary:

$ ./run && echo hello world!
> hello world!
$ ./run || echo "hello world!"
> 

Now, you may be asking, “why zero??? wouldn’t boolean logic dictate true be 1 as convention?” That is an excellent question! In fact, try running this:

$ true && echo hello world!
> hello world!
$ true || echo "hello world!"
> 

Well dear friends, the `true` command is in fact a /C program which simply return 0 on every call/*. Lol. This convention was chosen long, long ago to allow for context to be given to any non-zero exit code.

You can view the previous exit code in shell by using the `$?` special variable.

$ true
$ echo $?
> 0
$ false
$ echo $?
> 1
$ ./run
$ echo $?
> 0

*NOTE: most modern shells build this command in, rather than relying on the set of system core utilities. https://www.gnu.org/software/coreutils/manual/coreutils.html#true-invocation

Examining the Makefile

filename: Makefile

CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
			 -Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run

all: main.c
	$(CC) main.c -o $(EXE) $(CFLAGS)

clean:
	rm -rf $(EXE)

I have defined a view variables here, CC for instance specifies what compiler I would like to use, EXE is an identifier for my eventual executable binary.

The clean target is a convenient way that I can remove the previous executable binary. Probably the most interesting of these variables is CFLAGS. Compiler flags of course are used to set the “strictness” of our compiler (among other things). I don’t want to go into the details of why I have chosen these flags at the present time, just suffice it to say that this is a very strict set and a very good collection in my humble opinion.

The PATH variable aka what’s with “./” ???

In order to execute our application, we MUST specify the path to it’s binary. That is, we cannot simply run it with run, that is, not yet.

You see, when we run

$ ./run

The operating system transforms the relative path ./run into an absolute path that may look something like: /home/ziggy/src/my_app/run
In fact, this has to be done for ALL executables.\ So why is it that some utilities on your machine like ls or echo can be called without this specification? The answer is through an environment variable called PATH. On my system, my path variable looks like this:

$ echo $PATH
> /home/ziggy/.opam/default/bin:/home/ziggy/.opam/default/bin:/home/ziggy/.cabal/bin:/home/ziggy/.ghcup/bin:/home/ziggy/.nvm/versions/nod
e/v21.6.1/bin:/home/ziggy/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/emulator:/opt/android-sdk/tools:/opt/andr
oid-sdk/tools/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/ziggy/bin:/home/ziggy/third
-party/julia-1.8.4/bin:/home/ziggy/go/bin:/home/ziggy/.local/bin:/home/ziggy/.fzf/bin:/home/ziggy/bin:/home/ziggy/third-party/julia-1.8
.4/bin:/home/ziggy/go/bin:/home/ziggy/.local/bin

This enormous variable tells the shell what directories contain executables and in what order to search for them. Executables contained within these directories can be called without a path specified because the shell will go through each “:” delimited path and attempt to append it to the command-name you have called! To add a new path to the PATH variable, you need only reassign its value!

$ export PATH=$PATH:/path/to/add

One important note is that user specified paths should typically be appended (as opposed to prepended i.e. PATH=/path/to/add:$PATH) as we don’t want any of our own personal executables to take precedence. Remember, PATH is evaluated from left to right. If we were to put our executable run at the front of PATH, if there were a critical executable on our system also named run, or silly program would be ran first.

One final note, in our example we have exported an environment variable but as many are probably aware, simply exporting an environment variable will not cause it to persist. In order to have our PATH continue to have this amendment added across shell session (i.e. after closing and opening a new shell), you will need to add this command to your shell configuration file (.bashrc, .zshrc, etc).

Amending the Makefile to install software

Very often in the wild you will encounter software on the internet which gives the following instructions for their installation:

make install clean

In fact, there is an entire operating system whose software management architecture is based on maintaining a set of in PATH directories full of Makefiles (the port system of FreeBSD).

If you would like this functionality, first place a local directory from which you are comfortable executing binaries from in PATH. A very common choice for this is a local /bin directory, $HOME/bin for instance. Once that directory is in PATH, simply copying your binary to that directory would allow you to call it as a normal executable from shell.

filename: Makefile

CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
			 -Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run
BIN_DIR=$(HOME)/bin

all: build

install: build
  cp $(EXE) $(BIN_DIR)/$(EXE)

build: main.c
	$(CC) main.c -o $(EXE) $(CFLAGS)

clean:
	rm -rf $(EXE)

Of course, a person you want to distribute this software to may not have that directory in PATH, or they may not WANT your executable IN that path. This is why it is considred polite to mark the installation target as install. We can come back to fancy ways to augment our Makefile to assist in installation but odds are the majority of people who care to even receive this Makefile are going to understand that they must specify an appropriate BIN_DIR.

Let’s get started!

C is weakly typed

In C, only a small set of primitive data types carry any actual semantics. For instance, declaring a variable as float informs the compiler how it should process the value I am giving it in memory. To us, the float may carry the semantics of the real number 2.75, but to the compiler, it knows to store and represent the underlying DATA as an IEEE 754 floating-point number and apply arithmetic operations accordingly. The value 2.75 to us can, in the real world of computers, be represented as the value 0x40300000. That is if you like, the so called IEEE 754 standard provides an implementation of our “real number” interface we call float.

Exposing this reality involves a bit of misdirection (later we’ll call this indirection ;)). WARNING: YOU WILL NOT UNDERSTAND THIS AT THE PRESENT MOMENT THAT IS OKAY

#include <stdio.h>

int main(void) {
  float x = 2.75;
  printf("x is %x\n", *(int *) &x);
  return 0;
}
$ make && ./run
> x is 40300000

This is an example of something called type punning and it is a mechanism by which we can circumvent the type system or the (interface/implementation construct of the compiler if you’d like) to get at the underlying raw data on which the CPU operates. IMHO, this capability is what it REALLY means to be WEAKLY typed.

Now, I know what you’re thinking, “wtf is with the %, * and & stuff, you haven’t even done hello world!” Well, let’s start “hello, world!” and build up in increasing detail until we understand this particular example and by doing so I hope that you will then understand the scariest of scariest topics in C what is a pointer?.

hello world! (finally)

Here is hello world in C:

#include <stdio.h>

int main(void) {
  printf("hello, world!\n");
  return 0;
}

The function printf() is provided by the inclusion of the stdio.h header file.

A header exposes the function prototypes (C’s notion of an interface) of an often (but not always) pre-compiled library (called an object). If all that if given to you is the prototype of printf(), how then can this compile? Well, stdio is a part of the C standard library, and your compiler is nice enough to link the implementation objects for you.

printf() does not automatically add a new line. Hence the additional `\n`.

As previously discussed, we return 0 to indicate the program terminated successfullly.

POINTERS I: string literals

C has string literals but they come with some things to keep in mind. First, as you are probably aware, a string is actually an array of characters. How those characters are encoded can vary language to language but in C the char type corresponds to good ol’ fashioned ASCII (chars are ALWAYS 8-bit, period).

#include <stdio.h>

int main(void) {
  char *str = "hello, world!";
  printf("%s\n", str);
  return 0;
}

What’s with the \*? That, dear hearts, char \* declares the str variable as a pointer to a char.

So what is this pointer thing? Well, a string is a sequence of bytes (char is 8-bit) stored in memory. Your CPU simply must know where in the hell the sequence of bytes starts if it is ever to iterate over it. If it cannot iterate over it, it simply cannot display it.

A pointer is stores a memory address. On your machine, it is likely a 64-bit unsigned integer. Why? Because your memory addresses, where the data actually lives, are labeled with 64-bit unsigned integers. So, to be more precise,

a pointer is just a variable that stores a memory address, which is just a number

Let’s pretend for a moment that we’re on an 16-bit machine for simplicity, it’s memory addresses are 16-bit unsigned integers (we always represent pointers) in hex. This is what your string would look like in memory:

0x0006: 'h'
0x0007: 'e'
0x0008: 'l'
0x0009: 'l'
0x000A: 'o'
0x000B: ' '
0x000C: 'w'
0x000D: 'o'
0x000E: 'r'
0x000F: 'l'
0x0010: 'd'
0x0011: '\n'
0x0012: '\0'
0x0013: 0x0006

Of course, those characters I have placed, /’h’/ and so forth, are actually numbers themselves, the ASCII character code of /’h’/ and so forther. i.e.,

0x0000: 0x0068

Let’s direct our attention to the very last two entries here:

0x0012: '\0'
0x0013: 0x0006

The very last one, well, that’s your pointer.

It is absolutely critical to understand that a pointer is itself a variable which is stored in memory! It’s value is a memory address!

Do you see how our pointer contains the memory address where ‘h’ is stored?

Okay, so what’s the deal with 0x0012? Well, that is what is known as the null terminator, it is a special non-printable character that is used to indicate the end of a string! We will show why this is done, and how to utilize this fact in due course (indeed printf() itself is leveraging it), but I would like to mention that the compiler is going to add this character by default for string literals, i.e. it has been added because you defined str as pointing to /”hello, world”. The key point here is the use of double quotes, that is what we mean by /string literal.

These outputs I have given are what are sometimes known as “hex dumps”, and they are a way of diagramming the state of memory as it physically exists. A few notes on that.

Most hexdumps are formatted more like this (here is an actual segement of a hexdump from our compiled program):

00002000: 0100 0200 6865 6c6c 6f2c 2077 6f72 6c64  ....hello, world
00002010: 2100 2573 0a00 0000 011b 033b 2400 0000  !.%s.......;$...

The left column is the memory address. The right column prints any values which just so happen to be printable characters.

It should be noted that not all memory addresses are actually shown, it is up to the reader to understand that, say the byte 0x68 (corresponding to the letter ‘h’) corresponds to the memory address 0x00002005 (use your finger and count to it). Do you see the 0x00 immediately following ‘!’? If you look at the right printout, you see it is non-printable. That corresponds to our ‘\0’ (null terminator). What follows that is what is our format string which contains what we will come to know as a format specifier (the “%s”).

But where is the pointer? It is stored somewhere in this enormous hexdump (it’s 1033 memory addresses lol). I promise you it is in there, but the thing is, real hexdumps are always enormous despite how small our source code is. The reason why is, this is not just a raw binary format, I mean, this is certainly raw binary, but it has been embellished with additional information that the operating system (namely the kernel) uses at runtime to begin execution. In actuality, the pointer is defined very far away from the string literal, and the reason is that

string literals are stored in an unmodifiable section of memory called static memory!

In the future, we may have opportunity to look at the build step of the toolchain just before we are turned into machine code, this is where we exist in an intermediate language called assembler. In this format, it is very clear that string literals are stored in a very different section than stack allocated variables. But this detail need not concern us for now, this is already tl;dr.

tl;dr: For the interested reader, you can actually compile a program into raw binary (which will of course be rejected by the kernel if you were to attempt to run it). This process involves telling the compiler that standard libraries may not be implemented, compiling to object and then utilizing a custom linker to only output the raw binary.

A little bit easier thing to do would be to compile to so-called .COM file, which was the very simple (and hilariously dangerous and exploitable) headerless binary format of old DOS machines. Incidentally, the .COM format is utilized by x86 processors in the boot procedure, before the system hands off control to the operating system init procedure.

The modern practice of writing software which compiles to .COM files to be ran in this so called real-mode is often called boot sector programming.

In any case, a popular tool which can help you to explore these different formats (typically used to examine assembler) is the entirely online application https://godbolt.org/.

Our very first segfault (and const keyword) :D

String literals are always stored in the so called “read-only” section of static memory. Once declared, a string literal CANNOT be modified, i.e., you cannot index into a location and change that character.
Okay… but the thing is… you can actually try. On most systems, the following code is going to cause the kernel to “segmentation fault”, which is a fancy way of saying “crash because we tried to do something bad with memory”.

#include <stdio.h>

int main(void) {
  char *hello = "Hello, world!\n";
  hello[1] = 'i'; // BAD!

  return 0;
}

So why did this happen? Well, because we are trying to access a read-only section of memory and write to it. The compiler didn’t stop us, so the operating system kernel did. Why didn’t the compiler try to stop us? Dude, your guess is as good as mine. Probably because C is still used to target architectures where string literals may NOT be read-only? Well, in any case here is some idiomatic C to make your life easier:

#include <stdio.h>

int main(void) {
  const char *hello = "Hello, world!\n";
  hello[1] = 'i'; // compiler warns us :D

  return 0;
}

You’ll note that when you run this you now get an appropriate error message:

main.c:5:12: error: read-only variable is not assignable
    5 |   hello[1] = 'i';
      |   ~~~~~~~~ ^

Which is 1000000x better than us just crashing. Here we can actually see what exactly went wrong. The const keyword (or qualifier) is a promise we make to the compiler to never alter the data we are initiailizing. However, it is important to note that while we cannot alter the data which we point to in this case, we can indirect this pointer to another memory address, at which point, the data at that memory address will not be modifiable.
This subtlety rarely comes up but you should know. In any case, when you say const, make sure you mean it. Here’s some code to prove the subtlety:

#include <stdio.h>

int main(void) {
  const char *hello = "Hello, world!\n";
  char *goodbye = "Goodbye, moon\n";

  hello = goodbye;

  printf("%s\n", hello); // Goodbye, moon!

  hello[0] = "H"; // <-- compiler says no still

  return 0;
}

Format specifiers (AKA data is just data)

We have been using stuff like %s and %c and %d in our string literals. These things are “placeholders”, but they are nameless. So if you come from the world of JavaScript or Python, know that what you think of as an “interpolated string” is effectively what we are doing here, save for the fact that these values are injected by order as arguments from something like printf (the f stands for “format” and these type of strings (which contain format specifiers) are often called “format strings”).
When you say “%s”, you mean that the data you are passing is a string, and the compiler then knows to expand that string, pointer to null terminator. When you say “%d” you mean an unsigned 32-bit integer in decimal format, “%c” the character representation of an 8-bit integer, etc, etc. Why do you have to do this? C is weakly typed, so again, at the end of the day, if you pass a string hello like

const char *hello = "Hello, world!\n";
printf("%s\n", hello);

if you were to not specify that you want to print a string, C only knows that you gave it some data. It has absolutely no additional context to draw from in order to format your output correctly. Sometimes this is useful! Want to know some ASCII character codes? Here:

#include <stdio.h>

int main(void) {
  const char *abc = "abcdefghijklmnop";
  while (*abc != '\0') printf("ASCII: %d\n", *abc++);
  return 0;
}

This works because I am printing each character in the string as a number. What about the whole incrementing abc thing? Well, it turns out, since strings are arrays in the sense that they are stored in a contiguous memory region, C affords us a nice syntactic sugar where incrementing a pointer, causes the pointer to point to the next piece of data. This is an example of pointer arithmetic, and we will talk about this more later.

Table of Format Specifiers

Here is a table of the most common format specifiers.

SpecifierDescription
%dDecimal integer
%iInteger, base 10
%uUnsigned decimal integer
%fFloating point number (with lowercase output)
%FFloating point number (with uppercase output)
%eScientific notation (lowercase)
%EScientific notation (uppercase)
%gShortest representation of %f or %e
%GShortest representation of %F or %E
%xUnsigned hexadecimal integer (lowercase)
%XUnsigned hexadecimal integer (uppercase)
%oUnsigned octal integer
%sString of characters
%cCharacter
%pPointer (address in hexadecimal)
%nNumber of characters written/read so far
%%A literal ‘%’ character

You can even create your own apparently

C arrays

TODO

Projects

Links to candbox related projects

What started as a simple parsing example is slowly becoming my attempt at a very stupid programming language…

References

There is an infinite supply of C programming resources and I’ll note a few here in order of what I feel is the most helpful.

Beej’s Guide to Network Programming is super famous but this one is just as incredible in my opinion. I wish I had been aware of thes guides’ existence when I first started writing C. Beej’s writing style is incredibly easy to parse and he has an incredible sense of what students of the C language tend to struggle with. This guide can be read cover to cover without any boredom or dullness arising.

As I mentioned before, Beej’s most famous guide is the Network Programming one but I might as well link his page. My dude has a way of explaing the things.

K&R C

Obviously…