DISCLAIMER:
The following is a set of notes on low-level system programming targetting programmers more accustomed to the higher-level world of interpreted languages.
I am NOT an expert in C programming and I can barely follow any assembler language that is not nasm. This document may contain many incorrect statements but how about you and I go toe-to-toe on Bird Law and see who comes out the victor?
Each day we’ve worked on this repo will be separated into branches, if you want to follow sequentially as this repo has evolved, select the branch by day, day-one being the branch corresponding to the state of the repo at the time of writing this message.
Have fun and please feel free to absolutely roast me for any and all statements.
We are not going to talk C at all until we cover some basic UNIX OS concepts and discuss build tools. Even if you don’t care about C, this stuff might be useful.
You need a standard compiler and make.
You have either gcc or clang pre-installed on UNIX-like OSs,
so the only thing you will need is make.
If it isn’t installed by default then consult your distro’s
package docs. This should return something if it’s installed:
which make
brew install make
As with Linux, this is going to depend on your BSD flavor but obviously if you are using BSD you probably shouldn’t be reading this. If you use FreeBSD and you are somehow unaware, core make in FreeBSD (i.e. what you use for ports) is not the same thing as the commie GNU make that most people are familiar with. You would need to install gmake to be perfectly consistent with these notes but who cares?
The best starting place is to install gentoo
first.
Okay fine, then you should probably use WSL because
literally nobody but game devs have the sanity or motivation
necessary to learn C using Windows APIs.
If you ARE using WSL, you are most likely using Ubuntu:
sudo apt-install make
There is also cygwin but I haven’t
the slightest idea how that works, so good luck.
IMPORTANT: You can’t copy paste this code because Makefiles
are whitespace sensitive and this org file doesn’t preserve
tabs. :(
Makefiles are the most convenient build tool ever created. They have been around for almost 50 years. You can use them for almost anything. Start by creating a file titled Makefile and give it a target “hello”
hello:
echo hello world
If you run this via make:
$ make hello
echo hello world
hello world
That is, for a specified target the set of tabbed lines directly beneath are the commands which will be ran (called a recipe)). Makefiles are tab sensitive! If you don’t want to see the command you invoked, provide a @ symbol.
hello:
@echo hello world
$ make hello
hello world
You can provide any number of targets.
hello:
@echo hello world
goodbye:
@echo goodbye moon
$ make hello
hello world
$ make goodbye
goodbye moon
You can also provide any number of recipes to each target.
hello:
@echo hello world
@echo hello earth
goodbye:
@echo goodbye moon
@echo goodbye sun
$ make hello
hello world
hello earth
$ make goodbye
goodbye moon
goodbye sun
Targets can be composed with other targets as dependencies. What this means is that the other targets specified to the direct right of the `:` symbol will be evaluated before the indented target recipes fire.
hello_goodbye: hello goodbye
@echo all done
hello:
@echo hello world
goodbye:
@echo goodbye moon
$ make hello_goodbye
hello world
goodbye moon
all done
Incidentally, the top-most target is taken as a default value if no target is given as an argument to make. NOTE THAT THE TARGET NAMES ARE COMPLETELY ARBITRARY AND THE TOP-MOST WILL ALWAYS SERVE AS THE DEFAULT:
$ make
hello world
goodbye moon
all done
Like shell scripts, we can bind identifiers to expressions. make will literally inject these values wherever it encounters them within $(). i.e.,
HELLO=hello world
GOODBYE=goodbye moon
CAN_BE_TARGET_TOO_LOL=i literally dont matter
$(CAN_BE_TARGET_TOO_LOL): hello goodbye
@echo $(CAN_BE_TARGET_TOO_LOL)
hello:
@echo $(HELLO)
goodbye:
@echo $(GOODBYE)
$ make
hello world
goodbye moon
i literally dont matter
Sometimes in shell scripting we want the output of an evaluated shell expression, for instance:
$ echo today is $(date | awk -F: '{ print $1}')
today is Thu Apr 4 01
Of course, this couldn’t quite work in Makefile as is, how would the parser distinguish between subtitution and evaluation? Solution: just add another $:
HELLO=hello world
GOODBYE=goodbye moon
CAN_BE_TARGET_TOO_LOL=i literally dont matter
$(CAN_BE_TARGET_TOO_LOL): hello goodbye
@echo $(CAN_BE_TARGET_TOO_LOL)
@echo but at least its $$(date | awk -F: '{ print $1 }')
hello:
@echo $(HELLO)
goodbye:
@echo $(GOODBYE)
$ make
hello world
goodbye moon
i literally dont matter
but at least its Thu Apr 4 01
That’s enough for now, we’re actually ready to start a C project.
Here comes some boilerplate.
filename: Makefile
CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
-Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run
all: main.c
$(CC) main.c -o $(EXE) $(CFLAGS)
clean:
rm -rf $(EXE)
And at last, perhaps the simplest C program imaginable:
filename: main.c
int main(void) {
return 0;
}
Note that main.c
should exist at the project’s root, together with the
Makefile
. When after we run make, we can run our program by giving
it’s executable name relative to our current directory. i.e.,
$ make && ./run
Aaaaaaand…. Nothing happens. :D
What this program does it simply return the
number 0 to standard out (stdout). It is a convention in UNIX that an “exit
value of zero” is an indication of success. \
It is extremely important that this convention is followed. This is how
we have the capability of running conditional shell commands and applications in
succession. Observe the following behavior with our newly compiled binary:
$ ./run && echo hello world!
> hello world!
$ ./run || echo "hello world!"
>
Now, you may be asking, “why zero??? wouldn’t boolean logic dictate true be 1 as convention?” That is an excellent question! In fact, try running this:
$ true && echo hello world!
> hello world!
$ true || echo "hello world!"
>
Well dear friends, the `true` command is in fact a /C program which simply
return 0 on every call/*. Lol.
This convention was chosen long, long ago to allow for context to be given
to any non-zero exit code.
You can view the previous exit code in shell by using the `$?` special variable.
$ true
$ echo $?
> 0
$ false
$ echo $?
> 1
$ ./run
$ echo $?
> 0
*NOTE: most modern shells build this command in, rather than relying on the set of system core utilities. https://www.gnu.org/software/coreutils/manual/coreutils.html#true-invocation
filename: Makefile
CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
-Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run
all: main.c
$(CC) main.c -o $(EXE) $(CFLAGS)
clean:
rm -rf $(EXE)
I have defined a view variables here, CC for instance specifies what compiler I would like to use, EXE is an identifier for my eventual executable binary.
The clean target is a convenient way that I can remove the previous executable binary. Probably the most interesting of these variables is CFLAGS. Compiler flags of course are used to set the “strictness” of our compiler (among other things). I don’t want to go into the details of why I have chosen these flags at the present time, just suffice it to say that this is a very strict set and a very good collection in my humble opinion.
In order to execute our application, we MUST specify the path to it’s binary.
That is, we cannot simply run it with run, that is, not yet.
You see, when we run
$ ./run
The operating system transforms the relative path ./run into an absolute path
that may look something like: /home/ziggy/src/my_app/run
In fact, this has to be done for ALL executables.\
So why is it that some utilities on your machine like ls or echo can be
called without this specification? The answer is through an
environment variable called PATH. On my system, my path variable looks like
this:
$ echo $PATH
> /home/ziggy/.opam/default/bin:/home/ziggy/.opam/default/bin:/home/ziggy/.cabal/bin:/home/ziggy/.ghcup/bin:/home/ziggy/.nvm/versions/nod
e/v21.6.1/bin:/home/ziggy/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/emulator:/opt/android-sdk/tools:/opt/andr
oid-sdk/tools/bin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/home/ziggy/bin:/home/ziggy/third
-party/julia-1.8.4/bin:/home/ziggy/go/bin:/home/ziggy/.local/bin:/home/ziggy/.fzf/bin:/home/ziggy/bin:/home/ziggy/third-party/julia-1.8
.4/bin:/home/ziggy/go/bin:/home/ziggy/.local/bin
This enormous variable tells the shell what directories contain executables and in what order to search for them. Executables contained within these directories can be called without a path specified because the shell will go through each “:” delimited path and attempt to append it to the command-name you have called! To add a new path to the PATH variable, you need only reassign its value!
$ export PATH=$PATH:/path/to/add
One important note is that user specified paths should typically be appended
(as opposed to prepended i.e. PATH=/path/to/add:$PATH
) as we don’t want any of
our own personal executables to take precedence. Remember,
PATH is evaluated from left to right. If we were to put our executable run
at the front of PATH, if there were a critical executable on our system also
named run, or silly program would be ran first.
One final note, in our example we have exported an environment variable but as many are probably aware, simply exporting an environment variable will not cause it to persist. In order to have our PATH continue to have this amendment added across shell session (i.e. after closing and opening a new shell), you will need to add this command to your shell configuration file (.bashrc, .zshrc, etc).
Very often in the wild you will encounter software on the internet which gives the following instructions for their installation:
make install clean
In fact, there is an entire operating system whose software management
architecture is based on maintaining a set of in PATH directories full
of Makefiles (the port system of FreeBSD).
If you would like this functionality, first place a local directory
from which you are comfortable executing binaries from in PATH. A very
common choice for this is a local /bin
directory, $HOME/bin
for instance.
Once that directory is in PATH, simply copying your binary to that directory
would allow you to call it as a normal executable from shell.
filename: Makefile
CC=clang
CFLAGS=-Wall -Wextra -pedantic -Wconversion \
-Wunreachable-code -Wswitch-enum -Wno-gnu
EXE=run
BIN_DIR=$(HOME)/bin
all: build
install: build
cp $(EXE) $(BIN_DIR)/$(EXE)
build: main.c
$(CC) main.c -o $(EXE) $(CFLAGS)
clean:
rm -rf $(EXE)
Of course, a person you want to distribute this software to may not have that directory in PATH, or they may not WANT your executable IN that path. This is why it is considred polite to mark the installation target as install. We can come back to fancy ways to augment our Makefile to assist in installation but odds are the majority of people who care to even receive this Makefile are going to understand that they must specify an appropriate BIN_DIR.
In C, only a small set of primitive data types carry any actual semantics.
For instance, declaring a variable as float informs the compiler how it
should process the value I am giving it in memory. To us, the float may
carry the semantics of the real number 2.75, but to the compiler, it knows
to store and represent the underlying DATA as an
IEEE 754 floating-point number and apply arithmetic operations accordingly.
The value 2.75 to us can, in the real world of computers, be represented as
the value 0x40300000. That is if you like, the so called IEEE 754 standard
provides an implementation of our “real number” interface we call float.
Exposing this reality involves a bit of misdirection (later we’ll call this indirection ;)). WARNING: YOU WILL NOT UNDERSTAND THIS AT THE PRESENT MOMENT THAT IS OKAY
#include <stdio.h>
int main(void) {
float x = 2.75;
printf("x is %x\n", *(int *) &x);
return 0;
}
$ make && ./run
> x is 40300000
This is an example of something called type punning and it is a mechanism by
which we can circumvent the type system or the (interface/implementation
construct of the compiler if you’d like) to get at the underlying raw data
on which the CPU operates.
IMHO, this capability is what it REALLY means to be WEAKLY typed.
Now, I know what you’re thinking, “wtf is with the %, * and & stuff, you haven’t even done hello world!” Well, let’s start “hello, world!” and build up in increasing detail until we understand this particular example and by doing so I hope that you will then understand the scariest of scariest topics in C what is a pointer?.
Here is hello world in C:
#include <stdio.h>
int main(void) {
printf("hello, world!\n");
return 0;
}
The function printf() is provided by the inclusion of the stdio.h header
file.
A header exposes the function prototypes (C’s notion of an interface) of
an often (but not always) pre-compiled library (called an object). If all that
if given to you is the prototype of printf(), how then can this compile?
Well, stdio is a part of the C standard library, and your compiler is nice
enough to link the implementation objects for you.
printf() does not automatically add a new line. Hence the additional `\n`.
As previously discussed, we return 0 to indicate the program terminated
successfullly.
C has string literals but they come with some things to keep in mind. First, as you are probably aware, a string is actually an array of characters. How those characters are encoded can vary language to language but in C the char type corresponds to good ol’ fashioned ASCII (chars are ALWAYS 8-bit, period).
#include <stdio.h>
int main(void) {
char *str = "hello, world!";
printf("%s\n", str);
return 0;
}
What’s with the \*?
That, dear hearts, char \* declares the str variable as a pointer
to a char.
So what is this pointer thing? Well, a string is a sequence of bytes
(char is 8-bit) stored in memory. Your CPU simply must know where in the
hell the sequence of bytes starts if it is ever to iterate over it. If it
cannot iterate over it, it simply cannot display it.
A pointer is stores a memory address. On your
machine, it is likely a 64-bit unsigned integer. Why? Because your memory
addresses, where the data actually lives, are labeled with 64-bit unsigned
integers. So, to be more precise,
a pointer is just a variable that stores a memory address, which is just a number
Let’s pretend for a moment that we’re on an 16-bit machine for simplicity, it’s memory addresses are 16-bit unsigned integers (we always represent pointers) in hex. This is what your string would look like in memory:
0x0006: 'h' 0x0007: 'e' 0x0008: 'l' 0x0009: 'l' 0x000A: 'o' 0x000B: ' ' 0x000C: 'w' 0x000D: 'o' 0x000E: 'r' 0x000F: 'l' 0x0010: 'd' 0x0011: '\n' 0x0012: '\0' 0x0013: 0x0006
Of course, those characters I have placed, /’h’/ and so forth, are actually numbers themselves, the ASCII character code of /’h’/ and so forther. i.e.,
0x0000: 0x0068
Let’s direct our attention to the very last two entries here:
0x0012: '\0' 0x0013: 0x0006
The very last one, well, that’s your pointer.
It is absolutely critical to understand that a pointer is itself a variable which is stored in memory! It’s value is a memory address!
Do you see how our pointer contains the memory address where ‘h’ is stored?
Okay, so what’s the deal with 0x0012? Well, that is what is known as the
null terminator, it is a special non-printable character that is used to
indicate the end of a string! We will show why this is done, and how to utilize
this fact in due course (indeed printf() itself is leveraging it),
but I would like to mention that the compiler is
going to add this character by default for string literals, i.e. it has been
added because you defined str as pointing to /”hello, world”. The key point
here is the use of double quotes, that is what we mean by /string literal.
These outputs I have given are what are sometimes known as “hex dumps”, and they
are a way of diagramming the state of memory as it physically exists. A few
notes on that.
Most hexdumps are formatted more like this (here is an actual segement of a hexdump from our compiled program):
00002000: 0100 0200 6865 6c6c 6f2c 2077 6f72 6c64 ....hello, world 00002010: 2100 2573 0a00 0000 011b 033b 2400 0000 !.%s.......;$...
The left column is the memory address. The right column prints any values
which just so happen to be printable characters.
It should be noted that not all memory addresses are
actually shown, it is up to the reader to understand that, say the byte 0x68
(corresponding to the letter ‘h’) corresponds to the memory address
0x00002005 (use your finger and count to it). Do you see the 0x00 immediately
following ‘!’? If you look at the right printout, you see it is non-printable.
That corresponds to our ‘\0’ (null terminator). What follows that is what is
our format string which contains what we will come to know as a
format specifier (the “%s”).
But where is the pointer? It is stored somewhere in this enormous hexdump (it’s 1033 memory addresses lol). I promise you it is in there, but the thing is, real hexdumps are always enormous despite how small our source code is. The reason why is, this is not just a raw binary format, I mean, this is certainly raw binary, but it has been embellished with additional information that the operating system (namely the kernel) uses at runtime to begin execution. In actuality, the pointer is defined very far away from the string literal, and the reason is that
string literals are stored in an unmodifiable section of memory called static memory!
In the future, we may have opportunity to look at the build step of the
toolchain just before we are turned into machine code, this is where we
exist in an intermediate language called assembler. In this format, it is
very clear that string literals are stored in a very different section
than stack allocated variables. But this detail need not concern us for now,
this is already tl;dr.
tl;dr: For the interested reader, you can actually compile a program
into raw binary (which will of course be rejected by the kernel if you
were to attempt to run it). This process involves telling the compiler
that standard libraries may not be implemented, compiling to object and then
utilizing a custom linker to only output the raw binary.
A little bit easier thing to do would be to compile to so-called .COM file,
which was the very simple (and hilariously dangerous and exploitable)
headerless binary format of old DOS machines. Incidentally, the .COM format
is utilized by x86 processors in the boot procedure, before the system hands
off control to the operating system init procedure.
The modern practice of writing software which compiles to .COM files
to be ran in this so called real-mode is often
called boot sector programming.
In any case, a popular tool which can help you to explore these different formats (typically used to examine assembler) is the entirely online application https://godbolt.org/.
String literals are always stored in the so called “read-only”
section of static memory. Once declared, a string literal CANNOT
be modified, i.e., you cannot index into a location and change that
character.
Okay… but the thing is… you can actually try. On most systems, the
following code is going to cause the kernel to “segmentation fault”, which
is a fancy way of saying “crash because we tried to do something bad with
memory”.
#include <stdio.h>
int main(void) {
char *hello = "Hello, world!\n";
hello[1] = 'i'; // BAD!
return 0;
}
So why did this happen? Well, because we are trying to access a read-only section of memory and write to it. The compiler didn’t stop us, so the operating system kernel did. Why didn’t the compiler try to stop us? Dude, your guess is as good as mine. Probably because C is still used to target architectures where string literals may NOT be read-only? Well, in any case here is some idiomatic C to make your life easier:
#include <stdio.h>
int main(void) {
const char *hello = "Hello, world!\n";
hello[1] = 'i'; // compiler warns us :D
return 0;
}
You’ll note that when you run this you now get an appropriate error message:
main.c:5:12: error: read-only variable is not assignable 5 | hello[1] = 'i'; | ~~~~~~~~ ^
Which is 1000000x better than us just crashing. Here we can actually see what
exactly went wrong.
The const keyword (or qualifier) is a promise we make to the compiler to
never alter the data we are initiailizing. However, it is important to note
that while we cannot alter the data which we point to in this case, we can
indirect this pointer to another memory address, at which point, the
data at that memory address will not be modifiable.
This subtlety rarely comes up but you should know. In any case, when you say
const, make sure you mean it. Here’s some code to prove the subtlety:
#include <stdio.h>
int main(void) {
const char *hello = "Hello, world!\n";
char *goodbye = "Goodbye, moon\n";
hello = goodbye;
printf("%s\n", hello); // Goodbye, moon!
hello[0] = "H"; // <-- compiler says no still
return 0;
}
We have been using stuff like %s and %c and %d in our string
literals. These things are “placeholders”, but they are nameless. So
if you come from the world of JavaScript or Python, know that what you
think of as an “interpolated string” is effectively what we are doing
here, save for the fact that these values are injected by order as
arguments from something like printf (the f stands for “format” and
these type of strings (which contain format specifiers) are often
called “format strings”).
When you say “%s”, you mean that the data you are passing is a string,
and the compiler then knows to expand that string, pointer to null
terminator. When you say “%d” you mean an unsigned 32-bit integer in decimal
format, “%c” the character representation of an 8-bit integer, etc, etc.
Why do you have to do this? C is weakly typed,
so again, at the end of the day, if you pass a string hello like
const char *hello = "Hello, world!\n";
printf("%s\n", hello);
if you were to not specify that you want to print a string, C only knows that you gave it some data. It has absolutely no additional context to draw from in order to format your output correctly. Sometimes this is useful! Want to know some ASCII character codes? Here:
#include <stdio.h>
int main(void) {
const char *abc = "abcdefghijklmnop";
while (*abc != '\0') printf("ASCII: %d\n", *abc++);
return 0;
}
This works because I am printing each character in the string as a number.
What about the whole incrementing abc thing? Well, it turns out, since strings
are arrays in the sense that they are stored in a contiguous memory region,
C affords us a nice syntactic sugar where incrementing a pointer, causes the
pointer to point to the next piece of data. This is an example of
pointer arithmetic, and we will talk about this more later.
Here is a table of the most common format specifiers.
Specifier Description %d Decimal integer %i Integer, base 10 %u Unsigned decimal integer %f Floating point number (with lowercase output) %F Floating point number (with uppercase output) %e Scientific notation (lowercase) %E Scientific notation (uppercase) %g Shortest representation of %f or %e %G Shortest representation of %F or %E %x Unsigned hexadecimal integer (lowercase) %X Unsigned hexadecimal integer (uppercase) %o Unsigned octal integer %s String of characters %c Character %p Pointer (address in hexadecimal) %n Number of characters written/read so far %% A literal ‘%’ character
You can even create your own apparently
TODO
Links to candbox related projects
What started as a simple parsing example is slowly becoming my attempt at a very stupid programming language…
There is an infinite supply of C programming resources
and I’ll note a few here in order of what I feel is the
most helpful.
Beej’s Guide to Network Programming is super famous but this one is just as incredible in my opinion. I wish I had been aware of thes guides’ existence when I first started writing C. Beej’s writing style is incredibly easy to parse and he has an incredible sense of what students of the C language tend to struggle with. This guide can be read cover to cover without any boredom or dullness arising.
As I mentioned before, Beej’s most famous guide is the Network Programming one but I might as well link his page. My dude has a way of explaing the things.
Obviously…