The subject.md
file contains exercises for this implementation if you want to
do it yourself. In that case, don't peek at the solutions in vec.c
beforehand!
The C-standard library doesn't offer a good dynamic data structure often found
in other languages such vector
in C++ or Vec
in Rust. This is an
overview of the idea, design and implementation of such data structure in C.
When we talk about a dynamic data structures, we usually mean a container with a growing memory buffer. Such a buffer has a certain strategy to allocate more memory when the existing memory runs out. A typical strategy is to double the size of the buffer each time a limit is reached. This way we minimize the number of reallocations in the buffer which is our most expensive operation. While we still have space left in the buffer we can append to the buffer with negligible cost. When we reach the limit and reallocate, we incur the cost of the allocating the space for a new buffer and copying over the data from the old buffer.
Such data structures can be typed, but in this tutorial we will make our best to mimic the usefulness of vectors in other languages by providing a generic interface which will accept any type of element.
Now let's implement a dynamic vector data structure. Let's consider the
equivalent Rust called Vec
. Vec
implements tons of functions, but in essence
it's a growable and shrinkable buffer of elements of type <T>
, and the main operations
are push
and pop
. Pushing a new element to the end of the buffer and
"popping" a.k.a. removing the element from the end and returning you a pointer to the element (this is an important detail). In addition, we implement the insert
and remove
operations that allow you to specify an index to which the element is added or which is removed. Furthermore, we usually have a get
method, or we can access the elements by their index using the []
syntax.
It's important to note that there is a difference in performance between the operations. Adding to the end of the array and deleting from the end will always be cheapest. Everything else, such as inserting an element at the beginning or in the middle of the array, will incur the penalty of copying data around.
However, in various tests I've made, those penalties have been quite small compared to the penalties that incur with different data structures claiming to solve this problem. Sure you can prepend to a linked list without removing or shuffling around elements, but in a dynamic vector you are dealing with data that is all laid out sequantially in memory. This is important because modern processors are very fast when the required data can be found in the cache. In a linked list or other pointer-type list, the cache misses from pointer indirection far outweigh any other gains in my experience.
First, let's talk a little about design patterns. We should adopt conventions that make our API easy to reason with and practices that make the code safe and easy to debug.
- Allocation and deallocation happen at the same level as the variable declaration.
This first pattern helps to avoid unnecessary allocations leading to increased performance as well as easier time with leaks. Let's see what it means.
// This "constructor" returns an allocated pointer to a `t_foo`. But what if
// we didn't need a pointer, because we use t_foo in the caller's scope only?
t_foo *create_a_foo(int foosize);
// We could just return a `t_foo`, but now the problem is the error condition.
// You can't return a NULL if something goes wrong.
t_foo create_a_foo(int foosize);
// This is how we create a `t_foo` without allocating a pointer (if it wasn't
// already allocated) and we can return an integer representing some error condition.
int create_a_foo(t_foo *foo, int foosize);
// ...or if we want to be passing foo around directly as a parameter to
// another function, we can return a pointer to `t_foo`.
t_foo *create_a_foo(t_foo *foo, int foosize);
- Parameter order is (dst, src, params, fpointers).
We always have the destination before the source if a destination pointer or variable is applicable.
int foo(char *dst, char *src, int len, (*bar)(char *));
We create a struct called s_vec
and typedef it to t_vec
.
typedef struct s_vec
{
unsigned char *memory; // Pointer to the first byte of allocated memory.
size_t elem_size; // Size of a vector element in bytes.
size_t alloc_size; // Total size of allocated bytes.
size_t len; // Length of the used-up part of the vector in
// `elem_size` chunks.
} t_vec;
When we access elements in the vector our bounds are 0 -> len - 1. We might have allocated more memory in total, but we will only access memory in the byte-range 0 -> len * elem_size.
Here is our vec.h
header file with implementation prototypes;
#ifndef VEC_H
# define VEC_H
#include "stdlib.h"
#include "unistd.h"
#include "string.h"
#include "stdbool.h"
typedef struct s_vec
{
unsigned char *memory;
size_t elem_size;
size_t alloc_size;
size_t len;
} t_vec;
int vec_new(t_vec *src, size_t init_len, size_t elem_size);
void vec_free(t_vec *src);
int vec_from(t_vec *dst, void *src, size_t len, size_t elem_size);
int vec_resize(t_vec *src, size_t target_len);
int vec_clear(t_vec *src);
int vec_push(t_vec *src, void *elem);
int vec_pop(void *dst, t_vec *src);
int vec_copy(t_vec *dst, t_vec *src);
void *vec_get(t_vec *src, size_t index);
int vec_insert(t_vec *dst, void *elem, size_t index);
int vec_remove(t_vec *src, size_t index);
int vec_append(t_vec *dst, t_vec *src);
int vec_prepend(t_vec *dst, t_vec *src);
void vec_iter(t_vec *src, void (*f) (void *));
void *vec_find(t_vec *src, bool (*f) (void *));
int vec_map(t_vec *dst, t_vec *src, void (*f) (void *));
int vec_filter(t_vec *dst, t_vec *src, bool (*f) (void *));
int vec_reduce(void *dst, t_vec *src, void (*f) (void *, void *));
void vec_sort(t_vec *src, int (*f)(void *, void *));
#endif
Function vec_new
will take an allocated pointer dst
and
allocate len * elem_size amount of bytes in the buffer.
int main(void)
{
t_vec v;
int ret;
ret = vec_new(&v, 10, sizeof(int));
if (ret < 0)
printf("Error!");
}
Deallocation:
int main(void)
{
t_vec v;
int ret;
ret = vec_new(&v, 10, sizeof(int));
if (ret < 0)
printf("Error!");
vec_free(&v);
}
A handy from
method that creates a vec out data passed in as a pointer.
int main(void)
{
int vals[] = {1, 2, 3};
t_vec v;
int ret;
ret = vec_from(&v, vals, 3, sizeof(int));
if (ret < 0)
printf("Error!");
vec_free(&v);
}
Resizing is used by many functions that we will implement later, so better do it
now. In the resizing function it's handy to call the vec_copy
method so we'll
implement that as well.
The copy function is very simple and will only copy at most as many bytes as are
available in the dst
vector.
int main(void)
{
int vals[] = {1, 2, 3};
t_vec v;
t_vec d;
vec_from(&v, vals, 3, sizeof(int));
vec_new(&v, 3, sizeof(int));
vec_copy(&d, &v);
vec_free(&v);
vec_free(&d);
}
The resizing function vec_resize
will take in a target_size
parameter and either shrink
(destructively) or grow the vector to the target size copying the old contents
over to the new alloaction.
static int vec_resize(t_vec *src, size_t target_size);
We have several functions that we use to manipulate the vector. We can push
elements to the end of the array or pop
them from the end or get
a pointer
to an element at any index in the vector. These are the least expensive
operations to perform.
int main(void)
{
int vals[] = {1, 2, 3};
int ret;
int *ptr;
t_vec v;
vec_from(&v, vals, 3, sizeof(int));
vec_push(&v, &vals[0]);
vec_push(&v, &vals[1]);
vec_pop(&ret, &v);
ptr = vec_get(&v, 0);
vec_free(&v);
}
A little more involved operation is inserting or removing an element from any index in the vector.
int main(void)
{
int vals[] = {1, 2, 3};
t_vec v;
vec_from(&v, vals, 3, sizeof(int));
vec_insert(&v, &vals[0], 2);
vec_remove(&v, 2);
vec_free(&t1);
}
When we iterate our vector we usually use iterator functions. It's quite surprising how many use cases these few simple functions cover, reducing errors and boilerplate code.
A simple iterator takes in a function pointer that is called for each element of the vector, with a pointer to the element passed as an argument. Any modifications to the element will modify the original values.
void print_int(void *src)
{
printf("%d\n", *(int *)src);
}
int main(void)
{
t_vec t1;
int base[] = {1, 2, 3, 4, 5};
vec_from(&t1, base, 5, sizeof(int);
vec_iter(&t1, print_int);
vec_free(&t1);
}
A map
function which copies over the current elements (this operation is
non-destructive), sends each element to f
and then appends the element to a new vector dst
.
void *add_one(void *src)
{
*(int *)src += 1;
}
int main()
{
t_vec values;
t_vec mapped_values;
vec_new(&values, 10, sizeof(int));
vec_new(&mapped_values, 0, sizeof(int));
for (int i = 0; i < 10; i++)
vec_push(&values, &i);
vec_map(&mapped_values, &values, add_one);
}
Next a filtering function. Now the function pointer will return true
or false
indicating if the element should be added to the resulting vector.
bool filter_even(void *src)
{
if (*(int *)src % 2 == 0)
return (true);
return (false);
}
int main()
{
t_vec values;
t_vec even;
vec_new(&values, 10, sizeof(int));
vec_new(&even, 0, sizeof(int));
for (int i = 0; i < 10; i++)
vec_push(&values, &i);
vec_filter(&even, &values, filter_even);
}
Finally, a function that iterates the elements of the vector, and at each iteration,
applies the function f
that gets the acc
element passed to vec_reduce
as the first
parameter (the "accumulator"), and the current element as the second parameter. With the accumulator, the results of iterating over the elements are reduced to one element such as in summing a list of integers.
void sum(void *acc, void *elem)
{
*(int *)acc += *(int *)elem;
}
int main()
{
t_vec t1;
int base[] = {1, 2, 3, 4, 5};
int result = 0;
vec_from(&t1, base, 5, sizeof(int));
vec_reduce(&result, &t1, reduce_tester);
assert(result == 15);
vec_free(&t1);
}
That's a quick overview on how to implement a performant and easy-to-use
dynamic data structure in C. This structure can be used as a fundamental
building block in many programs. The implementation has been kept simple
because frankly, it doesn't have to be more complicated. New functions
could definitely be added to grow the functionality of t_vec
even
further.