- klox is a fork of the clox interpreter with a proof-of-concept demonstration of O(1) garbage collection.
- The tradeoff: Allocation and GC at O(1) cost to main thread, but dereference costs O(log32(n)) instead of O(1).
- The question: Does this performance tradeoff make for an interesting/desirable language runtime?
- Bonus question: Could the persistent and partially-persistent data structures of this work be leveraged for interesting language constructs (e.g. by-value semantics for large recursive structures like maps)?
- This took a lot of time to produce in the hope that it would be an interesting concept for a language runtime, so I would very much appreciate your feedback: dkopko at runbox.com.
- Build CB
$ cd c
$ make -j CBROOT=/your/path/to/cb
$ cd ..
$ ./util/test.py # Runs the test suite
$ ./c/BUILD/RelWithDebInfo/klox [filename] # Runs the REPL, or optionally provided script file
- Uses a power-of-2 sized ring for very fast sequential allocation.
- "Pointers" become offsets into a ring, dereferenced via a bitmasked add to the ring's base memory address.
- Objects of the target language (those items directly traversed/marked via the
garbage collector) use integer enumeration for their handles.
- Object handles are dereferenced via an O(log32(n)) lookup of
handle -> offset
.
- Object handles are dereferenced via an O(log32(n)) lookup of
- Program memory is tri-partite, consisting of regions A, B, and C within a
ring.
- At all times, regions B and C are read-only, and mutation of structures only occurs in region A.
- Newly created structures are allocated in A. Old structures in B or C must be copied to A before being mutated. (Copies are of small records, not large nor recursive structures.)
- Dereferencing an object handle checks regions A, B, C (in that order) until the desired target is resolved.
- O(1) Garbage Collection process:
- Main thread executes with 2 active regions. Regions A and B have useful data, region C is empty.
- Main thread reaches point of execution where GC is invoked.
- Regions shift: C = B, B = A, A = new region
- Request is sent to GC thread to compact live objects of B and C into "newB"
- Main thread continues execution, temporarily in a mode with 3 active regions. GC thread concurrently performs live object compaction into newB.
- GC thread completes compaction, responds to main thread with compacted newB region.
- Main thread reaches an instruction which observes and integrates GC
response.
- Regions shift: C = empty, B = newB.
- Main thread continues execution, again with only 2 active regions.
The shift in cost to instructions (measured in rdtsc ticks in my latop) is as follows (see ilat_compare.20210714-014511.out):
OP_INVOKE lat: 75.5 -> 264.7 (+ 250.5 %), Runtime%: 17.5 (+ 9.7), AbsRuntime: 24593582980 (+ 250.5 %)
OP_POP lat: 25.8 -> 23.2 (- 10.1 %), Runtime%: 12.1 (- 8.9), AbsRuntime: 17053609960 (- 10.1 %)
OP_CONSTANT lat: 26.7 -> 26.8 (+ 0.2 %), Runtime%: 11.3 (- 6.3), AbsRuntime: 15920016520 (+ 0.2 %)
OP_GET_PROPERTY lat: 44.9 -> 101.2 (+ 125.4 %), Runtime%: 10.0 (+ 3.1), AbsRuntime: 14069303340 (+ 125.4 %)
OP_GET_GLOBAL lat: 30.9 -> 96.3 (+ 211.8 %), Runtime%: 8.2 (+ 4.1), AbsRuntime: 11485070980 (+ 211.8 %)
OP_SET_GLOBAL lat: 42.5 -> 348.2 (+ 718.9 %), Runtime%: 6.1 (+ 4.9), AbsRuntime: 8603748740 (+ 718.9 %)
OP_EQUAL lat: 26.7 -> 35.5 (+ 33.0 %), Runtime%: 6.0 (- 1.1), AbsRuntime: 8486099980 (+ 33.0 %)
OP_RETURN lat: 27.2 -> 53.9 (+ 98.3 %), Runtime%: 4.8 (+ 1.0), AbsRuntime: 6733731096 (+ 98.3 %)
OP_GET_LOCAL lat: 25.1 -> 24.4 (- 2.7 %), Runtime%: 3.9 (- 2.4), AbsRuntime: 5490038904 (- 2.7 %)
OP_TRUE lat: 28.5 -> 28.5 (+ 0.1 %), Runtime%: 3.6 (- 2.0), AbsRuntime: 5132829560 (+ 0.1 %)
OP_NIL lat: 29.3 -> 28.5 (- 2.7 %), Runtime%: 3.5 (- 2.1), AbsRuntime: 4988193700 (- 2.7 %)
OP_CALL lat: 51.5 -> 139.5 (+ 171.0 %), Runtime%: 3.1 (+ 1.3), AbsRuntime: 4414534240 (+ 171.0 %)
OP_ADD lat: 28.7 -> 43.7 (+ 52.3 %), Runtime%: 3.0 (- 0.1), AbsRuntime: 4203289480 (+ 52.3 %)
OP_JUMP_IF_FALSE lat: 26.6 -> 34.1 (+ 28.3 %), Runtime%: 2.5 (- 0.5), AbsRuntime: 3522163660 (+ 28.3 %)
OP_LESS lat: 27.3 -> 41.2 (+ 51.0 %), Runtime%: 1.6 (- 0.1), AbsRuntime: 2216343500 (+ 51.0 %)
OP_SUBTRACT lat: 23.1 -> 38.8 (+ 68.0 %), Runtime%: 0.8 (+ 0.1), AbsRuntime: 1182540060 (+ 68.0 %)
OP_SET_PROPERTY lat: 222.1 -> 268.9 (+ 21.1 %), Runtime%: 0.8 (- 0.2), AbsRuntime: 1087259920 (+ 21.1 %)
OP_LOOP lat: 31.0 -> 34.3 (+ 10.5 %), Runtime%: 0.6 (- 0.2), AbsRuntime: 792237360 (+ 10.5 %)
OP_FALSE lat: 30.8 -> 31.7 (+ 2.9 %), Runtime%: 0.4 (- 0.2), AbsRuntime: 633526160 (+ 2.9 %)
OP_PRINT lat: 21348.4 -> 177190.3 (+ 730.0 %), Runtime%: 0.1 (+ 0.1), AbsRuntime: 89835500 (+ 730.0 %)
OP_NOT lat: 31.7 -> 38.4 (+ 21.3 %), Runtime%: 0.1 (- 0.0), AbsRuntime: 89677720 (+ 21.3 %)
OP_SUPER_INVOKE lat: 62.2 -> 158.0 (+ 154.2 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 52667300 (+ 154.2 %)
OP_GREATER lat: 26.2 -> 50.1 (+ 91.4 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 29524440 (+ 91.4 %)
OP_GET_UPVALUE lat: 41.8 -> 63.8 (+ 52.5 %), Runtime%: 0.0 (- 0.0), AbsRuntime: 21273020 (+ 52.5 %)
OP_JUMP lat: 30.3 -> 31.0 (+ 2.2 %), Runtime%: 0.0 (- 0.0), AbsRuntime: 21104460 (+ 2.2 %)
OP_SET_LOCAL lat: 29.4 -> 28.9 (- 1.6 %), Runtime%: 0.0 (- 0.0), AbsRuntime: 7251060 (- 1.6 %)
OP_CLOSURE lat: 898.9 -> 2138.5 (+ 137.9 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 583820 (+ 137.9 %)
OP_DEFINE_GLOBAL lat: 367.0 -> 1502.3 (+ 309.4 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 408620 (+ 309.4 %)
OP_METHOD lat: 312.9 -> 1655.4 (+ 429.0 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 276460 (+ 429.0 %)
OP_CLASS lat: 457.4 -> 973.3 (+ 112.8 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 96360 (+ 112.8 %)
OP_NEGATE lat: 36.4 -> 78.9 (+ 116.5 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 54340 (+ 116.5 %)
OP_INHERIT lat: 644.8 -> 1521.6 (+ 136.0 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 38040 (+ 136.0 %)
OP_CLOSE_UPVALUE lat: 104.4 -> 504.4 (+ 383.2 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 16140 (+ 383.2 %)
OP_MULTIPLY lat: 167.8 -> 457.8 (+ 172.8 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 8240 (+ 172.8 %)
OP_DIVIDE lat: 293.3 -> 644.4 (+ 119.7 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 5800 (+ 119.7 %)
OP_SET_UPVALUE lat: 184.0 -> 408.0 (+ 121.7 %), Runtime%: 0.0 (+ 0.0), AbsRuntime: 2040 (+ 121.7 %)
OP_GET_SUPER lat: 1540.0 -> 940.0 (- 39.0 %), Runtime%: 0.0 (- 0.0), AbsRuntime: 940 (- 39.0 %)
struct cb
- A "continuous buffer". This is a power-of-2-sized ring implementation with methods paralleling typical memory allocator routines. Allocations return acb_offset_t
into the ring, which can be dereferenced to a raw pointer viacb_at()
. Such raw pointers into this ring can become invalid due to resizing: if allocations within thestruct cb
exceed the available size, it will resize to the next larger power-of-2 size (and call acb_on_resize_t
callback, allowing a rewrite of raw pointers to be implemented).cb_offset_t
- A location within thestruct cb
ring. These follow Serial Number Arithmetic, so must be compared withcb_offset_cmp()
instead of standard comparison operations.struct cb_region
- A subregion of astruct cb
within which allocations can be made. (NOTE: This is a separate concept to the garbage collector's A/B/C regions which are implicit ranges ofcb_offset_t
.)struct cb_bst
- A partially-persistent red-black tree implementation. (The partial persistence feature is not leveraged in this POC.)struct structmap
- An O(log32(n))uint64_t->uint64_t
map. Used primarily to map theObjID
integers ofObj
allocations to theircb_offset_t
locations. This structure was inspired by Phil Bagwell's Ideal Hash Tries, a.k.a. "Hash Array Mapped Trees (HAMT)", however it removes the hashing and the dense packing.ObjID
a unique ID for an allocation of anObj
, which is a structure the GC will be responsible for collecting/compacting. These are generated inobjtable_add()
, which is primarily invoked throughassignObjectToID()
upon the creation ofObj
s. These are resolved viaobjtable_lookup()
, usually through anOID<T>::clip()
orOID<T>::mlip()
.- A key concept is being able to adequately size the newB
cb_region
which the GC thread will use as a destination for its compaction of live objects. A confounding factor is the padding needed to fulfill alignment concerns of objects, and how the needed padding may fluctuate depending on order of allocations. All objects therefore overestimate their size by including their maximal alignment needs, in patterns that look like the following:sizoef(struct X) + alignof(struct X) - 1
. - "Internal size" - The maximal allocation size (inclusive of alignment padding) of the internal nodes of a data structure.
- "External size" - The maximal allocation size (inclusive of alignment padding) of the referred-to entries a data structure. (NOTE: This is used only by the ObjTable, the mapping of ObjID integers to allocated objects. This is because the ObjTable is considered to own the objects it refers to, and so these objects will need to be considered to be part of the ObjTable's size for compaction sizing concerns.)
CBO<T>
- Acb_offset_t
reference to aT
. Dereferenced through an O(1)cb_at()
. Used to refer to raw allocations.OID<T>
- AnObjID
reference to aT
. Dereferenced through an O(log32(n))objtable_lookup()
. Used to refer to Obj allocations which are the purview of the garbage collector.RCBP<T>
- A raw pointer into astruct cb
ring which will be rewritten if thatstruct cb
were to be resized due to a new allocation. NOTE: At present, these may only exist outside of thestruct cb
ring, e.g. in a C-language stack frame.- How to read some of the dereferencing methods:
- cp() - "const pointer", O(1)
- mp() - "mutable pointer", O(1)
- clp() - "const local pointer", O(1)
- mlp() - "mutable local pointer", O(1)
- crp() - "const remote pointer" (pointing to a
struct cb
other thanthread_cb
), O(1) - mrp() - "mutable remote pointer" (pointing to a
struct cb
other thanthread_cb
), O(1) - clip() - "const local id-based pointer", O(log32(n))
- mlip() - "mutable local id-based pointer", O(log32(n)) + possible
deriveMutableObjectLayer()
copy.
deriveMutableObjectLayer()
- Copies an object from the read-only B or C regions to the mutable A region and updates the ObjTable, such that this Obj for this ObjID can be modified until the next freeze of the A region due to a GC. (NOTE: The reason it is considered a "mutable object layer" and not just a "mutable object" is because ObjClass methods and the ObjInstance fields maps will not be copied into the mutable region A. Instead, these Obj types will be copied with empty maps into the mutable section A. New methods/fields can be added, but lookups will check the map "layers" at each of the the A, B, and C regions. This preserves O(1) copying of ObjClass and ObjInstance objects.)- GC is presently still initiated (as in
clox
) by the main thread invokingreallocate()
and it choosing to callcollectGarbage()
. This will cause a shift of regions on the main thread and emission (viagc_submit_request()
) of a request to the GC thread that it should perform compaction. The GC thread waits ingc_main_loop()
for such requests to arrive, at which point it will perform its compaction to the request's destination region, and then respond back the main thread by posting a response viagc_submit_response()
. The main thread will observe and integrate any ready compacted result of a GC cycle duringOP_LOOP/OP_CALL/OP_INVOKE/OP_SUPER_INVOKE/OP_RETURN
instructions through a call tointegrate_any_gc_response()
. PIN_SCOPE
- Used to extend the lifetime ofstruct cb
-allocated data within it scope, so that it may still be referred to after a potential GC due to an allocation.
- The cb ring will resize if an allocation calls for more memory than is available. The expectation is that a program's running memory size can be stated in advance to a power-of-2 order of magnitude in order to avoid ring resizes, or else the program can pay the necessary penalty the resize.
- I intentionally avoided some optimization paths (e.g. labels as values for threaded interpretation) so that performance comparison of klox to clox would show only the impact of the new data structures and GC approach. The purpose of this POC is to evaluate the costs and tradeoffs of this approach to an O(1) garbage collector.
- I am aware that there are probably a few cases where
PIN_SCOPE
as used is insufficient protection, but these bugs do not undermine the overall concept of this POC and the test suite is passing. If this POC is considered worth expanding on, I expect it would be with translation of the concept to another language's runtime anyway, so I'm not convinced these gaps are worth fixing. - Q: Why are verbose names used for things (e.g. clip()) that could be implemented as operator->() overloads? A: These names are both greppable and make costs more apparent when read, facilitating optimization.
- Q: Why are there so many assert()s? A: I follow a style called "Assertion-Oriented Programming" (AOP). The completion of this work would not have been possible without heavy reliance on assertions. Even when one of my assert()-stated assumptions have been wrong, it has often taken hours to get to the bottom of such a bug. If these assert()s did not exist, there is absolutely no way I could have fixed the same bugs because they would have only shown up at some unrelated later point of execution, and I would have long ago tired of this effort.
- The CB project's README contains old notes which elaborate on some of the concepts. In those notes, the naming of the regions are inverted (the mutable region is called "C").
- The CB project also has a "structmap" class, which was an earlier
implementation that was abandoned. Aside from
struct cb
,struct cb_region
, andstruct cb_bst
, most of the remaining code of that project is abandoned.
This work is dedicated to my family, with a special thank you to my parents, who always supported me.