Flatritrie

This repository contains a set of C++ templates implementing an optimized Trie structures for matching IP addresses to metadata (like GeoIP, routing, ID, etc.). This structures (Tritrie and Flatritrie) were designed to:

return the most specific match (a longest-prefix match),
work with 32-bit (IPv4) and 128-bit (IPv6) keyspace,
be queried fast, at the expense of high memory usage and bigger load time,
work with DPDK to handle high network throughput.

As a result they also:

are immutable after construction,
can have multiple lockless readers (since immutable).

Tritrie is a sparse trie that matches IP addresses by 1 to 8 bits at a time. The “tri” in the name works best for 3 BITS obviously. Flatritrie is a similar, flattened version.

Made them out of curiousity if simple enough algorithm can outperform a hashmap, which I though would be fast for small datasets. It’s not. For certain datasets and query patterns they seem to outperform (at least one implementation of) LC-trie.

Context

Matching an IP address to some additional metadata is a known and a common problem. Hardware routers do it with a TCAM, Linux kernel does it with LC-tries (Level-Compressed tries). IP like a.b.c.d can be matched by a.0.0.0/8 or a.b.c.0/24 entry and we will usually want to use the most specific match available.

Simplest, a bit naive, but quite fast solution (given enough RAM) is to expand all network addresses from the most generic (small mask) to the most detailed and put them in a hashmap - but that won’t scale to IPv6.

The idea was to check whether trivial querying operations (few lookups, on huge structure in RAM) could outperform a more memory-compact solutions but with a more complicated query routine. And to gain some knowledge about how the C++ evolved after the 2000.

Query example

Tritrie<3> or Flatritrie<3> will match address 3 bits at a time, for at most 11 memory accesses for IPv4. With following entries:

123.250.0.0/16 -> 300
123.250.85.17/32 -> 400

Querying for 123.250.85.17, will:

Traverse first 5 levels without finding anything: 011 -> 110 -> 111 -> 111 -> 101
At 6th level (001) find a match “300”.
If nothing more would match, the 300 would be returned as a best match.
Match rest of the levels: 010 -> 101 -> 000 -> 100 -> 01 and at the last one, match 400.
Return 400 as the best match.

Querying Tritrie and Flatritrie works conceptually in the same way. Flatritrie tries to pack the data better in the memory to achieve better cache locality. It’s construction is similar to a “Finite State Automaton” with states defined in a table.

Construction by example

Inserting into Tritrie<B> will start from the Trie Root, in each step take B bits from the address and check if there’s a node at current level. If there’s none - new entry is allocated and linked into the Trie. Pretty much generic trie with one quirk.

The inserted mask can split the last “level”. During the insertion of an IP [111][101][10X] (mask /8) Tritrie<3> will traverse first 2 complete levels, and then insert two nodes, one for [100] and another one for [101]. If there’s already a node (for example [111][101][100]/9) it will be overwritten. To achieve correct behaviour, the input data needs to be inserted from the most generic masks, to most specific.

Flatritrie build it’s structure based on an existing Tritrie.

Benchmarks.

Performance benchmarks in benchmark.cpp use a real-life data - a GeoIP database: GeoLite2 by MaxMind, available from https://www.maxmind.com. (Creative Commons Attribution-ShareAlike 4.0).

Test executed on AMD Ryzen 7 1700X, with 32GB RAM Corsair MK32GX4M2A2666C16 - 2133MT/s (according to dmidecode), board B350M-A.

Middle-sized data

Data set is created by using only GeoIP entries for Poland. They consist of 5469 entries from /12 to /32 for a total of 21250764 IPs when expanded. Benchmarks are sensitive to data, query depths, etc.

Structure, memory usage, generation time and performance in millions queries per second (Mq/s) and nanoseconds per query (ns/q):

Structure	Gen	Rnd 1.5%	Rnd 100%	Rep Pos	Rep Neg	RAM
Hashmap	1s	12 Mq/s	19.1	233	233	949MB
		80.5 ns/q	52.3	4.28	4.29
Tri	4.28ms	39.8	11.3	18	18.5	0.9MB

TriTrie<8>	46ms	105.7	26.2	170.5	197.8	150MB
		9.46	38.18	5.86	5.05
Flatritrie<8>	46 +	131.955	30.64	150.17	197.9	160MB
	73ms	7.57	32.64	6.65	5.05
Tritrie<6>	10ms	130.3	34.35	106	129	19MB

Flatritrie<6>	3.67 +	138	34.19	121.1	129.4	21MB
	8.22ms	7.24	29.24	8.25	7.72
Flatritrie<4>	2.51 +	95.87	40.22	91.64	96
	1.86

Fri/flatri have suboptimal implementation (for eg. CLZ using a loop). See lower for lc-trie comparison.

The tests are as follows (see benchmark.cpp):

Rnd 1.5%: Query completely random addresses using fast pseudo-random generator. Only 1.475-1.485% queries match some IP.
Rnd 100%: Query IPs at random, but from the dataset with 100% matches. This uses a data set prepared beforehand by selecting 5mln times a random subnet from the test data and generating a random IP from it.
Rep Pos - Ask for an IP defined deeply with /32 mask which exists in dataset repetively (the same IP each time).
Rep Neg - Repetively ask for IP which is not in the set.

Relation between number of bits matched at each level, performance and RAM (in million queries per second) for Flatritrie:

BITS	3	4	4U	5	6	7	8
Rnd 1.5%	87.92	94.56	95.4	128.6	140.75	111	127.2
Rnd 100%	30.01	38.08	39.47	32.37	34.12	24.7	31.4
Rep Pos	67.15	91.81	119.48	102.12	121.03	144.6	179.2
Rep Neg	69.64	96.45	110.9	109.1	129.58	156.9	198.1
RAM	21MB	23MB	23MB	29MB	39MB	180MB	180MB

4U being the union version, which uses union and struct instead of bitshifts to read IP address nibble by nibble. I didn’t check how compiler compiles it.

Whole GeoIP Database

Additional test loading full GeoIP database and doing a random queries with 97.92% matched.

Algo	RAM	Mq/s	Build time [ms]
Flatritrie<3>	62MB	35	108 + 66
Flatritrie<4> Union	155MB	36	154 + 141
Flatritrie<6>	1.3GB	27	554 + 810

7 and 8 bits seem infeasible.

Comparison with LCtree implementation

Benchmark with a BSD-licensed LC-trie implementation: https://github.com/chuckination/lctrie.git

Data sets based on parts of BGP routing table from lctrie repo:

Tiny set: 5000 entries, 38.2% matched during testing (random IP shifted right 4 bits).
Big: 590040 entries, 48.6% matched (random IP shifted right 4 bits).
GeoIP: 336109 entries, 96.9% matched.

Set	Tritrie 8b	Flat 8b	Flat 8b, union	Flat 4b, union	LCtrie
Tiny	67 Mq/s	75 Mq/s	79.6 Mq/s	77.4 Mq/s	62.5 Mq/s
Big	49 Mq/s	59 Mq/s	62.1 Mq/s	67.4 Mq/s	50.7 Mq/s
GeoIP	-	-	-	36.4 Mq/s	62.7 Mq/s

Flat 4b union uses a union and struct to read IP nibble by nibble instead of bit shifting - I didn’t check what compilator does with it though.

LCtrie is close in terms of query performance, uses WAY less memory, builds faster - but can sometimes be beaten on smaller datasets where query algorithm simplicity seems to be advantageous.

Full GeoIP dataset doesn’t fit into memory when using 8 BIT levels.

Next steps

Flatritrie and Tritrie aren’t much different in benchmarks. Possibly using Flatritrie table allocator in Tritrie would bury the difference. Allocating entries grouped by mask ranges might be sensible.

Flatritrie, on the other hand, could optimise table positions with after-creation knowledge to gain better cache locality than it does currently.

Precalculating statistics for data and choosing levels optimally.

Version <8> and <4> can be simpler, using an union instead of bitshifts and work directly with network-byte order addresses.

blaa/flatritrie