Extremely efficient string interning solution for Rust crates.
String interning: The technique of representing all strings which are equal by a pointer or ID that is unique to the contents of that strings, such that O(n) string equality check becomes a O(1) pointer equality check.
Interned strings in Stringleton are called "symbols", in the tradition of Ruby.
- Ultra fast: Getting the string representation of a
Symbolis a lock-free memory load. No reference counting or atomics involved. - Symbol literals (
sym!(...)) are "free" at the call-site. Multiple invocations with the same string value are eagerly reconciled on program startup using linker tricks. - Symbols are tiny. Just a single pointer - 8 bytes on 64-bit platforms.
- Symbols are trivially copyable - no reference counting.
- No size limit - symbol strings can be arbitrarily long (i.e., this is not a "small string optimization" implementation).
- Debugger friendly: If your debugger is able to display a plain Rust
&str, it is capable of displayingSymbol. - Dynamic library support: Symbols can be passed across dynamic linking
boundaries (terms and conditions apply - see the documentation of
stringleton-dylib). no_stdsupport:stdsynchronization primitives used in the symbol registry can be replaced withonce_cellandspin. See below for caveats.serdesupport - symbols are serialized/deserialized as strings.- Fast bulk-insertion of symbols at runtime.
- You have lots of little strings that you need to frequently copy and compare.
- Your strings come from trusted sources.
- You want good debugger support for your symbols.
- You have an unbounded number of distinct strings, or strings coming from untrusted sources. Since symbols are never garbage-collected, this is a source of memory leaks, which is a denial-of-service hazard.
- You need a bit-stable representation of symbols that does not change between runs.
- Consider if
smol_strorcowstris a better fit for such use cases.
Add stringleton as a dependency of your project, and then you can do:
use stringleton::{sym, Symbol};
// Enable the `sym!()` macro in the current crate. This should go at the crate root.
stringleton::enable!();
let foo = sym!(foo);
let foo2 = sym!(foo);
let bar = sym!(bar);
let message = sym!("Hello, World!");
let message2 = sym!("Hello, World!");
assert_eq!(foo, foo2);
assert_eq!(bar.as_str(), "bar");
assert_eq!(message, message2);
assert_eq!(message.as_str().as_ptr(), message2.as_str().as_ptr());- std (enabled by default): Use synchronization primitives from the
standard library. Implies
alloc. When disabled,critical-sectionandspinmust both be enabled (see below for caveats). - alloc (enabled by default): Support creating symbols from
String. - serde: Implements
serde::Serializeandserde::Deserializefor symbols, which will be serialized/deserialized as plain strings. - debug-assertions: Enables expensive debugging checks at runtime - mostly useful to diagnose problems in complicated linker scenarios.
- critical-section: When
stdis not enabled, this enablesonce_cellas a dependency with thecritical-sectionfeature enabled. Only relevant inno_stdenvironments. Seecritical-sectionfor more details. - spin: When
stdis not enabled, this enablesspinas a dependency, which is used to obtain global read/write locks on the symbol registry. Only relevant inno_stdenvironments (and is a pessimization in other environments).
Stringleton tries to be as efficient as possible, but it may make different
tradeoffs than other string interning libraries. In particular, Stringleton is
optimized towards making the use of the sym!(...) macro practically free.
Consider this function:
fn get_symbol() -> Symbol {
sym!("Hello, World!")
}This compiles into a single load instruction. Using cargo disasm on x86-64
(Linux):
get_symbol:
8bf0 mov rax, qword ptr [rip + 0x52471]
8bf7 retThis is "as fast as it gets", but the price is that all symbols in the program are deduplicated when the program starts. Any theoretically faster solution would need fairly deep cooperation from the compiler aimed at this specific use case.
Also, symbol literals are always a memory load. The compiler cannot perform
optimizations based on the contents of symbols, because it doesn't know how they
will be reconciled until link time. For example, while sym!(a) != sym!(a) is
always false, the compiler cannot eliminate code paths relying on that.
Stringleton relies on magical linker tricks (supported by linkme and ctor)
to minimize the cost of the sym!(...) macro at runtime. These tricks are
broadly compatible with dynamic libraries, but there are a few caveats:
- When a Rust
dylibcrate appears in the dependency graph, and it hasstringletonas a dependency, things should "just work", due to Rust's linkage rules. - When a Rust
cdylibcrate appears in the dependency graph, Cargo seems to be a little less clever, and thecdylibdependency may need to use thestringleton-dylibcrate instead. Due to Rust's linkage rules, this will cause the "host" crate to also link dynamically with Stringleton, and everything will continue to work. - When a library is loaded dynamically at runtime, and it does not appear in
the dependency graph, the "host" crate must be prevented from linking
statically to
stringleton, because it would either cause duplicate symbol definitions, or worse, the host and client binaries would disagree about whichRegistryto use. To avoid this, the host binary can usestringleton-dylibexplicitly instead ofstringleton, which forces dynamic linkage of the symbol registry. - Dynamically unloading libraries is extremely risky (
dlclose()and similar). Unloading a library that has any calls to thesym!(..)orstatic_sym!(..)macros is instant UB. Such a library can in principle useSymbol::new(), but probably notSymbol::new_static().
To summarize:
- When no dynamic libraries are present in the project, it is always best to
use
stringletondirectly. - When only normal Rust dynamic libraries (
crate-type = ["dylib"]) are present, it is also fine to usestringletondirectly - Cargo and rustc will figure out how to link things correctly. cdylibdependencies should usestringleton-dylib. The host can usestringleton.- When loading dynamic libraries at runtime, both sides should use
stringleton-dylibinstead ofstringleton. - Do not unload dynamic libraries at runtime unless you are really, really sure what you are doing.
Stringleton works in no_std environments, but it does fundamentally require
two things:
- Allocator support, in order to maintain the global symbol registry. This is a
hashbrownhash map. - Some synchronization primitives to control access to the global symbol registry when new symbols are created.
The latter can be supported by the spin and critical-section features:
spinreplacesstd::sync::RwLock, and is almost always a worse choice whenstdis available.critical-sectionreplacesstd::sync::OnceLockwithonce_cell::sync::OnceCell, and enables thecritical-secionfeature ofonce_cell. Usingcritical-sectionrequires additional work, because you must manually link in a crate that provides the relevant synchronization primitive for the target platform.
Do not use these features unless you are familiar with the tradeoffs.
stringleton works in WASM binaries, but since the wasm32-unknown-unknown
does not support static constructors, the sym!(..) macro will fall back to a
slightly slower implementation that uses atomics and a single branch. (Note that
WASM is normally single-threaded, so atomic operations have no overhead.)
Please note that it is not possible to pass a Symbol across a WASM boundary,
because the host and the guest have different views of memory, and use separate
registries. However, it is possible to pass an opaque u64 representing the
symbol across such a boundary using Symbol::to_ffi() and
Symbol::try_from_ffi(). Getting the string representation of the symbol is
only possible on the side that owns the symbol.
The name is a portmanteau of "string" and "singleton".