Stringleton

Extremely efficient string interning solution for Rust crates.

String interning: The technique of representing all strings which are equal by a pointer or ID that is unique to the contents of that strings, such that O(n) string equality check becomes a O(1) pointer equality check.

Interned strings in Stringleton are called "symbols", in the tradition of Ruby.

Distinguishing characteristics

Ultra fast: Getting the string representation of a Symbol is a lock-free memory load. No reference counting or atomics involved.
Symbol literals (sym!(...)) are "free" at the call-site. Multiple invocations with the same string value are eagerly reconciled on program startup using linker tricks.
Symbols are tiny. Just a single pointer - 8 bytes on 64-bit platforms.
Symbols are trivially copyable - no reference counting.
No size limit - symbol strings can be arbitrarily long (i.e., this is not a "small string optimization" implementation).
Debugger friendly: If your debugger is able to display a plain Rust &str, it is capable of displaying Symbol.
Dynamic library support: Symbols can be passed across dynamic linking boundaries (terms and conditions apply - see the documentation of stringleton-dylib).
no_std support: std synchronization primitives used in the symbol registry can be replaced with once_cell and spin. See below for caveats.
serde support - symbols are serialized/deserialized as strings.
Fast bulk-insertion of symbols at runtime.

Good use cases

You have lots of little strings that you need to frequently copy and compare.
Your strings come from trusted sources.
You want good debugger support for your symbols.

Bad use cases

You have an unbounded number of distinct strings, or strings coming from untrusted sources. Since symbols are never garbage-collected, this is a source of memory leaks, which is a denial-of-service hazard.
You need a bit-stable representation of symbols that does not change between runs.
Consider if smol_str or cowstr is a better fit for such use cases.

Usage

Add stringleton as a dependency of your project, and then you can do:

use stringleton::{sym, Symbol};

// Enable the `sym!()` macro in the current crate. This should go at the crate root.
stringleton::enable!();

let foo = sym!(foo);
let foo2 = sym!(foo);
let bar = sym!(bar);
let message = sym!("Hello, World!");
let message2 = sym!("Hello, World!");

assert_eq!(foo, foo2);
assert_eq!(bar.as_str(), "bar");
assert_eq!(message, message2);
assert_eq!(message.as_str().as_ptr(), message2.as_str().as_ptr());

Crate features

std (enabled by default): Use synchronization primitives from the standard library. Implies alloc. When disabled, critical-section and spin must both be enabled (see below for caveats).
alloc (enabled by default): Support creating symbols from String.
serde: Implements serde::Serialize and serde::Deserialize for symbols, which will be serialized/deserialized as plain strings.
debug-assertions: Enables expensive debugging checks at runtime - mostly useful to diagnose problems in complicated linker scenarios.
critical-section: When std is not enabled, this enables once_cell as a dependency with the critical-section feature enabled. Only relevant in no_std environments. See critical-section for more details.
spin: When std is not enabled, this enables spin as a dependency, which is used to obtain global read/write locks on the symbol registry. Only relevant in no_std environments (and is a pessimization in other environments).

Efficiency

Stringleton tries to be as efficient as possible, but it may make different tradeoffs than other string interning libraries. In particular, Stringleton is optimized towards making the use of the sym!(...) macro practically free.

Consider this function:

fn get_symbol() -> Symbol {
    sym!("Hello, World!")
}

This compiles into a single load instruction. Using cargo disasm on x86-64 (Linux):

get_symbol:
  8bf0    mov  rax, qword ptr [rip + 0x52471]
  8bf7    ret

This is "as fast as it gets", but the price is that all symbols in the program are deduplicated when the program starts. Any theoretically faster solution would need fairly deep cooperation from the compiler aimed at this specific use case.

Also, symbol literals are always a memory load. The compiler cannot perform optimizations based on the contents of symbols, because it doesn't know how they will be reconciled until link time. For example, while sym!(a) != sym!(a) is always false, the compiler cannot eliminate code paths relying on that.

Dynamic libraries

Stringleton relies on magical linker tricks (supported by linkme and ctor) to minimize the cost of the sym!(...) macro at runtime. These tricks are broadly compatible with dynamic libraries, but there are a few caveats:

When a Rust dylib crate appears in the dependency graph, and it has stringleton as a dependency, things should "just work", due to Rust's linkage rules.
When a Rust cdylib crate appears in the dependency graph, Cargo seems to be a little less clever, and the cdylib dependency may need to use the stringleton-dylib crate instead. Due to Rust's linkage rules, this will cause the "host" crate to also link dynamically with Stringleton, and everything will continue to work.
When a library is loaded dynamically at runtime, and it does not appear in the dependency graph, the "host" crate must be prevented from linking statically to stringleton, because it would either cause duplicate symbol definitions, or worse, the host and client binaries would disagree about which Registry to use. To avoid this, the host binary can use stringleton-dylib explicitly instead of stringleton, which forces dynamic linkage of the symbol registry.
Dynamically unloading libraries is extremely risky (dlclose() and similar). Unloading a library that has any calls to the sym!(..) or static_sym!(..) macros is instant UB. Such a library can in principle use Symbol::new(), but probably not Symbol::new_static().

To summarize:

When no dynamic libraries are present in the project, it is always best to use stringleton directly.
When only normal Rust dynamic libraries (crate-type = ["dylib"]) are present, it is also fine to use stringleton directly - Cargo and rustc will figure out how to link things correctly.
cdylib dependencies should use stringleton-dylib. The host can use stringleton.
When loading dynamic libraries at runtime, both sides should use stringleton-dylib instead of stringleton.
Do not unload dynamic libraries at runtime unless you are really, really sure what you are doing.

`no_std` caveats

Stringleton works in no_std environments, but it does fundamentally require two things:

Allocator support, in order to maintain the global symbol registry. This is a hashbrown hash map.
Some synchronization primitives to control access to the global symbol registry when new symbols are created.

The latter can be supported by the spin and critical-section features:

spin replaces std::sync::RwLock, and is almost always a worse choice when std is available.
critical-section replaces std::sync::OnceLock with once_cell::sync::OnceCell, and enables the critical-secion feature of once_cell. Using critical-section requires additional work, because you must manually link in a crate that provides the relevant synchronization primitive for the target platform.

Do not use these features unless you are familiar with the tradeoffs.

WASM caveats

stringleton works in WASM binaries, but since the wasm32-unknown-unknown does not support static constructors, the sym!(..) macro will fall back to a slightly slower implementation that uses atomics and a single branch. (Note that WASM is normally single-threaded, so atomic operations have no overhead.)

Please note that it is not possible to pass a Symbol across a WASM boundary, because the host and the guest have different views of memory, and use separate registries. However, it is possible to pass an opaque u64 representing the symbol across such a boundary using Symbol::to_ffi() and Symbol::try_from_ffi(). Getting the string representation of the symbol is only possible on the side that owns the symbol.

Name

The name is a portmanteau of "string" and "singleton".

simonask/stringleton