
Optimized block functions for the ChaCha stream cipher

Primary LanguageAssembly


This is an optimized library for ChaCha, a stream cipher with a 256 bit key and a 64 bit nonce.

HChaCha is also implemented, which is used to build XChaCha, a variant which extends the nonce from 64 bits to 192 bits. See Extending the Salsa20 nonce.

The most optimized version for the underlying CPU, that passes internal tests, is selected at runtime.

All assembler is PIC safe.

If you encrypt anything without using a MAC (HMAC, Poly1305, etc), you will be found, and made fun of.


The library can be initialized, i.e. the most optimized implementation that passes internal tests will be automatically selected, in two ways, neither of which are thread safe:

  1. int chacha_startup(void); explicitly initializes the library, and returns a non-zero value if no suitable implementation is found that passes internal tests

  2. Do nothing and use the library like normal. It will auto-initialize itself when needed, and hard exit if no suitibale implementation is found.


Common assumptions:

  • chacha_key, chacha_iv, and chacha_iv24 variables can be accessed through their b member, which is an array of unsigned bytes.

  • rounds is an even number 2 or greater.

  • If in is NULL, the output will be stored to out (useful for things like random number generation or generating intermediate keys).


in and out are assumed to be word aligned. Incremental support has no alignment requirements, but will obviously slow down if non word-aligned pointers are passed.

void chacha(const chacha_key *key, const chacha_iv *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);

void xchacha(const chacha_key *key, const chacha_iv24 *iv, const uint8_t *in, uint8_t *out, size_t inlen, size_t rounds);

Encrypts inlen bytes from in to out, using key, iv, and rounds`.


Incremental in and out buffers are not required to be word aligned. Unaligned buffers will require copying to aligned buffers however, which will obviously incur a speed penalty.

void chacha_init(chacha_state *S, const chacha_key *key, const chacha_iv *iv, size_t rounds);

void xchacha_init(chacha_state *S, const chacha_key *key, const chacha_iv24 *iv, size_t rounds);

Initialize the chacha_state with key and iv, and rounds, and sets the internal block counter to 0.

size_t chacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);

size_t xchacha_update(chacha_state *S, const uint8_t *in, uint8_t *out, size_t inlen);

Generates/xors up to inlen + 63 bytes depending on how many bytes are in the internal buffer, and returns the number of encrypted bytes written to out.

size_t chacha_final(chacha_state *S, uint8_t *out);

size_t xchacha_final(chacha_state *S, uint8_t *out);

Generates/crypts any leftover data in the state to out, returns the number of bytes written.


void hchacha(const uint8_t key[32], const uint8_t iv[16], uint8_t out[32], size_t rounds);

Computes HChaCha in to out, using key, iv, and rounds.



const size_t rounds = 20;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100];

chacha(&key, &iv, in, out, 100, rounds);


Encrypting incrementally, i.e. with multiple calls to collect/write data. Note that passing in data to be encrypted will not always result in data being written out. The implementation collects data until there is at least 1 block (64 bytes) of data available.

const size_t rounds = 20;
chacha_state S;
chacha_key key = {{..}};
chacha_iv iv = {{..}};
uint8_t in[100] = {..}, out[100], *out_pointer = out;
size_t i, bytes_written;

chacha_init(&S, &key, &iv, rounds);

/* add one byte at a time, extremely inefficient */
for (i = 0; i < 100; i++) {
    bytes_written = chacha_update(&S, in + i, out_pointer, 1);
    out_pointer += bytes_written;
bytes_written = chacha_final(&S, out_pointer);


x86-64, SSE2-32, and SSE3-32 versions are minorly modified from DJB's public domain implementations.


x86 (32 bit)


x86-64 will almost always be slower than SSE2, but on some older AMDs it may be faster



See asm-opt#configuring for full configure options.

If you would like to use Yasm with a gcc-compatible compiler, pass --yasm to configure.

The Visual Studio projects are generated assuming Yasm is available. You will need to have Yasm.exe somewhere in your path to build them.


make lib

and make install-lib OR copy bin/chacha.lib and app/include/chacha.h to your desired location.


./configure --pic
make shared
make install-shared


make util
bin/chacha-util [bench|fuzz]


Benchmarking will implicitly test every available version. If any fail, it will exit with an error indicating which versions did not pass. Features tested include:

  • Partial block generation
  • Single block generation
  • Multi block generation
  • Counter handling when the 32-bit low half overflows to the upper half
  • Streaming and XOR modes
  • Incremental encryption
  • Input/Output alignment


Fuzzing tests every available implementation for the current CPU against the reference implementation. Features tested are:

  • HChaCha output
  • One-shot ChaCha
  • Incremental ChaCha with potentially unaligned output


As I have not updated any benchmarks yet, raw cycle counts should have ~10-20 cycles added from the overhead of targets not being hardcoded.


Impl.1 byte81220576 bytes812208192 bytes81220
SSSE3-64 237 300 437 1.71 2.23 3.30 1.46 1.90 2.82
SSE2-64 262 337 500 1.98 2.65 3.97 1.68 2.29 3.42
SSSE3-32 287 350 487 2.04 2.69 3.99 1.72 2.37 3.59
SSE2-32 312 400 562 2.43 3.26 4.95 2.12 2.90 4.52


SSSE3-64 162 237 362
SSSE3-32 175 250 375
SSE2-64 200 275 450
SSE2-32 200 275 450


Impl.1 byte81220576 bytes812208192 bytes81220
AVX-64 176 240 364 1.22 1.68 2.64 1.04 1.46 2.29
SSSE3-64 180 248 384 1.35 1.88 2.94 1.18 1.65 2.59
AVX-32 184 248 380 1.50 2.03 3.10 1.24 1.72 2.68
SSSE3-32 228 292 428 1.84 2.47 3.74 1.65 2.23 3.41


AVX-64 116 180 308
AVX-32 128 192 320
SSSE3-64 128 192 328
SSSE3-32 136 204 336

Timings are with Turbo Boost and Hyperthreading, so their accuracy is not concrete. For reference, OpenSSL and Crypto++ give ~0.8cpb for AES-128-CTR and ~1.1cpb for AES-256-CTR, and ~7.4cpb for SHA-512.


Impl.1 byte81220576 bytes812208192 bytes81220
AVX2-64 146 194 313 0.68 0.97 1.48 0.52 0.71 1.08
AVX2-32 170 218 337 0.83 1.11 1.66 0.62 0.83 1.24
AVX-64 146 194 316 1.06 1.50 2.33 0.94 1.32 2.05
AVX-32 158 206 328 1.32 1.82 2.81 1.12 1.57 2.47


(these are all literally the same version, timing differences are noise)

AVX2-64 81 155 251
AVX2-32 87 155 254
AVX-64 87 155 274
AVX-32 87 152 251

AMD FX-8120

Timings are with Turbo on, so accuracy is not concrete. I'm not sure how to adjust for it either, and depending on clock speed (3.1ghz vs 4.0ghz), OpenSSL gives between 0.73cpb - 0.94cpb for AES-128-CTR, 1.03cpb - 1.33cpb for AES-256-CTR, and 10.96cpb - 14.1cpb for SHA-512.


Impl.1 byte81220576 bytes812208192 bytes81220
XOP-64 194 269 418 1.09 1.47 2.25 0.93 1.22 1.80
AVX-64 245 344 544 1.41 1.97 3.14 1.20 1.63 2.51
XOP-32 247 322 471 1.44 1.96 3.01 1.26 1.70 2.59
AVX-32 276 375 573 1.88 2.53 3.78 1.62 2.16 3.23


XOP-64 84 160 309
XOP-32 91 165 318
AVX-64 144 243 441
AVX-32 144 237 441

ZedBoard (Cortex-A9)

I don't have access to the cycle counter yet, so cycles are computed by taking the microseconds times the clock speed (666mhz) divided by 1 million. For comparison, on long messages, OpenSSL 1.0.0e gives 52.3 cpb for aes-128-cbc (woof), and djb's armneon6 Salsa20/20 implementation gives 8.2 cpb.


Impl.1 byte81220576 bytes812208192 bytes81220
NEON-32 460 573 814 3.53 4.73 7.13 3.06 4.26 6.47
ARMv6-32 437 565 793 5.33 7.07 10.87 5.07 6.93 10.73


NEON shares the same implementation as ARMv6 as NEON latencies are too high for a single block.

NEON-32 294 446 658
ARMv6-32 294 446 658


Public Domain, or MIT