P256 ECDH and ECDSA for Cortex-M4, Cortex-M33 and other 32-bit ARM processors
This library implements highly optimimzed assembler versions for the NIST P-256 (secp256r1) elliptic curve for Cortex-M4/Cortex-M33. While optimized for these processors, it works on other newer 32-bit ARM processors as well.
The DSP extension CPU feature is required for Cortex-M33.
For full API documentation, see the header file p256-cortex-m4.h
.
To use it in your project, add the following files to your project: p256-cortex-m4.h
, p256-cortex-m4-config.h
, p256-cortex-m4.c
. Then add only one of the asm files that suits you best as a compilation unit to your project. If you use Keil, add p256-cortex-m4-asm-keil.s
as a source file and add --cpreproc
to "Misc Controls" under Options -> Asm for the file. If you use GCC, add p256-cortex-m4-asm-gcc.S
to your Makefile just like any other C source file.
To only compile in the features needed, the file p256-cortex-m4-config.h
can be modified to include only specific algorithms. If used on a Cortex-A processor, the has_d_cache
setting shall also be enabled in order to prevent side-channel attacks. There are also optimization options to trade code space for performance. The same options can also be defined directly at the command line when compiling, using e.g. -Dinclude_p256_sign=0
to omit the code for creating a signature.
The library does not include a hash implementation (used during sign and verify), nor does it include a secure random number generator (used during keygen and sign). These functions must be implemented externally. Note that the secure random number generator must be for cryptographic purposes. In particular, rand()
from the C standard library must not be used, while /dev/urandom
, as can be found on many Unix systems, is compliant.
Note: all uint32_t
arrays represent 256-bit integers in little-endian byte order (native to the CPU), located at a 4-byte alignment byte boundary. The uint8_t
arrays either represent pure byte strings, or integers in big-endian byte order (no alignment requirements). When interacting with other libraries, make sure to carefully understand the data format used by those libraries. Some data conversion routines for easier interopability are included in the API.
Generate a key pair for either ECDSA or ECDH (a key pair should not be used for both purposes).
uint32_t pubkey_x[8], pubkey_y[8], privkey[8];
do {
generate_secure_random_data(privkey, sizeof(privkey));
} while (!p256_keygen(pubkey_x, pubkey_y, privkey));
The result will now be contained in pubkey_x
, pubkey_y
and privkey
(little-endian).
In this example, SHA-256 is used as hash algorithm.
// Input values
uint8_t message[] = ...;
size_t message_len = ...;
uint32_t privkey[8] = ...;
// Output values (the signature)
uint32_t signature_r[8], signature_s[8];
uint8_t hash[32];
sha256_hash(message, message_len, hash);
uint32_t k[8]; // must be kept secret
do {
generate_secure_random_data(k, sizeof(k));
} while (!p256_sign(signature_r, signature_s, hash, sizeof(hash), privkey, k));
In this example, SHA-256 is used as hash algorithm.
// Input values
uint8_t message[] = ...;
size_t message_len = ...;
uint32_t pubkey_x[8] = ..., pubkey_y[8] = ...;
uint32_t signature_r[8] = ..., signature_s[8] = ...;
uint8_t hash[32];
sha256_hash(message, message_len, hash);
if (p256_verify(pubkey_x, pubkey_y, hash, sizeof(hash), signature_r, signature_s)) {
// Signature is valid
} else {
// Signature is invalid
}
After both parties have generated their key pair and exchanged their public keys, the shared secret can be generated. Both parties execute the following code.
// Input values
uint32_t others_public_key_x[8] = ..., others_public_key_y[8] = ...; // Received from remote party
uint32_t my_private_key[8] = ...; // Generated locally earlier during keygen
// Output value
uint8_t shared_secret[32];
if (!p256_ecdh_calc_shared_secret(shared_secret, my_private_key, others_public_key_x, others_public_key_y)) {
// The other part sent an invalid public key, so abort and take actions
// The shared_secret will at this point contain an undefined value, and should hence not be read
} else {
// The shared_secret is now the same for both parts and may be used for cryptographic purposes
}
If you are receiving or sending 32-byte long uint8_t
arrays representing 256-bit integers in big-endian byte order, you may convert them to or from uint32_t
arrays in little-endian byte order (which are commonly used in this library) using p256_convert_endianness
.
For example, before validating a signature, call:
// Input values
uint8_t signature_r_in[32] = ..., signature_s_in[32] = ...;
// Output values
uint32_t signature_r[8], signature_s[8];
p256_convert_endianness(signature_r, signature_r_in, 32);
p256_convert_endianness(signature_s, signature_s_in, 32);
After generating a signature, call:
// Input values
uint32_t signature_r[8] = ..., signature_s[8] = ...; // from p256_sign
// Output values
uint8_t signature_r_out[32], signature_s_out[32];
p256_convert_endianness(signature_r_out, signature_r, 32);
p256_convert_endianness(signature_s_out, signature_s, 32);
The same technique can be used for public keys.
The library has been tested against test vectors from Project Wycheproof (https://github.com/google/wycheproof). To run the tests, first execute node testgen.js > tests.c
using Node >= 10.4. Then add the project files according to "How to use" plus tests.c
and nrf52_tests_main.c
to a new clean nRF52840 project using e.g. Segger Embedded Studio or Keil µVision. Compile and run and make sure all tests pass, by verifying that main
returns 0.
Currently the work has been tested successfully on nRF52840, nRF5340 and MAX32670.
The following numbers were obtained on a nRF52840 with ICACHE turned on, using GCC as compiler with -O2
optimization.
Operation | Cycles | Time at 64 MHz |
---|---|---|
Key generation ECDH/ECDSA | 327k | 5.1 ms |
Sign ECDSA | 375k | 5.9 ms |
Verify ECDSA | 976k | 15.3 ms |
Shared secret ECDH | 906k | 14.2 ms |
Point decompression | 48k | 0.75 ms |
With all features enabled, the full library takes 8.9 kB in compiled form. 1.5 kB can be saved by enabling options that trades code space for performance.
The stack usage is at most 2 kB.
The implementation runs in constant time (unless input values are invalid) and uses a constant code memory access pattern, regardless of the scalar/private key in order to protect against side channel attacks. If desired, in particular when the processor has a data cache (like Cortex-A processors), the has_d_cache
option can be enabled which also causes the RAM access pattern to be constant, at the expense of ~10% performance decrease.
The code is written in Keil's assembler format but was converted to GCC's assembler syntax using the included script convert-keil-to-gcc.sh
(reads from stdin and writes to stdout).
The code is licensed under the MIT license.
Thanks to ASSA ABLOY PPI for funding this work!