/aes

Fast constant-time AES implementations on 32-bit architectures

Primary LanguageAssemblyMIT LicenseMIT

Fast constant-time AES implementations on 32-bit architectures

This repository contains efficient constant-time implementations for the Advanced Encryption Standard (AES) algorithm as supplementary material for the paper Fixslicing AES-like Ciphers - New bitsliced AES speed records on ARM Cortex-M and RISC-V published in TCHES 2021/1.

Structure of the repository

The repository structure is as follows:

aes
│   README.md
│   LICENSE   
│
├───armcortexm
│   ├───1storder_masking
│   ├───barrel_shiftrows
│   └───fixslicing
│   
├───opt32
│   ├───barrel_shiftrows
│   └───fixslicing
│   
├───riscv
│   ├───barrel_shiftrows
│   └───fixslicing

where armcortexm and riscv directories respectively refer to assembly implementations for ARM Cortex-M and RV32I, whereas opt32 refers to C language implementations. Note that the main goal of the opt32 directory is to provide cross-platform implementations and to serve a didactic purpose. Therefore if you intend to run it for benchmarking, you should consider some modifications regarding execution speed.

AES representations

Each directory includes two different AES representations:

  • barrel_shiftrows

    Processes 8 blocks in parallel. Requires 1408 and 1920 bytes to store all the round keys for AES-128 and AES-256, respectively.

  • fixslicing

    Processes 2 blocks in parallel. Requires 352 and 480 bytes to store all the round keys for AES-128 and AES-256, respectively. Two fixsliced versions are proposed:

    • Fully-fixsliced: faster but at the cost of a larger code size
    • Semi-fixsliced: slightly slower but more compact.

Performance

Since the fixsliced representations require 4 times less RAM to store all the round keys, they are more suited to the most resource-constrained platforms. Still, the barrel-shiftrows representation might be worthy of consideration for use-cases that deal with large amount of data on architectures with numerous general-purpose registers (e.g. RV32I). The table below summarizes the performance of each version on ARM Cortex-M3 and E31 RISC-V processors in cycles per byte. Note that those implementations are non-unrolled to ensure greater clarity and limit the impact on code size. Unrolling them would result in slightly better performance and we refer to the paper for more details.

Algorithm Parallel blocks ARM Cortex-M3 E31 RISC-V core
AES-128 semi-fixsliced 2 87.1 93.4
AES-128 fully-fixsliced 2 84.3 89.3
AES-128 barrel-shiftrows 8 94.8 78.9
AES-256 semi-fixsliced 2 119.9 128.4
AES-256 fully-fixsliced 2 115.6 122.4
AES-256 barrel-shiftrows 8 127.9 105.7

First-order masking

A first-order masked implementation based on fixslicing can be found in armcortexm/1storder_masking. The masking scheme is the one described in the article Masking the AES with Only Two Random Bits and is strongly based on the code from the corresponding repository. Note that the code in charge of the randomness generation is specific to the STM32F407VG development board and some changes would be necessary to run it on another board (e.g. adapting the RNG_SR address). The table below summarizes their performance on ARM Cortex-M4 in cycles per byte. Once again, results can be slightly enhanced by unrolling the code.

Algorithm Parallel blocks ARM Cortex-M4
1st-order masked AES-128 semi-fixsliced 2 199.3
1st-order masked AES-128 fully-fixsliced 2 195.8

Caution

This masking scheme was mainly introduced to achieve first-order masking while limiting the amount of randomness to generate. Please be aware that other first-order masking schemes provide a better security level. Note that no practical evaluation has been undertaken to assess the security of our masked implementations!