ECHO hash function home design security hard soft compare

The news that Intel will include an AES instruction set in a range of new products was warmly welcomed by the cryptographic community. Not only does this provide additional resistance to a range of side-channel attacks, but it also provides huge performance benefits. We implemented in assembly all the SHA3 candidates that could benefit from the new Intel AES instructions set and herein summarize the results.

summary




performance comparison details


Performance estimates for all SHA-3 candidates considered below come from Nehalem measurements on a Core i7 920 processor clocked at 2.67 GHz with GNU/Linux Debian running a
2.6.26-1-amd64
kernel. The compiler was
icc for amd64, Version 11.0, Build 20081105
. Since the new Westmere CPUs have been released, we were able to check these estimates with a Core i5 540M processor clocked at 2.53 GHz, running Ubuntu with a
2.6.34-amd64
kernel. As one can notice, our estimates were pretty accurate. For those algorithms that solely use the AES round in its entirety, we give the number of AES rounds/Byte and also the cost, which is computed as the number of cycles/AES round. In general terms the lower the cost, the more efficiently the AES round is being used with respect to AES-NI.

AES-NI estimations
(Nehalem Core i7)
AES-NI results
(Westmere Core i5)
AES encryptions
256-bit hash 512-bit hash 256-bit hash 512-bit hash #AES/Byte
256-bit / 512-bit
cost
256-bit / 512-bit
Arirang 14.9 16 Arirang
Cheetah 7.6 7.6 Cheetah
ECHO-DP 6.612.3 6.812.6 1.33 / 2.50 5.11 / 5.04 ECHO-DP
ECHO-DP32 ‡∗ 8.315.3 1.33 / 2.50 6.24 / 6.12 ECHO-DP32 ‡∗
ECHO-SP 5.78.1 5.88.4 1.14 / 1.67 5.09 / 5.03 ECHO-SP
ECHO-SP32 †∗ 7.110.1 1.14 / 1.67 6.23 / 6.05 ECHO-SP32 †∗
Lane 5.513.9 4.913.5 1.31 / 1.75 3.74 / 7.71 Lane
Lesamnta 30.819.9 29.519 Lesamnta
LUX 6.6 6.6 LUX
SHAvite-3 5.6
  7.6
5.5
  8.7
4.8
  7.7
4.3
  8.7
0.81 / 1.31 5.93 / 3.28
  9.38 / 6.64
SHAvite-3
Vortex 4.45.2 4.45.2 Vortex
Grøstl×15.829.1Grøstl×

                                         double-pipe version    single-pipe version    pure 32-bit
x86
implementation (using 8 xmm registers)
                                                                       implementation and specification mismatch, see here   
               × Grøstl, though not using a full AES round, benefits from AES-NI (figures collected by using the ebash supercop package on our Core i5)

Click on the algorithm name to download its assembly implementation. See there for instructions on compiling and running the code.

While we tried to implement each of the above algorithms as efficiently as possible, improvements might be possible. If you have other implementation ideas that you would like to share with us or obtain better figures with a supporting public assembly implementation, do not hesitate to get in touch, we will update this page accordingly.


methodology

As the Westmere processors (the first with AES-NI) were not yet available at the time, we have proposed a new methodology that can be used to get an accurate emulation of AES-NI. Our methodology relies on the fact that Westmere (formerly Nehalem-C) and Nehalem processors share the same micro-architecture. This means that if we can find suitable instructions patterns that behave exactly as AES-NI instructions, we will get very good estimates for the future performance of AES-based SHA-3 candidates on a Westmere processor, but using a Nehalem processor. Previously, a substitution instruction was proposed for Westmere processors. However this substitution does not exhibit the correct behaviour for Westmere and can give misleading results. In a companion paper, we provide a particularly accurate replacement instructions pattern for aesenc and we explain how to derive it from publicly available information only. The corresponding slides have been presented to the Asiacrypt 2009 conference.


compiling and running

package structure

For each candidate, the package consists in one or two subfolders, depending on the implemented versions (256- and/or 512-bit versions). Each subfolder contains three main files:

  • compress.h: the core source file containing the inline assembly implementation of the compression function.
  • nist_api.h: the NIST API, basically from the candidate's package, sometimes modified for compatibility with our implementation of the compression function.
  • wrap.c: the main routine with two modes: a benchmark mode that uses the "replacement instructions" in the compression function (see our companion paper) to be run on Nehalem, and an "emulator" mode that uses the AES-NI genuine instructions to be run on Westmere or inside an emulator on other processors.

compiling

To compile a candidate, a Makefile is present at the root of the candidate's folder. There are five (or three, depending on the hash length variants) compilation toggles in the Makefile:
  • westmere256 and westmere512: compiles the Westmere AES-NI version of the algorithm. Two binaries are generated:
        • A first one to check the implementation correctness with Intel's SDE emulator or by genuine execution on Westmere CPUs.
        • A second one to perform speed measurements on Westmere.
  • nehalem256 and nehalem512: compiles the replacement versions of the algorithm for benchmarking purposes (estimates on Nehalem).
  • clean: removes compiled files.

running

  • The benchmark binaries are run with a single argument, a number which represents the repeat factor:
    ./westmere_binary name_of_file_to_be_hashed
  • The Westmere correctness checking binaries can be either run on Westmere CPUs, or through Intel's SDE:
    sde -- westmere_binary name_of_file_to_be_hashed

measurements

In order to get stable measurements, it is recommended to shut down all services. To lower the impact of the remaining process, we also set the CPU affinity and maximize its priority (this requires root privileges). Although the number of samples needed to get stable results depends on the code and the surrounding "OS noise", we found that about 108 samples give stable and clean results for all the candidates.

Finally, timings may also vary depending on the BIOS settings for the Core i7 920 and the Core i5 540M. The results given here have been obtained with:

  • Genuine frequency and QPI (no overclocking)
  • Hyperthreading off
  • Turbo Boost off

compiling with gcc

It is also possible to compile our source codes with gcc insead of icc (gcc supports AES-NI in 4.4 and above). For this, set the CC variable in the Makefile to read:

 CC=gcc -funsafe-math-optimizations  

Compiling without the above flag will not give proper results. This is related to the Nehalem AES-NI replacement nature: the mulps instructions might indeed generate floating point exceptions at run time (because of uncontrolled overflows), leading to huge penalties during the handling of the exceptions. The gcc compilation flag "-funsafe-math-optimizations" changes the mxcsr register so that the overflow FPU/SSE exceptions are masked, see gcc documentation.

The icc compiler does not need special care as an equivalent flag "-fp-model fast" is implied by -O1 or -O3. Another workaround is to manually set the mxcsr register at the beginning of the loop measuring the compression function.

Finally, switching to gcc also implies, for some candidates, to change icc's
-fast
option to -O3. This can lead to some differences in the results.