ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

hg · 2022-05-22T14:38:39Z

With the 'new' protocol, ChaCha is taking a decent amount of CPU time, at least in debug build (optimization makes perf output unreadable):

Children      Self  Command  Shared Object         Symbol
-   99.72%     0.00%  tincd    libc.so.6             [.] __libc_init_first
     __libc_init_first
   - main
      - 99.71% main_loop
         - 90.08% event_loop
            - 69.62% handle_incoming_vpn_data
               - 69.20% handle_incoming_vpn_packet
                  - 68.06% receive_udppacket
                     - 67.88% sptps_receive_data
                        - 67.68% sptps_receive_data_datagram
                           - 49.15% chacha_poly1305_decrypt
                                39.06% chacha_encrypt_bytes
                                9.83% poly1305_auth
                           + 17.80% receive_sptps_record
                  + 0.67% lookup_node_udp
            + 13.34% handle_device_data
            + 4.69% recvmmsg
            + 1.37% handle_meta_io
         + 9.50% epoll_wait

tincd is using the lowest common denominator implementation of this function. Let's add a couple of optimized ones based on compiler intrinsics.

All the hard work has been done by Romain Dolbeau. I just copied it with some adjustments.

Compatibility

x86 / amd64

We'll be shipping three versions of the function (or two, with old compilers without avx2 support):

generic C implementation
another one based on SSSE3
another one based on AVX2

The right one is picked at runtime depending on current CPU capabilities.

Other architectures

Only the old C implementation is used. ARM Neon could be added later.

Benchmarks

i5-4460
Linux 5.17.5
gcc 12.1
iperf3 in two network namespaces
performance CPU governor, as few processes as possible, all the basic benchmarking stuff

TL;DR: 20-22% increase in throughput.

bench_chacha.c

Percentage is relative to generic C implementation.

C

      32:     2790793.08 op/s
     256:     1261587.31 op/s
     512:      728262.56 op/s
    1024:      390193.12 op/s
   16384:       26361.08 op/s
  131072:        3320.87 op/s
 1048576:         415.73 op/s

SSE

      32:     3112408.34 op/s (+11%)
     256:     2441758.81 op/s (+93%)
     512:     1627719.13 op/s (+123%)
    1024:      972969.81 op/s (+149%)
   16384:       74304.47 op/s (+181%)
  131072:        9427.75 op/s (+183%)
 1048576:        1182.82 op/s (+184%)

AVX2

      32:     3159181.11 op/s (+13%)
     256:     2449003.64 op/s (+94%)
     512:     2450859.66 op/s (+236%)
    1024:     1628639.74 op/s (+317%)
   16384:      145438.38 op/s (+451%)
  131072:       18729.81 op/s (+464%)
 1048576:        2330.21 op/s (+460%)

Always resolving the correct function (instead of doing it once and storing in a pointer) is a bit slower:

      32:     3126362.45 op/s
     256:     2395000.08 op/s
     512:     2399900.36 op/s
    1024:     1600087.45 op/s
   16384:      144505.38 op/s
  131072:       18464.47 op/s
 1048576:        2295.46 op/s

iperf3:

buildtype=release

C

[  5]  55.00-56.00  sec   115 MBytes   965 Mbits/sec    2    557 KBytes
[  5]  56.00-57.00  sec   115 MBytes   965 Mbits/sec    1    491 KBytes
[  5]  57.00-58.00  sec   116 MBytes   975 Mbits/sec    0    648 KBytes
[  5]  58.00-59.00  sec   115 MBytes   965 Mbits/sec    1    588 KBytes
[  5]  59.00-60.00  sec   115 MBytes   965 Mbits/sec    2    522 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.73 GBytes   963 Mbits/sec  136             sender
[  5]   0.00-60.00  sec  6.73 GBytes   963 Mbits/sec                  receiver

SSSE3

[  5]  55.00-56.00  sec   130 MBytes  1.09 Gbits/sec   25    600 KBytes
[  5]  56.00-57.00  sec   131 MBytes  1.10 Gbits/sec    2    560 KBytes
[  5]  57.00-58.00  sec   130 MBytes  1.09 Gbits/sec    2    515 KBytes
[  5]  58.00-59.00  sec   132 MBytes  1.11 Gbits/sec    0    683 KBytes
[  5]  59.00-60.00  sec   131 MBytes  1.10 Gbits/sec    2    649 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  7.64 GBytes  1.09 Gbits/sec  2659             sender
[  5]   0.00-60.00  sec  7.64 GBytes  1.09 Gbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   142 MBytes  1.20 Gbits/sec    2    602 KBytes
[  5]  56.00-57.00  sec   141 MBytes  1.19 Gbits/sec    2    574 KBytes
[  5]  57.00-58.00  sec   142 MBytes  1.20 Gbits/sec    1    550 KBytes
[  5]  58.00-59.00  sec   142 MBytes  1.20 Gbits/sec    2    520 KBytes
[  5]  59.00-60.00  sec   142 MBytes  1.20 Gbits/sec    1    494 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.29 GBytes  1.19 Gbits/sec  1126             sender
[  5]   0.00-60.00  sec  8.28 GBytes  1.19 Gbits/sec                  receiver

I thought that it might be possible that optimizing for a specific CPU or auto-vectorization that is performed at -O3 would remove the need of writing assembly:

buildtype=release + -march=native -mtune=native

C

[  5]  55.00-56.00  sec   110 MBytes   923 Mbits/sec    1    498 KBytes
[  5]  56.00-57.00  sec   110 MBytes   923 Mbits/sec    0    646 KBytes
[  5]  57.00-58.00  sec   110 MBytes   923 Mbits/sec    3    581 KBytes
[  5]  58.00-59.00  sec   110 MBytes   923 Mbits/sec    2    506 KBytes
[  5]  59.00-60.00  sec   110 MBytes   923 Mbits/sec    0    650 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.40 GBytes   916 Mbits/sec  2579             sender
[  5]   0.00-60.01  sec  6.40 GBytes   916 Mbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   141 MBytes  1.18 Gbits/sec    4    649 KBytes
[  5]  56.00-57.00  sec   142 MBytes  1.20 Gbits/sec    1    626 KBytes
[  5]  57.00-58.00  sec   142 MBytes  1.20 Gbits/sec    2    602 KBytes
[  5]  58.00-59.00  sec   142 MBytes  1.20 Gbits/sec    1    571 KBytes
[  5]  59.00-60.00  sec   141 MBytes  1.18 Gbits/sec    3    539 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.30 GBytes  1.19 Gbits/sec  981             sender
[  5]   0.00-60.00  sec  8.30 GBytes  1.19 Gbits/sec                  receiver

-O3 + -march=native -mtune=native

C

[  5]  55.00-56.00  sec   111 MBytes   933 Mbits/sec    0    680 KBytes
[  5]  56.00-57.00  sec   111 MBytes   933 Mbits/sec    3    619 KBytes
[  5]  57.00-58.00  sec   112 MBytes   944 Mbits/sec    1    554 KBytes
[  5]  58.00-59.00  sec   111 MBytes   933 Mbits/sec    0    691 KBytes
[  5]  59.00-60.00  sec   111 MBytes   933 Mbits/sec    2    634 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.48 GBytes   927 Mbits/sec  3267             sender
[  5]   0.00-60.00  sec  6.47 GBytes   926 Mbits/sec                  receiver

AVX2

[  5]  55.00-56.00  sec   139 MBytes  1.16 Gbits/sec    1    639 KBytes
[  5]  56.00-57.00  sec   139 MBytes  1.16 Gbits/sec    1    607 KBytes
[  5]  57.00-58.00  sec   138 MBytes  1.15 Gbits/sec    1    578 KBytes
[  5]  58.00-59.00  sec   139 MBytes  1.16 Gbits/sec    2    546 KBytes
[  5]  59.00-60.00  sec   138 MBytes  1.15 Gbits/sec    1    510 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  8.01 GBytes  1.15 Gbits/sec  312             sender
[  5]   0.00-60.00  sec  8.00 GBytes  1.15 Gbits/sec                  receiver

Not really.

Speeds up local iperf3 in Linux network namespaces by roughly 20%.

hg · 2022-05-22T16:39:40Z

I tried the NEON implementation from the same author. It's doesn't have as much of an edge over the generic one, at least on Neoverse-N1. So iperf3 speeds are comparable.

If anyone has an old Raspberry Pi and wants to test this on a slower CPU — be my guest. The code is in pretty bad shape since it's work in progress, but it does work.

Otherwise, I see no reason to merge NEON support.

generic

      32:     2846064.09 op/s
     256:     1424432.86 op/s
     512:      850478.15 op/s
    1024:      473508.42 op/s
   16384:       33032.01 op/s
  131072:        4146.80 op/s
 1048576:         517.55 op/s

[  5]  55.00-56.00  sec   100 MBytes   839 Mbits/sec    0    658 KBytes
[  5]  56.00-57.00  sec   101 MBytes   849 Mbits/sec    8    571 KBytes
[  5]  57.00-58.00  sec   102 MBytes   860 Mbits/sec    2    484 KBytes
[  5]  58.00-59.00  sec   101 MBytes   849 Mbits/sec    0    622 KBytes
[  5]  59.00-60.00  sec   101 MBytes   849 Mbits/sec    1    532 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec  2875             sender
[  5]   0.00-60.00  sec  5.67 GBytes   812 Mbits/sec                  receiver

NEON

       32:     2933899.98 op/s (+3%)
      256:     1943603.43 op/s (+36%)
      512:     1242317.12 op/s (+46%)
     1024:      724495.40 op/s (+53%)
    16384:       53621.92 op/s (+62%)
   131072:        6754.49 op/s (+62%)
  1048576:         844.45 op/s (+63%)

[  5]  55.00-56.00  sec   101 MBytes   849 Mbits/sec    1    536 KBytes
[  5]  56.00-57.00  sec   100 MBytes   839 Mbits/sec    0    666 KBytes
[  5]  57.00-58.00  sec   101 MBytes   849 Mbits/sec    3    584 KBytes
[  5]  58.00-59.00  sec   100 MBytes   839 Mbits/sec    0    704 KBytes
[  5]  59.00-60.00  sec   102 MBytes   860 Mbits/sec    2    626 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec  1660             sender
[  5]   0.00-60.00  sec  5.81 GBytes   832 Mbits/sec                  receiver

Badly modified `ns_ping.py` for running iperf3

#!/usr/bin/env python3

"""Create two network namespaces and run ping between them."""

import os
import signal
import subprocess as subp
import typing as T

from testlib import external as ext, util, template, cmd
from testlib.log import log
from testlib.proc import Tinc, Script
from testlib.test import Test

util.require_root()
util.require_command("ip", "netns", "list")
util.require_path("/dev/net/tun")

IP_FOO = "192.168.1.1"
IP_BAR = "192.168.1.2"
MASK = 24


def init(ctx: Test) -> T.Tuple[Tinc, Tinc]:
    """Initialize new test nodes."""
    foo, bar = ctx.node(), ctx.node()

    log.info("create network namespaces")
    assert ext.netns_add(foo.name)
    assert ext.netns_add(bar.name)

    log.info("initialize two nodes")

    stdin = f"""
        init {foo}
        set Port 0
        set Subnet {IP_FOO}
        set Interface {foo}
        set Address localhost
        set AutoConnect no
    """
    foo.cmd(stdin=stdin)
    foo.add_script(Script.TINC_UP, template.make_netns_config(foo.name, IP_FOO, MASK))
    foo.start()

    stdin = f"""
        init {bar}
        set Port 0
        set Subnet {IP_BAR}
        set Interface {bar}
        set Address localhost
        set AutoConnect no
    """
    bar.cmd(stdin=stdin)
    bar.add_script(Script.TINC_UP, template.make_netns_config(bar.name, IP_BAR, MASK))

    cmd.exchange(foo, bar)

    return foo, bar


def ping(namespace: str, ip_addr: str) -> int:
    """Send pings between two network namespaces."""
    log.info("pinging node from netns %s at %s", namespace, ip_addr)

    proc = subp.run(
        ["ip", "netns", "exec", namespace, "iperf3", "-t", "60", "-c", ip_addr], check=False
    )

    log.info("ping finished with code %d", proc.returncode)
    return proc.returncode


with Test("ns-ping") as context:
    foo_node, bar_node = init(context)
    bar_node.cmd("start")

    log.info("waiting for nodes to come up")
    bar_node[Script.TINC_UP].wait()

    log.info("add script foo/host-up")
    bar_node.add_script(foo_node.script_up)

    log.info("add ConnectTo clause")
    bar_node.cmd("add", "ConnectTo", foo_node.name)

    log.info("bar waits for foo")
    bar_node[foo_node.script_up].wait()

    subp.Popen(["ip", "netns", "exec", bar_node.name, "iperf3", "-s"], stdout=subp.DEVNULL, stderr=subp.DEVNULL)

    log.info("ping must work after connection is up")
    assert not ping(foo_node.name, IP_BAR)

gsliepen · 2022-05-22T19:51:28Z

The benchmark code you wrote is not really measuring the usage pattern tinc has. The sptps_speed program (which you might have to explicitly build using ninja -C build src/sptps_speed) tests the SPTPS protocol more thoroughly, using typical MTU-sized packets.

Performance measurements using debug builds are mostly useless. Have you tried configuring the build system with -Dbuildtype=debugoptimized?

I also have a branch (PR #360) that adds AES-256-GCM support to SPTPS, which depends on OpenSSL, but it will then also use OpenSSL for Chacha20-Poly1305. It might be interesting to compare the speed with this optimized version as well.

hg · 2022-05-23T07:59:38Z

Oh yes. If you're willing to reintroduce dependency on libssl, it would of course be best. We get highly optimized code for all possible architectures for free.

TL;DR:

flavor	throughput
generic	0.974 Gbit/s
avx2	1.19 Gbit/s
libssl1.1	1.28 Gbit/s
libssl3.1	1.23 Gbit/s*

*: probably a debug build

detailed results

generic

Generating keys for 10 seconds:          27182.71 op/s
Ed25519 sign for 10 seconds:             25932.53 op/s
Ed25519 verify for 10 seconds:            8360.30 op/s
ECDH for 10 seconds:                      6783.36 op/s
SPTPS/TCP authenticate for 10 seconds:    3173.35 op/s
SPTPS/TCP transmit for 10 seconds:           2.47 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3167.00 op/s
SPTPS/UDP transmit for 10 seconds:           2.46 Gbit/s

[  5]  55.00-56.00  sec   116 MBytes   975 Mbits/sec    0    686 KBytes
[  5]  56.00-57.00  sec   119 MBytes   996 Mbits/sec    1    634 KBytes
[  5]  57.00-58.00  sec   118 MBytes   986 Mbits/sec    2    573 KBytes
[  5]  58.00-59.00  sec   118 MBytes   986 Mbits/sec    1    506 KBytes
[  5]  59.00-60.00  sec   118 MBytes   986 Mbits/sec    0    662 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  6.80 GBytes   974 Mbits/sec  5725             sender
[  5]   0.00-60.01  sec  6.80 GBytes   973 Mbits/sec                  receiver

avx2

Generating keys for 10 seconds:          27313.96 op/s
Ed25519 sign for 10 seconds:             26049.17 op/s
Ed25519 verify for 10 seconds:            8411.54 op/s
ECDH for 10 seconds:                      6792.54 op/s
SPTPS/TCP authenticate for 10 seconds:    3156.87 op/s
SPTPS/TCP transmit for 10 seconds:           4.00 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3160.10 op/s
SPTPS/UDP transmit for 10 seconds:           4.13 Gbit/s

[  5]  25.00-26.00  sec   142 MBytes  1.20 Gbits/sec    1    509 KBytes
[  5]  26.00-27.00  sec   142 MBytes  1.20 Gbits/sec    0    691 KBytes
[  5]  27.00-28.00  sec   142 MBytes  1.20 Gbits/sec    1    672 KBytes
[  5]  28.00-29.00  sec   142 MBytes  1.20 Gbits/sec    1    649 KBytes
[  5]  29.00-30.00  sec   142 MBytes  1.20 Gbits/sec    2    619 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.12 GBytes  1.19 Gbits/sec  273             sender
[  5]   0.00-30.00  sec  4.12 GBytes  1.19 Gbits/sec                  receiver

alt (OpenSSL 1.1.1.o)

Generating keys for 10 seconds:          27294.88 op/s
Ed25519 sign for 10 seconds:             26004.59 op/s
Ed25519 verify for 10 seconds:            8409.03 op/s
ECDH for 10 seconds:                      6786.60 op/s

Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds:    3182.28 op/s
SPTPS/TCP transmit for 10 seconds:           6.24 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3178.83 op/s
SPTPS/UDP transmit for 10 seconds:           6.16 Gbit/s

Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds:    3177.70 op/s
SPTPS/TCP transmit for 10 seconds:           8.89 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3169.44 op/s
SPTPS/UDP transmit for 10 seconds:           8.73 Gbit/s

[  5]  25.00-26.00  sec   154 MBytes  1.29 Gbits/sec    1    628 KBytes
[  5]  26.00-27.00  sec   154 MBytes  1.29 Gbits/sec    4    619 KBytes
[  5]  27.00-28.00  sec   155 MBytes  1.30 Gbits/sec    2    611 KBytes
[  5]  28.00-29.00  sec   155 MBytes  1.30 Gbits/sec    2    595 KBytes
[  5]  29.00-30.00  sec   154 MBytes  1.29 Gbits/sec    1    585 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.42 GBytes  1.28 Gbits/sec  3427             sender
[  5]   0.00-30.00  sec  4.41 GBytes  1.28 Gbits/sec                  receiver

alt (OpenSSL 3.1)

I don't remember what flags were used to build the library. It may be a debug build, so don't pay much attention to it.

Generating keys for 10 seconds:          27368.60 op/s
Ed25519 sign for 10 seconds:             26068.22 op/s
Ed25519 verify for 10 seconds:            8434.89 op/s
ECDH for 10 seconds:                      6839.51 op/s

Cipher suite 0 (Chacha20-Poly1305):
SPTPS/TCP authenticate for 10 seconds:    3152.25 op/s
SPTPS/TCP transmit for 10 seconds:           5.83 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3149.35 op/s
SPTPS/UDP transmit for 10 seconds:           5.77 Gbit/s

Cipher suite 1 (AES-256-GCM):
SPTPS/TCP authenticate for 10 seconds:    3151.33 op/s
SPTPS/TCP transmit for 10 seconds:           7.69 Gbit/s
SPTPS/UDP authenticate for 10 seconds:    3149.60 op/s
SPTPS/UDP transmit for 10 seconds:           7.88 Gbit/s

[  5]  25.00-26.00  sec   148 MBytes  1.24 Gbits/sec    4    670 KBytes
[  5]  26.00-27.00  sec   148 MBytes  1.24 Gbits/sec    2    649 KBytes
[  5]  27.00-28.00  sec   148 MBytes  1.24 Gbits/sec    1    632 KBytes
[  5]  28.00-29.00  sec   148 MBytes  1.24 Gbits/sec    2    609 KBytes
[  5]  29.00-30.00  sec   148 MBytes  1.24 Gbits/sec    1    591 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.24 GBytes  1.23 Gbits/sec  973             sender
[  5]   0.00-30.00  sec  4.24 GBytes  1.22 Gbits/sec                  receiver

I'll leave it here for now until after #360 is merged.

Add optimized ChaCha20 for amd64

cc951ef

Speeds up local iperf3 in Linux network namespaces by roughly 20%.

hg marked this pull request as draft May 23, 2022 07:59

gsliepen added enhancement New feature requests or performance improvement. 1.1 Issue related to Tinc 1.1 labels May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

hg commented May 22, 2022

hg commented May 22, 2022

gsliepen commented May 22, 2022

hg commented May 23, 2022

generic

avx2

alt (OpenSSL 1.1.1.o)

alt (OpenSSL 3.1)

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

Are you sure you want to change the base?

ChaCha20: add optimized versions for amd64 (SSSE3 & AVX2) #392

Conversation

hg commented May 22, 2022

Compatibility

x86 / amd64

Other architectures

Benchmarks

C

SSE

AVX2

C

SSSE3

AVX2

C

AVX2

C

AVX2

hg commented May 22, 2022

generic

NEON

gsliepen commented May 22, 2022

hg commented May 23, 2022

generic

avx2

alt (OpenSSL 1.1.1.o)

alt (OpenSSL 3.1)