Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QAT] No devices found in container without privilege #1700

Open
Kewei-Lu opened this issue Mar 26, 2024 · 2 comments
Open

[QAT] No devices found in container without privilege #1700

Kewei-Lu opened this issue Mar 26, 2024 · 2 comments

Comments

@Kewei-Lu
Copy link

Describe the bug
Some processes report "No devices found" during running openssl speed command when privilege is not set to container

To Reproduce

  1. Build the docker image based on demo/openssl-qat-engine/Dockerfile
  2. Deploy qat device-plugin and check resources are available
$ kubectl describe node node1
 Allocatable:
   cpu:                128
   ephemeral-storage:  67612704657
   hugepages-1Gi:      0
   hugepages-2Mi:      0
   memory:             263621052Ki
   pods:               110
   qat.intel.com/cy:   128
  1. Deploy openssl-qat-engine with below manifest
kind: Pod
apiVersion: v1
metadata:
  name: openssl-qat-engine
spec:
  containers:
  - name: openssl-qat-engine
    image: [My local registry]/qat-engine:latest
    imagePullPolicy: Always
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 300000; done;" ]
    resources:
      requests:
        qat.intel.com/cy: '16'
      limits:
        qat.intel.com/cy: '16'
    securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          add:
          - NET_BIND_SERVICE
          - IPC_LOCK
        readOnlyRootFilesystem: false
  1. Login to the container and verify using openssl
$ kubectl exec -it openssl-qat-engine bash
$ openssl engine -c -t -v qatengine
(qatengine) Reference implementation of QAT crypto engine(qat_hw & qat_sw) v1.4.0
 [RSA, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA256, ChaCha20-Poly1305, id-aes128-GCM, id-aes192-GCM, id-aes256-GCM, SHA3-256, SHA3-384, SHA3-512, TLS1-PRF, X25519, X448, SM2]
     [ available ]
     ENABLE_EXTERNAL_POLLING, POLL, SET_INSTANCE_FOR_THREAD,
     GET_NUM_OP_RETRIES, SET_MAX_RETRY_COUNT, SET_INTERNAL_POLL_INTERVAL,
     GET_EXTERNAL_POLLING_FD, ENABLE_EVENT_DRIVEN_POLLING_MODE,
     GET_NUM_CRYPTO_INSTANCES, DISABLE_EVENT_DRIVEN_POLLING_MODE,
     SET_EPOLL_TIMEOUT, SET_CRYPTO_SMALL_PACKET_OFFLOAD_THRESHOLD,
     ENABLE_INLINE_POLLING, ENABLE_HEURISTIC_POLLING,
     GET_NUM_REQUESTS_IN_FLIGHT, INIT_ENGINE, SET_CONFIGURATION_SECTION_NAME,
     ENABLE_SW_FALLBACK, HEARTBEAT_POLL, DISABLE_QAT_OFFLOAD, HW_ALGO_BITMAP,
     SW_ALGO_BITMAP
80FBC9006E7F0000:error:1280006A:DSO support routines:dlfcn_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_dlfcn.c:188:symname(EVP_PKEY_base_id): /usr/lib/x86_64-linux-gnu/engines-3/qatengine.so: undefined symbol: EVP_PKEY_base_id
80FBC9006E7F0000:error:1280006A:DSO support routines:DSO_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_lib.c:176:

# This works fine
$ openssl speed -engine qatengine -elapsed -async_jobs 8 rsa2048
Engine "qatengine" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 199083 2048 bits private RSA's in 10.00s
Doing 2048 bits public rsa's for 10s: 959389 2048 bits public RSA's in 10.00s
version: 3.0.2
built on: Fri Feb 16 08:51:30 2024 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-olCZw9/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000050s 0.000010s  19908.3  95938.9

# When setting multiple processes, the error pop up
$ openssl speed -engine qatengine -elapsed -async_jobs 8 -multi 8 rsa2048

Forked child 0
Forked child 1
Forked child 2
Forked child 3
Forked child 4
Forked child 5
Forked child 6
Forked child 7
No devices found
No devices found
No devices found
No device found
No device found
No device found
Engine "qatengine" set.
Engine "qatengine" set.
Engine "qatengine" set.
+DTP:2048:private:rsa:10
+DTP:2048:private:rsa:10
+DTP:2048:private:rsa:10
No devices found
No device found
...
Got: +F2:2:2048:6680.100000:83645.800000 from 0
Got: +F2:2:2048:6630.500000:83514.600000 from 1
Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 2
Got: +F2:2:2048:11644.800000:144419.200000 from 2
Got: +F2:2:2048:6952.300000:84143.700000 from 3
Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 4
Got: +F2:2:2048:11627.200000:142830.069930 from 4
Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 5
Got: +F2:2:2048:11668.800000:143320.000000 from 5
Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 6
Got: +F2:2:2048:11376.800000:127321.378621 from 6
Don't understand line 'ADF_UIO_PROXY err: icp_adf_userProcessToStart: Failed to start SHIM' from child 7
Got: +F2:2:2048:11647.200000:142272.027972 from 7

# The result seems also get boosted somehow, but not sure via qat_sw or qat_hw
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-olCZw9/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef
                  sign    verify    sign/s verify/s
rsa 2048 bits 0.000013s 0.000001s  78227.7 951466.8

Expected behavior
All 8 processes should be able to use qatengine as 16 instances are passthrough

Screenshots
image
image

System (please complete the following information):

  • OS version: CentOS Stream release 8
  • Kernel version: 6.8.1-1.el8.elrepo.x86_64
  • Device plugins version: v0.29.0
  • Hardware info: CPU: 6454S + intree driver

Additional context

As you can see from the screenshot, not all processes fail to fetch the qat handler, which makes me curious.

What makes the problem more tricky is that if I add privileged: true to pod manifest, everything works fine (i.e., I can create 16 processes when running openssl speed without error info) but I think that may not be used in production env.
image
image

@mythi
Copy link
Contributor

mythi commented Mar 26, 2024

$ openssl speed -engine qatengine -elapsed -async_jobs 8 rsa2048

Can you try with 4 jobs? It could be that the qatlib allocation limitation triggers the problem you're seeing. Try settin QAT_POLICY=1 environment variable. If that helps, we'll need to update our docs a bit.

Ref: https://github.com/intel/qatlib/blob/ec817626e7de237b24cfb91b7cad076902df603a/INSTALL#L519-L522

@Kewei-Lu
Copy link
Author

Nice catch! Will not see the error if adding that ENV in container. Really appreciate :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants