Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Esys called in bad sequence #118

Open
Danigaralfo opened this issue Jul 15, 2024 · 6 comments
Open

Error: Esys called in bad sequence #118

Danigaralfo opened this issue Jul 15, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@Danigaralfo
Copy link

Danigaralfo commented Jul 15, 2024

Hello,

We have encountered an issue in how the provider behaves when a process spawns two or more threads that access the TPM. In our case, this process was an Apache server that uses the TPM as a generator/storage for the private key. We have built the provider with debug mode enabled as well as activated the tpm2-tss traces.

Depending on the browser, this issue appears or not: with Firefox does not appear, but with Chrome/Edge does. This is because Firefox only sends one request, while Chrome/Edge send two (spawning three threads in the Apache process, one for key loading, and two for request processing).

You can see the logs in chrome_logs.txt. The flow would be as follows (In a general way. The order is not exactly like this always, as one thread could have started the TPM operations before the other one gets even started, but you get the point):

  1. The two requests are received, so the Apache process spawns the three mentioned threads.
  2. One stand-alone thread does some RSA operations, probably because it's re-generating the private key with the context saved in the private key file.
  3. When the key has being loaded by this stand-alone thread, the two other threads start the random number generation:
  4. When the two threads start the operations, the second one fails (lines 16 to 36 in chrome_logs.txt), returning an ESAPI async sequence error.
  5. The Apache server returns an internal SSL error, and the browser does not load the web page, even when the first spawned thread finish flawlessly.

This behaviour could also be observable in the signs operations. Seeing this, we thought that this could be a synchronization problem between the two threads accessing the TPM. This shouldn't be a problem, since the abrmd is a thing, but the daemon manages the access to the TPM at a process level (it uses the Dbus): it delegates the thread access to the developer.

In order to debug more, we tried to address the issue. We found some problems with the RAND operations order, so we ended up deactivating the rand operations in OpenSSL with the provider. We did as follows:

  1. Deactivated the RAND operations in OpenSSL configuration.
  2. Added some mutex to control the execution of the threads. I can provide the file with the mutexes if required.
  3. Changed the traces of the provider, so we could be able to see which thread was doing the TPM operations and which mutex was being locked. Built the modified provider, load it into OpenSSL and restart the Apache process.
  4. Make the requests with the browser that was failing before.

You can see the logs of this execution in chrome_sign_mutex_logs.txt. Now, the execution would go as follows (I will skip the rand generation since it is not being done in the TPM now):

  1. The two requests are received, so the Apache process spawns the two mentioned threads.
  2. One stand-alone thread does some RSA operations, probably because it's re-generating the private key with the context saved in the private key file.
  3. When the key has being loaded by this stand-alone thread, the first thread that gets into the sign operations claims the mutex twice (since DIGEST_INIT is done twice).
  4. The thread with the mutex start the signs operations. When SIGN DIGEST_SIGN estimate is called, the mutex is freed once.
  5. When the SIGN DIGEST_SIGN is done, the mutex is freed totally. Then, the other thread makes the same process.
  6. With all operations done, the Apache server returns the webpage and the browser displays it.

With this, the problem seems to be "solved", at least for the sign operations. We're looking into it more deeply, in order to make a better description of the problem, but some help may be useful.

Any suggestion is appreciated.
Thanks.

chrome_logs.txt
chrome_sign_mutex_logs.txt

@Danigaralfo
Copy link
Author

To add more information, we're using the following versions:

  • tpm2-abrmd: 2.4.1
  • tpm2-openssl: 1.2.0
  • tpm2-tss: 3.2.2

@Danigaralfo
Copy link
Author

We have found the issue. Long story short: the threads are using the same context at the same time. The provider should protect the Esys operations or treat them as atomic operations.

Let's take the random operations as an example. In the TSS function Esys_GetRandom_Async, the esys_ctx->state variable is set to ESYS_STATE_INTERNALERROR by default:

/* Check context, sequence correctness and set state to error for now */
    if (esysContext == NULL) {
        LOG_ERROR("esyscontext is NULL.");
        return TSS2_ESYS_RC_BAD_REFERENCE;
    }
    r = iesys_check_sequence_async(esysContext);
    if (r != TSS2_RC_SUCCESS)
        return r;
    esysContext->state = ESYS_STATE_INTERNALERROR;

If the second thread starts its execution after the first one set this variable to ESYS_STATE_INTERNALERROR, iesys_check_sequence_async(esysContext) will return an error. An example (the state lines were added by me in the code to check the value of esys_ctx->state):

[Tue Jul 16 13:34:56.371358 2024] [ssl:info] [pid 23727:tid 4142355392] [client 192.168.208.186:64050] AH01964: Connection to child 0 established (server 127.0.1.1:443)
RAND GET_CTX_PARAMS [ state ]
RAND NEW
RAND GET_CTX_PARAMS [ max_request ]
RAND GENERATE
[Tue Jul 16 13:34:56.374992 2024] [ssl:info] [pid 23727:tid 4121949120] [client 192.168.208.186:64051] AH01964: Connection to child 1 established (server 127.0.1.1:443)
RAND GET_CTX_PARAMS [ state ]
trace:esys:src/tss2-esys/api/Esys_GetRandom.c:228:Esys_GetRandom_Finish() context=0x1a5c6d8, randomBytes=0xf6e74a3c
[Tue Jul 16 13:34:56.375839 2024] [ssl:debug] [pid 23727:tid 4121949120] ssl_engine_kernel.c(2425): [client 192.168.208.186:64051] AH02645: Server name not provided via TLS extension (using default/first virtual host)
State before state-check in Esys_GetRandom_Finish(): 1 (tid: 4142355392)
RAND NEW
RAND GET_CTX_PARAMS [ max_request ]
RAND GENERATE
trace:esys:src/tss2-esys/api/Esys_GetRandom.c:140:Esys_GetRandom_Async() context=0x1a5c6d8, bytesRequested=0020
State before state-check in Esys_GetRandom_Async(): 3 (tid: 4121949120)
ERROR:esys:src/tss2-esys/esys_iutil.c:1145:iesys_check_sequence_async() Esys called in bad sequence.
ERROR:esys:src/tss2-esys/api/Esys_GetRandom.c:74:Esys_GetRandom() Error in async function ErrorCode (0x00070007)

@gotthardp
Copy link
Contributor

@Danigaralfo I implemented a mutex-based solution. Would you be able to test the https://github.com/tpm2-software/tpm2-openssl/tree/threads branch, please?

gotthardp added a commit that referenced this issue Dec 15, 2024
@Danigaralfo
Copy link
Author

Hello @gotthardp
Sorry for the delay. I will try to test it tomorrow. I'll tell you the results. Thanks!

@Danigaralfo
Copy link
Author

I've checked the commits and I saw that you used CRYPTO_RWLOCK mutex from OpenSSL.

This would resolve the problem with threads, but with more processes created from the parent (APACHE behaves in this way with the StartServers configuration) I think that this will not work.

As I said, tomorrow I will be able to test it. I'll test it using fork and configuring APACHE to use more than one process.

@Danigaralfo
Copy link
Author

I've just tested the APACHE server with the patch that you provided.

When talking about threads, the patch works. When talking about processes, It is still failing with the following errors:

[Thu Dec 19 11:11:43.117649 2024] [ssl:info] [pid 4215:tid 4248] [client 192.168.100.163:54346] AH01964: Connection to child 257 established (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.119102 2024] [ssl:debug] [pid 4215:tid 4248] ssl_engine_kernel.c(2421): [client 192.168.100.163:54346] AH02645: Server name not provided via TLS extension (using default/first virtual host)
[Thu Dec 19 11:11:43.119102 2024] [ssl:debug] [pid 4215:tid 4245] ssl_engine_kernel.c(2421): [client 192.168.100.171:56503] AH02645: Server name not provided via TLS extension (using default/first virtual host)
[Thu Dec 19 11:11:43.121404 2024] [ssl:info] [pid 4167:tid 4222] [client 192.168.100.171:56504] AH01964: Connection to child 134 established (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.124046 2024] [ssl:debug] [pid 4167:tid 4222] ssl_engine_kernel.c(2421): [client 192.168.100.171:56504] AH02645: Server name not provided via TLS extension (using default/first virtual host)
DER DECODER DECODE
TSS2 DECODER DECODE 0x87
TSS2 DECODER LOAD parent: persistent 0x81000001
TSS2 DECODER DECODE 0x87
TSS2 DECODER LOAD parent: persistent 0x81000001
TSS2 DECODER DECODE found RSA
RSA LOAD
RSA GET_PARAMS [ bits security-bits max-size ]
RSA HAS 85
RSA HAS 2
RSA NEW
RSA IMPORT [ n e ]
RSA MATCH 0x6
RSA HAS 2
RSA MATCH 0x6
DER DECODER DECODE
RSA HAS 2
RSA MATCH 0x6
SIGN DIGEST_INIT rsa MD=SHA256
SIGN DIGEST_INIT rsa MD=SHA2-256
SIGN SET_CTX_PARAMS rsa [ pad-mode ]
SIGN SET_CTX_PARAMS rsa [ saltlen ]
SIGN DIGEST_SIGN estimate
SIGN DIGEST_SIGN
ERROR:tcti:src/util/io.c:114:write_all() failed to write to fd 3: Device or resource busy
ERROR:tcti:src/tss2-tcti/tcti-device.c:124:tcti_device_transmit() wrong number of bytes written. Expected 164, wrote 0.
ERROR:esys:src/tss2-esys/api/Esys_Hash.c:198:Esys_Hash_Async() Finish (Execute Async) ErrorCode (0x000a000a)
ERROR:esys:src/tss2-esys/api/Esys_Hash.c:78:Esys_Hash() Error in async function ErrorCode (0x000a000a)
[Thu Dec 19 11:11:43.221886 2024] [ssl:info] [pid 4167:tid 4222] [client 192.168.100.171:56504] AH02008: SSL library error 1 in handshake (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.222001 2024] [ssl:info] [pid 4167:tid 4222] SSL Library Error: error:4000000E:tpm2::cannot hash (655370 tcti:IO failure)
[Thu Dec 19 11:11:43.222059 2024] [ssl:info] [pid 4167:tid 4222] SSL Library Error: error:0A080006:SSL routines::EVP lib
[Thu Dec 19 11:11:43.222093 2024] [ssl:info] [pid 4167:tid 4222] [client 192.168.100.171:56504] AH01998: Connection closed to child 134 with abortive shutdown (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.230119 2024] [ssl:info] [pid 4167:tid 4230] [client 192.168.100.171:56505] AH01964: Connection to child 138 established (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.230753 2024] [ssl:debug] [pid 4167:tid 4230] ssl_engine_kernel.c(2421): [client 192.168.100.171:56505] AH02645: Server name not provided via TLS extension (using default/first virtual host)
SIGN DIGEST_INIT rsa MD=SHA256
SIGN DIGEST_INIT rsa MD=SHA2-256
SIGN SET_CTX_PARAMS rsa [ pad-mode ]
SIGN SET_CTX_PARAMS rsa [ saltlen ]
SIGN DIGEST_SIGN estimate
SIGN DIGEST_SIGN
ERROR:esys:src/tss2-esys/esys_iutil.c:1145:iesys_check_sequence_async() Esys called in bad sequence.
ERROR:esys:src/tss2-esys/api/Esys_Hash.c:78:Esys_Hash() Error in async function ErrorCode (0x00070007)
[Thu Dec 19 11:11:43.250594 2024] [ssl:info] [pid 4167:tid 4230] [client 192.168.100.171:56505] AH02008: SSL library error 1 in handshake (server 127.0.1.1:443)
[Thu Dec 19 11:11:43.250656 2024] [ssl:info] [pid 4167:tid 4230] SSL Library Error: error:4000000E:tpm2::cannot hash (458759 esapi:Function called in the wrong order)
[Thu Dec 19 11:11:43.250704 2024] [ssl:info] [pid 4167:tid 4230] SSL Library Error: error:0A080006:SSL routines::EVP lib
[Thu Dec 19 11:11:43.250730 2024] [ssl:info] [pid 4167:tid 4230] [client 192.168.100.171:56505] AH01998: Connection closed to child 138 with abortive shutdown (server 127.0.1.1:443)

The thing here is the line ERROR:tcti:src/util/io.c:114:write_all() failed to write to fd 3: Device or resource busy when the first sign operations are being performed.

Maybe, instead of using mutexes, why don't use semaphores in shared memory? Allowing all (parent and forked) processes to access them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants