-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls-eio: decide on improvements needed #464
Comments
Hello @talex5, indeed there has been a lot of PRs and issues. I also suspect this issue is slightly badly titled. From what I learned over the past:
Going into more detail below. Please let me know whether this is useful for you. Concurrent writesThinking about it, I fail to understand the use-case (application-wise) where you have a single TLS connection and write concurrently from multiple tasks onto it. Usually (HTTP/DNS/...) you have a request-response mechanism with only a single task being involved. Would you mind to elaborate a concrete and real application that concurrently writes from multiple tasks?
That could indeed be an issue in the lwt backend. Though I haven't seen a tls-lwt usage that concurrently writes to a single fd. Errors using flow after failureMaybe we should distinct some errors here: (a) out of memory in my opinion should never be intercepted (indeed, that should be fixed in tls-lwt); (b) any other exception. For (b) my assumption is that any (exposed) TLS alert (from the other side) is fatal -- i.e. there won't be any further communication possible. Any local error (processing data from the other side) is as well fatal, and no way to recover from that. Any socket issue (i.e. ECONNRESET) is as well fatal. So, leaving out of memory aside, I don't think it makes sense to use a flow that moved to the failed state once. Errors using a flow after cancellationI agree, we can stick (at least for now) to "cancelling an operation on a flow cancels that flow". Again, I don't see a concrete use case to allow cancelling a read/write and afterwards using the flow for a different (or the same) purpose. Half-close / shutdownAs said, needs to be implemented in the tls core. There may be additional thoughts that closing a TLS connection does not imply the underlying transport is finished. I.e. you could re-use that TCP connection for e.g. another TLS handshake (though, tbh I lack real-world usage thereof). "What EIO should be doing until half-close / shutdown is in tls?" I don't know, after all it's your framework. |
as discussed in mirleft#464 (thanks to @talex5 for raising this)
Now I understand a bit better (your Errors using flow after failure). The background is that considering we have a TLS session where we did some request and receive a response. Now the server finished with the response and sends a close_notify message. Due to usage of TCP it can be the case that TLS receives My current approach would be to replace the let handle tls buf =
match Tls.Engine.handle_tls tls buf with
| Ok (state', `Response resp, `Data data) ->
let state' = match state' with
| `Ok tls -> `Active tls
| `Eof -> `Eof
| `Alert a -> `Error (Tls_alert a)
in
t.state <- state' ;
safely (resp |> when_some (write_t t)) >|= fun () ->
(* HERE *)
if data = None && state' <> `Active _ then state' else `Ok data
| Error (alert, `Response resp) ->
t.state <- `Error (Tls_failure alert) ;
write_t t resp >>= fun () -> read_react t
in What do you think? This would return errors earlier, but keeping the last received bytes in tact. |
I don't want to do concurrent writes at all. The problematic case is:
Then the two writes from (2) and (3) may get interleaved. The fuzz tests don't detect this because once the server has sent the close request it stops reading from the socket, so it doesn't matter if the data sent from the client gets corrupted. I tried adding an assert to the fuzz tests to detect overlapping writes. Then, the client's read operation gets an assertion failure. But Is sending a close confirmation the only case where a TLS read does a socket write? If so, using the TLS 1.3 close behaviour would make all of this complexity go away. If not, we need a lock around the writes.
It doesn't compile for me ( |
I wonder how this close_tls system can work even for TLS 1.2. If you want to shut down a TLS connection but continue using the raw socket afterwards, then you'll still need to wait until you've received the client's confirmation of the shutdown. But doing a Tls read after sending the close notify just fails immediately with |
So you do a read and write at the same time on the same flow. Would you mind to point me to an application thereof? Usually, a "read" in TLS doesn't invoke a "write" -- apart from the renegotiation / rekeying which may occur. Of course this can be tracked in the state as well, and then a "read" would never do a "write" (but only signal the state that further writes could use an updated crypto context). But, as said above, I'd be keen to see the concrete application / protocol that behaves as you described.
Indeed, to support such a use-case we would need to revise the API. I would wait to think about it until we find a protocol / application that requires such a behaviour (re-use the TCP socket once TLS session is finished). |
Most users of TLS will do this, I would think. For example, an HTTP client requesting multiple resources will want to write the requests while reading the responses. If it waits for each response before sending the next request it will be very slow, and if it waits until it has sent all the requests before starting to read then it's likely to deadlock (as the server may decide to stop reading more requests until the client has read some responses). Other examples: capnp-rpc allows either party to send at any time; a chat application will typically allow users to send messages while also waiting for incoming ones; an HTTP connection using push notifications allows both parties to send at any time.
OK, so that explains why the fuzz tests didn't find it then. close_notify isn't tested because we know we can't currently use a flow after sending close_notify, and we don't test renegotiation / rekeying yet. |
On the other side, in the recent 8 years of OCaml-TLS being used, there hadn't been such an issue with either Lwt or Async.
Interesting, I've not observed such code in the wild. But maybe that is a thing. To me it sounds like the internal engine will need to be restructured then:
Now, any read doesn't involve a write. If a new cryptographic context is established, this is preserved in the state, and the next write may use that information to send a KeyUpdate/ChangeCipherSpec message to the other peer, and use the new keys. The invariant to preserve is that no read leads to a write (this is not the case at the moment), and no write leads to a read (this is already in place). Does that sound sensible? |
Yes, it's very unlikely this race would be a problem normally, since it only happens when rekeying (which I doubt many people do) or after closing.
Here's httpaf, for example: https://github.com/inhabitedtype/httpaf/blob/9cefa96f39712d8c8a4957c98970c2ac4a814ae2/lwt-unix/httpaf_lwt_unix.ml#L279-L283 It runs separate read and write threads, so it will read the next request while writing out a previous response.
I think there are two things that would be useful:
We need to be able to do writes from |
From what I can see:
This is not in scope at the moment, I worked on shutdown semantics in #488, and figured that while a single TLS record won't lead to a response (to the other peer) and some data for the application, dealing with multiple records at a time will make this possible. This is only the case during a handshake (or renegotiation) - and a relict of TLS < 1.3.
You can model the delay in your tls-eio if desired to come to the semantics that you think is necessary. |
In #460 (comment), @hannesm commented:
I agree we're sending too many (half-baked) PRs. Let's first decide what behaviour we want and then write the code afterwards. But I think we really need @hannesm's input on this.
As I see it, there are 4 issues being discussed:
As far as I can see, 1, 2 and 3 also affect tls-lwt (4 doesn't because the documentation doesn't claim to support it in the first place).
Concurrent writes
Reading from a TLS flow may cause a write. This means that performing a read and a write of the TLS flow at the same time (which should be allowed) may cause overlapping writes on the underlying flow (which isn't).
I'm not sure why we haven't seen this problem in tls-lwt, which seems to be doing the same thing (
Lwt_cs.write_full
doesn't take a lock, even though it may do multiple writes on the FD).This problem doesn't show up in the fuzz tests because the mock socket handles concurrent writes in order. But I tried modifying it to fail in the case of concurrent writes and it still didn't trigger. Do concurrent writes only happen on errors, perhaps?
In #458 (comment), @hannesm says:
I propose that Eio should explicitly document that the behaviour of concurrent writes is unspecified (see ocaml-multicore/eio#387). It's possible that there's nothing to fix here, but I suspect we need some kind of lock when reporting errors at least. I can make a PR to do that, but I'd like to know whether this is really a problem and how to trigger it in the fuzz tests if so.
Errors using a flow after failure
If an exception occurs while reading or writing the underlying flow, tls-eio moves into the
Error
state and all future operations fail with the same error.This can be confusing. For example, if a read failed with
Out_of_memory
then all future reads will also claim to be out of memory. The stack-trace only indicates where the error was rethrown, not the original cause.This also affects tls-lwt, but people don't expect to get reasonable backtraces from Lwt anyway.
I propose that:
Tls_previously_failed _
to indicate that the error hasn't just happened.In #458, @bikallem instead proposes not storing errors at all. I think that could be confusing, however, since you will likely get bogus errors about protocol violations. In any case, I think the tls library as a whole should decide whether it wants to do this and tls-eio should do the same as tls-lwt.
Errors using a flow after cancellation
If a read or write operation is cancelled, this raises a
Cancelled
exception which is stored as noted above. This means that a TLS flow cannot be used again after being cancelled. The reported error is confusing because it looks like it has been cancelled again and doesn't mention the location of the original cancellation. This is a bigger problem than for other exceptions because the original cancellation exception will typically not have been shown to the user.This should also affect Lwt, which also raises an exception to indicate cancellation.
There seem to be three reasonable options here:
Option 1 seems simplest here. I'm not convinced there's much use for cancelling a flow operation if you might want it later, and we've managed with this behaviour in tls-lwt so far without trouble.
Half-close / shutdown
tls-eio claims to implement the
Eio.Flow.two_way
signature, which allows performing a half close. However, tls always closes both directions (and tls-eio gives a confusing behaviour when you try to read, by simply returning end-of-file).#460 tried to fix this by shutting down the underlying flow, but that won't work (both because TLS needs to write in order to read, and because the underlying flow is supposed to remain usable after a TLS shutdown).
It should be possible to support half-close properly (tracked in #452). @hannesm wrote:
So the main question is what tls-eio should do until then. Some options are:
tls_eio.mli
that shutdown closes both directions for now and note that we intend to improve this in future.close_tls
operation as tls-lwt does.Part of the problem is that some users may be satisfied with the current behaviour (i.e. they don't plan to read after shutting down the sending side), so (2) or (4) will annoy them unnecessarily. And changing the API in (4) and then changing it back once it's working seems annoying too. So I think I prefer either (1) or (3).
The text was updated successfully, but these errors were encountered: