-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: UDP protocol performance optimization #194
Comments
Generate derived keys in background, and save them into a cache for future usage? |
我也在寻求方法,实现在客户端与服务端之间PINGPONG,从而能传递临时会话key的功能。 |
Well, then we may have to add synchronization between the derive key task and UDP association tasks. The lock may become a new bottleneck. Another solution may be using a faster KDF, for example, #[bench]
fn bench_blake3(b: &mut Bencher) {
const SUBKEY_INFO: &[u8] = b"ss-subkey";
const KEY: &[u8] = b"12345678901234567890123456789012";
let mut iv = [0u8; 32];
let mut key = [0u8; 32];
let mut rng = rand::thread_rng();
b.iter(|| {
rng.fill(&mut iv);
let hk = SimpleHkdf::<blake3::Hasher>::new(Some(&iv), KEY);
hk.expand(SUBKEY_INFO, &mut key).expect("hkdf-sha1");
});
} The result shows:
The HMAC algorithm should be the key for optimization. |
There are several obvious flaws in this method:
So I don't think this will solve this problem effectively. |
我的意见不是太专业,请不要介意。 但是这个方式有个问题--时间因子。因为各种设备,会有各自的时间,虽然我通过ntp方式,让服务器上的时间,尽可能同步(与手机的时间有±2秒内的误差),也会在边界附近,因为时间误差而导致产生的key不相同。 这个我尝试了一些优化,但还是不太理想,主要是人处于高速移动中,因为手机频繁切换基站,已经创建好的连接因为不会关闭,导致通讯仍使用旧的key传输,服务器解密失败而影响使用质量。当然如果是udp,这个影响会小很多。 |
现在我在努力的事情是,如何安全传输key产生的时间因子。暂时的方案是,直接附在密文的末尾处,但这个因为改变了报文格式,不太倾向这种方式。 |
I did a quick calculation based on your benchmark: 10^6us / 2.2us * 1500byte = 681 MB/s bandwidth per CPU core So, it looks to me that, for most of our users, the bottleneck is their internet speed... |
Yes, it's true. I made some simple speedtests locally with iperf3:
On my laptop (i7-9750H), when testing with
and both
CPU consumption is You are correct about average users won't get to this extreme cases, so we can put away these numbers and focus on: how to make UDP associations to have the same performance as TCP channels. Or maybe, UDP associations should have better performance than the TCP channels because UDP tests don't have protocol overheads like acks, ... |
I have been doing experiments and benchmarks of changing the current Shadowsocks AEAD protocol. If standardized, the spec will be maintained by the Shadowsocks.NET organization. Shadowsocks 2022 EditionGoals
Non-Goals
PSKThe popular VPN protocol WireGuard provides a simple userspace program Instead of asking the user to provide a password, Shadowsocks 2022 is taking the same approach. A user can use Subkey DerivationHKDF_SHA1 is replaced by BLAKE3's key derivation mode. A randomly generated 32-byte salt is appended to the PSK to be used as key material.
I believe BLAKE3's key derivation mode alone without HKDF is secure enough for this purpose. Required MethodMethod TCPChaCha20-Poly1305
Protocol
The first message MUST be the header or start with the header.
Replay ProtectionBoth server and client MUST record all incoming salts during the last 30 seconds. When a new TCP session is started, the first received message is decrypted and its timestamp MUST be check against system time. If the time difference is within 30 seconds, then the salt is checked against all stored salts. If no repeated salt is discovered, then the salt is added to the pool and the session is successfully established. UDPXChaCha20-Poly1305
The official Go implementation of ChaCha20-Poly1305 provides XChaCha20-poly1305. RustCrypto's Protocol
Session ID based Routing and Sliding Window Replay ProtectionAn implementation SHOULD implement a NAT table using session ID as the key. A NAT entry SHOULD at least store the following information:
Upon receiving an encrypted packet, the packet is decrypted using the first 24 bytes as nonce. The header is verified by checking the timestamp against system time. Then the session ID in the header is used to look up its NAT entry. If the lookup is successful, pass the packet ID to the sliding window filter to complete verification. The last seen time and address MUST be updated after packet verification. Updating last seen address ensures that, when the client changes network, the session won't be interrupted. Optional MethodsReduced-round variants of
|
Open Questions
Reference Implementations
/cc @madeye @Mygod @riobard from ss org |
I'm sorry but UDP tunnel in userspace won't reach the same performance as TCP because stream-based abstraction enjoys extensive optimization from the kernel while packet-based abstraction doesn't. |
Ah, thanks for your work.
It doesn't seem to add too much complexity on implementation, so I am Ok with it. I wouldn't suggest to use In terms of UDP session, how to generate a globally unique session ID? |
Well yes, I did some tests with TCP,
UDP,
Conclusion: UDP can only have 50% bandwidth of TCP. Hmm.... But @riobard , the If we can lower the price of each UDP packets, the brandwidth should at least level with the current TCP implementation (7Gbps).
The UDP associations could have the same performance (7Gbps on my laptop) as the TCP channels. |
@zonyitoo Don't test |
XChaCha20-Poly1305 is not an IETF standard. And it should be very easy to wrap a fast ChaCha20-Poly1305 implementation into XChaCha20-Poly1305 by adding an HChaCha20 layer.
The session ID is 64-bit long. I wouldn't worry about the probability of collision of two randomly generated 64-bit integers. If you don't want random numbers, it's still safe to use a counter as session ID, as long as you stick with |
@database64128 Is IETF C20P1305 slow and/or broken? If not, please refrain from adding even more choices/confusions to the current mess. We're already having a hard time explaining which cipher is the best choice for the average users. The mistakes from the horribly long list of stream ciphers must be avoided. |
@riobard I think in the spec we can advise implementations to gate optional ciphers behind optional features. Advanced users and developers can build their own binary by enabling the optional features. The spec only requires implementing one method |
BTW, we should focus more on the protocol itself about how to lower the cost of CPU resource in UDP protocol. On some devices that have low calculation capability, like mobile phones, or routers, they will gain lots of benefit if we can make the protocol faster. |
@database64128 No, the intended goal is to reduce the number of optional ciphers, ideally leaving only one mandatory cipher (which is IETF C20P1305), so there's no need for average users to even think about the choice. If you read carefully the current spec, that's exactly what is being written:
The fact that various implementations decide to add other AEAD ciphers is very unfortunate, as it creates more confusion for little benefit. You failed to explain why it is necessary to create another cipher, and I don't see any benefits changing the status quo. |
@zonyitoo I'd like to but there's only so much you could do with UDP in userspace. |
Well, I think what @database64128 's proposal said is TCP protocol uses only |
@riobard I don't see any problem with letting the user select from existing Shadowsocks AEAD ciphers. The only practical difference between existing Shadowsocks AEAD ciphers is probably performance. My spec only has ONE mandatory method: To give users some flexibility, some optional methods are suggested, only because they are just as secure, and can yield significant performance boosts. For example, switching from ChaCha20-Poly1305 to AES-256-GCM increases the maximum TCP throughput by 25%, the separate header proposal for UDP is twice as fast as the mandatory XChaCha20-Poly1305 construction. Please keep in mind that performance boosts translate to less energy consumption on resource-constrained devices. |
@zonyitoo No, he's also changing key derivation procedure which breaks backward compatibility for no benefits. To the end users, they'll just see one more entry in the (already long enough) list of available ciphers. The technical details are irrelevant to the discussion of reducing optional choices (and thus complexity). @database64128 Two points:
That's your opinion and I disagree. Complexity is the root the all evil in software. That's why TLS 1.3 is getting rid of most options and there's not even a choice of cipher in WireGuard.
This only happens on devices with hardware-accelerated AES instructions, and even on those devices the software needs to be carefully designed to properly use those instructions. On iOS (arguably one of the easier platforms to support), it was very late (IIRC in 2021 at earliest) when client apps (e.g. Surge) could figure out how to correctly make use of AES acceleration. And the performance boost isn't really worth the additional complexity and potential implementation defects (AES/GCM is notoriously difficult to get right). Optimized implementation of C20P1305 is more than enough on the majority of modern devices. So no, I just don't see the benefits of the proposal. |
What backward compatibility are you talking about? The ability for a user to choose an arbitrary password? I don't see how maintaining such "compatibility" could provide any benefits.
And here I am, getting rid of the uncertainty of user-provided password, so we don't have to worry about choosing a secure key derivation algorithm for passwords. HKDF_SHA1 is replaced by BLAKE3's key derivation mode because HKDF_SHA1 is slow, and because I don't want to use anything marked as "obsolete" in the year of 2022.
This is a very cynical take on the issue. And if you are not confident about using AES, just stick to the default ChaCha20-Poly1305. |
Err.. Since we are talking about a new protocol for the future, I think we can assume that end users are using devices with hard-accelerated AES instructions (mostly
These tests are done on my laptop (
Well, from the test shown above, the optimized C20P1305 is still about 50% slower than aes-256-gcm.
Hmm? I don't think so, the current AEAD protocol should remain unchanged. All the current discussion is only applied to the new protocol. The version 1 (stream protocol), version 3 (AEAD protocol) will not be changed. @riobard Since SHA1 is marked as cryptographically broken, it is a good chance to replace it with a new modern hash function. I am Ok with this proposal to choose Blake3, because it is actually faster than
The 1st test is As for |
Maybe we should set aside the topic about using exactly 1 chosen cipher or keep the 3 selected ones in version 3 (AEAD protocol). We should discuss more about the design of the protocol itself. |
I disagree.
|
Ah, it uses MD5, I just remembered. Alright, how about using another KDF to replace it? Generating one by hand is not user friendly in most cases. |
It's actually much more user-friendly than the current best practice of generating a password in your password manager GUI, then copy-paste it to your config files. Now all you need is a cryptographically-secure 32-byte key. You can generate one with a one-liner in your favorite shell, which anyone running a Shadowsocks server should be familiar with. Or you can use existing userspace programs like When we ask a user to input a password, they may not bother to actually generate a secure one. But when we ask that they must provide a base64-encoded 32-byte key, it's very unlikely that any weak key gets used, unless the user very much intends to do so. |
I do agree with you but we just cannot ask users what to do. Users can generate a 32 bytes key with any tools. For users who know exactly what they are doing, they can generate one in Base64 encoded and pass then to the Providing one won't take any hard efforts and it only run once when the process starts. How about Argon2, which have been proven to be a cryptographically secured KDF function (password hash). |
From my point of view, AES-256-GCM on modern devices is fast enough in this age. |
Well then C20P1305 is fast enough for the vast majority 😎 |
Just saw an HN submission about using kTLS in NGINX. Some interesting comments I'd like to quote:
|
Don't be silly: Netflix edge devices stream at 50~100Gbps. We can talk about CPU usage when 10Gbps fiber is as common as 100Mbps. |
We should talk about CPU usage, because not everyone can afford beefy machines like Netflix's edge devices. My dual-core AMD EPYC $18/mo Digital Ocean VPS runs at 50% (out of 100%) CPU utilization when I download over WireGuard over Shadowsocks 2022 |
Yeah… go figure which part eats your CPU budget. |
As outlined in this blog post by Cloudflare, using
|
shadowsocks-rust's Shadowsocks 2022 implementation is not 100% complete, but it's ready for benchmarks.
|
Many people rely on domestic relays with better international connectivity for their Shadowsocks servers. Most servers serve more than one person, and using more than one port is not always an option. With legacy Shadowsocks, we have Outline that supports multiple passwords on a single port using trial decryption, we have mmp-go that relays from one port to multiple servers based on the password used. These solutions seek to maintain backward compatibility by resorting to brute force, which has performance and security implications. We need a solution that's built into the protocol, fast, and secure by default. Shadowsocks 2022 Extensible Identity HeadersIdentity headers are one or more additional layers of headers, each consisting of the next layer's PSK hash. The next layer of an identity header is the next identity header, or the protocol header if it's the last identity header. Identity headers are encrypted with the current layer's identity PSK using an AES block cipher. Identity headers are implemented in such a way that's fully backward compatible with current Shadowsocks 2022 implementations. Each identity processor is fully transparent to the next.
TCPIn TCP requests, identity headers are located between salt and AEAD chunks.
UDPIn UDP packets, identity headers are located between the separate header (session ID, packet ID) and AEAD ciphertext.
When iPSKs are used, the separate header MUST be encrypted with the first iPSK. Each identity processor MUST decrypt and re-encrypt the separate header with the next layer's PSK. Scenarios
A set of PSKs, delimited by A relay decrypts the first identity header with its identity key, looks up the PSK hash table to find the target server, and relays the remainder of the request. A single-port-multi-user-capable server decrypts the identity header with its identity key, looks up the user PSK hash table to find the cipher for the user PSK, and processes the remainder of the request. In the above graph, To start a TCP session, To process the TCP request, To send a UDP packet, To process the UDP packet, |
Please open new issues for any SIP. @zonyitoo I think you can close this issue and open a new one. |
@madeye The spec has not been finalized and is still in design phase. I think we should keep discussions in one place, under this thread. New issues will be opened when the spec is ready. |
Agree. Expecting a formal SIP issue about AEAD-2022. |
I just finished implementing Test method:
Note that Next step would be to evaluate whether downlink can benefit from |
Downlink now uses Test method:
|
What is the test way? I can`t repeat this result. |
Port 30001 is forwarded to iperf3 server's port by sslocal and ssserver. |
Why do you use -R to test only server -> client traffic? |
Because that's what matters.
I also did upload tests and the results are similar. |
shadowsocks-go v1.0.0 has been released as the reference Go implementation of Shadowsocks 2022.
Thanks to |
why is the udp throughput less than this? |
Motivation
QUIC, a UDP based protocol, is now getting famous in the Internet. So the UDP packet relay performance should be taken greater consideration in shadowsocks.
As all we know, the shadowsocks UDP protocol (AEAD) creates a packet in the following steps:
HKDF-SHA1
Recently I have done some benchmarks about the cost of each steps. All these tests are written in Rust.
1. Generate a random IV or salt
The test result is:
The
random_iv_or_salt
takes 4 ns more because it needs to verify the generated iv.2. Derive a key with
HKDF-SHA1
The test result is:
From the result of
1.
we can know that theHKDF-SHA1
algorithm takes most of the time.3. Encrypt the whole data payload with the chosen AEAD cipher
The
Cipher
that I was using here is from shadowsocks-crypto, which is the actual library that was using in shadowsocks-rust. The test result is:Analysis
Please forget about the absolute numbers of each tests. When we compare the result in
3.
and2.
, we can easily make a conclusion: The key derivation process takes roughly 50% of time when creating a UDP packet. So if we can optimize this process, we may get at most 50% of performance improvement!There should be no way to optimize it without changing the protocol. Here are some possible design:
The text was updated successfully, but these errors were encountered: