-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to achieve 10 Gbit/s throughput on Hetzner server #637
Comments
I'm guessing this is cpu bottlenecked. What is the output of |
The servers are SX133 and SX134 servers from Hetzner (linked in issue description). They have Click to expand `/proc/cpuinfo` details:
|
The machines have 4 and 8 physical cores respectively. |
Hey, just a short followup on whether anything can be done to achieve proper 10 Gbit/s throughput, or how to investigate when it doesn't happen. |
Had the same problem with the Hetzner CX cloud servers. Without nebula iperf3 would report around 7 Gbit/s between two servers. With nebula it wouldn't go above 1 Gbit/s. I think it has something to do with nebula being tcp over udp traffic and that udp traffic on Hetzner is either rate limited or the routers can't handle the udp traffic. Tcp over tcp would be the solution IMO, but nebula does not support that at the moment. |
Here's the link to the thread on the NebulaOSS slack channel: https://nebulaoss.slack.com/archives/CS01XE0KZ/p1619532900073100 |
@HenkVanMaanen I cannot confirm that. What speed does For me it's as fast as TCP mode between 2 dedicated 10 Gbit/s servers (Hetzner SX133 and SX134):
So 10 Gbit/s works on UDP between these 2 machines on the Hetzner network. Nebula-based iperf3 tops out on ~3.5 Gbit/s between the same machines, no matter if via TCP or UDP (no matter the flows). I also re-measured with
|
Replying to some more topics I read on that thread:
The same appears for me: Nebula uses only ~125% CPU usage which is evenly spread out across cores, not peaking out a single core. Interface MTUs:
There is only 1 physical link active on my servers, so confusing different links is impossible.
I did see some packet drops at tun. I changed
I **do see drops in Increasing I tried with the default Is it possible to verify that the Similarly, in
Following this post I used
|
I found that changing the sysctl
The fact that I had to set But even with all drops being fixed the throughput of Nebula does not improve. |
These are the results between two CX servers, direct tunnel:
|
@HenkVanMaanen You're giving My run was with |
Some more info: In the thread view in
Interestingly, if I So now in sum it takes less than 100%. Then, for the time the process is spending, I checked in htop that the fractions are On the receiver side,
Not sure how accurate that is, as the throughput over the single-threaded one breaks down from 1.4 Gbit/s to 0.44 Gbit/s when strace is active.
I wonder what the |
All around 1 Gbit/s Via TCP I get 4 Gbit/s |
@nh2 just curious, for the encryption method in your config are you using aes? |
@sfxworks Yes, AES. |
@HenkVanMaanen is using CX servers (Hetzner Cloud virtual servers), I'm using SX servers (dedicated bare-metal). This might explain why I can get up to 10 Gbit/s outside of Nebula. |
The content of this comment is the most telling for me #637 (comment) When you are testing your underlay network with multiple flows directly (5 in that run) you see maximum throughput of about 9.5Gbit/s, a single flow gets about 4Gbit/s. When you run with nebula you see nearly the same throughput as the single flow underlay network test at 3.5 Gbit/s. Nebula will (currently) only be 1 flow on the underlay network between two hosts. The throughput limitation is likely to be anything between and/or including the two NICs in the network since it looks like you have already ruled out cpu on the host directly. The folks at Slack have run into similar situations with AWS and this PR may be of interest to you #768 I do not see the output for
|
Closing this for inactivity. Please see also the discussion at #911 |
Reopened by request from @nh2. |
An update from my side: I have tried for a long time now, and failed to get 10 Gbit/s speed out of Nebula in any setting I tried. If anybody has a reproducible setup where this works, it would be great to post it (I saw the linked #911 but in there I can also only find claims like "Nebula is used to do many gigabits per second in production on hundreds of thousands of hosts", but not basic evidence such as "here's how I set up these 2 servers with Nebula, look at my iperf showing 10 Gbit/s). In other words: Instead of finding out why 10 Gbit/s doesn't work in this case, it seems better to first find anybody for whom 10 Gbit/s throughput reliably works. I also observed that when putting a big data pusher such as Ceph inside Nebula, it would make Nebula cap out at 1-2 GBit/s and 100% CPU, and Nebula would start dropping packets. As a result, important small-data services inside Nebula would also get their packets dropped; for example Consul consensus. This would then destabilise my entire cluster. My only solution so far was to remove big data pushers such as Ceph from Nebula, defeating the point of running everything inside the VPN. |
Overall the "many gigabits per second" relates to exactly what @nbrownus mentions above. This cited number is in aggregate. At Slack, we didn't encounter workloads that have single path host-to-host tunnels trying to do 10gbit/s, but with a small-ish MTU. Nebula allows you to configure MTUs for different network segments, and Slack uses this internally across production. I do understand that in your case, Hetzner does not allow a higher MTU, which contributes to this bottleneck. More broadly, Nebula's default division of work is per-tunnel. If you have 4+ hosts talking to a single host over Nebula, and you turn on muiltiroutine processing, Nebula will quickly match the maximum line rate of a single 10gbit interface. In the case of Ceph, are you often sending many gbit/s between individual hosts? We are certainly open to enhancing this if more people ask for a bump when using individual tunnels with small MTUs. We will also be sharing our research here in a future blog post for people to validate, and which will have tips for optimizing performance. |
Hi @nh2 - We've identified a bug in Nebula, beginning with v1.6.0, released June 2022 where Nebula nodes configured with a listen port of I understand that you opened this issue in February 2022, prior to the bug, but have continued debugging since v1.6.0. Given that this is the case, I will humbly request that you re-test your configuration. Additionally, in December 2022, prior to closing this issue, @nbrownus asked you to run a few commands to collect some extra debugging information. We believe that the output of Thank you! |
@johnmaguire Thanks! I'm using a fixed listen port of
|
Hi @nh2, I just wanted to make note of the blog post we recently published about performance here: https://www.defined.net/blog/nebula-is-not-the-fastest-mesh-vpn/ I hope that answers some of your questions here, and I'm happy to clarify any of the points. I'll close this issue in a week, unless there is something further to discuss that isn't covered there. Thanks! |
@rawdigits The blog post looks great and is very useful. But I believe it is still about aggregrate throughput, when indeed my issue report is for the point-to-point connection between single hosts. I can get 10 Gbit/s between 2 Hetzner servers via WireGuard and via iperf3 UDP (5 Gbit/s with single flow, full 10 Gbit/s with multiple flows, as mentioned in #637 (comment)). But I cannot get this with Nebula.
Yes, that is the standard workflow. When you write a file to CephFS, the client that does the So for example, you write a 10 GB file. With Ceph-on-Nebula it takes ~100 seconds (capped at ~1 Gbit/s), with Ceph outside of the VPN it takes ~10 secons (capped at ~1 Gbit/s). This factor makes a big difference for what workloads/apps you can handle. A tangentially related issue issue is that in my tests, Nebula starts dropping packets when large transfer rates occur. Concretely, when I had both Ceph and Consul (the consensus server) running on Nebula, and Ceph would do some large transfer, Nebula would drop packets, including those of Consul. This caused instability (consensus being lost). The issue disappears when running the same over a normal link instead of Nebula, apparently even when the normal link is 1 Gbit/s instead of 10 Gbit/s. My guess is that Nebula gets CPU-bottlenecked and thus leading to UDP packet loss that would happen differently on a real link. But I still don't fully understand why that causes such big instabilities: Both Ceph and Consul use TCP, so theoretically a CPU-bottlenecked Nebula on a 10 Gbit/s interface should not lose more Consul-related packets than physical 1 Gbit/s interface; but it somehow does. I think we should probably rename the issue to make clear it's about point-to-point performance, not aggregate. I understand the blog post says
but there are still good reasons to use Nebula even when point-to-point is the main use case:
|
@nh2 what is the upper limit you're able to achieve using Nebula? Also would it be possible for you to share your tweaks to default config values? I'm facing similar issue, but cannot saturate even 1Gbps link ( @rawdigits I did read the blog, and I do understand the limitations, but I was hoping (looking at the "performance per core" graphs) that Nebula would be able to give me 1Gbps speed. I can get up to 5Gbps in multi-threaded Also, when I run When I do this,
|
The config I'm using in production currently has no tuning, only non-performance relevant settings, as I have not managed to boost the performance significantly with any settings:
|
same here as @nh2 ... just tested today on Hetzner dedicated cloud servers ... tried tunning multiple parameters and nothing helped significantly |
just tested tailscale (following their getting started) and got basically same results |
I think only #768 could improve situation - is there anything I can do to make it merged (even as experimental feature) - @rawdigits ? |
Maybe some of the ideas from https://toonk.io/sending-network-packets-in-go/ could be useful? |
@ondrej-smola I made a v1.8.2-multiport release that is just v1.8.2 with this PR merged in if you want to test with it, binaries here: https://github.com/wadey/nebula/releases/tag/v1.8.2-multiport |
Hey @ondrej-smola - I was just wondering if you had a chance to test the build @wadey provided. If so, how did it go? |
@wadey @johnmaguire thank you for creating release - I am on parental leave but should be back in June |
I've noticed a fairly drastic drop using Nebula over Hetzners cloud networks Hetzner private network (no Nebula)
with 1.9.3
with the above build (1.8.2-multiport)
Servers are both Hetzners Ampere servers (hardware AES is enabled) |
I'm benchmarking Nebula with storage servers from dedicated server provider Hetzner where 10 Gbit/s links are cheap.
Unless you ask them to connect your servers by a dedicated switch, the MTU cannot be changed, so jumbo frames are not possible.
In this setup, I have not been able to achieve more than 2 Gbit/s with
iperf3
over Nebula, no matter how I tuneread_buffer
/write_buffer
/batch
/routines
.In https://theorangeone.net/posts/nebula-intro/ it was said
and on https://i.reddit.com/r/networking/comments/iksyuu/overlay_network_mesh_options_nebula_wireguard/
but that's evidently not the case for me.
Did all those setups use jumbo frames?
Is there anything that can be done to achieve 10 Gbit/s throughput without jumbo frames?
The text was updated successfully, but these errors were encountered: