-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcpdirect support #41
base: master
Are you sure you want to change the base?
Conversation
Usage: Tx ./multi_neper.py --client --hosts 192.168.1.46 \ --devices eth1 --buffer-size 409600 \ --flows 1 --threads 1 --length 10 Rx ./multi_neper.py --hosts 192.168.1.46 \ --devices eth1 --src-ips 192.168.1.47 \ --flows 1 --threads 1 --length 10 \ --buffer-size 409600 ./multi_neper.py -h to view other flags
set CUDA_VISIBLE_DEVICES for each Neper call, and call cudaSetDevice to force cudaMalloc to allocate buffers on the correct GPU
taskset same number of CPUs as number of threads
multi_neper.py
Outdated
run_pre_neper_cmds(dev) | ||
|
||
# TODO flow-steering rules installed in Neper now | ||
# control_port = args.control_port + i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flow-steering rules are installed in the tcp_stream binary now, remove extraneous comments and functions (install_flow_steer_rules()
and del_flow_steer_rules()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah delete any commented out code or dead code. I still see functions to install flow steering rules.
Please if possible do a pass to delete all dead code to ease the code review as well.
Also force 1:1 thread:flow ratio Because flow-steering required for TCPDirect, force incrementing threads to listen on incrementing ports, i.e. thread_0 listens on port x thread_1 listens on port x+1 thread_2 listens on port x_2 etc.
small nits regarding build warnings build via `make tcp_stream WITH_TCPDIRECT=1` instead of `make tcp_stream_cuda2`
This option on the client side periodically prints io statistics (tx and rx operations, bytes and Mbps) so it is possible to monitor throughput variations in real time. netperf has a similar option. Tested: run with --iostat-ms 1000
In bidirectional mode, acks are piggybacked behind data and this creates unwanted dependencies between forward and reverse flows. To solve the problem, IN BIDIRECTIONAL STREAM MODE ONLY we use one tcp socket per direction (the user-specified number of flows is doubled after option parsing), used as follows: - client and server always read from all sockets - client sends only on half of the sockets (those witheven f_id). This is done by disabling EPOLLOUT on alternate sockets. - server starts sending on all sockets, but will stop sending and disable EPOLLOUT on sockets on which data is received. This is done in stream_handler() The above allows to have half of the sockets in tx, and half in rx, without control plane modifications. For backward compatibility, this is controlled by the --split-bidir command line option which implies -rw on both sides. Tested: manual test with --split-bidir and different '-m' values on client and server
fix error when -Wincompatible-pointer-types is included
Signed-off-by: Antonio Ojea <[email protected]>
There seems to be no use for the numlist object
percentiles in CSV file were incorrectly divided by MILLION, resulting in mostly 0 values. Remove the divisor. Probably the feature was never used, otherwise it would have been noticed.
histogram methods were implemented as virtual functions, but since there is only one possible implementation this was overkill. Simplify the code by exposing the actual methods. The implementation still remains opaque. No functional changes. Tested with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999.9999,100 -A/tmp/x.csv -l 4 and verified that the csv file has the correct data. (histograms are only exercised in rr tests)
histograms store samples in buckets with pseudo logarithmic size. The previous implementation used a table of thresholds, and binary search to locate the correct bucket. This patch replaces the thresholds with the fast pseudo-logarithm algorithm used in lr-cstats and bpftrace so we can locate the bucket in a handful of instructions. This gives memory savings, reduced cache trashing, and better performance. Tests show that with a hot cache a lookup now takes less than 2us compared to 20-25us with the previous approach. Also, we can remove the now-useless neper_histo_factory. The actual resolution of the buckets is approximately the same as in the previous implementation (about 1.5%). On passing, correct a few bugs in the previous implementation: - resolution was supposed to be 0.25% but due to an implementation bug it was around 1% or even bigger at low values, and cause the thresholds to become negative - conversion from double to int for the sample could have unchecked overflows. Tested with tcp_rr and verifying that the distribution and csv files contain correct values.
Allow arbitrary percentiles to be specificed, instead of just integer and p99.9 and p99.99 This also makes the code faster because we can just compute the values requested instead of all 103 entries. Any floating point number between 0 and 100 is now accepted, with 999 and 9999 mapped to 99.9 and 99.99 for backward compatibility. Tested as usual with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999,9999,100 -A/tmp/x.csv and verifying the correct content of the csv file.
Computing percentiles is expensive, as it requires scanning all the 4k-8k buckets used to store samples, and is done for each flow. Benchmarks show the original code took an average of 20us per flow, with frequent peaks in the 60-80us range. This patch eliminates the cost by not storing samples in buckets if no percentiles are requested, and otherwise achieves a ~5x reduction by tracking the range of buckets that contain values in each epoch. Also change the precision to 6 bits, which halves the cost without much impact on the results. This value may become a command line flag. Tested, as usual, by running tcp_rr and verifying the logs and csv
neper_snaps methods were implemented as virtual functions, but since there is only one possible implementation this was overkill. Simplify the code by exposing the actual methods. The implementation still remains opaque. No functional changes. Tested with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999.9999,100 -A/tmp/x.csv -l 4 and verified that the csv file has the correct data. (histograms are only exercised in rr tests)
This option on the client side will delay the client from creating any threads (and thus flows) after the control connection has been established. It can be useful if multiple neper server-client pairs are created over the same link, and the link gets too congested from earlier pairs for the later ones to successfully establish the control connection. The option can also be used to set up simulated packet dropping rules between making the control connection and sending traffic. Tested: ./tcp_stream -c -H 127.0.0.1 --wait-start 5 --logtostderr Verified that the client waited for 5 seconds before starting to send traffic.
The current code takes a list of comma separated 'double' values representing the latency percentile data points that the user is interested in. The user might repeat a particular percentile value. To prevent printing the same percentile data point twice, the code (percentile.c) sorts the list of percentiles requested by the user, in the order of their percentile values, i.e. higher percentiles will be sorted at the end. After sorting and removing the duplicates, the code incorrectly sets the number of the percentiles to one less than the actual count. This results in the code not printing the last percentile in the provided list of percentile arguments. For example, option_used_in_cmd> -p 50.0,95.0,99.0, results in the following stdout_results> percentiles=50,95 This patch fixes the issue, by setting the total count of the request percentiles to the correct value.
Stream clients only take 1 snapshot, but doesn't properly return timespec start_time. However, tries to take snapshots too many times, reducing throughput on udp_stream significantly. Reduce it so only 2 snapshots are needed. author: lixiaoyan@
many LOG <- printf swaps removing commented out code indent fixes
cleanup: address PR comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, likely needs a check from Willem.
3194c71
to
f2849c2
Compare
Includes instructions for running Neper with udmabuf tcpdevmem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention CUDA here?
tcpdirect.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commit removing a lot of code? If so, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More useful commit comment is to explain why to add this.
Before neper, we had to use netperf, which did not have -T $THREADS -F $flows.
So we had a super_netperf.sh script to run multiple instances in parallel.
Due to the multi-flow and multi-thread support in neper, we no longer need that. Seems like we are re-adding super_netperf.sh anyway?
tcpdirect.cu
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commenting out code is not very clean. Either it's needed, or needs to be removed?
Add tcpdevmem README
Enable allocation of CUDA/UDMA buffers for sender/receiver, then use those buffers to send traffic via TCPDirect.
Include multi_neper.py to run parallel neper instances between pairs of sender/receiver IP addresses (i.e. eth1->eth1, eth2->eth2, eth3->eth3) and aggregate throughputs of runs.
Usage documented in README_tcpdevmem.md