tcpdirect support #41

gfantom · 2023-09-25T19:05:56Z

Enable allocation of CUDA/UDMA buffers for sender/receiver, then use those buffers to send traffic via TCPDirect.

Include multi_neper.py to run parallel neper instances between pairs of sender/receiver IP addresses (i.e. eth1->eth1, eth2->eth2, eth3->eth3) and aggregate throughputs of runs.

Usage documented in README_tcpdevmem.md

Usage: Tx ./multi_neper.py --client --hosts 192.168.1.46 \ --devices eth1 --buffer-size 409600 \ --flows 1 --threads 1 --length 10 Rx ./multi_neper.py --hosts 192.168.1.46 \ --devices eth1 --src-ips 192.168.1.47 \ --flows 1 --threads 1 --length 10 \ --buffer-size 409600 ./multi_neper.py -h to view other flags

set CUDA_VISIBLE_DEVICES for each Neper call, and call cudaSetDevice to force cudaMalloc to allocate buffers on the correct GPU

taskset same number of CPUs as number of threads

Makefile

check_all_options.c

define_all_flags.c

gfantom · 2023-09-25T19:12:37Z

multi_neper.py

+                                run_pre_neper_cmds(dev)
+
+                        # TODO flow-steering rules installed in Neper now
+                        # control_port = args.control_port + i


flow-steering rules are installed in the tcp_stream binary now, remove extraneous comments and functions (install_flow_steer_rules() and del_flow_steer_rules())

Yeah delete any commented out code or dead code. I still see functions to install flow steering rules.

Please if possible do a pass to delete all dead code to ease the code review as well.

Also force 1:1 thread:flow ratio Because flow-steering required for TCPDirect, force incrementing threads to listen on incrementing ports, i.e. thread_0 listens on port x thread_1 listens on port x+1 thread_2 listens on port x_2 etc.

small nits regarding build warnings build via `make tcp_stream WITH_TCPDIRECT=1` instead of `make tcp_stream_cuda2`

This option on the client side periodically prints io statistics (tx and rx operations, bytes and Mbps) so it is possible to monitor throughput variations in real time. netperf has a similar option. Tested: run with --iostat-ms 1000

In bidirectional mode, acks are piggybacked behind data and this creates unwanted dependencies between forward and reverse flows. To solve the problem, IN BIDIRECTIONAL STREAM MODE ONLY we use one tcp socket per direction (the user-specified number of flows is doubled after option parsing), used as follows: - client and server always read from all sockets - client sends only on half of the sockets (those witheven f_id). This is done by disabling EPOLLOUT on alternate sockets. - server starts sending on all sockets, but will stop sending and disable EPOLLOUT on sockets on which data is received. This is done in stream_handler() The above allows to have half of the sockets in tx, and half in rx, without control plane modifications. For backward compatibility, this is controlled by the --split-bidir command line option which implies -rw on both sides. Tested: manual test with --split-bidir and different '-m' values on client and server

fix error when -Wincompatible-pointer-types is included

Signed-off-by: Antonio Ojea <[email protected]>

There seems to be no use for the numlist object

percentiles in CSV file were incorrectly divided by MILLION, resulting in mostly 0 values. Remove the divisor. Probably the feature was never used, otherwise it would have been noticed.

histogram methods were implemented as virtual functions, but since there is only one possible implementation this was overkill. Simplify the code by exposing the actual methods. The implementation still remains opaque. No functional changes. Tested with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999.9999,100 -A/tmp/x.csv -l 4 and verified that the csv file has the correct data. (histograms are only exercised in rr tests)

histograms store samples in buckets with pseudo logarithmic size. The previous implementation used a table of thresholds, and binary search to locate the correct bucket. This patch replaces the thresholds with the fast pseudo-logarithm algorithm used in lr-cstats and bpftrace so we can locate the bucket in a handful of instructions. This gives memory savings, reduced cache trashing, and better performance. Tests show that with a hot cache a lookup now takes less than 2us compared to 20-25us with the previous approach. Also, we can remove the now-useless neper_histo_factory. The actual resolution of the buckets is approximately the same as in the previous implementation (about 1.5%). On passing, correct a few bugs in the previous implementation: - resolution was supposed to be 0.25% but due to an implementation bug it was around 1% or even bigger at low values, and cause the thresholds to become negative - conversion from double to int for the sample could have unchecked overflows. Tested with tcp_rr and verifying that the distribution and csv files contain correct values.

Allow arbitrary percentiles to be specificed, instead of just integer and p99.9 and p99.99 This also makes the code faster because we can just compute the values requested instead of all 103 entries. Any floating point number between 0 and 100 is now accepted, with 999 and 9999 mapped to 99.9 and 99.99 for backward compatibility. Tested as usual with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999,9999,100 -A/tmp/x.csv and verifying the correct content of the csv file.

Computing percentiles is expensive, as it requires scanning all the 4k-8k buckets used to store samples, and is done for each flow. Benchmarks show the original code took an average of 20us per flow, with frequent peaks in the 60-80us range. This patch eliminates the cost by not storing samples in buckets if no percentiles are requested, and otherwise achieves a ~5x reduction by tracking the range of buckets that contain values in each epoch. Also change the precision to 6 bits, which halves the cost without much impact on the results. This value may become a command line flag. Tested, as usual, by running tcp_rr and verifying the logs and csv

neper_snaps methods were implemented as virtual functions, but since there is only one possible implementation this was overkill. Simplify the code by exposing the actual methods. The implementation still remains opaque. No functional changes. Tested with ./tcp_rr -c -H 127.0.0.1 -p1,2,10,50,90,999.9999,100 -A/tmp/x.csv -l 4 and verified that the csv file has the correct data. (histograms are only exercised in rr tests)

This option on the client side will delay the client from creating any threads (and thus flows) after the control connection has been established. It can be useful if multiple neper server-client pairs are created over the same link, and the link gets too congested from earlier pairs for the later ones to successfully establish the control connection. The option can also be used to set up simulated packet dropping rules between making the control connection and sending traffic. Tested: ./tcp_stream -c -H 127.0.0.1 --wait-start 5 --logtostderr Verified that the client waited for 5 seconds before starting to send traffic.

The current code takes a list of comma separated 'double' values representing the latency percentile data points that the user is interested in. The user might repeat a particular percentile value. To prevent printing the same percentile data point twice, the code (percentile.c) sorts the list of percentiles requested by the user, in the order of their percentile values, i.e. higher percentiles will be sorted at the end. After sorting and removing the duplicates, the code incorrectly sets the number of the percentiles to one less than the actual count. This results in the code not printing the last percentile in the provided list of percentile arguments. For example, option_used_in_cmd> -p 50.0,95.0,99.0, results in the following stdout_results> percentiles=50,95 This patch fixes the issue, by setting the total count of the request percentiles to the correct value.

Stream clients only take 1 snapshot, but doesn't properly return timespec start_time. However, tries to take snapshots too many times, reducing throughput on udp_stream significantly. Reduce it so only 2 snapshots are needed. author: lixiaoyan@

Makefile

flow.c

stream.c

tcpdevmem.c

tcpdevmem_cuda.cu

tcpdevmem_udmabuf.c

many LOG <- printf swaps removing commented out code indent fixes

cleanup: address PR comments

mina

Looks good to me, likely needs a check from Willem.

Includes instructions for running Neper with udmabuf tcpdevmem.

wdebruij

No need to mention CUDA here?

wdebruij · 2024-07-24T20:02:31Z

tcpdirect.cu

Is this commit removing a lot of code? If so, why?

wdebruij · 2024-07-24T20:06:50Z

multi_neper.py

More useful commit comment is to explain why to add this.

Before neper, we had to use netperf, which did not have -T $THREADS -F $flows.

So we had a super_netperf.sh script to run multiple instances in parallel.

Due to the multi-flow and multi-thread support in neper, we no longer need that. Seems like we are re-adding super_netperf.sh anyway?

wdebruij · 2024-07-24T20:09:12Z

tcpdirect.cu

commenting out code is not very clean. Either it's needed, or needs to be removed?

Add tcpdevmem README

gfantom added 22 commits August 23, 2023 20:56

tcpdirect initial commit

9abda64

tcpdirect: discard frags and bind cuda rx bufs

1ceae04

tcpd: create page pool for host

15b5cee

tcpd: create page_pool for cuda host

82737ea

tcpd: specify link to use with cuda tcpdirect

23cb22a

tcpd-multi:print throughputs of each link

d70af00

tcpd: force device index when allocating CUDA bufs

cebb661

set CUDA_VISIBLE_DEVICES for each Neper call, and call cudaSetDevice to force cudaMalloc to allocate buffers on the correct GPU

tcpd: attempt at some basic data validation

67e41db

tcpd: toggle header-split on Rx

d05bfb5

tcpd: toggle header-split

8fd9b35

tcpd: minor fix

6277841

tcpd: allocate gpu buffer earlier

032b6f7

tcpd: fill client cuda buf with a char earlier

4fb36d3

tcpd: install flow-steer after cudaMalloc

5398c56

tcpd: reset device state before running neper

0b1feef

tcpd: minor changes

1f61d4d

tcpd: change default neper-dir to .

51292ba

tcpd: don't hardcode eth1, change every dev's port

94f61d9

tcpd: add queue_start and queue_num flags

1d4662c

tcpd: use tcpdirect properly on Tx

ac0203b

tcpd: don't print out each frag received

a1f6bc8

taskset same number of CPUs as number of threads

gfantom commented Sep 25, 2023

View reviewed changes

gfantom added 7 commits October 2, 2023 21:55

fill tx cuda buffer with [1-111] repeating

af783cf

adding rx-buffer-cpy and rx-data-validation

8e11a8c

copy each fragment to cp_buffer

a8d203e

keep track of bytes_sent for tcpdirect Tx

fb78051

co-opt num_ports option for flow-steer compat

8665354

Also force 1:1 thread:flow ratio Because flow-steering required for TCPDirect, force incrementing threads to listen on incrementing ports, i.e. thread_0 listens on port x thread_1 listens on port x+1 thread_2 listens on port x_2 etc.

fixing segfault

1ed0f55

clean up Makefile and build target

c36aa25

small nits regarding build warnings build via `make tcp_stream WITH_TCPDIRECT=1` instead of `make tcp_stream_cuda2`

luigirizzo and others added 17 commits April 15, 2024 17:33

explicit cast from sockaddr_{in,in6} to sockaddr

8c5f1cb

fix error when -Wincompatible-pointer-types is included

use neper in a container image

fbd2fb5

Signed-off-by: Antonio Ojea <[email protected]>

numlist: remove unused component

3606589

There seems to be no use for the numlist object

rr: remove incorrect division by MILLION csv printing

60b0af7

percentiles in CSV file were incorrectly divided by MILLION, resulting in mostly 0 values. Remove the divisor. Probably the feature was never used, otherwise it would have been noticed.

fixing minor lint complaints regarding imports

8b2f1ba

include header, brief changelog to README

cc96940

Merge remote-tracking branch 'origin/master' into tcpd

54f1e0b

gfantom requested a review from mina April 17, 2024 16:03

mina reviewed Apr 17, 2024

View reviewed changes

gfantom and others added 2 commits April 18, 2024 18:54

addressing pull request comments

a32203b

many LOG <- printf swaps removing commented out code indent fixes

Merge pull request #63 from google/tcpd_git_comments

f2849c2

cleanup: address PR comments

gfantom requested a review from mina April 18, 2024 22:46

mina reviewed Apr 30, 2024

View reviewed changes

gfantom force-pushed the tcpd branch 2 times, most recently from 3194c71 to f2849c2 Compare June 25, 2024 16:05

Add tcpdevmem README

e92168c

Includes instructions for running Neper with udmabuf tcpdevmem.

wdebruij reviewed Jul 24, 2024

View reviewed changes

tcpdirect.cu Outdated

Copy link

Collaborator

wdebruij Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commit removing a lot of code? If so, why?

wdebruij reviewed Jul 24, 2024

View reviewed changes

Merge pull request #65 from google/lint_tidy

5714163

Add tcpdevmem README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcpdirect support #41

tcpdirect support #41

gfantom commented Sep 25, 2023 •

edited

Loading

gfantom Sep 25, 2023

mina Oct 12, 2023

mina left a comment

wdebruij left a comment

wdebruij Jul 24, 2024

wdebruij Jul 24, 2024

wdebruij Jul 24, 2024

tcpdirect support #41

Are you sure you want to change the base?

tcpdirect support #41

Conversation

gfantom commented Sep 25, 2023 • edited Loading

gfantom Sep 25, 2023

Choose a reason for hiding this comment

mina Oct 12, 2023

Choose a reason for hiding this comment

mina left a comment

Choose a reason for hiding this comment

wdebruij left a comment

Choose a reason for hiding this comment

wdebruij Jul 24, 2024

Choose a reason for hiding this comment

wdebruij Jul 24, 2024

Choose a reason for hiding this comment

wdebruij Jul 24, 2024

Choose a reason for hiding this comment

gfantom commented Sep 25, 2023 •

edited

Loading