Skip to content

Commit

Permalink
infiniband: fix markdown and drop obsolete refs
Browse files Browse the repository at this point in the history
Fixes #2972

Signed-off-by: Sandro Bonazzola <[email protected]>
  • Loading branch information
sandrobonazzola committed Jul 11, 2022
1 parent eaeb3c8 commit 15f9541
Showing 1 changed file with 54 additions and 38 deletions.
92 changes: 54 additions & 38 deletions source/develop/networking/infiniband.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,65 +9,81 @@ authors:

## Introduction

Although targeted at high performance computing Infiniband networks may be a quite cheap alternative to 10 Gigabit Ethernet. Nevertheless it is not an out of the box experience. So your expectations should never be to get close to wire speed but to be happy with every MB/s that you can reach beyond Giagbit Ethernet. This page should give a first impression for the interested reader what problems one might encounter.
Although targeted at high performance computing Infiniband networks may be a quite cheap alternative to 10 Gigabit Ethernet.
Nevertheless it is not an out of the box experience. So your expectations should never be to get close to wire speed but to be happy with every MB/s that you can reach beyond Giagbit Ethernet.
This page should give a first impression for the interested reader what problems one might encounter.

## IPoIB

IP over Infiniband (IPoIB) is an encapsulation of TCP packets inside Infiniband packets. That adds a lot of overhead but combined with an NFS server it is the easiest setup that is fully supported within OVirt.
IP over Infiniband (IPoIB) is an encapsulation of TCP packets inside Infiniband packets.
That adds a lot of overhead but combined with an NFS server it is the easiest setup that is fully supported within oVirt.

### Hypervisor node setup

On the hypervisor node you have to load the IPoIB required modules. These consist of the driver of your card, the transport and a managing module. For Mellanox ConnectX cards create a /etc/modules-load.d/ib.conf with the following lines
On the hypervisor node you have to load the IPoIB required modules. These consist of the driver of your card, the transport and a managing module.
For Mellanox ConnectX cards create a `/etc/modules-load.d/ib.conf` with the following lines

mlx4_ib
ib_ipoib
ib_umad
```
mlx4_ib
ib_ipoib
ib_umad
```

After loading these modules you should see an Infiniband interface ib0 inside Ovirt (additionally ib1 if you have a two port card). Add a new network as usual and assign it with a static IP to the interface.
After loading these modules you should see an Infiniband interface ib0 inside oVirt (additionally ib1 if you have a two port card).
Add a new network as usual and assign it with a static IP to the interface.

### Issue: Mellanox TSO bug

The kernel advertises TSO for Mellanox ConnectX cards although it is not supported. Therefore the hardware creates corrupt IP fragments on sender side during large requests and the receiving client cannot use LRO. The result of a lengthy discussion is stated [here](http://www.spinics.net/lists/linux-rdma/msg17787.html). So check if your Mellanox card has revision **a0**. Here an example of a non TSO compatible card:
The kernel advertises TSO for Mellanox ConnectX cards although it is not supported.
Therefore the hardware creates corrupt IP fragments on sender side during large requests and the receiving client cannot use LRO.
The result of a lengthy discussion is stated [here](http://www.spinics.net/lists/linux-rdma/msg17787.html).
So check if your Mellanox card has revision **a0**. Here an example of a non TSO compatible card:

lspci | grep Mellanox
03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
```console
# lspci | grep Mellanox
03:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)
```

If you have such an old card disable TSO and make that setting permanent in some startup script.

` isOldCard=`lspci | grep Mellanox | grep a0 | wc -l` `
if [ $isOldCard -gt 0 ]; then
ethtool -K ib0 tso off
ethtool -K ib1 tso off
fi
```bash
isOldCard=`lspci | grep Mellanox | grep a0 | wc -l`
if [ $isOldCard -gt 0 ]; then
ethtool -K ib0 tso off
ethtool -K ib1 tso off
fi
```

### Issue: Old hardware and MTU 2044

If you are running on old switch hardware than your maximum IPoIB MTU will be limited to 2044 bytes. That is no problem at all - at least on switch level. On your NFS server and hypervisor nodes this can result in unneccessary CPU cycles and reduced throughput. Once again a reference to a [discussion thread](http://www.spinics.net/lists/linux-rdma/msg15133.html).

If you are not afraid of compiling kernels yourself and you know what you are doing than you can benefit from a dirty hack that limits the IPoIB MTU inside the kernel to 3072 bytes. With that receive operations will be served within a single 4K page and unneccessary copy operations can be avoided. Add the following modification to ipoib_add_port() in drivers/infiniband/ulp/ipoib/ipoib_main.c:

...
if (!ib_query_port(hca, port, &attr))
/* Limit max MTU to 3KB */
/* priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu); */
priv->max_ib_mtu = 3072;
else {
...

If you are running on old switch hardware than your maximum IPoIB MTU will be limited to 2044 bytes. That is no problem at all - at least on switch level
On your NFS server and hypervisor nodes this can result in unneccessary CPU cycles and reduced throughput.
Once again a reference to a [discussion thread](http://www.spinics.net/lists/linux-rdma/msg15133.html).

If you are not afraid of compiling kernels yourself and you know what you are doing than you can benefit from a dirty hack that limits the IPoIB MTU inside the kernel to 3072 bytes.
With that receive operations will be served within a single 4K page and unneccessary copy operations can be avoided.
Add the following modification to `ipoib_add_port()` in `drivers/infiniband/ulp/ipoib/ipoib_main.c`:
```c
...
if (!ib_query_port(hca, port, &attr))
/* Limit max MTU to 3KB */
/* priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu); */
priv->max_ib_mtu = 3072;
else {
...
```
## NFS over RDMA
In this setup NFS sunrpc layer driectly accesses the basic infiniband mechanisms to exchange data between NFS server and client. The configuration is explained [here](https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt) and might have some bugs:

* [Cannot read more than 812 bytes from NFS server file](https://bugzilla.redhat.com/show_bug.cgi?id=1046011):
The bug can occur between kernels 3.7 and 3.13. It should be fixed with the commit used to be available at `http://article.gmane.org/gmane.linux.nfs/60953` but gmane service got discontinued.
* [NFS crashes](http://www.mail-archive.com/[email protected]/msg14145.html): Not yet tested/observed.

At the moment OVirt does not allow to mount NFS shares over RDMA. So the only option is to modify the mount operation in /usr/share/vdsm/storage/storageServer.py yourself.
In this setup NFS sunrpc layer driectly accesses the basic infiniband mechanisms to exchange data between NFS server and client.
The configuration is explained [here](https://www.kernel.org/doc/Documentation/filesystems/nfs/nfs-rdma.txt).
...
class NFSConnection(object):
DEFAULT_OPTIONS = ["soft", "nosharecache", "rdma", "port=20049"]
...
At the moment OVirt does not allow to mount NFS shares over RDMA. So the only option is to modify the mount operation in `/usr/share/vdsm/storage/storageServer.py` yourself.
```python
...
class NFSConnection(object):
DEFAULT_OPTIONS = ["soft", "nosharecache", "rdma", "port=20049"]
...
```

A reqeust for enhancement of OVirt has been created as Redhat Bug [1057043](https://bugzilla.redhat.com/show_bug.cgi?id=1057043)

Expand Down

0 comments on commit 15f9541

Please sign in to comment.