Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bandwidth problem with sja1105 port #47

Open
Meng0527 opened this issue Apr 30, 2019 · 22 comments
Open

Bandwidth problem with sja1105 port #47

Meng0527 opened this issue Apr 30, 2019 · 22 comments

Comments

@Meng0527
Copy link

Hi,I found a problem when I tried the Qbv demo. I connected two hosts through a sja1105 and found that the bandwidth between them is unstable. The bandwidth is sometimes close to 1000M, but sometimes it is only about 500M. The switch is set to the default tsn configuration. When I tried to connect two hosts via two sja1105, the situation got worse and the bandwidth was only about 10M. Do you know the reason for this problem?

@vladimiroltean
Copy link
Contributor

The default configuration for Qbv does not permit 1000 Mbps bandwidth for best-effort traffic anyway.
However, per the IEEE spec, when the Qbv engine is not running, all gates should be open, therefore all bandwidth is available for regular traffic. And the ptp4l program only starts the Qbv engine when the PTP offset is small enough.
So in this case, perhaps maybe the issue is when the best-effort bandwidth is full, not when it isn't?
May I suggest that your situation might be re-stated as "PTP time sync is not stable and the Qbv engine is stopping"? If you think this is not the case, can you please provide port counters (sja1105-tool status port) so we can investigate possible frame drops?

@Meng0527
Copy link
Author

Meng0527 commented May 6, 2019

My description may be confusing. I did not enable Qbv when testing bandwidth, as a comparison test of Qbv demo. Now the bandwidth is normal again, I will give feedback the next time this problem happens.

@Meng0527
Copy link
Author

Meng0527 commented May 8, 2019

I think I know the reason for the sja1105 bandwidth change.When this happens, there are some errors on the ingress port of the sja1150 that cause frame drop, so the tester cannot get the correct bandwidth value. The dropped frames account for about one thousandth of the total. The error is mainly CRCERR, also some SOFERR and MIIERR.When testing only one sja1105, this happens occasionally, and when connecting it to another sja1105 or a normal switch to test, the problem will always occur.
Here is the port counter of the ingerss port.
MAC-Level Diagnostic Counters
N_RUNT 0
N_SOFERR 123
N_ALIGNERR 0
N_MIIERR 255

MAC-Level Diagnostic Flags
TYPEERR 0
SIZEERR 0
TCTIMEOUT 0
PRIORERR 0
NOMASTER 0
MEMOV 0
MEMERR 0
INVTYP 0
INTCYOV 0
DOMERR 0
PCFBAGDROP 0
SPCPRIOR 0
AGEPRIOR 0
PORTDROP 0
LENDROP 0
BAGDROP 0
POLICEERR 0
DRPNON664ERR 0
SPCERR 0
AGEDRP 0

High-Level Diagnostic Counters
N_N664ERR 0
N_VLANERR 0
N_UNRELEASED 0
N_SIZERR 0
N_CRCERR 6735
N_VLNOTFOUND 0
N_BEPOLERR 0
N_POLERR 0
N_RXFRM 353977
N_RXBYTE 181234516
N_TXFRM 15
N_TXBYTE 1404
N_QFULL 0
N_PART_DROP 0
N_EGR_DISABLED 0
N_NOT_REACH 0

@vladimiroltean
Copy link
Contributor

Thank you for the investigation done so far.
Do you still have a system running with this packet loss issue?
Could you please tell me the output of the following:

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c
etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18

May I also know which port number these errors are seen on? I need this info for some further commands.

@Meng0527
Copy link
Author

This is the result of a recent test.These errors are seen on port 0 (eth5).
[root@OpenIL:~]# sja1105-tool status port 0
Port 0

MAC-Level Diagnostic Counters
N_RUNT 0
N_SOFERR 6
N_ALIGNERR 0
N_MIIERR 21

MAC-Level Diagnostic Flags
TYPEERR 0
SIZEERR 0
TCTIMEOUT 0
PRIORERR 0
NOMASTER 0
MEMOV 0
MEMERR 0
INVTYP 0
INTCYOV 0
DOMERR 0
PCFBAGDROP 0
SPCPRIOR 0
AGEPRIOR 0
PORTDROP 0
LENDROP 0
BAGDROP 0
POLICEERR 0
DRPNON664ERR 0
SPCERR 0
AGEDRP 0

High-Level Diagnostic Counters
N_N664ERR 0
N_VLANERR 0
N_UNRELEASED 0
N_SIZERR 0
N_CRCERR 657
N_VLNOTFOUND 0
N_BEPOLERR 0
N_POLERR 0
N_RXFRM 667782
N_RXBYTE 341902336
N_TXFRM 55
N_TXBYTE 5498
N_QFULL 0
N_PART_DROP 0
N_EGR_DISABLED 0
N_NOT_REACH 0

@Meng0527
Copy link
Author

[root@OpenIL:init.d]# ./S46sja1105-link-speed-fixup start
Setting ETH2 link speed to 1000
Setting ETH3 link speed to 1000
Setting ETH4 link speed to 1000
Setting ETH5 link speed to 1000
[root@OpenIL:init.d]# etsec_mdio write 3 0x1c $((3 << 10)) && etsec_mdio read 3 0x1c
0xe00
[root@OpenIL:init.d]# etsec_mdio write 3 0x18 $(((7 << 12) | 7)) && etsec_mdio read 3 0x18
0x71e7

@vladimiroltean
Copy link
Contributor

vladimiroltean commented May 10, 2019

Can you please further run the following commands after you observe the RGMII errors? You should run them once before the frame errors occur, and once afterwards (the reason is that the counters get cleared upon read):

source /etc/init.d/S46sja1105-link-speed-fixup
etsec_mdio read 6 0x11
etsec_mdio read 6 0x12
etsec_mdio read 6 0x13
etsec_mdio read 6 0x1A
etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15

Also, what would it take for me to try to reproduce this? How many cables do you have connected to the switch? Is the temperature higher than usual? It happens even when the link partner is another LS1021A-TSN switch port, right? Are both boards connected to the same ground reference?
Are the PHY LEDs still on when this issue happens? Does it happen on a single board/single port?

@Meng0527
Copy link
Author

Meng0527 commented May 13, 2019

Before sending the test stream:
Port 0

MAC-Level Diagnostic Counters
N_RUNT 0
N_SOFERR 0
N_ALIGNERR 0
N_MIIERR 0
High-Level Diagnostic Counters
N_CRCERR 0
N_RXFRM 0
N_RXBYTE 0
N_TXFRM 91
N_TXBYTE 17874

[root@OpenIL:]# /etc/init.d/S46sja1105-link-speed-fixup start
Setting ETH2 link speed to 1000
Setting ETH3 link speed to 1000
Setting ETH4 link speed to 1000
Setting ETH5 link speed to 1000
[root@OpenIL:]# etsec_mdio read 6 0x11
0x321
[root@OpenIL:]# etsec_mdio read 6 0x12
0x0
[root@OpenIL:]# etsec_mdio read 6 0x13
0xff
[root@OpenIL:]# etsec_mdio read 6 0x1A
0xc3e
[root@OpenIL:]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15
0x0

@Meng0527
Copy link
Author

Meng0527 commented May 13, 2019

After sending the test stream:
Port 0

MAC-Level Diagnostic Counters
N_RUNT 0
N_SOFERR 255
N_ALIGNERR 0
N_MIIERR 255

High-Level Diagnostic Counters
N_CRCERR 15789
N_RXFRM 3232540
N_RXBYTE 1655060480
N_TXFRM 3
N_TXBYTE 222

[root@OpenIL:etc]# /etc/init.d/S46sja1105-link-speed-fixup start
Setting ETH2 link speed to 1000
Setting ETH3 link speed to 1000
Setting ETH4 link speed to 1000
Setting ETH5 link speed to 1000
[root@OpenIL:etc]# etsec_mdio read 6 0x11
0x2321
[root@OpenIL:etc]# etsec_mdio read 6 0x12
0x0
[root@OpenIL:etc]# etsec_mdio read 6 0x13
0xff
[root@OpenIL:etc]# etsec_mdio read 6 0x1A
0x2c3e
[root@OpenIL:etc]# etsec_mdio write 6 0x17 $((0xf00 | 1)) && etsec_mdio read 6 0x15
0x0

@Meng0527
Copy link
Author

Meng0527 commented May 13, 2019

The following is a simplified topology:
image
image
1.Usual temperature.
2.Yes.
3.Yes.
4.The PHY LEDs are still on.
5.It sometimes happens on a single board.

@vladimiroltean
Copy link
Contributor

Are you performing any SPI transactions to any of the switches when this is happening? Or are the systems simply idling and passing traffic?

@Meng0527
Copy link
Author

No,I do nothing with it when I sent the test stream.

@vladimiroltean
Copy link
Contributor

The PHY counters I asked you to read are indicating that bad start-of-stream delimiters have been found in received frames since the last readout. So whatever the SJA1105 port is seeing, the PHY is seeing too.
You have shown two diagrams above. In both of them, the tester is connected to ETH4 and ETH5. However, the ETH4/ETH5 pair is also used in the second diagram to interconnect two LS1021A-TSN boards. Then you are showing a list of counters for SJA1105 port 0, which is confusingly ETH5. What is the link partner of the port that's seeing bad SSD frames? Always the tester, always the LS1021A-TSN, or both?

May I know what the tester is testing for? Frame preemption, by any chance? Does the tester have the ability to decode raw Ethernet code words? Do you have a capture of the frames that trigger the bad SSD error? What is the structure of the test stream?

@Meng0527
Copy link
Author

The ETH5 connected to the tester (LS1021ATSN in Figure 1 and LS1021ATSN-1 in Figure 2) sometimes sees packet loss,the ETH5 (LS1021ATSN-2 in Figure 2)connected to LS1021ATSN always sees. The counter list values of the ports which packets are lost in different diagrams are very close,so I only show one.
The tester only performs basic parameter testing (bandwidth, delay, etc.) and without involving any TSN functions.
The frame of the test stream is an Ethernet frame with a length of 512 bytes and broadcast.
Test frames captured.zip

@vladimiroltean
Copy link
Contributor

Have you made any progress with this? I am not able to confirm the behavior with traffic based on your PCAP, or provide other debugging hints. Is your switch configuration XML different from the standard?

@jihe123
Copy link

jihe123 commented Nov 7, 2019

Hello, i am doing demo with one tsn board(LS1021ATSN), according to the pdf(Open Industrial Linux User Guide Release v0.2),but when i did the schedule configuration (6.8.6),there's something wrong,just like this:
[root@OpenIL:~]# sja1105-tool conf mod schedule-table entry-count 2
[root@OpenIL:sja1105]# for i in 0 1; do sja1105-tool conf mod schedule-table[$i]
\destports 0b00100;done
Index out of bounds!
Please adjust the entry count of the table:

  • config modify entry-count )
    modify failed!
    Index out of bounds!
    Please adjust the entry count of the table:
  • config modify
  • entry-count )
    modify failed!
    ,i am new to this,do you know what is wrong?Thank you.

@elsinkior
Copy link

elsinkior commented Feb 7, 2020

As part of the demonstration of the TSN functionality of the LS1021ATSN-PA card embedding the OS Open-ILv1.7 - Xenomai / cobalt v3.1-devel, I followed the procedure specified by the document " Open Industrial Linux User Guide, Rev 1.6 08/2019 "for this hardware (chapter 7.2) after having set up the network topology presented in chapter 7.2.2 page 114 (3 LS1021ATSN-PA cards linked together).

Image collée à 2020-2-6 10-43

I encountered a problem from the first step, when setting up a standard configuration (expected results covered by chapter 7.2.8.5.4).

The bandwidth obtained from board 2 to 3 and from board 1 to 3 is "chaotic", very far from 950 mbits / s.

I used the command line "sja1105-tool status port" on each board and I observed an incrementing of the N_MIIERR counters (port 1 for board 1, port 1 and 2 for board 2 and port 2 for board 3) while iPerf3 running (source 172.15.0.1 destination 127.15.0.3). Bandwidth drops rapidly (over 90%) and oscillates around 10 mbits / s. Same issue for the 2 to 3 test board.

This test was performed in TCP. In UDP, the problem is less obvious, but with a loss of 50%.

In addition, for the "Rate-Limiting - Prioritizing configuration" scenario with the implementation of priorities (flow 1 to 3 priority over flow 2 to 3), I saw (test in UDP) an inversion of bandwidths ( 1 to 3 around 100mbits / s with 2 to 3 around 500mbits / s for 5s then inversion ...).

Regarding the tests on the implementation of the "Synchronized Qbv" demonstration, despite the bandwidth problem, I could observe the expected result for the "3-HOP" scenario (stable latency 30 ms).
On the other hand, the "1 HOP" scenario leads to an inconsistent result: unstable latency around 15 ms.

Do you have some idea of investigation to submit to us or an idea on ​​the origin of the problem to help us set up a representative demonstration?

In addition to my issue description, you can find hereafter a test which read, during iPerf3 running, the control register of the PHY3 and PHY2 provided by the BCM56514R.

The collision test bit appears to have been mounted about twenty times for the board where I monitored it, for PORT 2 ETH3 connected to BOARD 1 and for PORT1 ETH2 connected to board 3.

The test steps are described hereafter.

Could you indicates me more information about the read registers used (#47 (comment)) (I didn't find registers specification for the BCM56514R)

  1. B3 : Start iPerf3 server

[root@OpenIL:~]# iperf3 -1 -f m -i 0.5 -s -p 5202

  1. B1 : Start iPerf3 client

[root@OpenIL:~]# iperf3 -t 86400 -p 5202 -c 172.15.0.3
...
[ 4] 6.00-6.46 sec 512 KBytes 9.09 Mbits/sec 28 1.41 KBytes

  1. B2 : Start registries read looper

while true; do etsec_mdio read 3 0x0; done | tee /tmp/outP1_ETH2;
while true; do etsec_mdio read 4 0x0; done | tee /tmp/outP2_ETH3;

  1. B1 : Stop iPerf3 client

  2. B2 : Stop registries read looper

  3. B2 : Read port status

PORT 2

|| MAC-Level Diagnostic Counters ||
|| N_RUNT 0 ||
|| N_SOFERR 3 ||
|| N_ALIGNERR 0 ||
|| N_MIIERR 5 ||

PORT 3

|| MAC-Level Diagnostic Counters ||
|| N_RUNT 0 ||
|| N_SOFERR 2 ||
|| N_ALIGNERR 0 ||
|| N_MIIERR 117 ||

  1. B2 : Count number of colision test bit rising edge

[root@OpenIL:~]# grep 11e1 /tmp/outP1_ETH2 | wc -l
17

[root@OpenIL:~]# grep 11e1 /tmp/outP2_ETH3 | wc -l
19

Thank you for your feedback,

@vladimiroltean
Copy link
Contributor

Hi there,

I'm sorry for the trouble and I'm also aware of the NXP support ticket you have opened.
The N_MIIERR counter is described in UM10944 PDF as:

This field counts the number of frames that started with a valid start
sequence (preamble plus SOF delimiter byte) but terminated with the MII
error input being asserted.

So it is perhaps indicative of a hardware issue (misconfiguration or otherwise): the PHY has either asserted the RX_ER signal, or deasserted the RX_DV signal of the switch's MAC. We have not seen this manifest during development or testing.

The unfortunate part is that the default LS1021A-TSN image is not equipped with software for proper debugging for this kind of issue. The sja1105-tool being a user space driver, it does not register net devices in the kernel, so it cannot register with the PHY library, to get a driver in control of the BCM5464R or cannot even perform any sort of MDIO access towards the PHY. This cannot be changed given that sja1105-tool is what it is (a user space driver).

What the etsec_mdio script does is more of a hack: it copies what the kernel driver does (drivers/net/ethernet/freescale/fsl_pq_mdio.c) and does that from a shell script with raw access to the MDIO controller registers, via devmem. But since the kernel MDIO driver is also running, and the PHY library is polling the 2 AR8031 PHYs for eth0 and eth1 once per second, the results are not completely defined, since MDIO access is not atomic with respect to memory writes in the controller's register map. So the devmem commands might (and will) interfere with the kernel driver doing its work, and vice versa. So I would not blindly trust the 0x11e1 value that you got 17 times in MII_BMCR.

I think that even unbinding eth0 and eth1 would be enough to get a more reliable read:

echo soc:ethernet@2d10000 > /sys/bus/platform/drivers/fsl-gianfar/unbind
echo soc:ethernet@2d50000 > /sys/bus/platform/drivers/fsl-gianfar/unbind

But if there is a PHY configuration issue, that would still be difficult to spot with raw MDIO accesses. The BCM5464R PHY, in the default OpenIL setup, is left with mostly the defaults configured via pin strapping, with the exception of link speeds which are forced to 1000 in the /etc/init.d/S46sja1105-link-speed-fixup init script. That file is provided in case you need to change the PHY fixed speed if you have a link partner that runs at 100 Mbps. Since you are not in that situation, I would disable the init script altogether, since in theory it is possible that that, too, interferes with the kernel MDIO driver and, as a result, writes something else to the PHY than what is expected.

The lack of a PHY driver was one of the main reasons for moving sja1105 to a kernel driver, and if you are willing to spend some time, then it would be helpful if you could give the mainline kernel a try. There, the switch ports are registered as swp2, swp3, swp4, swp5, and the MAC statistics can be retrieved with ethtool -S swp2 (it's the same information, but has the advantage that the drivers/net/phy/broadcom.c file gets engaged in configuring the BCM5464R). There is even a fork of OpenIL that enables the mainline kernel for this board. The steps for compiling the image (make nxp_ls1021atsn_defconfig && make) are the same.

Hope this helps,
-Vladimir

@elsinkior
Copy link

Thank you very much for your reply. I will continue the investigation taking into account your advice :)

@elsinkior
Copy link

elsinkior commented Feb 11, 2020

Thank you very much for your advices.

After your recommendations, I followed the following steps:

  • rebuild Open-IL (https://github.com/vladimiroltean/openil-community.git), in order to have a driver that manages the MDIO interface with the broadcom PHY attached to the SJA1105.
  • new iPerf3 test between B1 and B3, bandwidth measurement, reading of PHY errors counters directly via ethtool -S swp2

Unfortunately, I also observe the same issue, with disastrous bandwidth:

"ethtool -S swp2" result:

NIC statistics:

tx_packets: 253860
tx_bytes: 314867536
rx_packets: 109783
rx_bytes: 6632319
n_runt: 0
**n_soferr: 17**
n_alignerr: 0
**n_miierr: 94**

"ifconfig" result:

swp2 Link encap: Ethernet HWaddr 00: 04: 9F: EF: 05: 05

UP BROADCAST RUNNING MULTICAST MTU: 1500 Metric: 1
RX packets: 109469 errors: 0 **dropped: 6626** overruns: 0 frame: 0
TX packets: 252486 errors: 0 dropped: 0 overruns: 0 carrier: 0
collisions: 0 txqueuelen: 1000
RX bytes: 6599238 (6.2 MiB) TX bytes: 314766316 (300.1 MiB)

Some phytool execution on PHY connected to TSN swicth port 2 and 1

phytool read swp2/3/0x12
0x0003

phytool read swp3/4/0x12
0x0020

phytool read swp3/4/0x12
0x004f

image

A hardware issue seems to be the cause of the errors encountered with our three boards.

@vladimiroltean
Copy link
Contributor

Thanks for the work investigating this.
I have observed some Ethernet PHY issues being correlated with booting the board with the microUSB cable not plugged in. I haven't found the reason for that. Just thought I'd make sure you have those plugged. It's rather strange to have 3 boards fail in the same way, when mostly everybody else hasn't seen that happen.

@vladimiroltean
Copy link
Contributor

Does the PHY report receive errors from other link partners too? What happens if you change the link speed with "ethtool -s swp2 advertise 0x8" (for 100 Mbps, or 0x20 to go back to 1 Gbps)? Could the cables be an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants