Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading a configuration file is not always reliable #39

Open
sylane opened this issue Nov 29, 2018 · 34 comments
Open

Loading a configuration file is not always reliable #39

sylane opened this issue Nov 29, 2018 · 34 comments

Comments

@sylane
Copy link
Contributor

sylane commented Nov 29, 2018

Trying to load the configuration file config.error.xml on one of my LS1021ATSN devkit, sja1105-tool randomly fail:

[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
configuration is invalid
static_config_flush failed
staging_area_flush failed
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
configuration is invalid
static_config_flush failed
staging_area_flush failed
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
configuration is invalid
static_config_flush failed
staging_area_flush failed
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
local crc failed while uploading config
static_config_flush failed
staging_area_flush failed
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
[root@etsn1:~]# /usr/bin/sja1105-tool config load -f config.error.xml 
configuration is invalid
static_config_flush failed
staging_area_flush failed

Rebooting the router did not change anything.
The same config load fine on my two other routers that have the exact same software installed.
The error descriptions suggest that this has nothing to do with the switch fabric, that it is a validation error, what could cause this ?

Thank you for your help.

EDIT 1:
Sometime I get this error too:

global crc failed while uploading config
static_config_flush failed
staging_area_flush failed

EDIT 2:
I got similar errors, less often, on the other routers with some other configuration file.

@vladimiroltean
Copy link
Contributor

vladimiroltean commented Nov 29, 2018

Sounds like either a filesystem corruption, or a memory corruption. The latter is more likely since I think we went already through the filesystem corruption possibility in a previous ticket and we ruled that out.
@Kay-Potthoff opened this pull request with some memory corruption fixes a while ago. I didn't merge it because some of the fixes were improper and I didn't have time to rework them. You can try to apply them though and see if it helps.
Also, to be on the same page, what is your sja1105-tool --version?

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

[root@etsn1:~]# sja1105-tool --version
libsja1105 version: master
libsja1105 build date: Nov 19 2018 16:13:37
sja1105-tool version: master
sja1105-tool build date: Nov 19 2018 16:13:38

@vladimiroltean
Copy link
Contributor

While running watch -n 1 "sja1105-tool conf load -f config.error.xml; echo \$?", I never see any output other than 0.
Question: Is the NETCONF server running on your system? (ps aux | grep netopeer) If it is, can you stop it? (/etc/init.d/S90netconf stop)

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

Netconf is not running, I had to remove it as it was making sja1105-tool even more unreliable.

@vladimiroltean
Copy link
Contributor

vladimiroltean commented Nov 30, 2018

I looked again through the code, "configuration is invalid" is a message that gets printed if the switch returns CONFIGS = 0 in the general status. So this is not a validation failure in the software. What is your SPI frequency in /etc/sja1105/sja1105.conf? Could you try lowering it?
And does your kernel include this patch? nxp-archive/openil_linux@d23888f You can easily tell by looking at ls -la /proc/device-tree/soc/dspi@2100000/spidev@1/ and seeing whether the fsl,spi-cs-sck-delay property is present.

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

We build the system using OpenIL OpenIL-201808, is this patch supposed to be in ? Because it doesn't look like:

[root@etsn1:~]# ls -la /proc/device-tree/soc/dspi@2100000/spidev@1/
total 0
-r--r--r--    1 root     root             4 Jan  1 05:56 #address-cells
-r--r--r--    1 root     root             4 Jan  1 05:56 #size-cells
drwxr-xr-x    2 root     root             0 Jan  1 05:56 .
drwxr-xr-x    6 root     root             0 Jan  1 05:56 ..
-r--r--r--    1 root     root            14 Jan  1 05:56 compatible
-r--r--r--    1 root     root             7 Jan  1 05:56 name
-r--r--r--    1 root     root             4 Jan  1 05:56 reg
-r--r--r--    1 root     root             4 Jan  1 05:56 spi-max-frequency
[root@etsn1:~]# cat /etc/sja1105/sja1105.conf
[spi_setup]
	staging_area = /lib/firmware/sja1105.bin
	device       = /dev/spidev0.1
	bits         = 8
	speed        = 1000000
	delay        = 0
	cs_change    = 0
	mode         = SPI_CPHA
	dry_run      = false
	auto_flush   = false

[general]
	screen_width     = 120
	entries_per_line = 10
	verbose          = false
	debug            = false

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

That doens't explain why I have one router without any error, one with very few and a lat one with a lot of them... Same software, same config...

@vladimiroltean
Copy link
Contributor

It should have went into that release tag according to https://github.com/openil/linux/commits/OpenIL-linux-201808, I don't know where you have it. When I made the patch it was because of some weirdly-behaving board during PTP commands, there seems to be some variability on the SPI interface. It would be good if we could at least rule that out.

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I tried with speed 500000 and 100000, same errors...

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I randomly get local crc failed while uploading config, global crc failed while uploading config and configuration is invalid.

@vladimiroltean
Copy link
Contributor

Yes, I get that, those are all responses from the switch.
To prove to you that it's not the program that is sending bad data, do set dry_run = true in sja1105.conf and compare the dumps from a functional and a non-functional board using the same xml. If they're the same it's most likely an SPI issue, and you should at least try to apply the patch that repairs the timing violation between the chip select and the clock signal (which is still there at 500 and 100 KHz).

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I just tried dru_run = true but the output is different fro every run, the 6th and 7th bytes of the first transfer keep changing...

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

Is there a way to dump the SPI without doing a dry_run ? This would be the only way to be sure what was sent on the SPI when failing.

@vladimiroltean
Copy link
Contributor

Ok, that's interesting, so we're back at the memory corruption hypothesis.
Could you instrument the code around this area? Obviously sja1105_static_config_check_valid didn't trigger a validation error at the point it was called, perhaps call it again after static_config_upload (SPI transfer)?

Is there a way to dump the SPI without doing a dry_run ?

Only using a probe, or dumps in the spidev driver, I think. But there are some statistics under ls -la /sys/bus/spi/devices/spi0.1/statistics/ that you can try to check out.

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I don't thing the changes in the first SPI transaction is relevant, because it happens all the time on all my routers. It sounds more like some time or random related configuration... What is the first SPI message send to the switch ?

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I can confirm that on all my routers, beside the first transaction bytes 6th and 7th, the SPI transactions in dry run are identical for all runs.

@vladimiroltean
Copy link
Contributor

vladimiroltean commented Nov 30, 2018

First command ((0000): 80 00 0**1 1**0 58 9C 9E BF) is a buggy sja1105_inhibit_tx (ETH_PORT_CONTROL_ADDR = 0x11) - buggy because there's no memset of the buffer to zero - that's the difference we're looking at).
Second command is a reset ((0000): 81 00 **44 0**0 00 00 00 04).

After applying this patch:

From 70d3574145b8382dc11c680cbaa487997089c62e Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <[email protected]>
Date: Fri, 30 Nov 2018 18:13:56 +0200
Subject: [PATCH] lib: fix uninitialized spi payload memory in
 sja1105_inhibit_tx

Signed-off-by: Vladimir Oltean <[email protected]>
---
 src/lib/control/port.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/lib/control/port.c b/src/lib/control/port.c
index 53cc30d..f78d67f 100644
--- a/src/lib/control/port.c
+++ b/src/lib/control/port.c
@@ -45,6 +45,7 @@ int sja1105_inhibit_tx(struct sja1105_spi_setup *spi_setup,
 	uint8_t packed_buf[BUF_LEN];
 	int i;
 
+	memset(packed_buf, 0, BUF_LEN);
 	for (i = 0; i < SJA1105T_NUM_PORTS; i++) {
 		gtable_pack(packed_buf, &port_mask->inhibit_tx[i], i, i, 4);
 	}
-- 
2.7.4

I can at least do the following:

[root@OpenIL:~]# sja1105-tool config upload > 1.dump
[root@OpenIL:~]# sja1105-tool config upload > 2.dump
[root@OpenIL:~]# diff -s 1.dump 2.dump 
Files 1.dump and 2.dump are identical

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

The data generated on the router that fail often and the router that almost never fail is the same (beside the first buggy transaction). So this is definitely something related to the switch state.

@vladimiroltean
Copy link
Contributor

You can try to insert delays between the sja1105_cold_reset operation and the static_config_upload if you believe it to be because the switch is in a state where it doesn't accept a new static config yet.

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

It fails even at speed 10000 so this doesn't seem related to speed.

@vladimiroltean
Copy link
Contributor

Could you try to apply the SPI timing violation patch first, before we examine more exotic possibilities?

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

Are you talking about this : nxp-archive/openil_linux@d23888f ?
If it is I am not sure how to apply it. I just built the image using OpenIL, I am no expert on buildroot.

@vladimiroltean
Copy link
Contributor

Just go to your buildroot tree, then cd output/build/linux-*, then patch -p1 < file.patch, then rm .stamp_built, then go back to the buildroot top dir and run make linux to produce an updated kernel image.

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

I applied the patch, removed .stamp_built called make linux then later on make (part of our makefile)... With the new image it seems the patch is not there:

[root@etsn1:~]# ls -la /proc/device-tree/soc/dspi@2100000/spidev@1/
total 0
-r--r--r--    1 root     root             4 Jan  1 00:04 #address-cells
-r--r--r--    1 root     root             4 Jan  1 00:04 #size-cells
drwxr-xr-x    2 root     root             0 Jan  1 00:04 .
drwxr-xr-x    6 root     root             0 Jan  1 00:04 ..
-r--r--r--    1 root     root            14 Jan  1 00:04 compatible
-r--r--r--    1 root     root             7 Jan  1 00:04 name
-r--r--r--    1 root     root             4 Jan  1 00:04 reg
-r--r--r--    1 root     root             4 Jan  1 00:04 spi-max-frequency

Not sure what I did wrong...

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

ok, just found out the command was for the wrong spi:

[root@etsn1:~]# ls -la /proc/device-tree/soc/dspi@2100000/spidev@0/ 
total 0
-r--r--r--    1 root     root             4 Jan  1 00:07 #address-cells
-r--r--r--    1 root     root             4 Jan  1 00:07 #size-cells
drwxr-xr-x    2 root     root             0 Jan  1 00:07 .
drwxr-xr-x    6 root     root             0 Jan  1 00:04 ..
-r--r--r--    1 root     root            14 Jan  1 00:07 compatible
-r--r--r--    1 root     root             4 Jan  1 00:07 fsl,spi-cs-sck-delay
-r--r--r--    1 root     root             4 Jan  1 00:07 fsl,spi-sck-cs-delay
-r--r--r--    1 root     root             7 Jan  1 00:07 name
-r--r--r--    1 root     root             4 Jan  1 00:07 reg
-r--r--r--    1 root     root             4 Jan  1 00:07 spi-max-frequency

@sylane
Copy link
Contributor Author

sylane commented Nov 30, 2018

With the patch applied it continue to fail the same way, with default speed or with 10000.

Is /proc/device-tree/soc/dspi@2100000/spidev@0 corresponding to configured /dev/spidev0.1 ?

@vladimiroltean
Copy link
Contributor

For me no:

[root@OpenIL:~]# ls -la /sys/bus/spi/devices/spi0.1/of_node
lrwxrwxrwx    1 root     root             0 Jan  1 00:01 /sys/bus/spi/devices/spi0.1/of_node -> ../../../../../../../firmware/devicetree/base/soc/dspi@2100000/spidev@1
[root@OpenIL:~]# ls -la /sys/bus/spi/devices/spi0.1/spidev/
total 0
drwxr-xr-x    3 root     root             0 Jan  1 00:01 .
drwxr-xr-x    5 root     root             0 Jan  1 00:01 ..
drwxr-xr-x    3 root     root             0 Jan  1 00:13 spidev0.1

@vladimiroltean
Copy link
Contributor

This is hardly a fix for the root cause, but could you let me know if the config upload process is at least more reliable with the following patch?

From 18418e7d67a1068cf3e5028ad0c005fe89183d60 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <[email protected]>
Date: Fri, 30 Nov 2018 23:23:23 +0200
Subject: [PATCH] tool: try to upload config 50 times to device before failing

Signed-off-by: Vladimir Oltean <[email protected]>
---
 src/tool/staging-area.c | 86 ++++++++++++++++++++++++++++---------------------
 1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/src/tool/staging-area.c b/src/tool/staging-area.c
index 6e1c3e7..a768887 100644
--- a/src/tool/staging-area.c
+++ b/src/tool/staging-area.c
@@ -286,7 +286,7 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
 {
 	struct sja1105_general_status status;
 	struct sja1105_egress_port_mask port_mask;
-	int i, rc;
+	int i, rc, retries = 50;
 
 	rc = sja1105_static_config_check_valid(config);
 	if (rc < 0) {
@@ -308,17 +308,56 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
 	 * follow, and that switch cold reset is thus safe
 	 */
 	usleep(1000);
-	/* Put the SJA1105 in programming mode */
-	rc = sja1105_cold_reset(spi_setup);
-	if (rc < 0) {
-		loge("sja1105_reset failed");
-		goto hardware_left_floating_error;
-	}
-	rc = static_config_upload(spi_setup, config);
-	if (rc < 0) {
-		loge("static_config_upload failed");
+	do {
+		/* Put the SJA1105 in programming mode */
+		rc = sja1105_cold_reset(spi_setup);
+		if (rc < 0) {
+			loge("Failed to reset switch");
+			continue;
+		}
+		/* Wait for the switch to come out of reset */
+		usleep(1000);
+		rc = static_config_upload(spi_setup, config);
+		if (rc < 0) {
+			loge("static_config_upload failed");
+			continue;
+		}
+		/* Check that SJA1105 responded well to the config upload */
+		if (spi_setup->dry_run == 0) {
+			/* These checks simply cannot pass (and do not even
+			 * make sense to have) if we are in dry run mode */
+			rc = sja1105_general_status_get(spi_setup, &status);
+			if (rc < 0)
+				continue;
+			if (status.ids == 1) {
+				loge("Mismatch between hardware and staging area "
+				     "device id. Wrote 0x%" PRIx64 ", wants 0x%" PRIx64,
+				     config->device_id, spi_setup->device_id);
+				continue;
+			}
+			if (status.crcchkl == 1) {
+				loge("local crc failed while uploading config");
+				continue;
+			}
+			if (status.crcchkg == 1) {
+				loge("global crc failed while uploading config");
+				continue;
+			}
+			if (status.configs == 0) {
+				loge("configuration is invalid");
+				continue;
+			}
+		}
+	} while (--retries && (spi_setup->dry_run == 0) &&
+	        (status.crcchkl == 1 || status.crcchkg == 1 ||
+	         status.configs == 0 || status.ids == 1));
+
+	if (!retries) {
+		rc = -EIO;
+		loge("Failed to upload config to device, giving up");
 		goto hardware_left_floating_error;
 	}
+
 	/* Configure the CGU (PHY link modes and speeds) */
 	rc = sja1105_clocking_setup(spi_setup, &config->xmii_params[0],
 	                           &config->mac_config[0]);
@@ -326,33 +365,6 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
 		loge("sja1105_clocking_setup failed");
 		goto hardware_left_floating_error;
 	}
-	/* Check that SJA1105 responded well to the config upload */
-	if (spi_setup->dry_run == 0) {
-		/* These checks simply cannot pass (and do not even
-		 * make sense to have) if we are in dry run mode */
-		rc = sja1105_general_status_get(spi_setup, &status);
-		if (rc < 0) {
-			goto hardware_left_floating_error;
-		}
-		if (status.ids == 1) {
-			loge("Mismatch between hardware and staging area "
-			     "device id. Wrote 0x%" PRIx64 ", wants 0x%" PRIx64,
-			     config->device_id, spi_setup->device_id);
-			goto hardware_left_floating_error;
-		}
-		if (status.crcchkl == 1) {
-			loge("local crc failed while uploading config");
-			goto hardware_left_floating_error;
-		}
-		if (status.crcchkg == 1) {
-			loge("global crc failed while uploading config");
-			goto hardware_left_floating_error;
-		}
-		if (status.configs == 0) {
-			loge("configuration is invalid");
-			goto hardware_left_floating_error;
-		}
-	}
 	return SJA1105_ERR_OK;
 staging_area_invalid_error:
 	sja1105_err_remap(rc, SJA1105_ERR_STAGING_AREA_INVALID);
-- 
2.7.4

@sylane
Copy link
Contributor Author

sylane commented Dec 3, 2018

I'll test it. I would say it should work given that I am now calling the tool multiple times if it fails and as a result the prototype seems to work. This is similar but done in the tool itself.

@vladimiroltean
Copy link
Contributor

If it works it points to an unreliable communication channel over SPI. To confirm this is the case, I would suggest to run sja1105-tool reg 0 (read the device ID) in a loop overnight and record the number of times you get an output different than 0x00000000: 9e00030e.

@sylane
Copy link
Contributor Author

sylane commented Dec 3, 2018

I tested the patch. It seems to work, though it would be nice to log that the operation was successful otherwise it sounds like it failed when retrying:

[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
local crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
...
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
configuration is invalid
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
...
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
local crc failed while uploading config
local crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
...
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
global crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
...
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
local crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
...
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
configuration is invalid
local crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
global crc failed while uploading config
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
0
[root@etsn1:~]# sja1105-tool config load -f config.error.xml; echo $?
global crc failed while uploading config
0

@vladimiroltean
Copy link
Contributor

it would be nice to log that the operation was successful

I don't intend to merge this patch yet, as I'm not convinced that it's the correct fix for the problem you're having. It would be good if you could perform some integrity checking on your SPI link.

@sylane
Copy link
Contributor Author

sylane commented Dec 3, 2018

I'll run it over night to be sure, but for now I run it ~10k times without any error (talking about sja1105-tool reg 0).

@sylane
Copy link
Contributor Author

sylane commented Dec 5, 2018

So I run sja1105-tool reg 0 for a very long time and it always read the proper value. Any more idea on how to check the integrity of the SPI link ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants