-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading a configuration file is not always reliable #39
Comments
Sounds like either a filesystem corruption, or a memory corruption. The latter is more likely since I think we went already through the filesystem corruption possibility in a previous ticket and we ruled that out. |
|
While running |
Netconf is not running, I had to remove it as it was making |
I looked again through the code, " |
We build the system using OpenIL
|
That doens't explain why I have one router without any error, one with very few and a lat one with a lot of them... Same software, same config... |
It should have went into that release tag according to https://github.com/openil/linux/commits/OpenIL-linux-201808, I don't know where you have it. When I made the patch it was because of some weirdly-behaving board during PTP commands, there seems to be some variability on the SPI interface. It would be good if we could at least rule that out. |
I tried with speed |
I randomly get |
Yes, I get that, those are all responses from the switch. |
I just tried |
Is there a way to dump the SPI without doing a dry_run ? This would be the only way to be sure what was sent on the SPI when failing. |
Ok, that's interesting, so we're back at the memory corruption hypothesis.
Only using a probe, or dumps in the spidev driver, I think. But there are some statistics under |
I don't thing the changes in the first SPI transaction is relevant, because it happens all the time on all my routers. It sounds more like some time or random related configuration... What is the first SPI message send to the switch ? |
I can confirm that on all my routers, beside the first transaction bytes 6th and 7th, the SPI transactions in dry run are identical for all runs. |
First command ( After applying this patch:
I can at least do the following:
|
The data generated on the router that fail often and the router that almost never fail is the same (beside the first buggy transaction). So this is definitely something related to the switch state. |
You can try to insert delays between the |
It fails even at speed |
Could you try to apply the SPI timing violation patch first, before we examine more exotic possibilities? |
Are you talking about this : nxp-archive/openil_linux@d23888f ? |
Just go to your buildroot tree, then |
I applied the patch, removed
Not sure what I did wrong... |
ok, just found out the command was for the wrong spi:
|
With the patch applied it continue to fail the same way, with default speed or with Is |
For me no:
|
This is hardly a fix for the root cause, but could you let me know if the config upload process is at least more reliable with the following patch? From 18418e7d67a1068cf3e5028ad0c005fe89183d60 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <[email protected]>
Date: Fri, 30 Nov 2018 23:23:23 +0200
Subject: [PATCH] tool: try to upload config 50 times to device before failing
Signed-off-by: Vladimir Oltean <[email protected]>
---
src/tool/staging-area.c | 86 ++++++++++++++++++++++++++++---------------------
1 file changed, 49 insertions(+), 37 deletions(-)
diff --git a/src/tool/staging-area.c b/src/tool/staging-area.c
index 6e1c3e7..a768887 100644
--- a/src/tool/staging-area.c
+++ b/src/tool/staging-area.c
@@ -286,7 +286,7 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
{
struct sja1105_general_status status;
struct sja1105_egress_port_mask port_mask;
- int i, rc;
+ int i, rc, retries = 50;
rc = sja1105_static_config_check_valid(config);
if (rc < 0) {
@@ -308,17 +308,56 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
* follow, and that switch cold reset is thus safe
*/
usleep(1000);
- /* Put the SJA1105 in programming mode */
- rc = sja1105_cold_reset(spi_setup);
- if (rc < 0) {
- loge("sja1105_reset failed");
- goto hardware_left_floating_error;
- }
- rc = static_config_upload(spi_setup, config);
- if (rc < 0) {
- loge("static_config_upload failed");
+ do {
+ /* Put the SJA1105 in programming mode */
+ rc = sja1105_cold_reset(spi_setup);
+ if (rc < 0) {
+ loge("Failed to reset switch");
+ continue;
+ }
+ /* Wait for the switch to come out of reset */
+ usleep(1000);
+ rc = static_config_upload(spi_setup, config);
+ if (rc < 0) {
+ loge("static_config_upload failed");
+ continue;
+ }
+ /* Check that SJA1105 responded well to the config upload */
+ if (spi_setup->dry_run == 0) {
+ /* These checks simply cannot pass (and do not even
+ * make sense to have) if we are in dry run mode */
+ rc = sja1105_general_status_get(spi_setup, &status);
+ if (rc < 0)
+ continue;
+ if (status.ids == 1) {
+ loge("Mismatch between hardware and staging area "
+ "device id. Wrote 0x%" PRIx64 ", wants 0x%" PRIx64,
+ config->device_id, spi_setup->device_id);
+ continue;
+ }
+ if (status.crcchkl == 1) {
+ loge("local crc failed while uploading config");
+ continue;
+ }
+ if (status.crcchkg == 1) {
+ loge("global crc failed while uploading config");
+ continue;
+ }
+ if (status.configs == 0) {
+ loge("configuration is invalid");
+ continue;
+ }
+ }
+ } while (--retries && (spi_setup->dry_run == 0) &&
+ (status.crcchkl == 1 || status.crcchkg == 1 ||
+ status.configs == 0 || status.ids == 1));
+
+ if (!retries) {
+ rc = -EIO;
+ loge("Failed to upload config to device, giving up");
goto hardware_left_floating_error;
}
+
/* Configure the CGU (PHY link modes and speeds) */
rc = sja1105_clocking_setup(spi_setup, &config->xmii_params[0],
&config->mac_config[0]);
@@ -326,33 +365,6 @@ int static_config_flush(struct sja1105_spi_setup *spi_setup,
loge("sja1105_clocking_setup failed");
goto hardware_left_floating_error;
}
- /* Check that SJA1105 responded well to the config upload */
- if (spi_setup->dry_run == 0) {
- /* These checks simply cannot pass (and do not even
- * make sense to have) if we are in dry run mode */
- rc = sja1105_general_status_get(spi_setup, &status);
- if (rc < 0) {
- goto hardware_left_floating_error;
- }
- if (status.ids == 1) {
- loge("Mismatch between hardware and staging area "
- "device id. Wrote 0x%" PRIx64 ", wants 0x%" PRIx64,
- config->device_id, spi_setup->device_id);
- goto hardware_left_floating_error;
- }
- if (status.crcchkl == 1) {
- loge("local crc failed while uploading config");
- goto hardware_left_floating_error;
- }
- if (status.crcchkg == 1) {
- loge("global crc failed while uploading config");
- goto hardware_left_floating_error;
- }
- if (status.configs == 0) {
- loge("configuration is invalid");
- goto hardware_left_floating_error;
- }
- }
return SJA1105_ERR_OK;
staging_area_invalid_error:
sja1105_err_remap(rc, SJA1105_ERR_STAGING_AREA_INVALID);
--
2.7.4 |
I'll test it. I would say it should work given that I am now calling the tool multiple times if it fails and as a result the prototype seems to work. This is similar but done in the tool itself. |
If it works it points to an unreliable communication channel over SPI. To confirm this is the case, I would suggest to run |
I tested the patch. It seems to work, though it would be nice to log that the operation was successful otherwise it sounds like it failed when retrying:
|
I don't intend to merge this patch yet, as I'm not convinced that it's the correct fix for the problem you're having. It would be good if you could perform some integrity checking on your SPI link. |
I'll run it over night to be sure, but for now I run it ~10k times without any error (talking about |
So I run |
Trying to load the configuration file config.error.xml on one of my LS1021ATSN devkit,
sja1105-tool
randomly fail:Rebooting the router did not change anything.
The same config load fine on my two other routers that have the exact same software installed.
The error descriptions suggest that this has nothing to do with the switch fabric, that it is a validation error, what could cause this ?
Thank you for your help.
EDIT 1:
Sometime I get this error too:
EDIT 2:
I got similar errors, less often, on the other routers with some other configuration file.
The text was updated successfully, but these errors were encountered: