Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix diagnostic messages #937

Closed
valegagge opened this issue Jan 29, 2024 · 22 comments · Fixed by #941 or #942
Closed

Fix diagnostic messages #937

valegagge opened this issue Jan 29, 2024 · 22 comments · Fixed by #941 or #942
Assignees

Comments

@valegagge
Copy link
Member

Bug description

In the log reported here, I noticed some weird diagnostic messages that might be worth investigating further.

  1. 426.026906 <ERROR> from BOARD 10.0.1.8 (left_leg-eb8-j0_3) time=467s 485m 342u : MC: generic motor error. (Error is a000000) .==> Here the joint ID is missing

  2. 418.960928 <WARNING> from BOARD 10.0.1.2 (left_arm-eb2-j0_1) time=460s 417m 341u : SYS: a service has detected that some CAN boards are not broacasting anymore. Type of service category is eomn_serv_category_temperatures. Lost can boards on (can1map, can2map) = ([ 13 ], [ ] ). ==> Why temperature? it was disabled. And what is the address 13?

  3. 420.013520 <ERROR> from BOARD 10.0.1.2 (left_arm-eb2-j0_1) time=461s 466m 194u : SYS: a service has detected that some CAN boards have stopped transmission. Type of service category is eomn_serv_category_all. Lost CAN boards are on (can1map, can2map) = ([ 2 ], [ ]). Time since last contact: 0 [ms]. ==> Why 0 ms?

Steps to reproduce

See the ticket robotology/icub-tech-support#1726

Expected behavior

No response

Example repository

No response

Additional context

No response

@pattacini
Copy link
Member

pattacini commented Jan 30, 2024

A bug related to diagnostic is also:

Just mentioning to create a xref.

@MSECode
Copy link
Contributor

MSECode commented Feb 5, 2024

Regarding the errors:

  • Point 3 is easy to fix --> if seen that in the EMS code we are not passing some of the info we are trying to request on the icub-main part. As shown by the screenshots we are not adding to the last significative bits of the par64 the timing information, as it is done for other errors:

image

but we are just passing the masks of the CAN boards that are justLOST :

image

so we just need to add that piece of information to the par64, since mspassed can be calculated for all the cases.

@MSECode
Copy link
Contributor

MSECode commented Feb 5, 2024

Point 2 instead is quite weird, since it is related to this message:
eocanprotASperiodic_parser_PER_AS_MSG__THERMOMETER_MEASURE which should not be related to any board, since I suppose there's not any board which currently uses this feature.
But, supposing we are wrongly initializing something, that errors is related to an overrun on the CAN bus carrying out temperature data, i.e. eoerror_value_AS_arrayoftemperaturedataoverflow

@MSECode
Copy link
Contributor

MSECode commented Feb 5, 2024

For point 1, we just need to parse at high level the par16, which contains the joint ID and it would be nice to make the mask in the par64 human-readable. For the generic error we are just passing to the pa64 the whole fault_state.bitmask, which is a uint32 made of this:

image

so we can create an ad-hoc function to check where we have the bits to 1

@MSECode
Copy link
Contributor

MSECode commented Feb 8, 2024

Regarding again the diagnostic error at point 2:
Probably the error is related to the fact the the interested ems board had already other error in discovering other CAN boards as shown in the image below. As you can see we have that on the CAN1 of the ems eb2, on which we have the strain we cannot transmit anything on the CAN.

image

Then another thing I'm wondering is if the temperature service related to the embObjFTsensor class is always initialized when using the FT sensors. Below an image showing the interesting line of code I'd like to check better:

image

@MSECode
Copy link
Contributor

MSECode commented Feb 12, 2024

UPDATE

Point 3:

420.013520 <ERROR> from BOARD 10.0.1.2 (left_arm-eb2-j0_1) time=461s 466m 194u : 
SYS: a service has detected that some CAN boards have stopped transmission.
Type of service category is eomn_serv_category_all. Lost CAN boards are on (can1map, can2map) = ([ 2  ], [  ]). 
Time since last contact: 0 [ms]

Need to remove the parsing at high level of the time since last contact since it would not make sense to ask for the that time when the boards are just lost. It would be basically zero. So it is correct that that time interval is not passed for that precise error at fw level.
Instead we realized that we are not parsing at software level the time since last contact for the warning eoerror_value_SYS_canservices_monitor_retrievedcontact, which is triggered when the boards are re-touched after being missed. So we need to update the software part of the diagnostic related to that warning adding both the boards CAN mapping and the time data.

Point 2:

418.960928 <WARNING> from BOARD 10.0.1.2 (left_arm-eb2-j0_1) time=460s 417m 341u :
 SYS: a service has detected that some CAN boards are not broacasting anymore. 
Type of service category is eomn_serv_category_temperatures. 
Lost can boards on (can1map, can2map) = ([ 13  ], [  ] )

Here, one of the problem is that the can mapping should be switched at a software level, since the parsing is done wrongly.
Other than that, we are checking the problem with the instantiation of the services. Need to check if is done as expected.

Branches were we are keeping the development are:

@MSECode
Copy link
Contributor

MSECode commented Feb 13, 2024

UPDATE 2

Point 1

Solved in the way that now the ID is shown. Working in improving wisely the management of a generic error so that it can be fully readable by the user

Point 2

Under investigation. Need to understand if services are initialized finely.

Point 3

Now should be fully solved in all its parts and fix are available here:

Problems related to that error can be summarized as follows:

  1. As already shown for some of the "error" (info, warning and error) conditions of the CANmonitor different info were passed towards the higher level but the icub-main code was treating them in the same way for all cases. First thing is that if we have not any problem in reading the boards or if we are continuously loosing the transmission with the, i.e. if we are still in the conditions State::OK or State::justLOST there's no reason to send up the time from last contact, which will be always be close to the tick() time of the CANmonitor (usually close to 1ms). So we fix the software part in icub-main in order to parse the mspassed only for the conditions State::justFOUND and State::stillLOST, when they are actually meaningful
  2. Then, another problem was related to the fact that the service category was not displayed correctly in the message. This was due to a problem with the array s_mn_servicecategory_strings.
  3. Finally, we had a couple of issues with the CANmonitor. First of all the constexpr Config, responsible for taking in input the custom configuration parameters for the CANmonitor and update its private variables with those values, was missing the update of the private variable periodofreport and therefore, even if we were modifying it value the CANmonitor object was always set with the default value of 10s
  4. Then, the last problem was related to the not correct updating of the CAN mask for the boards2report, whose values were always zero in the case in which the boolean value forcereport was set to false, such as for the states OK and stillLOST. This was due to the following situation:
    if forcereport is set to false then the CANmonitor is sending the report for the states only if regularreportnow is set to true (supposed that our report mode is Report::ALL). Then, the boolean value regularreportnow is set to true only in the case that (timeoflastreport + _config.periodofreport) < timenow. Moreover, we can see from the old code that at each tick() of the CANmonitor (that happens at each 1ms) the MAP boards2report was re-initialized. Furthermore, we can see that we enter the state-machine part where the variable board2report is updated if and only if the boolean flag checknow is equal to true and this happens only the times in which (timeoflastcheck + _config.periodofcheck) < timenow.
    All that said, the problem of having always at zero the content of the CAN masks happened becuase, without forcing the report at each periodofcheck, in the moment in which regularreportnow goes to true, the other boolean checknow is set to false, because of timing conditions due to the fact that the tick() is executed at each 1ms, and so we never update the variable boards2report and being it re-initialized at each tick() is explained why we are sending the var always to zero.

NOTE:
Timing conditions here explained. Supposed that we have the following condition:

  • the tick() runs at each 1ms
  • periodofcheck is 250ms
  • periodofreport is 5s

So, in the moment in which we are going to be ready for the report the current time will be 50001 ms and therefore it would be different from the relative 250s in which will be set the boolean checknow to true.

Here some text and images showing the difference before and after the modifications and the timing printed in the TRACE of the Debugger:

BEFORE

Touched the boards @ S31:m752:u993 with timeoflastcheck: S31:m752:u935
Checked now the boards @ S32:m3:u935 with timeoflastcheck: S31:m752:u935
Touched the boards @ S32:m3:u992 with timeoflastcheck: S32:m3:u935
Checked now the boards @ S32:m254:u935 with timeoflastcheck: S32:m3:u935
Touched the boards @ S32:m254:u993 with timeoflastcheck: S32:m254:u935
Checked now the boards @ S32:m505:u935 with timeoflastcheck: S32:m254:u935
Touched the boards @ S32:m505:u993 with timeoflastcheck: S32:m505:u935
Regular report on the boards @ S32:m680:u935 with timeoflastreport: S27:m679:u935
[INFO] (theBATservice tsk11 @S32:m680:u0)-> {0x3c, p16 0x0009, p64 0x0000000000000000, dev 0, adr 9}: SYS: a service has verified that the TX of its CAN boards is regular..
Reported the boards @ S32:m681:u120 with timeoflastreport: S32:m680:u935
Checked now the boards @ S32:m756:u935 with timeoflastcheck: S32:m505:u935
Touched the boards @ S32:m756:u993 with timeoflastcheck: S32:m756:u935
Checked now the boards @ S33:m7:u935 with timeoflastcheck: S32:m756:u935
Touched the boards @ S33:m7:u992 with timeoflastcheck: S33:m7:u935
Checked now the boards @ S33:m258:u935 with timeoflastcheck: S33:m7:u935
Touched the boards @ S33:m258:u993 with timeoflastcheck: S33:m258:u935
Checked now the boards @ S33:m509:u935 with timeoflastcheck: S33:m258:u935
Touched the boards @ S33:m509:u993 with timeoflastcheck: S33:m509:u935
Checked now the boards @ S33:m760:u935 with timeoflastcheck: S33:m509:u935
Touched the boards @ S33:m760:u993 with timeoflastcheck: S33:m760:u935
Checked now the boards @ S34:m11:u935 with timeoflastcheck: S33:m760:u935
Touched the boards @ S34:m11:u993 with timeoflastcheck: S34:m11:u935
Checked now the boards @ S34:m262:u935 with timeoflastcheck: S34:m11:u935
Touched the boards @ S34:m262:u993 with timeoflastcheck: S34:m262:u935
Checked now the boards @ S34:m513:u935 with timeoflastcheck: S34:m262:u935
Touched the boards @ S34:m513:u993 with timeoflastcheck: S34:m513:u935
Checked now the boards @ S34:m764:u935 with timeoflastcheck: S34:m513:u935
Touched the boards @ S34:m764:u993 with timeoflastcheck: S34:m764:u935
Checked now the boards @ S35:m15:u935 with timeoflastcheck: S34:m764:u935
Touched the boards @ S35:m15:u993 with timeoflastcheck: S35:m15:u935
Checked now the boards @ S35:m266:u935 with timeoflastcheck: S35:m15:u935
Touched the boards @ S35:m266:u993 with timeoflastcheck: S35:m266:u935
Checked now the boards @ S35:m517:u935 with timeoflastcheck: S35:m266:u935
Touched the boards @ S35:m517:u993 with timeoflastcheck: S35:m517:u935
Checked now the boards @ S35:m768:u935 with timeoflastcheck: S35:m517:u935
Touched the boards @ S35:m768:u993 with timeoflastcheck: S35:m768:u935
Checked now the boards @ S36:m19:u935 with timeoflastcheck: S35:m768:u935
Touched the boards @ S36:m19:u993 with timeoflastcheck: S36:m19:u935
Checked now the boards @ S36:m270:u935 with timeoflastcheck: S36:m19:u935
Touched the boards @ S36:m270:u993 with timeoflastcheck: S36:m270:u935
Checked now the boards @ S36:m521:u935 with timeoflastcheck: S36:m270:u935
Touched the boards @ S36:m521:u994 with timeoflastcheck: S36:m521:u935
Checked now the boards @ S36:m772:u935 with timeoflastcheck: S36:m521:u935
Touched the boards @ S36:m772:u994 with timeoflastcheck: S36:m772:u935
Checked now the boards @ S37:m23:u935 with timeoflastcheck: S36:m772:u935
Touched the boards @ S37:m23:u993 with timeoflastcheck: S37:m23:u935
Checked now the boards @ S37:m274:u935 with timeoflastcheck: S37:m23:u935
Touched the boards @ S37:m274:u993 with timeoflastcheck: S37:m274:u935
Checked now the boards @ S37:m525:u935 with timeoflastcheck: S37:m274:u935
Touched the boards @ S37:m525:u994 with timeoflastcheck: S37:m525:u935
Regular report on the boards @ S37:m681:u935 with timeoflastreport: S32:m680:u935
[INFO] (theBATservice tsk11 @S37:m681:u0)-> {0x3c, p16 0x0009, p64 0x0000000000000000, dev 0, adr 9}: SYS: a service has verified that the TX of its CAN boards is regular..
Reported the boards @ S37:m682:u120 with timeoflastreport: S37:m681:u935
Checked now the boards @ S37:m776:u935 with timeoflastcheck: S37:m525:u935
Touched the boards @ S37:m776:u994 with timeoflastcheck: S37:m776:u935
Checked now the boards @ S38:m27:u935 with timeoflastcheck: S37:m776:u935
Touched the boards @ S38:m27:u993 with timeoflastcheck: S38:m27:u935
Checked now the boards @ S38:m278:u935 with timeoflastcheck: S38:m27:u935
Touched the boards @ S38:m278:u993 with timeoflastcheck: S38:m278:u935
Checked now the boards @ S38:m529:u935 with timeoflastcheck: S38:m278:u935
Touched the boards @ S38:m529:u994 with timeoflastcheck: S38:m529:u935
Checked now the boards @ S38:m780:u935 with timeoflastcheck: S38:m529:u935
Touched the boards @ S38:m780:u994 with timeoflastcheck: S38:m780:u935
Checked now the boards @ S39:m31:u935 with timeoflastcheck: S38:m780:u935
Touched the boards @ S39:m31:u993 with timeoflastcheck: S39:m31:u935
Checked now the boards @ S39:m282:u935 with timeoflastcheck: S39:m31:u935
Touched the boards @ S39:m282:u993 with timeoflastcheck: S39:m282:u935
Checked now the boards @ S39:m533:u935 with timeoflastcheck: S39:m282:u935
Touched the boards @ S39:m533:u994 with timeoflastcheck: S39:m533:u935
Checked now the boards @ S39:m784:u935 with timeoflastcheck: S39:m533:u935
Touched the boards @ S39:m784:u994 with timeoflastcheck: S39:m784:u935
Checked now the boards @ S40:m35:u935 with timeoflastcheck: S39:m784:u935
Touched the boards @ S40:m35:u993 with timeoflastcheck: S40:m35:u935
Checked now the boards @ S40:m286:u935 with timeoflastcheck: S40:m35:u935
Touched the boards @ S40:m286:u993 with timeoflastcheck: S40:m286:u935
Checked now the boards @ S40:m537:u935 with timeoflastcheck: S40:m286:u935
Touched the boards @ S40:m537:u994 with timeoflastcheck: S40:m537:u935

image (12)

AFTER

Checked now the boards @ S55:m303:u875 with timeoflastcheck: S55:m52:u875
Checked now the boards @ S55:m554:u875 with timeoflastcheck: S55:m303:u875
Regular report on the boards @ S55:m748:u875 with timeoflastreport: S50:m747:u875
[ERROR] (theBATservice tsk11 @S55:m748:u940)-> {0x3e, p16 0x0009, p64 0x00003a9b00000004, dev 0, adr 9}: SYS: a service has detected that some CAN boards are still not transmitting..
Checked now the boards @ S55:m805:u875 with timeoflastcheck: S55:m554:u875
Checked now the boards @ S56:m56:u875 with timeoflastcheck: S55:m805:u875
Checked now the boards @ S56:m307:u875 with timeoflastcheck: S56:m56:u875
Checked now the boards @ S56:m558:u875 with timeoflastcheck: S56:m307:u875
Checked now the boards @ S56:m809:u875 with timeoflastcheck: S56:m558:u875
Checked now the boards @ S57:m60:u875 with timeoflastcheck: S56:m809:u875
Checked now the boards @ S57:m311:u875 with timeoflastcheck: S57:m60:u875
Checked now the boards @ S57:m562:u875 with timeoflastcheck: S57:m311:u875
Checked now the boards @ S57:m813:u875 with timeoflastcheck: S57:m562:u875
Checked now the boards @ S58:m64:u875 with timeoflastcheck: S57:m813:u875
Checked now the boards @ S58:m315:u875 with timeoflastcheck: S58:m64:u875
Checked now the boards @ S58:m566:u875 with timeoflastcheck: S58:m315:u875
Checked now the boards @ S58:m817:u875 with timeoflastcheck: S58:m566:u875
Checked now the boards @ S59:m68:u875 with timeoflastcheck: S58:m817:u875
Checked now the boards @ S59:m319:u875 with timeoflastcheck: S59:m68:u875
Checked now the boards @ S59:m570:u875 with timeoflastcheck: S59:m319:u875
Checked now the boards @ S59:m821:u875 with timeoflastcheck: S59:m570:u875
Checked now the boards @ S60:m72:u875 with timeoflastcheck: S59:m821:u875
Checked now the boards @ S60:m323:u875 with timeoflastcheck: S60:m72:u875
Checked now the boards @ S60:m574:u875 with timeoflastcheck: S60:m323:u875
Regular report on the boards @ S60:m749:u875 with timeoflastreport: S55:m748:u875
[ERROR] (theBATservice tsk11 @S60:m749:u940)-> {0x3e, p16 0x0009, p64 0x00004e2400000004, dev 0, adr 9}: SYS: a service has detected that some CAN boards are still not transmitting..
Checked now the boards @ S60:m825:u875 with timeoflastcheck: S60:m574:u875
Checked now the boards @ S61:m76:u875 with timeoflastcheck: S60:m825:u875
Checked now the boards @ S61:m327:u875 with timeoflastcheck: S61:m76:u875
Checked now the boards @ S61:m578:u875 with timeoflastcheck: S61:m327:u875
Checked now the boards @ S61:m829:u875 with timeoflastcheck: S61:m578:u875
Checked now the boards @ S62:m80:u875 with timeoflastcheck: S61:m829:u875
Checked now the boards @ S62:m331:u875 with timeoflastcheck: S62:m80:u875
Checked now the boards @ S62:m582:u875 with timeoflastcheck: S62:m331:u875
Checked now the boards @ S62:m833:u875 with timeoflastcheck: S62:m582:u875
Checked now the boards @ S63:m84:u875 with timeoflastcheck: S62:m833:u875
[WARNING] (theBATservice tsk11 @S63:m84:u939)-> {0x3f, p16 0x0009, p64 0x0000574300000004, dev 0, adr 9}: SYS: a service has recovered all CAN boards that were not transmitting..
Checked now the boards @ S63:m335:u875 with timeoflastcheck: S63:m84:u875
Touched the boards @ S63:m335:u934 with timeoflastcheck: S63:m335:u875
Checked now the boards @ S63:m586:u875 with timeoflastcheck: S63:m335:u875
Touched the boards @ S63:m586:u934 with timeoflastcheck: S63:m586:u875
Checked now the boards @ S63:m837:u875 with timeoflastcheck: S63:m586:u875
Touched the boards @ S63:m837:u934 with timeoflastcheck: S63:m837:u875
Checked now the boards @ S64:m88:u875 with timeoflastcheck: S63:m837:u875
Touched the boards @ S64:m88:u933 with timeoflastcheck: S64:m88:u875
Checked now the boards @ S64:m339:u875 with timeoflastcheck: S64:m88:u875
Touched the boards @ S64:m339:u934 with timeoflastcheck: S64:m339:u875
Checked now the boards @ S64:m590:u875 with timeoflastcheck: S64:m339:u875
Touched the boards @ S64:m590:u934 with timeoflastcheck: S64:m590:u875
Checked now the boards @ S64:m841:u875 with timeoflastcheck: S64:m590:u875
Touched the boards @ S64:m841:u934 with timeoflastcheck: S64:m841:u875
Checked now the boards @ S65:m92:u875 with timeoflastcheck: S64:m841:u875
Touched the boards @ S65:m92:u933 with timeoflastcheck: S65:m92:u875
Checked now the boards @ S65:m343:u875 with timeoflastcheck: S65:m92:u875
Touched the boards @ S65:m343:u934 with timeoflastcheck: S65:m343:u875
Checked now the boards @ S65:m594:u875 with timeoflastcheck: S65:m343:u875
Touched the boards @ S65:m594:u934 with timeoflastcheck: S65:m594:u875
Checked now the boards @ S65:m845:u875 with timeoflastcheck: S65:m594:u875
Touched the boards @ S65:m845:u934 with timeoflastcheck: S65:m845:u875

image (11)

`
cc: @valegagge

@valegagge
Copy link
Member Author

@MSECode well done!!! 🚀

@valegagge
Copy link
Member Author

We still need to perform the tests on the setup with two boards connected to the same CAN port (i.e. strain +bat).

@MSECode
Copy link
Contributor

MSECode commented Feb 14, 2024

UPDATE 3:

Point 2

regarding the error at point 2, i.e.:

418.960928 <WARNING> from BOARD 10.0.1.2 (left_arm-eb2-j0_1) time=460s 417m 341u : SYS: a service has detected that some CAN boards are not broacasting anymore. Type of service category is eomn_serv_category_temperatures. Lost can boards on (can1map, can2map) = ([ 13  ], [  ] ). ==> Why temperature? it was disabled. And what is the address 13?

Trying to reproduce it with a setup basically equivalent to the ems eb1 of ergoCub SN00x, where we have the strain2c board and the BAT on the the CAN2, and the 2foc on CAN1, I realized that the aforementioned problem was related to the problem already found on the typo of the array used for converting the service category form the enum to the equivalent string.
Because of, as shown in this line fixed the other day, we were picking wrongly the index of the array for a specific enum value and as you can see the categories eomn_serv_category_inertials3 (which is the one instantiated for the strain2c) and eomn_serv_category_temperatures (the one wrongly shown in the warning message) are one after the other as shown here:

image

So the service eomn_serv_category_inertials3 was correctly instantiated but we were translating wrongly the category enum when received from the fw.
Then the error of the CAN address 13 (the default for the strain) being on the CAN1 mask was due to the fact that on icub-main sw the masks were swapped.
Now as you can see from the image below now the warning message is correctly shown.
I got it by disconnecting the CAN of the strain2c from the ems just for the purpose of the test. Otherwise it works fine.

image

Therefore I would say that this point is solved too.

@MSECode
Copy link
Contributor

MSECode commented Feb 14, 2024

Add space for error missing boards during CAN discovery since CAN address is attached to the number of board 1 of X and it is difficult to read at first glance

@valegagge
Copy link
Member Author

Just another minor improvement: from the image above, the period of transmission of the info CAN board is regular has a short period. I guess you do for testing purpose, but we shouldn't forget to update it to 10 or also 20 seconds.

Now we are sure, that when we lose a CAN board the error message is triggered suddenly, are we?

@MSECode
Copy link
Contributor

MSECode commented Feb 15, 2024

Reset to 10 seconds. It looks reasonable to me.

Regarding this point: Now we are sure, that when we lose a CAN board the error message is triggered suddenly, are we? we are sure about that because for the states: justFOUND and justLOST we set in the code forcereport to TRUE

image

@MSECode
Copy link
Contributor

MSECode commented Feb 16, 2024

Opened all PRS:

@valegagge
Copy link
Member Author

Ciao @MSECode ,
after the changes required in iCub-main PR, we can go on with the tests.

For the final tests on the diagnostic, we can do the following actions on the setup you created (EMS+battery +Strain +2FOC).

  1. start YRI ==> all is ok
  2. remove Strain
  3. reattach the strain
  4. disconnect strain and battery
  5. reconnect strain a battery.
  6. disconnect strain (or battery) and 2foc
  7. reconnect strain and 2foc.

Please you attach here the YRI log. thank you!
I'm not available on these two days but I'll check the issue or we'll catch up at lunch time quickly.

What do you think?

@MSECode
Copy link
Contributor

MSECode commented Feb 19, 2024

@valegagge
Here the output of the Tests:

  1. Start YRI --> All ok (the error is related to the fact that I request the FT service to be active without any sensor attached to the strain2c):
    image

  2. Removing the strain2c from the CAN we can see that we cannot find anymore the CAN boards related to both FT and inertials3 services, while the battery service is still ok:
    image

  3. Reconnecting the strain nothing changes --> the transmission continues to fail. The discovery cannot be refreshed until restarting of the YRI

  4. If we disconnect the Battery first, we can see from the logs that we are loosing the strain too. However, when we reconnect the battery, both the services restart. The battery service can restart since the BAT is powered from the power supply and not from the CAN only.
    Instead, if we do the opposite, i.e. we disconnect before the strain2c and later the BAT the only service that restarts is the battery service.
    This is probably due to the fact that until the strain2c remains connected via CAN it continues to stream the frames, even if the BAT is removed. We have, anyway, errors since the BAT and strain2c bus are short-circuited and having the BAT not attached is like having no CAN terminations and this is not good for the EMS
    image

  5. If a disconnect 2foc only I get the following output, meaning that at first we have errors for the failing of the transmission over CAN but at the reconnection the errors go away
    image

If we disconnect battery and 2foc the output is the following:
image

After some seconds
image

At reconnection
image

i suppose 2foc cannot be recovered until restarting

With 2foc and strain2c this one:
First disconnect the FT (warning messages)
Then the 2foc (Error and Warning)
image

At reconnection still errors
image

@valegagge
Copy link
Member Author

Hi @MSECode ,
thank you very much for the exhaustive tests.

I just wrote done some notes for each point.

2: In the reported log I cannot see the just-lost error. It might be in the previous line... Did you see it?

3: it is ok, the strain needs to receive the configuration and the start command because it is powered by CAN.

4:here I can see the just-stop and the restart, but not the still-lost. Probably you attached the CAN before the still-lost appeared, Didn't you?

5: I noticed this discordance:
image

Could you take a look at the diagnostic parser? If it is quickly we can fix it, otherwise we'll plan to fix it in the feature.

From your experiments and taking a look into the code, the 2foc or, more in general, the motion control hasn't the canMotitor service. So, we cannot take the motion control service into account for this tests.

  • When you said i suppose 2foc cannot be recovered until restarting It is not clear for me. Sorry.

From the penultimate image I see that we have another parser problem:
image

As before, If we have time, we can try to fix it otherwise we'll address it in the feature.

In general, it seems that the inertial3 service diagnostic messages hide the canMonitor info of ft service. Is it possible to remove the inertial3 service from the config file? So our log, should be more clear. We can check it together.

@MSECode
Copy link
Contributor

MSECode commented Feb 20, 2024

Point 2: We have the justLOST error, it was probably above wrt to the screen, here you can see it:

image

Point4:

We have also the stillLOST message for both battery and ft (which have the canmonitor configured) in both cases in which we disconnect the only the battery or the strain2c. In the latter case the inertial and ft services cannot recover the transmission as said at point 3

image
image

image

Point5:

Size of fifo: fixed (there was typo in the bitwise & between mask and data)
image

Finally here, in the error you mentioned:

image

that is the uint64 of the data, so a long unsigned. I removed the "." (dot) to be nicer to print. But the parsing is fine. In the fw we are setting the data using the method eo_common_canframe_data2u64 and in the sw part we are just parsing the whole uint64_t

@MSECode
Copy link
Contributor

MSECode commented Feb 20, 2024

Generic Error Improvement

I've hardcoded on the ems the rising of some of the bits of the bitmask for the motor fault

image

@MSECode
Copy link
Contributor

MSECode commented Feb 20, 2024

Also test removing and connecting again 2foc and bat or strain2c is fine.
Here the full YRI log

yarplog-2foc-ems-bat-strain-test.txt

@MSECode
Copy link
Contributor

MSECode commented Feb 21, 2024

Here the logs of the test with the latest changes done today
yarplog-2foc-ems-bat-strain-test-last.txt

@pattacini pattacini linked a pull request Feb 23, 2024 that will close this issue
@valegagge
Copy link
Member Author

PRs merged.

Dod satisfied. closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants