Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAN Communication Stops Responding to PINGs on VESC v6.05 (BETA 25) #792

Open
gowrav opened this issue Dec 14, 2024 · 2 comments
Open

CAN Communication Stops Responding to PINGs on VESC v6.05 (BETA 25) #792

gowrav opened this issue Dec 14, 2024 · 2 comments

Comments

@gowrav
Copy link

gowrav commented Dec 14, 2024

Description:

While using VESC firmware v6.05 (BETA 25), we encountered an issue where the CAN communication on the VESC becomes transmit-only under certain conditions. Specifically, the VESC controller continues to transmit telematics and LISP-triggered data, but it stops responding to PINGs issued from VESC Express or other CAN devices. This behavior renders the controller unresponsive to incoming CAN requests.

The issue resolves itself after performing a power cycle (resetting the controller), as no other interface is available due to the VESC App Configuration being set to ADC only.
Observations:
Termination Resistance: The CAN bus termination resistance was verified to be within limits (120 Ohms).
Power Cycle Dependency: Resetting the MCU via a power cycle restores communication, which suggests the issue is not with the wiring harness.
Reproducibility: The issue is reproducible if ridden for 20mins though the exact trigger conditions remain unclear.

Attempted Fixes:
Writing a LISP script to automatically update the CAN bus baud rate after detecting a loss of communication for 3 seconds did not resolve the issue. This approach was expected to stop and restart the CAN communication by saving and setting the application configurations.
Firmware/Software: This issue was observed on v6.05 (BETA 25) and has not yet been tested on the v6.05 release.

Potential Areas of Concern:
LISP Implementation: Could there be a LISP runtime-related issue causing the CAN driver to stop responding?
BLDC Firmware: Could a state in the BLDC firmware be locking or de-prioritizing incoming CAN communication?
ChibiOS (RTOS): Could there be a ChibiOS-related bug impacting the CAN driver or message prioritization?

Request for Assistance:
Debugging Steps:
What debugging strategies could help isolate the root cause in this scenario? Are there any logs or diagnostics within the firmware that could provide insight?
LISP Runtime: Could there be a conflict between LISP runtime execution and CAN driver operations?
Testing: Would testing on the v6.05 release be sufficient to rule out potential firmware bugs in the beta version?
Firmware Patch: Could a watchdog or an automatic CAN driver reset mechanism be implemented in the firmware to handle such cases?

Summary:
This issue affects critical communication over CAN and requires immediate resolution. Insights into debugging or workarounds would be greatly appreciated, especially in understanding whether the root cause lies within the firmware (LISP, BLDC code) or the underlying ChibiOS RTOS.

Looking forward to any suggestions or assistance.

@gowrav
Copy link
Author

gowrav commented Dec 14, 2024

Additional Context on Thread Behavior and Debugging Attempts

When the issue arises, we checked the state of all threads via UART from the terminal. Here are our observations and actions taken:

No Change from VESC Tool Configurations:
    Changing or resetting APP configurations using the VESC Tool did not result in any noticeable changes.
    Disabling CAN STATUS 1 and CAN STATUS 2 also had no effect.
    We attempted to change the CAN baud rate, expecting that the related threads would stop and restart, potentially resolving any hanging threads due to out-of-memory conditions. However, this approach also did not yield any improvement.

LISP Process Behavior:
    The LISP process appeared to be operating as normal throughout the investigation.

Loss of UART and Recovery:
    After a few minutes of probing, UART communication was lost entirely.
    Multiple attempts to reestablish UART communication (including disconnecting hardware and restarting the VESC Tool) were unsuccessful.
    Ultimately, a power cycle of the MCU's 3.3V section restored the MCU to its original state, with all functionalities returning to normal.

LISP Code Size Adjustment:
    Our LISP code size is not standard, as the extent of the sandbox code we are running is unusually large. To accommodate this, we increased the allocated LISP code size by 256 bytes.
    Our current hypothesis is that this increased LISP code size may be stepping on the CAN heap or threads, potentially causing the observed communication issues.

We hope this additional context helps narrow down the root cause. If there are specific debug points or areas we should focus on, please let us know.
Screenshot 2024-12-14 at 3 02 54 PM

As one can see in the attached image the Thread responsible for "CAN process" has moved to WTMTX - Waiting for MUTEX and my first thought is something in the code either on LISP or VESC is stalling the further process., however even after killing LISP code we weren't able to recover, kindly suggest what more steps can be followed to debug.

@gowrav
Copy link
Author

gowrav commented Dec 15, 2024

The following is a part of the LISP code which handles CAN Frames


(defun proc-eid (id data)
  (progn
    (sleep 0.1) ;; sleep here to reduce CPU time gobbled by Random CAN data
    (define last_battery_can_data_at (systime))

    (if (eq batteryState_eid (str-from-n id))
      (progn
        (define soc_now (bufget-u8 data 0))
      )
    )

    (if (eq currentDerating_eid (str-from-n id))
      (progn
        (define bms_dis (bufget-u8 data 2))
        (define bms_cha (- (bufget-u8 data 3)))
      )
    )
  )
)

..

...
(defun event-handler ()
  (loopwhile t
    (progn
      (recv
        ((event-can-eid . ((? id) . (? data))) (proc-eid id data))
        ((event-data-rx . (? data)) (proc-data data))
        (_ nil) ; Ignore other events
      )
      ;;(proc-eid (can-recv-eid))
    )
  )
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant