Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data written to EEPROM gets randomly wiped out (rewritten as 0xff) after reset, wake from sleep, or power off/on #9047

Open
6 tasks done
php4fan opened this issue Dec 6, 2023 · 13 comments

Comments

@php4fan
Copy link

php4fan commented Dec 6, 2023

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: ESP8266
  • Core Version: 3.1.2
  • Development Env: [Arduino IDE|]
  • Operating System: [Manjaro Linux]

Settings in IDE

  • Module: Sparkfun ESP8266 Thing Dev
  • Flash Mode: [qio|dio|other] I don't know
  • Flash Size: 512 kB
  • lwip Variant: v2 Lower Memory
  • Reset Method: I don't know
  • Flash Frequency: [40Mhz] I don't know
  • CPU Frequency: [80Mhz|160MHz] I don't know
  • Upload Using: [SERIAL]
  • Upload Speed: [115200] (serial upload only)

Problem Description

My code writes to the (fake?) EEPROM with EEPROM.write() and EEPROM.commit() and reads with EEPROM.read() the data that has been written in previous "sessions".

Sometimes, utterly randomly, after waking up from sleep, when my sketch reads a byte from a given position, instead of the value that it had previously written, it finds a 0xff.

My code had been running for literally years without issues on a dozen identical boards.

Recently I re-compiled and re-uploaded with the last version of the Arduino core on a new unit of the same identical board. I made no changes to the relevant part of the code, but the version compiled with the latest core has this issue.

Note that I'm observing the issue on a brand new chip, so this is not the Flash memory getting damaged by too many write cycles. Also, the old boards that are still running the code that was compiled with the old core, are still having no issue (they haven't got anywhere near 10k write cycles).

MCVE Sketch

NOTE: I cannot share the original sketch. I haven't written and tested a minimal sketch. The issue is hard to reproduce at will because it occurs randomly, but it happens often. I'm writing a minimal code example just to explain the issue, but I have not run and tested the code below; consider it as explanatory pseudo-code, it may very well contain mistakes.

#include <Arduino.h>

void setup() {
    EEPROM.begin(512);
    char x = EEPROM.read(0);
    if (x != 'X') Serial.printf("!!! Unexpected value found in EEPROM: %d\n", x);
    else Serial.println("Found expected value in EEPROM");
    EEPROM.write(0, 'X');
    EEPROM.commit();
}

void loop() {

}

Debug Messages

Debug messages go here
@mcspr
Copy link
Collaborator

mcspr commented Dec 6, 2023

Can you try these changes to EEPROM source?

Enable verbose output in the IDE preferences menu, and then read the build log to find out where our Core and EEPROM lib files are located
e.g. for me it looks like this right after sketch compilation is done

Using library EEPROM at version 1.0 in folder: C:\Users\max\Documents\Arduino\hardware\esp8266com\esp8266\libraries\EEPROM <<< go here

Then, remove reinterpret_cast<uint32_t*> around the _data pointer in EEPROM .cpp and rebuild your app

diff --git a/libraries/EEPROM/EEPROM.cpp b/libraries/EEPROM/EEPROM.cpp
index e193237d..2f361ac7 100644
--- a/libraries/EEPROM/EEPROM.cpp
+++ b/libraries/EEPROM/EEPROM.cpp
@@ -65,7 +65,7 @@ void EEPROMClass::begin(size_t size) {

   _size = size;

-  if (!ESP.flashRead(_sector * SPI_FLASH_SEC_SIZE, reinterpret_cast<uint32_t*>(_data), _size)) {
+  if (!ESP.flashRead(_sector * SPI_FLASH_SEC_SIZE, _data, _size)) {
     DEBUGV("EEPROMClass::begin flash read failed\n");
   }

@@ -132,7 +132,7 @@ bool EEPROMClass::commit() {
     return false;

   if (ESP.flashEraseSector(_sector)) {
-    if (ESP.flashWrite(_sector * SPI_FLASH_SEC_SIZE, reinterpret_cast<uint32_t*>(_data), _size)) {
+    if (ESP.flashWrite(_sector * SPI_FLASH_SEC_SIZE, _data, _size)) {
       _dirty = false;
       return true;
     }

@php4fan
Copy link
Author

php4fan commented Dec 7, 2023

I need a way to reliably test whether or not the proposed fix (once I apply it) is working. That is, to reproduce the issue; either at will or with a reasonable likelihood so that I can repeat the test many many times and see if it ever fails, first without the fix and then with the fix.

Doing that with my current real-life sketch many times and wait for the issue to happen is too time-consuming.

So, I need a test that I can rapidly run hundreds or thousands of times and see if it ever fails. Or better yet would be, if possible at all, a test that would provoke the issue systematically.

Any ideas on that? I've tried with a very minimal sketch but it's never triggering the issue in the first place. I guess if the program is too trivial, it'll never cause the kind of "unexpected" alignment of data in memory that supposedly causes the issue (assuming the issue is what you think it is).

@php4fan
Copy link
Author

php4fan commented Dec 7, 2023

The only way I've been able to kind-of-reproduce the issue at will has been: with a minimal sketch basically like the example code I posted above (except I write and read a bunch of bytes instead of one), I reset several times in very, very rapid succession. But I guess that's because I manage to reset right while the EEPROM is being written, and that's definitely not what happens when I observe the issue spontaneously in real life.

@mcspr
Copy link
Collaborator

mcspr commented Dec 7, 2023

Guess above is about sizing, mostly. You'll have to at least share your EEPROM class setup, not a random sketch that does nothing for our tests :/

new can never get _data itself unaligned, as it is always a multiple of 8 which is okay with our alignment requirement being a multiple of 4.
If you did not change sector location, this should also be ok as writing happens to a multiple-of-4 address
But, if EEPROM class size is not a multiple of 4, there a small trouble with values being written / read incorrectly.

Note that eeprom class has two steps - erase sector, then write. Mayhaps you reset some time between these two operations.
(edit and the value itself - 0xFF - this is the data on the flash after the erase)

@php4fan
Copy link
Author

php4fan commented Dec 7, 2023

What is EEPROM class size? What setup do you need? The code I posted is literally how I read from and write to the EEPROM, except I do that with several bytes (one at a time) at several positions. There's no "new" involved in my code. Whatever EEPROM does internally, I don't know.

Mayhaps you reset some time between these two operations.

Maybe when I tried the trivial read-and-write test and I reset multiple times very quickly, yes.
When I observe the issue "in real life", no.

I don't reset at all during execution. My code runs and goes to sleep. Then it either wakes up when the timer goes off or I wake it up manually with RTS, and then when the first value is read from EEPROM, sometimes the value 0xff is found instead of what was supposed to be written.
And as I said, this didn't happen with previous versions of the core.

In order to test the changes you suggested to the core (removing the reinterpret_cast), I need a suggestion on how to trigger the issue with a higher probability (based on whatever your guess is about what might be causing the issue). If I do the changes and just try it with my real code, I will need to run it for a looooong time before I can be remotely confident that it's working.

@mcspr
Copy link
Collaborator

mcspr commented Dec 7, 2023

What setup do you need?

Whether you construct EEPROMClass manually, address becomes different. Don't know if you do that.
If you have different value for begin() that is not 512, size differs so that's an important one.
If you always catch the error at read(0), or some other address. Also don't know that

@php4fan
Copy link
Author

php4fan commented Dec 7, 2023

Whether you construct EEPROMClass manually

I don't. I use the global instance EEPROM as shown in the code I posted.

If you have different value for begin() that is not 512

I don't. I call EEPROM.begin(512) once as shown in the code I posted.

The code I posted is LITERALLY all I do in my real code with EEPROM except I do more than one read and more than one write and commit (one or more consecutive writes followed immediately by one commit), at different addresses, but always with one char per write as shown.

If you always catch the error at read(0), or some other address.

The first read I do is at address 0, and if that's not the expected value (which is a "magic character", namely literally 'W'), I stop and don't read anything else, so I don't know whether or not the issue happens at other addresses as well.
Now that you mention it, when I did the trivial minimal test where I was trying to reproduce, I was writing and reading at other addresses (namely 100 - 107), never 0; so, there's a tiny chance that the issue only ever happens at address 0. That didn't occur to me, I assumed the address was irrelevant and that everything was getting erased, or a whole "page" or something.

I will retry the "trivial test" with address 0, and I'll also add more debugging to my real code to see whether, when the byte at address 0 is erased, other stuff also is.

@php4fan
Copy link
Author

php4fan commented Dec 7, 2023

I've been able to watch more closely, and it turns out, when the issue happens, there's an unexpected reset very shortly after the normal reset. This might be a hardware issue after all.

Context: I have a "reset" button that when pressed, connects the RST pin to GND causing a reset.

So here's what happens when the issue happens (which again, is randomly and sporadically):

  • my sketch is executed fully, it writes stuff to the EEPROM and goes to deep sleep.
  • while the board is in deep sleep, I press the reset button
  • the board wakes up and starts executing my sketch. It calls EEPROM.begin(512) and it MAY have time to do some EEPROM.read() but DEFINITELY no write
  • an unexpected RESET happens
  • the sketch restarts and runs normally. When it reads from the EEPROM, everything is 0xff (not only the byte at address 0, also a few others that are scattered at small addresses not far from 0).

This by itself still doesn't explain why the EEPROM bytes would be erased. The unexpected reset is NOT happening during write/commit, nor during begin() for that matter. It may very well be happening during a read, though.
However, since the board is resetting for no reason, I guess it means something's wrong with the hardware, so whatever hardware issue is causing a reset, it might be causing the EEPROM to be erased as well?

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 8, 2023

The unexpected reset happens very shortly after the initial reset from a push button.
How short is "very shortly" ?
Is it an electrical bouncing ?
Is it acceptable for your application to try and add a 1sec or so delay in the setup function before doing anything else ?

@php4fan
Copy link
Author

php4fan commented Dec 8, 2023

The unexpected reset is NOT happening during write/commit

Turns out I was wrong about that! It may be happening during a write/commit.

How short is "very shortly" ?

Apparently, long enough to execute at least all of this (plus whatever comes before the setup() method in the arduino framework gets called):

  Serial.begin(SERIAL_BAUD_RATE);
  WiFi.mode(WIFI_OFF);
  WiFi.persistent(false);
  pinMode(LED_PIN, OUTPUT);
  digitalWrite(LED_PIN, HIGH);
  pinMode(SS, OUTPUT);
  digitalWrite(SS, LOW);
  pinMode(0, INPUT_PULLUP);
  EEPROM.begin(512);
  Serial.println("Setup");

and possibly part of this but no more:

int something = EEPROM.read(SOME_ADDRESS);
EEPROM.write(ANOTHER_ADDRESS, 0);
EEPROM.write(YET_ANOTHER_ADDRESS, 0);
EEPROM.commit();

so, it could be during the commit. 🤦

Is it an electrical bouncing ?

I think that's unlikely. I tried quickly pressing the button twice and I'm very easily able to push it the second time before it gets to print "Setup" as per the code above. I don't think an electrical bouncing can be longer than a human pressing a button twice (a relatively big plastic mechanical button, like 5mm thick).
Plus I had never observed this before, having been running the same code on a dozen boards with the same circuitry for literally years.

However, a poorly soldered connection in the button is a possibility. 🤦

@php4fan
Copy link
Author

php4fan commented Dec 8, 2023

If I used SPIFFS instead of EEPROM, would that be safe against a reset/power-off in the middle of writing?

By "safe" I mean you end up with either the new data completely written or the old data still intact, but not with corrupt data or all 0xff.

@Tygrec
Copy link

Tygrec commented May 2, 2024

Hi, did you find a solution yet ? I have the exact same problem and I can't figure why this happens and how to fix it.

@php4fan
Copy link
Author

php4fan commented May 2, 2024

Hi, did you find a solution yet ?

Yes, I use EEPROM_Rotate. I think it's this one: https://github.com/xoseperez/eeprom_rotate

My issue was that there sometimes is a sudden reset very soon after startup, whose root cause I've never figured out for sure but it could be hardware and it could be just a (unusually late) bounce from the reset button.
In my case, sometimes the reset happened exactly while writing to flash, which is expected to inevitably cause data corruption/loss.

I haven't fixed the reset itself, but as long as the reset only happens within a limited, short interval of time from startup, the only problem it causes for me is the flash data loss, so EEPROM_Rotate avoids that (and other problems). Their readme explains how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants