feat: Implement `EventCacheStoreLock::lock()` with poison error, and `::lock_unchecked` #4285

Hywan · 2024-11-18T16:05:54Z

A flask of poison.

Since #4192, the event cache has a cross-process lock via EventCacheLock. It brings a lock method. Great! However, the caller is not aware if the lock has been acquired from the same holder or from another holder. The holder refers to the process name usually. If the holder of the lock changes, it means the in-memory data might now be invalid! That's absolutely terrifying. Imagine the following scenario:

the main app process adds events inside the event cache store
the main app process is paused
the push notification process is fired, and adds events inside the event cache store
the main app process is resumed
now what? how this process is able to know its data are invalid and should be reloaded?

Poor main app process 😢. We must help this poor main app process, don't we?

Hence this PR.

It introduces a generation counter. The principle is really simple: every time the lock is acquired from another holder, the generation is incremented by one. The generation does not change if the holder stays the same between lock acquisitions.

So.

The first patch do that: a generation value is added inside the stores (e.g. matrix-sdk-sqlite).
Next, EventCacheStoreLock::lock is renamed lock_unchecked. Why do we want an unchecked version of the lock? What does it even mean? A lock can be poisoned, but sometimes we just don't care! If we want to add a media in the cache, there is no in-memory invalidation involved. Using lock_unchecked is perfect in this case.
After that, we do a bit of Rust magic with BackingStore being automatically implemented for all Arc<T> where T: BackingStore.
And now we are ready to implement LockableEventCacheStore::is_poisoned. This “poisoning” API is restricted to the event cache; I didn't want to touch the CrossProcessStoreLock API. Maybe that would be useful for all cross-process store locks later, but for now, only event cache, nah!
All pieces are ready. We can implement EventCacheStoreLock::lock().
Finally, the test parts.

It's better to review this PR commit-by-commit for your own sanity.

Address [meta] EventCache storage #3280

codecov · 2024-11-18T16:48:27Z

Codecov Report

Attention: Patch coverage is 98.46154% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.04%. Comparing base (21bb85a) to head (cd4c12e).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...rates/matrix-sdk-base/src/event_cache/store/mod.rs	96.29%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #4285   +/-   ##
=======================================
  Coverage   85.04%   85.04%           
=======================================
  Files         274      274           
  Lines       30043    30080   +37     
=======================================
+ Hits        25550    25583   +33     
- Misses       4493     4497    +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

This patch introduces the _generation_ value for the event cache lock: every time the lock is acquired by a new holder, the generation is incremented by 1. This is fundamental to know if the lock has been poisoned or not.

We are about to introduce a lock which checks if it's been poisoned. To not break the current API semantics, `lock` is renamed `lock_unchecked` if poisoning isn't check/doesn't raise an error.

…kingStore`. This patch implements `BackingStore` for all `Arc<T>` where `T: BackingStore`.

This patch implements `LockableEventCacheStore:is_poisoned` to know whether the lock has been poisoned or not.

This patch implements `EventCacheStoreLock::lock`, which is similar to `lock_unchecked` but it returns a `Result<EventCacheStoreLockGuard, EventCacheStoreLockPoisonError<…>>` to handle the _poisoned_ case.

This patch adds a small `logged_in_base_client_with_store_config` function similarly to `logged_in_base_client`, but… with… a store config!

This patch adds tests for `EventCacheStoreLock::lock` and `lock_unchecked`.

bnjbvr

Thanks for making small commits, this made it easier to review.

main comment

I have my doubts about exposing the fact the store has been poisoned to callers of the event cache store. All the data that needs to be reset upon poisoning should be owned by the store itself, so that it can reset internally, and obtaining the lock is transparent to the callers; otherwise, each call site now needs to reflect what they need to do in that case, and that doesn't make for a very ergonomic API at all (and the Result<Result<>> reflects that weirdness).

Does it make sense to you too?

future directions

As discussed in the small chat we had today :) Those things can be implemented later, as follow-ups (even though I'd love to see the first item sooner rather than later :)).

crypto store

Since this is implementing from scratch the same mechanics that existed for the crypto-store, I strongly think we should merge implementations, instead of having two implementations that are close to each other but different. I think we should do it sooner rather than later, otherwise we'd be shipping new tech debt 😟

lock vs lock_unchecked

I like the idea of separating lock() from lock_unchecked(); I think we can go the next mile by splitting the store state into two parts:

bits that are OK to touch while holding a lock, independently of other processes: the media cache is a good example of that, likely. This would go into its own data structure, conceptually a SharedEventCacheStore, and it would contain references or own those bits.
bits that are not OK to touch while another process holds the lock, and that would need to be reset upon poisoning (aka no aliasing allowed), conceptually a UniqueEventCacheStore. This would contain the room events and such.

Then, to make it so that it's not possible to take the lock with lock_unchecked and touch the bits one is not allowed to, we'd slightly change the signatures of those two functions:

lock() -> Result<UniqueEventCacheStore, ...>
lock_unchecked() -> Result<SharedEventCacheStore, ...>

This way, we have static safety around those types, and it's not even possible to shoot oneself in the feet by misusing the lock.

bnjbvr · 2024-11-19T13:34:33Z

crates/matrix-sdk-common/src/store_locks.rs

@@ -54,6 +54,13 @@ use crate::{
    SendOutsideWasm,
 };

+/// A lock generation is an integer incremented each time it is taken by another
+/// holder. This is not used by all cross-process locks.


What's the meaning of the last sentence here? Consider removing it now, since we're likely to forget about this comment and let it rot even in a future where the crypto store lock would use it?

bnjbvr · 2024-11-19T13:35:06Z

crates/matrix-sdk-common/src/store_locks.rs

    };

+    type HolderExpirationGeneration = (String, Instant, LockGeneration);


Care to either comment, or use a small struct with named fields, please?

bnjbvr · 2024-11-19T13:35:32Z

crates/matrix-sdk-common/src/store_locks.rs

@@ -506,36 +520,45 @@ pub mod memory_store_helper {
        time::{Duration, Instant},
    };

+    use super::{LockGeneration, FIRST_LOCK_GENERATION};
+
+    type HolderExpirationGeneration = (String, Instant, LockGeneration);


Can you deduplicate this?

bnjbvr · 2024-11-19T13:38:11Z

crates/matrix-sdk-sqlite/src/event_cache_store.rs

-                        WHERE holder = ?2
+                    ON CONFLICT (key) DO
+                    UPDATE SET
+                        holder = excluded.holder,


Can you add in a comment where does this excluded entity comes from? More generally, can you add in a comment what this query does, because it's not quite the regular SQL one sees often? :)

bnjbvr · 2024-11-19T13:42:13Z

crates/matrix-sdk-base/src/event_cache/store/mod.rs

-struct LockableEventCacheStore(Arc<DynEventCacheStore>);
+struct LockableEventCacheStore {
+    store: Arc<DynEventCacheStore>,
+    generation_and_is_poisoned: Arc<Mutex<(LockGeneration, bool)>>,


Shouldn't the LockGeneration be optional before the first time the generation is loaded from the database?

Also: can you use a small struct for holding the LockGeneration and the bool? It will make documenting what the bool is better, and make the code using this lock easier to understand (no need to know what fields 0 and 1 are)

bnjbvr · 2024-11-19T13:48:16Z

crates/matrix-sdk-base/src/event_cache/store/mod.rs


-        Ok(EventCacheStoreLockGuard { cross_process_lock_guard, store: self.store.deref() })
+        Ok(if self.lockable_store.is_poisoned() {
+            Err(EventCacheStoreLockPoisonError(event_cache_store_lock_guard))


main comment here

Now that I see the full picture, I doubt that this is the best way to go. Why? Because the callers of lock() are now the ones who have to decide what to do based on getting a poisoned lock; and my guess is that they'll all do the same thing, that is, resetting the in-memory content of the event cache store. Since we are in the event cache store, could we instead reset the inner content ourselves, and then return the EventCacheStoreLockGuard? (This would be aligned with the end target of the crypto store cross-process lock, fwiw)

And considering the little code that was in lock_unchecked(), I'm sure we can keep the previous code of lock_unchecked() as it was, and not have lock_unchecked() call into lock(). That would also avoid the EventCacheStoreLockPoisonError type, which is super weird.

The EventCacheStoreLock isn't responsible to invalidate/refresh the in-memory data we have. However, EventCache and other users of this lock are responsible of that. Example: every time EventCache is calling store.lcok(), and is responsible of refreshing its own data. I don't see how EventCacheStoreLock can track all data used everywhere related to the event cache store. It's basically impossible.

Remember that the EventCache uses the EventCacheStoreLock to read load data from the store into memory, or to save data from memory into the store. The EventCacheStoreLock has no idea where the data that it has loaded or saved are used, and by whom.

I don't see how EventCacheStoreLock can track all data used everywhere related to the event cache store. It's basically impossible.

If it owns it, as I suggested, it doesn't need to track it.

A thought experiment: if we keep the current API (modulo making the return type of the lock look better, to not have to deal with a result of a result), then the next thing I would implement is a thin wrapper on top of the lock, to invalidate the data manually. So it seems like this is just kicking the can down the road, and pushing the problem elsewhere in our architecture.

Moreover, using a generic parameter for the data stored along the lock and that needs a reset would nicely solve the layer separation. A thin trait InvalidatableData { fn invalidate(); } (lol @ naming) would be sufficient to represent data that needs invalidation, and mocking it in testing situations should be straightforward (use a bool, toggle it when invalidate() is called).

I am really not sure the splitting of responsibility gives us anything valuable, so I still need to be convinced this is the best approach.

Would you have WIP code making use of this, so I can get a better opinion about how it's going to be used?

bnjbvr · 2024-11-19T13:52:10Z

crates/matrix-sdk-base/src/test_utils.rs

+
+/// Create a [`BaseClient`] with the given user id, if provided, or an hardcoded
+/// one otherwise, and with a store config.
+pub(crate) async fn logged_in_base_client_with_store_config(


Uhh, this is super ugly, and those methods I wanted to see go away with the new MatrixMockServer and MockClientBuilder; any chance you can reuse MockClientBuilder instead?

andybalaam · 2024-11-19T15:32:42Z

I will hold off on reviewing until you let me know the ideas have settled a bit or you want my input.

I also would like to share logic where it makes sense with the crypto store.

Hywan · 2024-11-20T08:32:36Z

I totally agree we should share the logic with the crypto store. However it implies a non-trivial work to move this generation and lock poison mechanism into the CrossProcessStoreLock API. I prefer to go step-by-step, especially because I'm running out of time to finish the event cache persistent storage. This PR is for the event cache. I agree a refactoring with all cross-process store locks will be necessary once this patch will land and once the design has proven to be mature. Moreover, the work on the crypto store isn't finished and I'm not confident I can rewrite and complete it. I think we should have a team discussion about that: to come with a solid design together. Having a bit of “duplication” isn't dramatically bad here: it's a specific solution, and I agree having a generic solution would be an improvement, but I don't see that as a blocker.

Hywan · 2024-11-20T13:57:37Z

After a team discussion, here is the new plan:

create new crate: matrix-sdk-cross-process-lock
move and rename CrossProcessStoreLock to CrossProcessLock inside matrix-sdk-cross-process-lock
improve the CrossProcessLock to include the generation mechanism introduced in #4285
add the lock_unchecked and lock+PoisonError logic inside this crate too, including the lock guards etc.
create a trait to represent a “poisonable” lock

Once we have that, the following items can be triggered:

matrix-sdk-crypto uses this new trait (+ friendly default to use CrossProcessStoreLock maybe?)
either:
- update the state store to be cross-process aware, then event cache store and crypto store use the state store
- use a new front type, like Stores, that is cross-process aware, and that gives access to all the stores
the event cache can implements EventCacheData that contains all in-memory data + owned access to the store (and reload in-memory data in case of poison)
the crypto can implements CryptoData that contains all in-memory data + owned access to the store (and reload in-memory data in case of poison) <- is it doable?
same dance for the state store?

Hywan force-pushed the feat-event-cache-lock-poisoned branch 5 times, most recently from cb04c54 to 05f0339 Compare November 18, 2024 16:33

Hywan added 7 commits November 18, 2024 19:52

feat: Event cache lock has a _generation_ value.

b06dac3

This patch introduces the _generation_ value for the event cache lock: every time the lock is acquired by a new holder, the generation is incremented by 1. This is fundamental to know if the lock has been poisoned or not.

refactor: Rename EventCacheStoreLock::lock to lock_unchecked.

dc31f93

We are about to introduce a lock which checks if it's been poisoned. To not break the current API semantics, `lock` is renamed `lock_unchecked` if poisoning isn't check/doesn't raise an error.

feat(common): Implement BackingStore for all Arc<T> where `T: Bac…

d4e69a3

…kingStore`. This patch implements `BackingStore` for all `Arc<T>` where `T: BackingStore`.

feat(base): Implement LockableEventCacheStore::is_poisoned.

eff9827

This patch implements `LockableEventCacheStore:is_poisoned` to know whether the lock has been poisoned or not.

feat(base): Implement EventCacheStoreLock::lock.

fa18ed4

This patch implements `EventCacheStoreLock::lock`, which is similar to `lock_unchecked` but it returns a `Result<EventCacheStoreLockGuard, EventCacheStoreLockPoisonError<…>>` to handle the _poisoned_ case.

test(base): Add logged_in_base_client_with_store_config.

4ace281

This patch adds a small `logged_in_base_client_with_store_config` function similarly to `logged_in_base_client`, but… with… a store config!

test(base): Add tests for EventCacheStoreLock::lock_*.

5073cd1

This patch adds tests for `EventCacheStoreLock::lock` and `lock_unchecked`.

Hywan force-pushed the feat-event-cache-lock-poisoned branch from 05f0339 to 5073cd1 Compare November 18, 2024 19:01

Hywan marked this pull request as ready for review November 18, 2024 19:07

Hywan requested review from a team as code owners November 18, 2024 19:07

Hywan requested review from bnjbvr and andybalaam and removed request for a team November 18, 2024 19:07

test(base): Improve test coverage.

cd4c12e

Hywan force-pushed the feat-event-cache-lock-poisoned branch from 48c54e1 to cd4c12e Compare November 19, 2024 08:33

bnjbvr requested changes Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement `EventCacheStoreLock::lock()` with poison error, and `::lock_unchecked` #4285

feat: Implement `EventCacheStoreLock::lock()` with poison error, and `::lock_unchecked` #4285

Hywan commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 18, 2024 •

edited

Loading

bnjbvr left a comment

bnjbvr Nov 19, 2024

bnjbvr Nov 19, 2024

bnjbvr Nov 19, 2024

bnjbvr Nov 19, 2024

bnjbvr Nov 19, 2024

bnjbvr Nov 19, 2024

Hywan Nov 19, 2024

bnjbvr Nov 20, 2024

bnjbvr Nov 19, 2024

andybalaam commented Nov 19, 2024

Hywan commented Nov 20, 2024

Hywan commented Nov 20, 2024

		};

		type HolderExpirationGeneration = (String, Instant, LockGeneration);

feat: Implement EventCacheStoreLock::lock() with poison error, and ::lock_unchecked #4285

Are you sure you want to change the base?

feat: Implement EventCacheStoreLock::lock() with poison error, and ::lock_unchecked #4285

Conversation

Hywan commented Nov 18, 2024 • edited Loading

codecov bot commented Nov 18, 2024 • edited Loading

Codecov Report

bnjbvr left a comment

Choose a reason for hiding this comment

main comment

future directions

crypto store

lock vs lock_unchecked

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

main comment here

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andybalaam commented Nov 19, 2024

Hywan commented Nov 20, 2024

Hywan commented Nov 20, 2024

feat: Implement `EventCacheStoreLock::lock()` with poison error, and `::lock_unchecked` #4285

feat: Implement `EventCacheStoreLock::lock()` with poison error, and `::lock_unchecked` #4285

Hywan commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 18, 2024 •

edited

Loading