-
-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TASK: Prevent multiple catchup runs #4751
base: 9.0
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great already, thanks a lot for your efforts!
I find it a bit hard to understand the queueing logic (even though we came up with it together *g)
Maybe that can be formulated a bit more explicitly such that it is easier to find potential issues.
I added some first suggestions inline
...toryRegistry/Classes/Factory/ProjectionCatchUpTrigger/SubprocessProjectionCatchUpTrigger.php
Outdated
Show resolved
Hide resolved
...toryRegistry/Classes/Factory/ProjectionCatchUpTrigger/SubprocessProjectionCatchUpTrigger.php
Outdated
Show resolved
Hide resolved
...toryRegistry/Classes/Factory/ProjectionCatchUpTrigger/SubprocessProjectionCatchUpTrigger.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks so much cleaner and easier to comprehend, thank you!
Two more comments that I wasn't aware before..
Let me know if you could use a rubber duck :)
Neos.ContentRepositoryRegistry/Classes/Command/SubprocessProjectionCatchUpCommandController.php
Outdated
Show resolved
Hide resolved
Neos.ContentRepositoryRegistry/Classes/Service/AsynchronousCatchUpRunnerState.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitsunet I thought about this one again realized that CATCHUPTRIGGER_ENABLE_SYNCHRONOUS_OPTION
is a bad naming choice since this is not so much about sync vs async but more about whether we expect multiple parallel threads or not.
For example: I would like to be able to simply switch the catch up trigger to the synchronous one for individual CRs for performance reasons – but they would still have to be queued in order to prevent concurrent processes from running into the "Failed to acquire checkpoint" exception.
I have a, somewhat radical, suggestion:
What if we replaced AsynchronousCatchUpRunnerState
by some central authority that can be used to queue catchups independently from their implementation?
final readonly class CatchUpQueue {
public function __construct(
private ContentRepositoryId $contentRepositoryId,
private FrontendInterface $catchUpLock,
) {}
public function queueCatchUp(ProjectionInterface $projection, \Closure $catchUpTrigger): void {
// here goes your logic from SubprocessProjectionCatchUpTrigger and AsynchronousCatchUpRunnerState
}
public function releaseCatchUpLock(string $projectionClassName): void {
// to be called from the part that invokes ContentRepository::catchUp()
}
}
I think that the change is smaller than it might first appear (since you already solved all of the logical issues) but it would allow us to use this for all implementations (for testing context we could consider replacing the cache with a NullBackend
).
E.g. in SubprocessProjectionCatchUpTrigger::triggerCatchUp()
:
$catchUpQueue = $this->catchUpQueueFactory->build($contentRepositoryId);
foreach ($projections as $projection) {
$catchUpQueue->queueCatchUp($projection, $this->startCatchUp(...));
}
Sorry for the forth and back, but I think we're getting there and I'm happy to help or take over!
The example code above is not correct yet because it would block catchups in the final readonly class CatchUpQueue {
// ...
public function queueCatchUp(array $projections, \Closure $catchUpTrigger): void {
// keep looping over $projections until there are no more running remaining
}
} |
Oh, that's a very cool idea, and I think this all makes sense, I'll see how far I get tomorrow and then we can see if you want to take over but at least I get your idea and know this part of the code already so it makes sense for me to go on. |
This is a though nut, I thought it would be nicer to wrap the debouncer - as I named it now - around the |
No, this also seems like a bad idea, because the hooks are per projection and I can't know from the outside which projection has said hook active. It feels unwise to have a global debouncer but a per projection (and repository) unlock of it (via the hook). SignalSlot might be an option but then it becomes messy to get hold of the debouncer to unlock the state. |
I guess I would also like a rubber duck here :) |
Using the
I think that's fine. It would only break for users that actually created a custom hook implementation – and it's really easy to fix
I would like the implementation to be independent from Flow as much as possible because something like this will be needed in other places, too..
I'm up for it – tomorrow? |
Will use caches to try avoiding to even start async catchups if for that projection one is still running. Also adds a simple single slot queue with incremental backup to ensure a requested catchup definitely happens.
…UpTrigger/SubprocessProjectionCatchUpTrigger.php Co-authored-by: Bastian Waidelich <[email protected]>
060e8ac
to
242b5f6
Compare
242b5f6
to
387f409
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kitsunet This looks awesome already, thanks for holding on!!
Beware: Most if not all of my inline comments are rather nitpicky – that's because I couldn't find any bigger issues ;)
Except for a two potential ones:
1. Global Queue concept
With the CR registry we tried to avoid our "global service" thinking, but now all CRs would share the same CatchUpDeduplicationQueue
, conceptually, i.e. sharing the same cache backend (my point: they could internally, but now they have to)
That might not be an issue in practice, but if we were to add a factory like
class CatchUpDeduplicationQueueFactory {
// ...
public function build(ContentRepositoryId $contentRepositoryId): CatchUpDeduplicationQueue
{
// ...
}
}
And with that, maybe, maybe it makes sense to take the one extra step and do introduce an interface for it..
To be honest: While writing it down, I'm not sure what issue there could be.. Maybe we can just postpone this part
2. Tests
Annoying one, but since this is a crucial part for the system to work, I would suggest we add some tests.
We could at least add a functional test, creating two instances and see how they interact?
But the best would be a test that covers this with multiple threads. I made some great experience with paratestphp/paratest (we use that for the consistency tests for neos/eventstore-doctrineadapter for example).
I'm happy to help with this, and we can – of course – add those in a separate PR as well (to get this one in asap)
Neos.ContentRepositoryRegistry/Classes/Command/SubprocessProjectionCatchUpCommandController.php
Outdated
Show resolved
Hide resolved
Neos.ContentRepositoryRegistry/Classes/Command/SubprocessProjectionCatchUpCommandController.php
Outdated
Show resolved
Hide resolved
{ | ||
$queuedProjections = $this->triggerCatchUpAndReturnQueued($projections); | ||
$attempts = 0; | ||
/** @phpstan-ignore-next-line */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity: What was PHPStan complaining about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new Projections::empty
is in an internal class if you have suggestions I am happy, we could make Projections not internal or I could not use it, but it seems very convenient.
Neos.ContentRepositoryRegistry/Classes/Service/CatchUpDeduplicationQueue.php
Outdated
Show resolved
Hide resolved
Neos.ContentRepositoryRegistry/Classes/Service/CatchUpDeduplicationQueue.php
Outdated
Show resolved
Hide resolved
Neos.ContentRepositoryRegistry/Classes/Service/CatchUpDeduplicationQueue.php
Outdated
Show resolved
Hide resolved
Neos.ContentRepositoryRegistry/Classes/Service/CatchUpDeduplicationQueue.php
Show resolved
Hide resolved
A few more potential issues/considerations (in addition to the potential race condition mentioned above):
|
I should have tried this as I thought the same, but I think that we might end up in a concurrent lock situation if we do not tackle the whole package at once (as then you end up in multiple while loops waiting for catchups to finish)
We could check if the lock is acquired in there and otherwise throw an exception (or just do nothing) ?
Right that needs to be fixed for sure! |
Using symfony/lock ensures atomic locking, which should really prevent duplication even under load.
Step one, replacing cache with lock, the overall logic is the same still. |
We now have two separate lock steps, a "run" lock around the catchUp of a specific projection, preventing that same projection catchUp to be run multiple times in parallel, and the deduplication "queue" lock that will check if a catchUp is already running and create a queue lock to prevent multiple processes to wait for another catchup run and all having to potentially spawn background processes. This second queue lock is not atomic as it depends on the "run" lock. This should be fine, it's there to prevent runaway spawning of parallel subrequests in busy installation and for that it should work fine with it's randomized back off. |
/** | ||
* This is not a regular unit test, it won't run with the rest of the testsuite. | ||
* the following two commands after another would run the parallel tests and then the validation of results: | ||
* requires "brianium/paratest": "^6.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see because the latest para test latest requires phpunit 10 but flow blocks that.
I think this is now superseded by #5321 |
Will use caches to try avoiding to even start async catchups if for that projection one is still running.
Also adds a simple single slot queue with incremental backup to ensure a requested catchup definitely happens.