-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: parallel repair #131
fix: parallel repair #131
Conversation
Codecov Report
@@ Coverage Diff @@
## parallelisation_crosswordloop #131 +/- ##
================================================================
Coverage ? 83.62%
================================================================
Files ? 8
Lines ? 507
Branches ? 0
================================================================
Hits ? 424
Misses ? 48
Partials ? 35 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Linking conversation from #129 and #130
Infectious was failing on my machine, although sporadically. I think it was just the race condition at play.
I don't think so, the race condition is between reading and writing of cells. For eg. |
Makes sense - if the data is malicious then we might end up setting different data into the same cell |
7990d89
to
3ad1f1e
Compare
All data races have been fixed. #128 needs to be merged for tests to pass @musalbas @adlerjohn please review |
Can you explain why row/col mutexes has been added? From the original description of the race condition this doesn't seem necessary |
After setting a cell with the rebuilt share, the entire row/col must be read to verify if the entire row has been set or not, ref While the entire row is being read, another go routine might be writing to the same row (just a different cell), which might lead to spurious results |
How can another go routine be writing to the same row with the current loop? If I understand correctly, it only touches each row once in each loop. |
I see, I think I can see what the race condition is, but I think that might still not fix it. Suppose the last two rows to complete a column are being repaired, but they both call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#128 has been merged, maybe pull from default branch? Edit: wrong one, ignore.
e837207
to
5b9e765
Compare
I don't think so, because here the |
Oh that's a bug then, |
On another general note about the mutex - I think instead of adding rowColMutex to the EDS, we should define rowColMutex in solveCrossword and pass it as arguements to solveCrosswordRow/Col, because the mutex is only relevant to crossword solving. Unless we think there's a way to fix this by making EDS itself entirely thread safe - but my current intuition is that even if EDS was thread safe the bug would still exist because it's a race condition in the crossword solver itself rather than the underlying EDS. |
Agreed, that would be better
Yes, the issue exists regardless of whether EDS is thread safe or not
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding rowColMutex to the EDS, we should define rowColMutex in solveCrossword and pass it as arguements to solveCrosswordRow/Col, because the mutex is only relevant to crossword solving
Blocked by #134 |
Is there a reason why it's a single mutex now instead of one per row/col? Computing the Merkle roots in computeSharesRoot takes up a significant chunk of the time so it would be ideal to parallelize that too. |
We can't separate the insertion logic from the validation logic, since we get the race condition mentioned above |
Going to close this b/c it doesn't appear to be actively being worked on. |
solveCrosswordRow
sets cells for the crossword which are accessed insolveCrosswordColumn
, which means that both being run in parallel has a race condition.This PR resolves this by running all
solveCrosswordRow
in parallel and waiting for all these calls to finish before executing allsolveCrosswordColumn
calls in parallel