Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parallel repair #131

Conversation

rahulghangas
Copy link
Contributor

solveCrosswordRow sets cells for the crossword which are accessed in solveCrosswordColumn, which means that both being run in parallel has a race condition.

This PR resolves this by running all solveCrosswordRow in parallel and waiting for all these calls to finish before executing all solveCrosswordColumn calls in parallel

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

❗ No coverage uploaded for pull request base (parallelisation_crosswordloop@5737e85). Click here to learn what that means.
The diff coverage is n/a.

@@                       Coverage Diff                        @@
##             parallelisation_crosswordloop     #131   +/-   ##
================================================================
  Coverage                                 ?   83.62%           
================================================================
  Files                                    ?        8           
  Lines                                    ?      507           
  Branches                                 ?        0           
================================================================
  Hits                                     ?      424           
  Misses                                   ?       48           
  Partials                                 ?       35           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@rahulghangas
Copy link
Contributor Author

Linking conversation from #129 and #130

This seems to imply the reason why parallelisation_crosswordloop didn't work was not because of the decoder, but because of rsmt2d. In that case how come it worked with infectious but not leopard or klauspost?

Infectious was failing on my machine, although sporadically. I think it was just the race condition at play.

Blocking column repair until row repair is complete means that we aren't maximizing the parallelisation that we can do. Is there another way to fix this, eg add mutexes for setting cells?

I don't think so, the race condition is between reading and writing of cells. For eg. solveCrosswordRow is setting cells here while the same cell is being read by solveCrosswordCol here. iiuc, the cell must set with the rebuilt shares before being read

@musalbas musalbas self-requested a review September 28, 2022 13:35
extendeddatacrossword.go Outdated Show resolved Hide resolved
extendeddatacrossword.go Outdated Show resolved Hide resolved
@musalbas
Copy link
Member

I don't think so, the race condition is between reading and writing of cells. For eg. solveCrosswordRow is setting cells here while the same cell is being read by solveCrosswordCol here. iiuc, the cell must set with the rebuilt shares before being read

Makes sense - if the data is malicious then we might end up setting different data into the same cell

@rahulghangas
Copy link
Contributor Author

All data races have been fixed. #128 needs to be merged for tests to pass

@musalbas @adlerjohn please review

@musalbas
Copy link
Member

musalbas commented Sep 28, 2022

Can you explain why row/col mutexes has been added? From the original description of the race condition this doesn't seem necessary

extendeddatacrossword.go Outdated Show resolved Hide resolved
@rahulghangas
Copy link
Contributor Author

Can you explain why row/col mutexes has been added? From the original description of the race condition this doesn't seem necessary

After setting a cell with the rebuilt share, the entire row/col must be read to verify if the entire row has been set or not, ref

While the entire row is being read, another go routine might be writing to the same row (just a different cell), which might lead to spurious results

@musalbas
Copy link
Member

How can another go routine be writing to the same row with the current loop? If I understand correctly, it only touches each row once in each loop.

@rahulghangas
Copy link
Contributor Author

solveCrosswordRow hits all columns here and solveCrosswordCol hits all rows here

@Wondertan Wondertan self-requested a review September 28, 2022 16:24
@musalbas
Copy link
Member

I see, I think I can see what the race condition is, but I think that might still not fix it. Suppose the last two rows to complete a column are being repaired, but they both call noMissingData before rebuilt shares are inserted into the square. Wouldn't noMissingData return false for both so the orthogonal row/col will never get verified? We should think about what would fix it fully - maybe applying a mutex over the entire second half of solveCrosswordRow/Col?

Copy link
Member

@adlerjohn adlerjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#128 has been merged, maybe pull from default branch? Edit: wrong one, ignore.

@rahulghangas rahulghangas force-pushed the fix/parallel-repair branch 2 times, most recently from e837207 to 5b9e765 Compare October 1, 2022 06:43
@rahulghangas
Copy link
Contributor Author

Suppose the last two rows to complete a column are being repaired, but they both call noMissingData before rebuilt shares are inserted into the square. Wouldn't noMissingData return false for both so the orthogonal row/col will never get verified?

I don't think so, because here the dataCol of the eds gets mutated because this is returning a slice, not a copy. Since the mutex is on the column, whichever row completes the column will successfully call noMissingData on it

@musalbas
Copy link
Member

musalbas commented Oct 2, 2022

Oh that's a bug then, col[r] = rebuiltShares[c] is not supposed to mutate the the underlying DS. Assuming we fix that bug would the aforementioned issue exist and how can it be solved? (maybe applying a mutex over the entire second half of solveCrosswordRow/Col?)

@musalbas
Copy link
Member

musalbas commented Oct 2, 2022

On another general note about the mutex - I think instead of adding rowColMutex to the EDS, we should define rowColMutex in solveCrossword and pass it as arguements to solveCrosswordRow/Col, because the mutex is only relevant to crossword solving.

Unless we think there's a way to fix this by making EDS itself entirely thread safe - but my current intuition is that even if EDS was thread safe the bug would still exist because it's a race condition in the crossword solver itself rather than the underlying EDS.

@rahulghangas
Copy link
Contributor Author

On another general note about the mutex - I think instead of adding rowColMutex to the EDS, we should define rowColMutex in solveCrossword and pass it as arguements to solveCrosswordRow/Col, because the mutex is only relevant to crossword solving

Agreed, that would be better

Unless we think there's a way to fix this by making EDS itself entirely thread safe - but my current intuition is that even if EDS was thread safe the bug would still exist because it's a race condition in the crossword solver itself rather than the underlying EDS

Yes, the issue exists regardless of whether EDS is thread safe or not

maybe applying a mutex over the entire second half of solveCrosswordRow/Col
This seems like this would be the best way forward with the given implementation

Copy link
Member

@musalbas musalbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding rowColMutex to the EDS, we should define rowColMutex in solveCrossword and pass it as arguements to solveCrosswordRow/Col, because the mutex is only relevant to crossword solving

@rahulghangas
Copy link
Contributor Author

Blocked by #134

@musalbas
Copy link
Member

musalbas commented Oct 21, 2022

Is there a reason why it's a single mutex now instead of one per row/col? Computing the Merkle roots in computeSharesRoot takes up a significant chunk of the time so it would be ideal to parallelize that too.

@rahulghangas
Copy link
Contributor Author

I see, I think I can see what the race condition is, but I think that might still not fix it. Suppose the last two rows to complete a column are being repaired, but they both call noMissingData before rebuilt shares are inserted into the square. Wouldn't noMissingData return false for both so the orthogonal row/col will never get verified? We should think about what would fix it fully - maybe applying a mutex over the entire second half of solveCrosswordRow/Col?

We can't separate the insertion logic from the validation logic, since we get the race condition mentioned above

@rootulp
Copy link
Collaborator

rootulp commented Jun 20, 2023

Going to close this b/c it doesn't appear to be actively being worked on.

@rootulp rootulp closed this Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants