-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992
base: master
Are you sure you want to change the base?
Conversation
Hi Thomas, |
@mvandelle Good to hear. Let me know if you have any further questions. I'll make this a draft PR so the other developers know it's WIP. |
@mvandelle Could you please link this PR to the issue? I also suggest using this PR for communication and submission. You can update the PR title accordingly. |
Hi @mvandelle, |
Hi @tomlqc, |
Hi @mvandelle, |
Hi @tomlqc, |
Hi @mvandelle, |
Hi @tomlqc, |
Hi @mvandelle, |
Hi @tomlqc, |
b6926cc
to
9fc9633
Compare
Hi @tomlqc , |
Hi @mvandelle, I will move your PR to ready for review so we can trigger the CIs and see your tests in action. 🙂 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #992 +/- ##
==========================================
- Coverage 97.67% 91.74% -5.94%
==========================================
Files 228 176 -52
Lines 36405 24532 -11873
==========================================
- Hits 35560 22507 -13053
- Misses 845 2025 +1180 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, @mvandelle! Nice to see improvements thread parallelization can bring.
I have a few general questions.
- Could you please check your code formatting? We have one CI complaining about that.
- Could you please point out where this function is tested (if it is at all) in our Python tests?
- In your first two plots we see in general a bad performance for 64 threads. Why? Also, why this changes around 17 qubits?
- What can you say about the relation between the number of threads and performance you observe with your benchmarks.
Hi @AmintorDusko,
|
Our repository has a Makefile with very nice functionalities. make format is what you are looking for. |
Thanks for the tips, it's done ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvandelle Thanks for your answers and for your update. We just have this final set of questions so we'll be able to complete the PR.
const std::complex<fp_precision> *values_ptr, | ||
std::vector<std::complex<fp_precision>> &result, | ||
index_type start, index_type end) { | ||
for (index_type i = start; i < end; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how collapsing these nested for-loops can improve performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By collapsing nested these for-loops it would lower the the loop control overhead and lower the number of access to memory which could improve performance. Instead of looping for every value in the sparse matrix and adding it to the the temporary result we could here add them and then do the multiplication by the vector value. It would limit the number of call to the data and therefore improve performance.
@@ -81,4 +81,17 @@ TEMPLATE_TEST_CASE("apply_Sparse_Matrix", "[Sparse]", float, double) { | |||
REQUIRE(result_refs[vec] == approx(result).margin(1e-6)); | |||
}; | |||
} | |||
|
|||
SECTION("Testing with different number of threads") { | |||
for (size_t num_threads : {1, 2, 4, 8, 16, 32}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the difference between size_t and std::size_t, and in what scenarios one might be preferred over the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both size_t and std::size_t are the same type. It is generally preferred to use std::size_t in code that follow strict namespace qualifications or interact extensively with the standard library, where types and functions are declared using the std namespace. Moreover, using std::size_t helps avoid potential ambiguities, where for exemple size_t in the global namespace could conflict with other type definitions. In simpler code where strict namespace is unnecessary, using size_t helps making the code more concise and in certain case can maintain compatibility with older code.
|
||
// Divide the rows approximately evenly among the threads | ||
index_type chunk_size = (vector_size + num_threads - 1) / num_threads; | ||
std::vector<std::thread> threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you rewrite this code using C++20 multi-threading features (e.g., jthreads)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rewrite line 93 "std::vectorstd::thread threads;" as "std::vectorstd::jthread threads;" in order to use jthread. It would simplify the thread management since jthread doesnt require explicite use of join on the different threads after execution. We could also use std::views::iota(start, end) from the std::range library when calling emplace_back later in the code, it would helps avoid manually handling the iteration indices and make the code easier to read. Also, if later we would change the chunk logic, using ranges we won't have to rewrite the loop logic.
Thanks @mvandelle |
Fixes #991