Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mvandelle
Copy link

@mvandelle mvandelle commented Nov 11, 2024

Fixes #991

@mvandelle
Copy link
Author

Hi Thomas,
Thank you for your quick answer. You are right it will be easier on Linux i'm currently installing a dual boot on my personal PC. I'll let you know if after that im able to build the library.

@tomlqc
Copy link
Contributor

tomlqc commented Nov 11, 2024

@mvandelle Good to hear. Let me know if you have any further questions. I'll make this a draft PR so the other developers know it's WIP.

@tomlqc tomlqc marked this pull request as draft November 11, 2024 18:59
@tomlqc
Copy link
Contributor

tomlqc commented Nov 11, 2024

@mvandelle Could you please link this PR to the issue? I also suggest using this PR for communication and submission. You can update the PR title accordingly.

@mvandelle mvandelle changed the title PR for communication about the apply_Sparse_Matrix technical assignement Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit Nov 11, 2024
@tomlqc
Copy link
Contributor

tomlqc commented Nov 15, 2024

Hi @mvandelle,
Don't hesitate to ask any further questions 😃

@mvandelle
Copy link
Author

Hi @tomlqc,
I'm sorry this week was pretty crazy in terms of deadline for my master so i didn't start the assignement. I did the linux dual boot and build the library successfuly this time, so everything is ready for me to start this week end. I'll definitely send you an update on my work monday if it is alright with you.

@tomlqc
Copy link
Contributor

tomlqc commented Nov 15, 2024

Hi @mvandelle,
No problem. I fully understand. This was just a reminder that you can ask for clarification if you need.

@mvandelle
Copy link
Author

Hi @tomlqc,
I parrallelized the function and designed some C++ test for it. I'm not sure to understand what type of python test you are expecting me to do for this function ?

@tomlqc
Copy link
Contributor

tomlqc commented Nov 18, 2024

Hi @mvandelle,
A good question. For many of the methods we have in Lightning, we have bindings to make them available in the Python API. We test these methods' behaviour in the Python layer as well. For the Hamiltonian class, we already have some test in Test_ObservablesLQubit.cpp. You could check if/where we test apply_Sparse_Matrix(), directly or indirectly, and add some test if necessary.

@mvandelle
Copy link
Author

mvandelle commented Nov 18, 2024

Hi @tomlqc,
I've checked the the hamiltonian class and saw that apply_sparse_matrix was used. To be sure i understood what you want, it would be to make a python file that uses the function that i know triggers apply sparse matrix ?
I also have a question about the benchmark. For now it's handmade by editing a textfile with average timestamp for different number of thread. I tried using the Benchmark macro in the test_sparseLinAlg.cpp but it was not recognized when compiling the testsuite, is it disabled by default and if so where can i edit it ?

@tomlqc
Copy link
Contributor

tomlqc commented Nov 19, 2024

Hi @mvandelle,
I realized that my answers need clarification. For the tests, make sure that your C++ code is tested in one of the C++ tests, e.g. TestHelpersStateVectors.hpp or Test_ObservablesLQubit.cpp via SparseHamiltonian, and as well in one of the python tests in pennylane-lightning/tests/lightning_qubit/. This might already be the case because you're optimizing an existing function, so please confirm that this is already the case or update the tests.
For the benchmarks, you can use a script of your own, in which case please provide us this code, and show numbers and/or plots here in the PR. To upload larger scripts you may use https://gist.github.com.

@mvandelle
Copy link
Author

Hi @tomlqc,
Thank you for the clarification

@mvandelle mvandelle closed this Nov 20, 2024
@mvandelle mvandelle force-pushed the technical-assignement branch from b6926cc to 9fc9633 Compare November 20, 2024 21:36
@mvandelle mvandelle reopened this Nov 20, 2024
@mvandelle
Copy link
Author

Hi @tomlqc ,
I commited the change on the apply_sparse_matrix function. I also updated the test of this function in C++, adding a testcase that run the function with a different number of threads. I checked the python test, this function is already tested with sparse hamiltonian in test_measurements_sparse.py. Here is the scripts i used to benchmark my implementation : https://gist.github.com/mvandelle/eb91eb52c7ddb20c74725a9061501f87. I plotted the runtime when increasing the number of qubits, i also plotted the old implementation of apply_sparse_matrix in order to compare with the new one using threads. "Auto thread" represent the performance of the new function when no specific number of threads is given in the parameter of the function. In this case the number of thread is decided using std::thread::hardware_concurrency(). I'm sorry for the delay again but the timing between this assessment and my homework for the university was really bad. Don't hesitate to ask me if you need anything else.
Result1
Result1_zoom
Result2

@tomlqc tomlqc requested a review from AmintorDusko November 21, 2024 20:05
@AmintorDusko
Copy link
Contributor

Hi @mvandelle, I will move your PR to ready for review so we can trigger the CIs and see your tests in action. 🙂

@AmintorDusko AmintorDusko marked this pull request as ready for review November 22, 2024 14:19
Copy link

codecov bot commented Nov 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.74%. Comparing base (45a67d4) to head (cead5eb).

❗ There is a different number of reports uploaded between BASE (45a67d4) and HEAD (cead5eb). Click for more details.

HEAD has 29 uploads less than BASE
Flag BASE (45a67d4) HEAD (cead5eb)
36 7
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #992      +/-   ##
==========================================
- Coverage   97.67%   91.74%   -5.94%     
==========================================
  Files         228      176      -52     
  Lines       36405    24532   -11873     
==========================================
- Hits        35560    22507   -13053     
- Misses        845     2025    +1180     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, @mvandelle! Nice to see improvements thread parallelization can bring.
I have a few general questions.

  1. Could you please check your code formatting? We have one CI complaining about that.
  2. Could you please point out where this function is tested (if it is at all) in our Python tests?
  3. In your first two plots we see in general a bad performance for 64 threads. Why? Also, why this changes around 17 qubits?
  4. What can you say about the relation between the number of threads and performance you observe with your benchmarks.

@mvandelle
Copy link
Author

Hi @AmintorDusko,
Thank you for the feedback.

  1. What formater do you use for this library so i can format my code correctly ?
  2. From what i have seen, apply_sparse_Matrix is called in the class definition of Sparse_Hamiltonian in the ObservablesLQubit.hpp file. This class is then tested in the Test_measurement _sparse.py, when computing qlm.expval(...), i exptect that it would call apply_sparse_matrix for the computation.
  3. The bad performance of the 64 threads for smaller data is caused by two mains reason. The first one is that the thread management for smaller sparse matrix dominate the computation. The second one is related to the hardware, on my machine i have 12 available cores which give me 24 available threads ( value given by the function hardware concurrency). Using more than 24 threads results in thread contention which can affect efficiency. However, when we start doing the computation on 17 qubits, the data is now too big to be stored on the cache. This causes most memory accesses to go directly to RAM, making the bottleneck of this function about memory access. In this case, using more threads helps saturate the available memory bandwidth. Using more threads, here 64, allows for latency hiding, where threads waiting for memory can be swapped out for others that are ready to perform computations. This reduces idle time and improves overall throughput, compensating for the overhead of managing extra threads.
  4. So we can see here that the larger the sparse matrix get the more efficient the function using more threads are. It is not always the case, here mosty it comes for the fact that the parallelization of the problem does not imply concurrency between the different threads, as each parrallelize worker compute the value for different chunk of the result vector. Therefore there will be no slow coming from thread communication or synchronization. This is why at a certain size of the sparse matrix the function that perform the better are the one using a lot of threads.

@AmintorDusko
Copy link
Contributor

AmintorDusko commented Nov 22, 2024

Our repository has a Makefile with very nice functionalities. make format is what you are looking for.

@mvandelle
Copy link
Author

Thanks for the tips, it's done !

Copy link
Contributor

@tomlqc tomlqc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvandelle Thanks for your answers and for your update. We just have this final set of questions so we'll be able to complete the PR.

const std::complex<fp_precision> *values_ptr,
std::vector<std::complex<fp_precision>> &result,
index_type start, index_type end) {
for (index_type i = start; i < end; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how collapsing these nested for-loops can improve performance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By collapsing nested these for-loops it would lower the the loop control overhead and lower the number of access to memory which could improve performance. Instead of looping for every value in the sparse matrix and adding it to the the temporary result we could here add them and then do the multiplication by the vector value. It would limit the number of call to the data and therefore improve performance.

@@ -81,4 +81,17 @@ TEMPLATE_TEST_CASE("apply_Sparse_Matrix", "[Sparse]", float, double) {
REQUIRE(result_refs[vec] == approx(result).margin(1e-6));
};
}

SECTION("Testing with different number of threads") {
for (size_t num_threads : {1, 2, 4, 8, 16, 32}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the difference between size_t and std::size_t, and in what scenarios one might be preferred over the other?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both size_t and std::size_t are the same type. It is generally preferred to use std::size_t in code that follow strict namespace qualifications or interact extensively with the standard library, where types and functions are declared using the std namespace. Moreover, using std::size_t helps avoid potential ambiguities, where for exemple size_t in the global namespace could conflict with other type definitions. In simpler code where strict namespace is unnecessary, using size_t helps making the code more concise and in certain case can maintain compatibility with older code.


// Divide the rows approximately evenly among the threads
index_type chunk_size = (vector_size + num_threads - 1) / num_threads;
std::vector<std::thread> threads;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you rewrite this code using C++20 multi-threading features (e.g., jthreads)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rewrite line 93 "std::vectorstd::thread threads;" as "std::vectorstd::jthread threads;" in order to use jthread. It would simplify the thread management since jthread doesnt require explicite use of join on the different threads after execution. We could also use std::views::iota(start, end) from the std::range library when calling emplace_back later in the code, it would helps avoid manually handling the iteration indices and make the code easier to read. Also, if later we would change the chunk logic, using ranges we won't have to rewrite the loop logic.

@tomlqc
Copy link
Contributor

tomlqc commented Nov 26, 2024

Thanks @mvandelle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallelize apply_Sparse_Matrix in lightning.qubit
4 participants