Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

mvandelle · 2024-11-11T17:57:01Z

Fixes #991

mvandelle · 2024-11-11T18:08:05Z

Hi Thomas,
Thank you for your quick answer. You are right it will be easier on Linux i'm currently installing a dual boot on my personal PC. I'll let you know if after that im able to build the library.

tomlqc · 2024-11-11T18:59:18Z

@mvandelle Good to hear. Let me know if you have any further questions. I'll make this a draft PR so the other developers know it's WIP.

tomlqc · 2024-11-11T19:01:32Z

@mvandelle Could you please link this PR to the issue? I also suggest using this PR for communication and submission. You can update the PR title accordingly.

tomlqc · 2024-11-15T17:24:34Z

Hi @mvandelle,
Don't hesitate to ask any further questions 😃

mvandelle · 2024-11-15T18:09:50Z

Hi @tomlqc,
I'm sorry this week was pretty crazy in terms of deadline for my master so i didn't start the assignement. I did the linux dual boot and build the library successfuly this time, so everything is ready for me to start this week end. I'll definitely send you an update on my work monday if it is alright with you.

tomlqc · 2024-11-15T21:50:07Z

Hi @mvandelle,
No problem. I fully understand. This was just a reminder that you can ask for clarification if you need.

mvandelle · 2024-11-18T12:18:59Z

Hi @tomlqc,
I parrallelized the function and designed some C++ test for it. I'm not sure to understand what type of python test you are expecting me to do for this function ?

tomlqc · 2024-11-18T14:28:03Z

Hi @mvandelle,
A good question. For many of the methods we have in Lightning, we have bindings to make them available in the Python API. We test these methods' behaviour in the Python layer as well. For the Hamiltonian class, we already have some test in Test_ObservablesLQubit.cpp. You could check if/where we test apply_Sparse_Matrix(), directly or indirectly, and add some test if necessary.

mvandelle · 2024-11-18T15:17:21Z

Hi @tomlqc,
I've checked the the hamiltonian class and saw that apply_sparse_matrix was used. To be sure i understood what you want, it would be to make a python file that uses the function that i know triggers apply sparse matrix ?
I also have a question about the benchmark. For now it's handmade by editing a textfile with average timestamp for different number of thread. I tried using the Benchmark macro in the test_sparseLinAlg.cpp but it was not recognized when compiling the testsuite, is it disabled by default and if so where can i edit it ?

tomlqc · 2024-11-19T15:41:25Z

Hi @mvandelle,
I realized that my answers need clarification. For the tests, make sure that your C++ code is tested in one of the C++ tests, e.g. TestHelpersStateVectors.hpp or Test_ObservablesLQubit.cpp via SparseHamiltonian, and as well in one of the python tests in pennylane-lightning/tests/lightning_qubit/. This might already be the case because you're optimizing an existing function, so please confirm that this is already the case or update the tests.
For the benchmarks, you can use a script of your own, in which case please provide us this code, and show numbers and/or plots here in the PR. To upload larger scripts you may use https://gist.github.com.

mvandelle · 2024-11-19T16:17:07Z

Hi @tomlqc,
Thank you for the clarification

mvandelle · 2024-11-20T22:32:20Z

Hi @tomlqc ,
I commited the change on the apply_sparse_matrix function. I also updated the test of this function in C++, adding a testcase that run the function with a different number of threads. I checked the python test, this function is already tested with sparse hamiltonian in test_measurements_sparse.py. Here is the scripts i used to benchmark my implementation : https://gist.github.com/mvandelle/eb91eb52c7ddb20c74725a9061501f87. I plotted the runtime when increasing the number of qubits, i also plotted the old implementation of apply_sparse_matrix in order to compare with the new one using threads. "Auto thread" represent the performance of the new function when no specific number of threads is given in the parameter of the function. In this case the number of thread is decided using std::thread::hardware_concurrency(). I'm sorry for the delay again but the timing between this assessment and my homework for the university was really bad. Don't hesitate to ask me if you need anything else.

AmintorDusko · 2024-11-22T14:19:11Z

Hi @mvandelle, I will move your PR to ready for review so we can trigger the CIs and see your tests in action. 🙂

codecov · 2024-11-22T14:39:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.74%. Comparing base (45a67d4) to head (cead5eb).

❗ There is a different number of reports uploaded between BASE (45a67d4) and HEAD (cead5eb). Click for more details.

HEAD has 29 uploads less than BASE

Flag BASE (45a67d4) HEAD (cead5eb)

36 7

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #992      +/-   ##
==========================================
- Coverage   97.67%   91.74%   -5.94%     
==========================================
  Files         228      176      -52     
  Lines       36405    24532   -11873     
==========================================
- Hits        35560    22507   -13053     
- Misses        845     2025    +1180

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

AmintorDusko

Nice job, @mvandelle! Nice to see improvements thread parallelization can bring.
I have a few general questions.

Could you please check your code formatting? We have one CI complaining about that.
Could you please point out where this function is tested (if it is at all) in our Python tests?
In your first two plots we see in general a bad performance for 64 threads. Why? Also, why this changes around 17 qubits?
What can you say about the relation between the number of threads and performance you observe with your benchmarks.

mvandelle · 2024-11-22T17:58:11Z

Hi @AmintorDusko,
Thank you for the feedback.

What formater do you use for this library so i can format my code correctly ?
From what i have seen, apply_sparse_Matrix is called in the class definition of Sparse_Hamiltonian in the ObservablesLQubit.hpp file. This class is then tested in the Test_measurement _sparse.py, when computing qlm.expval(...), i exptect that it would call apply_sparse_matrix for the computation.
The bad performance of the 64 threads for smaller data is caused by two mains reason. The first one is that the thread management for smaller sparse matrix dominate the computation. The second one is related to the hardware, on my machine i have 12 available cores which give me 24 available threads ( value given by the function hardware concurrency). Using more than 24 threads results in thread contention which can affect efficiency. However, when we start doing the computation on 17 qubits, the data is now too big to be stored on the cache. This causes most memory accesses to go directly to RAM, making the bottleneck of this function about memory access. In this case, using more threads helps saturate the available memory bandwidth. Using more threads, here 64, allows for latency hiding, where threads waiting for memory can be swapped out for others that are ready to perform computations. This reduces idle time and improves overall throughput, compensating for the overhead of managing extra threads.
So we can see here that the larger the sparse matrix get the more efficient the function using more threads are. It is not always the case, here mosty it comes for the fact that the parallelization of the problem does not imply concurrency between the different threads, as each parrallelize worker compute the value for different chunk of the result vector. Therefore there will be no slow coming from thread communication or synchronization. This is why at a certain size of the sparse matrix the function that perform the better are the one using a lot of threads.

AmintorDusko · 2024-11-22T18:11:51Z

Our repository has a Makefile with very nice functionalities. make format is what you are looking for.

mvandelle · 2024-11-22T20:14:48Z

Thanks for the tips, it's done !

tomlqc

@mvandelle Thanks for your answers and for your update. We just have this final set of questions so we'll be able to complete the PR.

tomlqc · 2024-11-26T13:58:12Z

pennylane_lightning/core/src/simulators/lightning_qubit/utils/SparseLinAlg.hpp

+                   const std::complex<fp_precision> *values_ptr,
+                   std::vector<std::complex<fp_precision>> &result,
+                   index_type start, index_type end) {
+    for (index_type i = start; i < end; i++) {


Could you explain how collapsing these nested for-loops can improve performance?

By collapsing nested these for-loops it would lower the the loop control overhead and lower the number of access to memory which could improve performance. Instead of looping for every value in the sparse matrix and adding it to the the temporary result we could here add them and then do the multiplication by the vector value. It would limit the number of call to the data and therefore improve performance.

tomlqc · 2024-11-26T13:58:48Z

pennylane_lightning/core/src/simulators/lightning_qubit/utils/tests/Test_SparseLinAlg.cpp

@@ -81,4 +81,17 @@ TEMPLATE_TEST_CASE("apply_Sparse_Matrix", "[Sparse]", float, double) {
            REQUIRE(result_refs[vec] == approx(result).margin(1e-6));
        };
    }
+
+    SECTION("Testing with different number of threads") {
+        for (size_t num_threads : {1, 2, 4, 8, 16, 32}) {


Could you explain the difference between size_t and std::size_t, and in what scenarios one might be preferred over the other?

Both size_t and std::size_t are the same type. It is generally preferred to use std::size_t in code that follow strict namespace qualifications or interact extensively with the standard library, where types and functions are declared using the std namespace. Moreover, using std::size_t helps avoid potential ambiguities, where for exemple size_t in the global namespace could conflict with other type definitions. In simpler code where strict namespace is unnecessary, using size_t helps making the code more concise and in certain case can maintain compatibility with older code.

tomlqc · 2024-11-26T13:59:53Z

pennylane_lightning/core/src/simulators/lightning_qubit/utils/SparseLinAlg.hpp

+
+    // Divide the rows approximately evenly among the threads
+    index_type chunk_size = (vector_size + num_threads - 1) / num_threads;
+    std::vector<std::thread> threads;


How would you rewrite this code using C++20 multi-threading features (e.g., jthreads)?

We could rewrite line 93 "std::vectorstd::thread threads;" as "std::vectorstd::jthread threads;" in order to use jthread. It would simplify the thread management since jthread doesnt require explicite use of join on the different threads after execution. We could also use std::views::iota(start, end) from the std::range library when calling emplace_back later in the code, it would helps avoid manually handling the iteration indices and make the code easier to read. Also, if later we would change the chunk logic, using ranges we won't have to rewrite the loop logic.

tomlqc · 2024-11-26T23:06:49Z

Thanks @mvandelle

tomlqc marked this pull request as draft November 11, 2024 18:59

mvandelle changed the title ~~PR for communication about the apply_Sparse_Matrix technical assignement~~ Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit Nov 11, 2024

mvandelle closed this Nov 20, 2024

mvandelle force-pushed the technical-assignement branch from b6926cc to 9fc9633 Compare November 20, 2024 21:36

Updated SparseLinAlg.hpp and Test_SparseLinAlg.cpp

42b8207

mvandelle reopened this Nov 20, 2024

tomlqc requested a review from AmintorDusko November 21, 2024 20:05

AmintorDusko marked this pull request as ready for review November 22, 2024 14:19

AmintorDusko reviewed Nov 22, 2024

View reviewed changes

mvandelle and others added 2 commits November 22, 2024 15:12

Format SparseLinAlg and Test

e8af71e

Auto update version from '0.40.0-dev13' to '0.40.0-dev15'

8e27978

Merge branch 'master' into technical-assignement

cead5eb

tomlqc reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

mvandelle commented Nov 11, 2024 •

edited

Loading

mvandelle commented Nov 11, 2024

tomlqc commented Nov 11, 2024

tomlqc commented Nov 11, 2024 •

edited

Loading

tomlqc commented Nov 15, 2024

mvandelle commented Nov 15, 2024

tomlqc commented Nov 15, 2024

mvandelle commented Nov 18, 2024

tomlqc commented Nov 18, 2024

mvandelle commented Nov 18, 2024 •

edited

Loading

tomlqc commented Nov 19, 2024

mvandelle commented Nov 19, 2024

mvandelle commented Nov 20, 2024

AmintorDusko commented Nov 22, 2024

codecov bot commented Nov 22, 2024 •

edited

Loading

AmintorDusko left a comment

mvandelle commented Nov 22, 2024

AmintorDusko commented Nov 22, 2024 •

edited

Loading

mvandelle commented Nov 22, 2024

tomlqc left a comment

tomlqc Nov 26, 2024

mvandelle Nov 26, 2024

tomlqc Nov 26, 2024

mvandelle Nov 26, 2024

tomlqc Nov 26, 2024

mvandelle Nov 26, 2024

tomlqc commented Nov 26, 2024

Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

Are you sure you want to change the base?

Fix issue #991 : Parallelize apply_Sparse_Matrix in lightning.qubit #992

Conversation

mvandelle commented Nov 11, 2024 • edited Loading

mvandelle commented Nov 11, 2024

tomlqc commented Nov 11, 2024

tomlqc commented Nov 11, 2024 • edited Loading

tomlqc commented Nov 15, 2024

mvandelle commented Nov 15, 2024

tomlqc commented Nov 15, 2024

mvandelle commented Nov 18, 2024

tomlqc commented Nov 18, 2024

mvandelle commented Nov 18, 2024 • edited Loading

tomlqc commented Nov 19, 2024

mvandelle commented Nov 19, 2024

mvandelle commented Nov 20, 2024

AmintorDusko commented Nov 22, 2024

codecov bot commented Nov 22, 2024 • edited Loading

Codecov Report

AmintorDusko left a comment

Choose a reason for hiding this comment

mvandelle commented Nov 22, 2024

AmintorDusko commented Nov 22, 2024 • edited Loading

mvandelle commented Nov 22, 2024

tomlqc left a comment

Choose a reason for hiding this comment

tomlqc Nov 26, 2024

Choose a reason for hiding this comment

mvandelle Nov 26, 2024

Choose a reason for hiding this comment

tomlqc Nov 26, 2024

Choose a reason for hiding this comment

mvandelle Nov 26, 2024

Choose a reason for hiding this comment

tomlqc Nov 26, 2024

Choose a reason for hiding this comment

mvandelle Nov 26, 2024

Choose a reason for hiding this comment

tomlqc commented Nov 26, 2024

mvandelle commented Nov 11, 2024 •

edited

Loading

tomlqc commented Nov 11, 2024 •

edited

Loading

mvandelle commented Nov 18, 2024 •

edited

Loading

codecov bot commented Nov 22, 2024 •

edited

Loading

AmintorDusko commented Nov 22, 2024 •

edited

Loading