Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Roof Duality through parallelization & better variable to vertex mapping. #4

Open
amahmudwave opened this issue Dec 1, 2020 · 0 comments

Comments

@amahmudwave
Copy link
Contributor

amahmudwave commented Dec 1, 2020

Three steps take most of the time.

  • Creation of implication network.
  • Computing Max-Flow.
  • Making implication network symmetric.

These three steps can be made faster by the following techniques and should be implemented when we have a good parallel Max-Flow algorithm at our disposal (otherwise Amdahl's law will come in effect).

Creation of Implication network :

For very large graphs, we can create all the edges corresponding to a vertex in parallel. Creating the edges of different vertices in parallel is not a good thing, since it will lead to too much contention and we have to use locks throughout the process. But to parallelize the creation of all the edges of a single vertex in parallel we need to do two things.

Keep an explicit posiform - many biases in the QUBO for a variable do not contribute to the implication network as they are flushed to zeroes when converted to an integral type, thus we cannot parallelize the creation of all the edges corresponding to a variable, as the index where the edge will be in the adjacency list will not be dependent on the index/iterator offset of the QUBO bias for that vertex, it will be dependent on the index/offset of the posiform coefficient. We can create the posiform in parallel though but having an explicit posiform will increase the memory requirement by around 25%.

Basically we want something like this :

for (vertex = 0 ;vertex < last_parallel vertex; vertex++) {
  #pragma omp parallel 
   // Note the counter does not go from 0 - num_out_edges, since other 
   // vertices lead to out edges coming out of vertex, that is why we also 
   // add counter[vertex] to out_edge to calculate index_of_edge_of_vertex.
   for(out_edge = 0; out_edge < num_posiform_coefficients; out_edge++) {
        // All threads write to different indices so no lock needed.
       index_of_edge_of_vertex = counter[vertex] + out_edge; 
       // This does not need a lock as only threads do not access 
       // common destination/destination_complement vertices.
       index_of_destination = counter[destination];
       Create edge;
       // Skipping code for symmetric and residual edges, 
       // 4 edges will be created in total. (see implication network)
       // Only one edge created out of destination and destination_complement.
       counter[destination]++;
       counter[destination_complement]++;
   }
   counter[vertex] += num_posiform_coefficients;
   counter[vertex_complement] += num_posiform_coefficients;
}

// We are processing an upper triangular matrix in essence,
// processing the edges of the last vertices in parallel is not efficient,
// since they do not have many outgoing edges.
for(vertex = last_parallel_vertex; vertex < num_vertices; vertex++) {
   Process them in serial
 }

Use of pointers instead of vectors - We have to preallocate the amount of memory we will need and each thread will write to the exact position, this can be done using vectors too but the resize() function of vectors adds too much overhead.

We should not try to parallelize the creation of the graph for BQMs that do not use random access iterators, for example adjMapBQM, since the benefit will be very low.

Computing Max-Flow :

There are many papers claiming much better performance than the push_relabel algorithm we have implemented but we have to note two things :

One is some of them are tuned for computer vision tasks, they may perform very poorly for common graphs.

Secondly these algorithms are generally compared with off the shelf implementations of push_relabel algorithm, some comparisons online where the serial and parallel version of the code are written by the same author show that there is not much benefit in parallizing a good implementation of push_relabel algorithm. So we can wait for a better parallel algorithm to be developed before we opt for parallelizing max-flow computation.

The breadth-first search is not a bottleneck now but it could also be parallelized, but that would require us to call openmp functions explicitly.

Making Residual Graph symmetric: - DONE

This is a step very easy to parallelize and have been parallelized already. But the problem is we are mapping variable 0-n-1 to vertex 0-n-1, n is the source and n+1 - 2n are the complements of the variables and 2n+1 is the sink. If we map variable 0 to vertex 0,1 , variable 1 to vertex 2,3 then in the adjacency list of the implication network the outedges from a vertex will end up being created in a sorted order based on the original variable. In the step where we make the residual network symmetric, a thread processes an edge and its symmetric version. But we cannot allow two different threads to process the edges twice, that is the thread that encounters the original edge and the thread that encounters the symmetric edge, we decide which thread will process them based on the comparison of indices of the source & destination variable. (see code).
Instead of doing this check every time, if the edges were sorted, we could do a binary search for each thread and start processing from the right position. We are skipping this step since that makes the code harder to read. Note for this optimization to work we must process the vertices in order, since that will create the edges in order. What it means is that we must create all the edges out of the source when it is time to create all the edges out of the source, and not create the edges to each vertex from the source when processing the respective vertex.

The above mentioned update to the step of making residual graph symmetric is done, now the mapping is separated into s different file of its own mapping_policy.hpp. The other code that is code outside of mapping_policy is more or less independent of the mapping methodology now.

@amahmudwave amahmudwave changed the title Improve performance through parallelization & better variable to vertex mapping. Improve performance of Roof Duality through parallelization & better variable to vertex mapping. Dec 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant