You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These three steps can be made faster by the following techniques and should be implemented when we have a good parallel Max-Flow algorithm at our disposal (otherwise Amdahl's law will come in effect).
Creation of Implication network :
For very large graphs, we can create all the edges corresponding to a vertex in parallel. Creating the edges of different vertices in parallel is not a good thing, since it will lead to too much contention and we have to use locks throughout the process. But to parallelize the creation of all the edges of a single vertex in parallel we need to do two things.
Keep an explicit posiform - many biases in the QUBO for a variable do not contribute to the implication network as they are flushed to zeroes when converted to an integral type, thus we cannot parallelize the creation of all the edges corresponding to a variable, as the index where the edge will be in the adjacency list will not be dependent on the index/iterator offset of the QUBO bias for that vertex, it will be dependent on the index/offset of the posiform coefficient. We can create the posiform in parallel though but having an explicit posiform will increase the memory requirement by around 25%.
Basically we want something like this :
for (vertex = 0 ;vertex < last_parallel vertex; vertex++) {
#pragma omp parallel
// Note the counter does not go from 0 - num_out_edges, since other
// vertices lead to out edges coming out of vertex, that is why we also
// add counter[vertex] to out_edge to calculate index_of_edge_of_vertex.
for(out_edge = 0; out_edge < num_posiform_coefficients; out_edge++) {
// All threads write to different indices so no lock needed.
index_of_edge_of_vertex = counter[vertex] + out_edge;
// This does not need a lock as only threads do not access
// common destination/destination_complement vertices.
index_of_destination = counter[destination];
Create edge;
// Skipping code for symmetric and residual edges,
// 4 edges will be created in total. (see implication network)
// Only one edge created out of destination and destination_complement.
counter[destination]++;
counter[destination_complement]++;
}
counter[vertex] += num_posiform_coefficients;
counter[vertex_complement] += num_posiform_coefficients;
}
// We are processing an upper triangular matrix in essence,
// processing the edges of the last vertices in parallel is not efficient,
// since they do not have many outgoing edges.
for(vertex = last_parallel_vertex; vertex < num_vertices; vertex++) {
Process them in serial
}
Use of pointers instead of vectors - We have to preallocate the amount of memory we will need and each thread will write to the exact position, this can be done using vectors too but the resize() function of vectors adds too much overhead.
We should not try to parallelize the creation of the graph for BQMs that do not use random access iterators, for example adjMapBQM, since the benefit will be very low.
Computing Max-Flow :
There are many papers claiming much better performance than the push_relabel algorithm we have implemented but we have to note two things :
One is some of them are tuned for computer vision tasks, they may perform very poorly for common graphs.
Secondly these algorithms are generally compared with off the shelf implementations of push_relabel algorithm, some comparisons online where the serial and parallel version of the code are written by the same author show that there is not much benefit in parallizing a good implementation of push_relabel algorithm. So we can wait for a better parallel algorithm to be developed before we opt for parallelizing max-flow computation.
The breadth-first search is not a bottleneck now but it could also be parallelized, but that would require us to call openmp functions explicitly.
Making Residual Graph symmetric: - DONE
This is a step very easy to parallelize and have been parallelized already. But the problem is we are mapping variable 0-n-1 to vertex 0-n-1, n is the source and n+1 - 2n are the complements of the variables and 2n+1 is the sink. If we map variable 0 to vertex 0,1 , variable 1 to vertex 2,3 then in the adjacency list of the implication network the outedges from a vertex will end up being created in a sorted order based on the original variable. In the step where we make the residual network symmetric, a thread processes an edge and its symmetric version. But we cannot allow two different threads to process the edges twice, that is the thread that encounters the original edge and the thread that encounters the symmetric edge, we decide which thread will process them based on the comparison of indices of the source & destination variable. (see code).
Instead of doing this check every time, if the edges were sorted, we could do a binary search for each thread and start processing from the right position. We are skipping this step since that makes the code harder to read. Note for this optimization to work we must process the vertices in order, since that will create the edges in order. What it means is that we must create all the edges out of the source when it is time to create all the edges out of the source, and not create the edges to each vertex from the source when processing the respective vertex.
The above mentioned update to the step of making residual graph symmetric is done, now the mapping is separated into s different file of its own mapping_policy.hpp. The other code that is code outside of mapping_policy is more or less independent of the mapping methodology now.
The text was updated successfully, but these errors were encountered:
amahmudwave
changed the title
Improve performance through parallelization & better variable to vertex mapping.
Improve performance of Roof Duality through parallelization & better variable to vertex mapping.
Dec 1, 2020
Three steps take most of the time.
These three steps can be made faster by the following techniques and should be implemented when we have a good parallel Max-Flow algorithm at our disposal (otherwise Amdahl's law will come in effect).
Creation of Implication network :
For very large graphs, we can create all the edges corresponding to a vertex in parallel. Creating the edges of different vertices in parallel is not a good thing, since it will lead to too much contention and we have to use locks throughout the process. But to parallelize the creation of all the edges of a single vertex in parallel we need to do two things.
Keep an explicit posiform - many biases in the QUBO for a variable do not contribute to the implication network as they are flushed to zeroes when converted to an integral type, thus we cannot parallelize the creation of all the edges corresponding to a variable, as the index where the edge will be in the adjacency list will not be dependent on the index/iterator offset of the QUBO bias for that vertex, it will be dependent on the index/offset of the posiform coefficient. We can create the posiform in parallel though but having an explicit posiform will increase the memory requirement by around 25%.
Basically we want something like this :
Use of pointers instead of vectors - We have to preallocate the amount of memory we will need and each thread will write to the exact position, this can be done using vectors too but the resize() function of vectors adds too much overhead.
We should not try to parallelize the creation of the graph for BQMs that do not use random access iterators, for example adjMapBQM, since the benefit will be very low.
Computing Max-Flow :
There are many papers claiming much better performance than the push_relabel algorithm we have implemented but we have to note two things :
One is some of them are tuned for computer vision tasks, they may perform very poorly for common graphs.
Secondly these algorithms are generally compared with off the shelf implementations of push_relabel algorithm, some comparisons online where the serial and parallel version of the code are written by the same author show that there is not much benefit in parallizing a good implementation of push_relabel algorithm. So we can wait for a better parallel algorithm to be developed before we opt for parallelizing max-flow computation.
The breadth-first search is not a bottleneck now but it could also be parallelized, but that would require us to call openmp functions explicitly.
Making Residual Graph symmetric: - DONE
This is a step very easy to parallelize and have been parallelized already. But the problem is we are mapping variable 0-n-1 to vertex 0-n-1, n is the source and n+1 - 2n are the complements of the variables and 2n+1 is the sink. If we map variable 0 to vertex 0,1 , variable 1 to vertex 2,3 then in the adjacency list of the implication network the outedges from a vertex will end up being created in a sorted order based on the original variable. In the step where we make the residual network symmetric, a thread processes an edge and its symmetric version. But we cannot allow two different threads to process the edges twice, that is the thread that encounters the original edge and the thread that encounters the symmetric edge, we decide which thread will process them based on the comparison of indices of the source & destination variable. (see code).
Instead of doing this check every time, if the edges were sorted, we could do a binary search for each thread and start processing from the right position. We are skipping this step since that makes the code harder to read. Note for this optimization to work we must process the vertices in order, since that will create the edges in order. What it means is that we must create all the edges out of the source when it is time to create all the edges out of the source, and not create the edges to each vertex from the source when processing the respective vertex.
The above mentioned update to the step of making residual graph symmetric is done, now the mapping is separated into s different file of its own mapping_policy.hpp. The other code that is code outside of mapping_policy is more or less independent of the mapping methodology now.
The text was updated successfully, but these errors were encountered: