You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way unordered subsets are managed in QDP++ is cumbersome for parscalarvec.
The subset is represented by a site table, referring to the 'linear site'. This is cumbersome to thread and vectorize. As an example consider how right now we would do a sum over an unordered subset as of commit:
We do a loop over sites in the subset (this can be parallelized over threads BUT.... see later)
We must find the block for the site
We do redundant operations (we compute the whole block)
We sum only one site from the block (actually I've generalized this to summation under a mask, but the mask has only 1 true element)
This can have several inefficiencies:
i) redundant computation within a thread if there is more than 1 site in the same outer block belonging to the thread. This also brings with it some additional memory traffic, tho it may be OK (ugh) if the repeatedly accessed memory stays in cache.
ii) potentially redundant computation carried out in several threads, if sites in the same outer block are scheduled to different threads. This will also duplicate memory traffic and may cause memory pingponging.
A natural table for parscalarvec would split into two tables:
a table of 'outer blocks' in the subset
for each 'outer block' a table of inner sites in the subset, or a mask
This latter approach would allow multi-threading over the outer blocks,
and vectorization (under mask) for the ILattice bits.
However, creating the tables from the 'site' table is like histogramming (go through sites and 'bin' them into 'outer blocks'). This can have an issue of parallelization (write contention on the binning.). For sets like rb, all, etc this is not a biggie as it can be done at startup and amortized. However, it can be a cost for SftMom in chroma which creates sets 'on the fly' or for user defined sets /subsets which create things on the fly, this can be a problem.
Thoughts anyone?
The text was updated successfully, but these errors were encountered:
The way unordered subsets are managed in QDP++ is cumbersome for parscalarvec.
The subset is represented by a site table, referring to the 'linear site'. This is cumbersome to thread and vectorize. As an example consider how right now we would do a sum over an unordered subset as of commit:
This can have several inefficiencies:
i) redundant computation within a thread if there is more than 1 site in the same outer block belonging to the thread. This also brings with it some additional memory traffic, tho it may be OK (ugh) if the repeatedly accessed memory stays in cache.
ii) potentially redundant computation carried out in several threads, if sites in the same outer block are scheduled to different threads. This will also duplicate memory traffic and may cause memory pingponging.
A natural table for parscalarvec would split into two tables:
This latter approach would allow multi-threading over the outer blocks,
and vectorization (under mask) for the ILattice bits.
However, creating the tables from the 'site' table is like histogramming (go through sites and 'bin' them into 'outer blocks'). This can have an issue of parallelization (write contention on the binning.). For sets like rb, all, etc this is not a biggie as it can be done at startup and amortized. However, it can be a cost for SftMom in chroma which creates sets 'on the fly' or for user defined sets /subsets which create things on the fly, this can be a problem.
Thoughts anyone?
The text was updated successfully, but these errors were encountered: