layout | permalink | title |
---|---|---|
page |
/weft/index.html |
Verification of Producer-Consumer Synchronization in GPU Programs |
Important: as of 2022, some of the links below are dead, and these instructions may no longer work. For the most up-to-date instructions, please go to the GitHub repo. This page is no longer being maintained.
This webpage describes the artifact for the following PLDI 2015 paper:
[Verification of Producer-Consumer Synchronization in GPU Programs]({{ "/pdfs/weft.pdf" | relative_url }})
The artifact (a CDE package) can be downloaded here. The artifact link is dead, please refer to the GitHub repo above.
The version of Weft documented here is significantly improved from the original version presented in our PLDI paper. This version of Weft has the following modifications:
- The functionality of PTXSim originally described in our paper
has been subsumed entirely by Weft. This version of Weft parses
an input PTX file and performs the emulation internally before
performing the validation. Consequently, the process of translating
from emulation to verification is significantly faster. The original
files that would have been generated by PTXSim can still be
generated using the
-p
flag. - The construction of the barrier dependence graph has been greatly optimized for higher performance.
- The representation of happens-before/after relationships have been optimized for reduced memory usage.
- Weft is now multi-threaded (see the
-t
flag) so that we can fully leverage the available memory bandwidth on modern processors which is crucial for achieving high performance.
As a result of our improvements, you may notice some differences between the version of PTXSim and Weft described in our original paper submission and the current source code. We assure you that this version of Weft has a super-set of the functionality of the original Weft and is anywhere from 2-30X faster than the original version depending on the input program and the machine architecture. We will document these improvements along with new performance results in the camera-ready version of our paper.
Below we cover how to run Weft using our CDE package. The instructions for downloading, compiling, and running Weft from source are available at the github repository. Please consider reading the README to understand the advanced usage of WEFT.
This page provides instructions to reproduce the experiments described in the paper using the artifact. We assume a Linux based system, such as, Ubuntu.
First, unpack the archive
{% highlight bash %} tar -zxvf cde-weft.tar.gz {% endhighlight %}
Next, go to the WEFT directory
{% highlight bash %} cd cde-package/cde-root/home/sharmar/Desktop/artifact/Weft-master/ {% endhighlight %}
This folder, referred to as the main directory, contains two sub-directories:
- The directory
src
contains the source code of WEFT with a precompiled binary weft. - The directory
examples
contains the examples described in the paper.
To run WEFT on the examples, run the following commands:
{% highlight bash %} cd examples/ ./run_examples.sh.cde {% endhighlight %}
This shell file is just a wrapper around run_examples.sh which is the script we normally use to run all of our test examples. Alternatively, you can execute run_examples.sh directly from the terminal using the following command:
{% highlight bash %} ../../../../../../../cde-exec ./run_examples.sh {% endhighlight %}
The ugly looking cde-exec prefix is important. It ensures
that the CDE package performs as expected. For more details,
please see the CDE user manual. The CDE user manual
link is dead.
This shell file first compiles WEFT. Since the package contains a precompiled binary of WEFT, you should see the following output.
{% highlight bash %} make: Nothing to be done for 'all' {% endhighlight %}
Next, it generates the PTX file for the SAXPY kernels by compiling the CUDA files. Since, these files have been precompiled, you should again see
{% highlight bash %} make: Nothing to be done for 'all' {% endhighlight %}
Next WEFT analyzes the SAXPY kernels starting with saxpy_single.
{% highlight bash %} Running saxpy single... WEFT INFO: No deadlocks detected! WEFT INFO: Barriers properly recycled! WEFT INFO: No races detected! WEFT STATISTICS for saxpy_single.ptx CTA Thread Count: 320 Shared Memory Locations: 512 Physical Named Barriers; 6 Dynamic Barrier Instances: 8192 Static Instructions: 169 Dynamic Instructions: 9575616 Weft Statements: 3670016 Total Race Tests: 1878392832 {% endhighlight %}
The output says that WEFT has proven that the kernel is well-synchronized (deadlock free and recycles barriers properly) and race-free followed by some statistics (see [paper]({{ "/pdfs/weft.pdf" | relative_url }})). There are 320 threads in the CTA we are analyzing and these threads share 512 memory locations amongst them. This CTA uses 6 named barriers out of the 16 available. Since named barriers are a limited physical resource, they need to be reused and this CTA has 8192 distinct uses of the barriers. The PTX file (with statically analyzable loops) has 169 statements. After unrolling all loops the kernel has about 9 million instructions (across 320 threads). After translating these to the modeling language, WEFT analyzed about three million statements, that is, about hundred thousand statements per thread on average. WEFT proved that the 1.8 billion memory access pairs that can potentially race are safe (do not race).
The shell script continues to analyze all benchmarks described in the paper. Next, consider a kernel with a different output:
{% highlight bash %} Running DME chemistry kepler... WEFT WARNING: Program has shuffle instructions but warp-synchronous execution was not assumed! Enabling warp-synchronous assumption... WEFT INFO: No deadlocks detected! WEFT INFO: Barriers properly recycled! Found races between 32 pairs of threads on line 3248 of chem_kepler.cu and line 3115 of chem_kepler.cu ... WEFT INFO: Found 7168 total races! Run with '-d' flag to see detailed per-thread and per-address races WEFT INFO: RACES DETECTED! WEFT STATISTICS for chem_kepler.ptx ... {% endhighlight %}
This kernel has shuffle instructions that require a warp-synchronous execution
assumption. In this case, the shell script does not provide the -s
flag which enables
the warp-synchronous execution assumption. Therefore WEFT emits a warning
and enables the warp-synchronous assumption. This kernel is well-synchronized,
but it does have data races and WEFT shows the lines in the CUDA file that race.
More information can be obtained by running the kernel with -d
flag.
If you want to compile WEFT for your machine (and not use the precompiled binary) then you can recompile the source using the following (from the main directory):
{% highlight bash %} cd src ../../../../../../../cde-exec make clean ../../../../../../../cde-exec make {% endhighlight %}
Similarly, if you want to regenerate the SAXPY PTX files from the CUDA files then you can do the following (again from the main directory):
{% highlight bash %} cd examples/saxpy ../../../../../../../../cde-exec make clean ../../../../../../../../cde-exec make {% endhighlight %}
The other examples can also be compiled similarly.
Weft can also be run on individual examples using the cde-exec command. We encourage you to use the '-i' flag which instruments the execution of Weft and give detailed performance results including the timing and memory usage for each stage of execution. If you also have a multi-core machine we encourage you to explore the performance effects of variable number of threads using the '-t' flag.
{% highlight bash %} cd examples/saxpy ../../../../../../../../cde-exec ../../src/weft -i -t 4 -n 320 saxpy_single.ptx WEFT INFO: No deadlocks detected! WEFT INFO: Barriers properly recycled! WEFT INFO: No races detected! WEFT STATISTICS for saxpy_single.ptx CTA Thread Count: 320 Shared Memory Locations: 512 Physical Named Barriers; 6 Dynamic Barrier Instances: 8192 Static Instructions: 177 Dynamic Instructions: 11674560 Weft Statements: 3670016 Total Race Tests: 1878392832 WEFT INSTRUMENTATION Parse PTX: 1.385 ms 0 MB Emulate Threads: 3278.410 ms 267 MB Construct Barrier Dependence Graph: 838.145 ms 89 MB Compute Happens-Before/After Relationships: 1558.494 ms 3741 MB Check for Race Conditions: 1978.978 ms 0 MB Total: 7655.412 ms 4099 MB {% endhighlight %}
Stanford University and NVIDIA