-
Notifications
You must be signed in to change notification settings - Fork 54
Counter Analysis Toolkit
Starting with PAPI 6.0, releases contain the src/counter_analysis_toolkit/
.
Simply running make
in the directory src/counter_analysis_toolkit/
should produce the binary file cat_collect
. Note that some of the source files are automatically generated and are very long, so parts of the build process might take multiple seconds, and the whole process could take a few minutes.
CAT is a collection of benchmarks that stress different parts of the CPU to help understand what type of information each native event collects. An example of using the tool (to stress branch instructions) can be seen below. As a note, CAT expects the directory to exist, and will fail if it does not.
mkdir OUT_DIR
./cat_collect -in event_list.txt -out OUT_DIR -branch
The parameter -in event_list.txt
specifies that the input file event_list.txt
contains the list of events that the benchmark should monitor while executing its kernels.
Each line in the input file event_list.txt
must contain a native event name followed by the number of qualifiers that cat_collect
should append to the event name. For example, consider the following lines:
BR_INST_RETIRED 1
BR_MISP_RETIRED:ALL_BRANCHES 0
L2_RQSTS:DEMAND_DATA_RD_HIT 0
L2_RQSTS:ALL_PF 0
The first line instructs the tool to append all applicable qualifiers to the base event BR_INST_RETIRED
, one qualifier at a time. On a Skylake architecture, for example, the possible qualifiers for this base event are: CONDITIONAL
, COND
, NEAR_CALL
, ALL_BRANCHES
, NEAR_RETURN
, NOT_TAKEN
, NEAR_TAKEN
, and FAR_BRANCH
.
Therefore, the first line of this file instructs the tool to monitor the following events:
BR_INST_RETIRED:CONDITIONAL
BR_INST_RETIRED:COND
BR_INST_RETIRED:NEAR_CALL
BR_INST_RETIRED:ALL_BRANCHES
BR_INST_RETIRED:NEAR_RETURN
BR_INST_RETIRED:NOT_TAKEN
BR_INST_RETIRED:NEAR_TAKEN
BR_INST_RETIRED:FAR_BRANCH
The last three lines of the example file specify a value of zero, so no additional qualifiers will be added. Specifying a value higher than one will produce combinations of qualifiers appended to a base event.
The parameter -out OUT_DIR
will instruct the tool to store all output files under the directory OUT_DIR
.
By using one or more of the flags -branch
, -dcr
, -dcw
, -flops
, -ic
, -vec
, and -instr
the user can determine if the tool will use the kernels that stress the brach instructions, data cache reads, data cache writes, floating point operations, instruction caches, vector floating point operations, or instructions.
The kernels that stress the data cache reads and writes (flags "-dcr" and "-dcw", respectively) utilize OpenMP to deploy multiple threads to increase the pressure on the memory hierarchy. The number of threads used and their placement can affect the quality of the measurements. For optimal results the user should set the following two environment variables before running the benchmark:
export OMP_PROC_BIND=close
export OMP_PLACES=cores
Regarding the number of threads used, it is ideal to use a number that is no more than the number of physical cores in one socket, and no less than half the number of physical cores in one socket. Setting the number of threads that will be used by OpenMP to 12, for example, can be done using the following:
export OMP_NUM_THREADS=12
Users who seek more fine-grained control of the cache benchmarks can modify the configuration file .cat_cfg
. This file contains lines (commented out, by default) for manually setting the following attributes:
-
sizes of the different levels of the cache hierarchy, i.e., L3_DCACHE_SIZE=33554432,
-
the number of physical cores that share a cache level, i.e., L3_SPLIT=8
-
the number of measurements performed by the benchmark that should fall within a particular cache level, i.e., PTS_PER_L2=7
For each event specified in the input file, cat_collect
will run all benchmarks specified by the corresponding flags (-branch
, -dcr
, -dcw
, -flops
, -ic
, -vec
, and -instr
). The measurements collected by measuring an event E1 when running benchmark B1 will be stored in file E1.B1
. For example if all benchmark flags have been specified by the user, then the measurements for event L2_RQSTS:ALL_PF
will be stored into the output folder in the following files:
L2_RQSTS:ALL_PF.branch
L2_RQSTS:ALL_PF.data.reads
L2_RQSTS:ALL_PF.data.writes
L2_RQSTS:ALL_PF.flops
L2_RQSTS:ALL_PF.icache
L2_RQSTS:ALL_PF.vec
L2_RQSTS:ALL_PF.instr
The following sections illustrate the format and meaning of the output files that will be generated by the different benchmarks of cat_collect
.
When the flag -dcr
is provided, the tool runs the benchmark that stresses reading from the data caches. The benchmark is automatically run six times using different parameters for each run and the output is the concatenation of the six runs. Each of the six runs makes multiple measurements varying the read buffer size from a small size (which is expected to fit in the L1 cache) to a very large size (which is expected to exceed the largest cache). The number of measurements and the exact sizes chosen depend on the size of the caches in the target system, or the values set in the configuration file, if any. The buffer size is incremented geometrically within each cache level.
The output file (say L2_RQSTS:ALL_PF.data.reads) contains two types of lines. Lines that start with a hash mark "#" contain meta-data about the following run, and lines that contain only numbers contain the measurements from a specific run.
Below will be an excerpt of example output.
Line # | Excerpt Output |
---|---|
1 |
# Core: 0 1 2 3 |
- The first line specifies that there were four threads and they were bound onto cores 0,1,2, and 3.
- The second line specifies the sizes (in bytes) of the three data caches found on the system.
- The third line specifies the parameters used in the run that generated the subsequent measurements.
- The following 12 lines have the same format. The first column contains the size of the buffer used and the remaining columns contain the value of the event being measured (which is specified by the file name), as measured by each thread. The reported value is normalized by the number of iterations in the kernel. In other words, the event measurements are reported per-iteration.
- The next meta-data line (line 16) specifies the parameters of a different run, the measurements of which follow. In this example, the difference between the first and second run is the value of the parameter "PPB", which specifies the number of pages per block in the pointer chain and has a direct effect on the behavior of the prefetching units of the cache.
The graph shown below depicts an example of data collected by the data cache read benchmark. The X axis of the graph is in log scale. To improve readability, we mark on the X axis the indices at which the buffer size exceeds the size of the three caches on that architecture.
The six subplots show the data from the six different parameter sets, and the labels in the bottom part of the graph show the parameters used. In particular:
-
Access Pattern: the first four runs perform pointer chasing using a random access pattern, while the last two perform pointer chasing using a sequential access pattern.
-
Stride: the first, second, and fifth runs use a stride of 64 bytes (i.e., each 64 byte segment in memory will be accessed only once), while the third, fourth, and sixth use a stride of 128 bytes.
-
Block Size: all runs segment the buffer in blocks when creating the pointer chain, such that each block is fully traversed before the next block starts being accessed. The first and third runs create a pointer chain within a block that spans 512 pages, while the second and fourth runs use a block of 16 pages. The notion of block does not apply to the sequential access pattern.