Skip to content

Counter Analysis Toolkit

Treece-Burgess edited this page Feb 19, 2024 · 14 revisions

The Counter Analysis Toolkit

Starting with PAPI 6.0, releases contain the ${\color{lightblue}\textsf{C}\color{white}\textsf{ounter}}$ ${\color{lightblue}\textsf{A}\color{white}\textsf{nalysis}}$ ${\color{lightblue}\textsf{T}\color{white}\textsf{oolkit}}$ (${\color{lightblue}\textsf{CAT}}$) to help PAPI users understand and verify CPU native events. The code resides under the directory src/counter_analysis_toolkit/.


Building CAT

Simply running make in the directory src/counter_analysis_toolkit/ should produce the binary file cat_collect. Note that some of the source files are automatically generated and are very long, so parts of the build process might take multiple seconds, and the whole process could take a few minutes.


Running CAT

CAT is a collection of benchmarks that stress different parts of the CPU to help understand what type of information each native event collects. An example of using the tool (to stress branch instructions) can be seen below. As a note, CAT expects the directory to exist, and will fail if it does not.

mkdir OUT_DIR
./cat_collect -in event_list.txt -out OUT_DIR -branch

The parameter -in event_list.txt specifies that the input file event_list.txt contains the list of events that the benchmark should monitor while executing its kernels. Each line in the input file event_list.txt must contain a native event name followed by the number of qualifiers that cat_collect should append to the event name. For example, consider the following lines:

BR_INST_RETIRED 1  
BR_MISP_RETIRED:ALL_BRANCHES 0  
L2_RQSTS:DEMAND_DATA_RD_HIT 0  
L2_RQSTS:ALL_PF 0  

The first line instructs the tool to append all applicable qualifiers to the base event BR_INST_RETIRED, one qualifier at a time. On a Skylake architecture, for example, the possible qualifiers for this base event are: CONDITIONAL, COND, NEAR_CALL, ALL_BRANCHES, NEAR_RETURN, NOT_TAKEN, NEAR_TAKEN, and FAR_BRANCH.

Therefore, the first line of this file instructs the tool to monitor the following events:

BR_INST_RETIRED:CONDITIONAL
BR_INST_RETIRED:COND
BR_INST_RETIRED:NEAR_CALL
BR_INST_RETIRED:ALL_BRANCHES
BR_INST_RETIRED:NEAR_RETURN
BR_INST_RETIRED:NOT_TAKEN
BR_INST_RETIRED:NEAR_TAKEN
BR_INST_RETIRED:FAR_BRANCH

The last three lines of the example file specify a value of zero, so no additional qualifiers will be added. Specifying a value higher than one will produce combinations of qualifiers appended to a base event.

The parameter -out OUT_DIR will instruct the tool to store all output files under the directory OUT_DIR.

By using one or more of the flags -branch, -dcr, -dcw, -flops, -ic, -vec, and -instr the user can determine if the tool will use the kernels that stress the brach instructions, data cache reads, data cache writes, floating point operations, instruction caches, vector floating point operations, or instructions.


Environment variables

The kernels that stress the data cache reads and writes (flags "-dcr" and "-dcw", respectively) utilize OpenMP to deploy multiple threads to increase the pressure on the memory hierarchy. The number of threads used and their placement can affect the quality of the measurements. For optimal results the user should set the following two environment variables before running the benchmark:

export OMP_PROC_BIND=close
export OMP_PLACES=cores

Regarding the number of threads used, it is ideal to use a number that is no more than the number of physical cores in one socket, and no less than half the number of physical cores in one socket. Setting the number of threads that will be used by OpenMP to 12, for example, can be done using the following:

export OMP_NUM_THREADS=12

Configuration file

Users who seek more fine-grained control of the cache benchmarks can modify the configuration file .cat_cfg. This file contains lines (commented out, by default) for manually setting the following attributes:

  • sizes of the different levels of the cache hierarchy, i.e., L3_DCACHE_SIZE=33554432,

  • the number of physical cores that share a cache level, i.e., L3_SPLIT=8

  • the number of measurements performed by the benchmark that should fall within a particular cache level, i.e., PTS_PER_L2=7


Understanding the output

For each event specified in the input file, cat_collect will run all benchmarks specified by the corresponding flags (-branch, -dcr, -dcw, -flops, -ic, -vec, and -instr). The measurements collected by measuring an event E1 when running benchmark B1 will be stored in file E1.B1. For example if all benchmark flags have been specified by the user, then the measurements for event L2_RQSTS:ALL_PF will be stored into the output folder in the following files:

L2_RQSTS:ALL_PF.branch
L2_RQSTS:ALL_PF.data.reads
L2_RQSTS:ALL_PF.data.writes
L2_RQSTS:ALL_PF.flops
L2_RQSTS:ALL_PF.icache
L2_RQSTS:ALL_PF.vec
L2_RQSTS:ALL_PF.instr

The following sections illustrate the format and meaning of the output files that will be generated by the different benchmarks of cat_collect.

Data Cache Reads/Writes

When the flag -dcr is provided, the tool runs the benchmark that stresses reading from the data caches. The benchmark is automatically run six times using different parameters for each run and the output is the concatenation of the six runs. Each of the six runs makes multiple measurements varying the read buffer size from a small size (which is expected to fit in the L1 cache) to a very large size (which is expected to exceed the largest cache). The number of measurements and the exact sizes chosen depend on the size of the caches in the target system, or the values set in the configuration file, if any. The buffer size is incremented geometrically within each cache level.

The output file (say L2_RQSTS:ALL_PF.data.reads) contains two types of lines. Lines that start with a hash mark "#" contain meta-data about the following run, and lines that contain only numbers contain the measurements from a specific run.

Below will be an excerpt of example output.

Line # Excerpt Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
...
# Core: 0 1 2 3
# L1:49152 L2:1310720 L3:12582912
# PTRN=3, STRIDE=64, PPB=512.000000, ThreadCount=4
10332 0.000850 0.000224 0.001233 0.000296
17377 0.000446 0.000266 0.000719 0.000587
29225 0.008044 0.001204 0.009319 0.000924
111694 0.005845 0.003518 0.006362 0.001503
253819 0.043994 0.039762 0.058863 0.016914
576790 0.098343 0.020495 0.076956 0.042899
2307160 2.567223 2.507576 2.525502 2.593909
4061117 2.786239 2.750716 2.828828 2.854764
7148474 2.700990 2.840585 2.863677 2.926115
25165824 2.789616 2.993953 2.961302 2.782289
50331648 2.913780 2.845519 2.958096 2.885143
100663296 2.960418 2.823707 2.994151 2.924065
# PTRN=3, STRIDE=64, PPB=16.000000, ThreadCount=4
10332 0.015985 0.023000 0.030157 0.012596
17377 0.000264 0.001310 0.001064 0.000571
29225 0.000282 0.000502 0.000836 0.001169
111694 0.002159 0.004715 0.002096 0.003334
253819 0.121256 0.109749 0.079567 0.084556
576790 0.025922 0.185804 0.024623 0.026075
2307160 2.015658 1.905451 2.022983 1.965422
4061117 2.069945 2.181070 2.138184 2.041927
7148474 2.238331 2.295703 2.276641 2.240004
25165824 2.266684 2.250697 2.294734 2.208849
50331648 2.265640 2.232139 2.285072 2.285199
100663296 2.277065 2.246612 2.286312 2.250087
...
  • The first line specifies that there were four threads and they were bound onto cores 0,1,2, and 3.
  • The second line specifies the sizes (in bytes) of the three data caches found on the system.
  • The third line specifies the parameters used in the run that generated the subsequent measurements.
  • The following 12 lines have the same format. The first column contains the size of the buffer used and the remaining columns contain the value of the event being measured (which is specified by the file name), as measured by each thread. The reported value is normalized by the number of iterations in the kernel. In other words, the event measurements are reported per-iteration.
  • The next meta-data line (line 16) specifies the parameters of a different run, the measurements of which follow. In this example, the difference between the first and second run is the value of the parameter "PPB", which specifies the number of pages per block in the pointer chain and has a direct effect on the behavior of the prefetching units of the cache.

The graph shown below depicts an example of data collected by the data cache read benchmark. The X axis of the graph is in log scale. To improve readability, we mark on the X axis the indices at which the buffer size exceeds the size of the three caches on that architecture.

The six subplots show the data from the six different parameter sets, and the labels in the bottom part of the graph show the parameters used. In particular:

  • Access Pattern: the first four runs perform pointer chasing using a random access pattern, while the last two perform pointer chasing using a sequential access pattern.

  • Stride: the first, second, and fifth runs use a stride of 64 bytes (i.e., each 64 byte segment in memory will be accessed only once), while the third, fourth, and sixth use a stride of 128 bytes.

  • Block Size: all runs segment the buffer in blocks when creating the pointer chain, such that each block is fully traversed before the next block starts being accessed. The first and third runs create a pointer chain within a block that spans 512 pages, while the second and fourth runs use a block of 16 pages. The notion of block does not apply to the sequential access pattern.