Skip to content

PAPI Flops

Frank Winkler edited this page Mar 9, 2020 · 11 revisions

Counting Floating Point Operations on Intel Sandy Bridge and Ivy Bridge

Counters

Intel's Sandy Bridge and Ivy Bridge cpu architectures provide a rich computing environment and a comprehensive performance monitoring unit with which to measure performance. These processors support 11 hardware performance counters per core: 3 fixed counters for core cycles, reference cycles and core instructions executed, in addition to 8 programmable counters with minimal restrictions. That's the good news. The bad news starts to show up when you actually use these counters in real situations. Most environments run with hyperthreading enabled. This allows each core to run two simultaneous interleaved threads, and hopefully keep the functional units filled to higher capacity. Those 8 programmable counters suddenly turn into 4, since each thread must maintain its own hardware counters. Further, most environments also run with a non-maskable interrupt (nmi) timer active. This can be implemented in a variety of ways, but cannot be guaranteed NOT to use one of the four remaining counters. That leaves 3 per thread. So this means that PAPI is only guaranteed 3 programmable counters at a given time, in addition to the 3 fixed counters mentioned earlier. The corollary is that any single PAPI derived event can consist of at most 3 programmable terms to be assured that it can always be counted. This is generally enough for most, but not all, situations.

Floating Point Flavors

Sandy Bridge and Ivy Bridge introduce a set of more powerful AVX assembly instructions. These are vector floating point instructions that operate on up to 256 bits of information at a time. That's 4 simultaneous double precision operations, or 8 parallel single precision operations. You can't guarantee all 256 bits are always in use, so counting floating point operations can be a bit tricky. Because of this and the need for backwards compatibility, these chips continue to support earlier floating point instructions and hardware as well, including 128 bit SSE instructions, MMX instructions, and even the venerable x87 instructions. In both single and double precision versions. That makes 8 different flavors of floating point, and raises the potential need for as many as 8 events to count them all.

Sandy Bridge Floating Point Events

For the last several generations, one of the performance events provided by Intel to count floating point instructions has been called FP_COMP_OS_EXE. This event name is generally associated with one or more umasks, or attributes, to further define what kinds of floating point instructions are being counted. For Sandy Bridge, the available attributes include the following:

Attribute Description
X87 Number of X87 uops executed
SSE_FP_PACKED_DOUBLE Number of SSE double precision FP packed uops executed
SSE_FP_SCALAR_SINGLE Number of SSE single precision FP scalar uops executed
SSE_PACKED_SINGLE Number of SSE single precision FP packed uops executed
SSE_SCALAR_DOUBLE Number of SSE double precision FP scalar uops executed

Although in theory it should be possible to combine all five of these attributes in a single event to count all variations of x87 and SSE floating point instructions, in practice these attributes are found to interact with each other in non-linear ways and must be empirically tested before they can be combined in a single counter. Further, the PACKED versions of these instructions represent more than one floating point operation each, and so can't simply be added to produce a meaningful result.

Intel engineers have verified that variations of this event count speculatively, leading to variable amounts of overcounting, depending on the algorithm. Further, as is discussed later in this article, speculative retries during resource stalls are also counted. Knowing this, it may be possible to use the excess counts as a way to monitor resource inefficiency.

To make matters more confusing, it appears that combining multiple attributes in a single counter produces a result that resembles total cycles more that combined floating point operations.

Sandy Bridge and AVX

You may have noticed that the event attributes shown above don't reference AVX instructions. That requires a separate event in another counter. The name of this event is SIMD_FP_256, and it supports two attributes: PACKED_SINGLE and PACKED_DOUBLE. As in the case of FP_COMP_OPS_EXE, these two attributes cannot be combined in practice without silently producing anomalous results.

Counter to the situation with FP_COMP_OPS_EXE, SIMD_FP_256 counts instructions retired rather than speculative instructions executed. That's a good thing, but overcounts are still observed, because this event also counts AVX operations that are not floating point, such as register loads and stores, and various logical operations. Since such data movement operations will generally be proportional to actual work, for a given algorithm, these counts, while theoretically inaccurate, should still prove to be useful as a measure of relative code performance.

The above discussion also does not mention MMX. There are no events available to Sandy Bridge that reference MMX. One can assume that MMX operations are being processed through SSE instructions and are counted as such.

Ivy Bridge Floating Point Events

Neither the FP_COMP_OPS_EXE nor the SIMD_FP_256 were originally documented on Ivy Bridge. Although rumor was that these events still existed, they were not exposed through the documentation. Due to user demand (thank you) as of late 2013 Intel has now exposed these events in their documentation. We support these events beginning with PAPI version 5.3, released December 2013. All experimentation for this white paper was done on Sandy Bridge. We expect similar results to hold for Ivy Bridge as well.

Counting Floating Point Events on Sandy and Ivy Bridge

In order to develop a feel for counting floating point events on the Sandy and Ivy Bridge architectures, we present a series of tables below that collect a number of different events from several different computational kernels, including a multiply-add, a simple matrix multiply, and optimized GEMMs for both single and double precision. We also show results from several events with multiple attributes. Results with an error of < 5% are marked with (*); errors < 15% with (**); errors > 15% with (***). Results that look suspiciously similar to PAPI_TOT_CYC are shown in bold. All these results were collected on Sandy Bridge; similar results should be expected on Ivy Bridge.

Counting Basic Arithmetic

SandyBridge Native Event single double single double
Iterations of c += a*b (should produce 2000000 operations) 1000000 1000000
2(n)^3; n = 100 (should produce 2000000 operations) 2000000 2000000
FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE 2696310(***) 0 2075676(*) 0
FP_COMP_OPS_EXE:SSE_PACKED_SINGLE 0 0 0 0
FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:SSE_PACKED_SINGLE 9208463 9413445 17797481 17507618
FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE 0 2263614(**) 0 2266045(**)
FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE 0 0 0 0
FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE:SSE_FP_PACKED_DOUBLE 9239106 9385100 17715967 17411885
FP_COMP_OPS_EXE:X87 0 0 0 0
SIMD_FP_256:PACKED_SINGLE 0 0 0 0
SIMD_FP_256:PACKED_DOUBLE 0 0 0 0
SIMD_FP_256:PACKED_SINGLE:PACKED_DOUBLE 9205620 9417067 17809940 17409846

The table above illustrates unoptimized arithmetic operations. There is apparently no use of packed SSE instructions, and no evidence of x87 or AVX instructions. All the operations counted here are scalar. The double precision counts are within 15% of the theoretically expected value, while one single precision count deviates by almost 35% and the other is high by about 3.5%. All attempts at combining more than one unit mask, or attribute, resulted in counts that look surprisingly similar to cycle counts. This was also true for unreported attribute combinations, suggesting that attribute bits cannot be combined.

Counting Optimized GEMMs on Sandy Bridge

SandyBridge Native Event DGEMM (SSE) SGEMM (SSE) DGEMM (AVX) SGEMM (AVX)
2(n)^3; n = 100 (should produce 2000000 operations) 2000000 2000000 2000000 2000000
FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE 0 0 0 4179
FP_COMP_OPS_EXE:SSE_PACKED_SINGLE 0 505096 0 4616
FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE:SSE_PACKED_SINGLE 569342 285453 337105 200464
FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE 0 0 0 0
FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE 1014991 0 0 0
FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE:SSE_FP_PACKED_DOUBLE 569020 284256 334380 202028
FP_COMP_OPS_EXE:X87 43 36 46 62
SIMD_FP_256:PACKED_SINGLE 0 0 0 251104
SIMD_FP_256:PACKED_DOUBLE 0 0 505830 0
SIMD_FP_256:PACKED_SINGLE:PACKED_DOUBLE 569112 285963 333429 200616

This table shows a pattern similar to the one in the table above. Packed single and double precision counts show up in the right places and quantities for both the SSE optimized and AVX optimized GEMMs. There are a small number of scalar and packed SSE operations that show up in the SGEMM case, possibly a result of incomplete AVX packing. There are also a very small number of x87 instructions that are counted in each case. Since these are negligible, they are ignored. As in the previous table, events with multiple attributes produce counts that are surprisingly similar to the equivalent cycle count.

PAPI Preset Definitions

From the observations in the previous two tables, it becomes clear that no single definition can encompass all variations of floating point operations on Sandy and Ivy Bridge. The table below defines PAPI Preset event definitions that encompass a range of cases with reasonable predictability while remaining within the constraint of using three counters or less. PAPI_FP_INS and _OPS are defined identically to include scalar operations only. This is a significant deviation from traditional definitions of these events, because all packed instructions are ignored. PAPI_SP_OPS and _DP_OPS count single and double precision events respectively. They each consist of three terms including scalar and packed SSE, and packed AVX, with terms appropriately scaled to represent operations rather than instructions. PAPI_VEC_SP and _DP count vector instructions in single and double precision using appropriately scaled SSE and AVX instructions.

PRESET Event Definition
PAPI_FP_INS SSE_SCALAR_DOUBLE + SSE_FP_SCALAR_SINGLE
PAPI_FP_OPS same as above
PAPI_SP_OPS FP_COMP_OPS_EXE:SSE_FP_SCALAR_SINGLE + 4*(FP_COMP_OPS_EXE:SSE_PACKED_SINGLE) + 8*(SIMD_FP_256:PACKED_SINGLE)
PAPI_DP_OPS FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE + 2*(FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE) + 4*(SIMD_FP_256:PACKED_DOUBLE)
PAPI_VEC_SP 4*(FP_COMP_OPS_EXE:SSE_PACKED_SINGLE) + 8*(SIMD_FP_256:PACKED_SINGLE)
PAPI_VEC_DP 2*(FP_COMP_OPS_EXE:SSE_FP_PACKED_DOUBLE) + 4*(SIMD_FP_256:PACKED_DOUBLE)

The table below shows measurements taken on a Sandy Bridge processor. Similar results should be expected on Ivy Bridge. In all cases where values are reported, the numbers have positive deviations from theoretical of varying magnitudes. The majority of counts are high by < 15%, which could be attributable to speculative execution. The deviations between measured FP_INS and FP_OPS offer an indication of run-to-run variability in a range from 0.2% to 8 or 9%. Highly optimized operations, such as the GEMMs actually show the best accuracy for both SSE and AVX versions, with deviation from theoretical on the order of 1 to 2%.

PRESET Event single double single double DGEMM SGEMM
SSE Only
PAPI_FP_INS 2261986 2000135 2257735 2258084 0 0
PAPI_FP_OPS 2433941 2261266 2271744 2254170 0 0
PAPI_SP_OPS 2261309 0 2258701 0 0 2020312
PAPI_DP_OPS 0 2668139 0 2259404 2030176 0
PAPI_VEC_SP 0 0 0 0 0 2020408
PAPI_VEC_DP 0 0 0 0 2030180 0
PAPI_TOT_CYC 9256476 9504398 16462338 15271566 567725 285832
SSE and AVX
PAPI_FP_INS 2239110 2130978 2257627 2256830 0 4143
PAPI_FP_OPS 2261253 2132367 2271805 2258158 0 4108
PAPI_SP_OPS 2757764 0 2258695 0 0 2031857
PAPI_DP_OPS 0 2261838 0 2259436 2023100 0
PAPI_VEC_SP 0 0 0 0 0 2028112
PAPI_VEC_DP 0 0 0 0 2023332 0
PAPI_TOT_CYC 9284436 9398770 16460486 15287080 332749 198933

AVX and Cache

John McCalpin at TACC has observed that in general Intel performance counters increment at instruction issue unless the event name specifies "retired". This can lead to overcounting if an instruction is reissued, for example, while waiting for a cache miss to be satisfied. Further experiments appear to verify this hypothesis, with overcount rates directly correlated to cache miss rates in ratios for the STREAM benchmark of anywhere fro 2.8 to 6.5 x the theoretical flop count and operation measured. Specifically in the case of AVX floating point instructions, it appears that overcounts can be explained by this instruction re-issue phenomenon. John has done some tests with the STREAM benchmark suggesting a strong correlation between overcounting and average cache latency. This also suggests an explanation for the relatively small error in AVX DGEMM and SGEMM results, since these algorithms have been optimized to minimize cache misses, and thus retries. Once again the user caveat is that while flop counts and rates for Sandy and Ivy Bridge may be valuable as a relative proxy for code and cache efficiency, they should not be assumed to be an absolute measure of the amount of work done.

Summing Up

Sandy Bridge and Ivy Bridge are powerful processors in the Intel lineage. Both offer a wealth of opportunities for performance measurement. However, measuring the traditional standby floating point metric must be done with care. Be forewarned that although accurate measurements can be made, particularly for highly optimized code, no single PAPI metric is likely to capture all floating point operations. Remember the error bars. Some measurements will be less accurate than others, and the errors will almost always be positive (overcounting) due to speculative execution. Since speculation is likely to be proportional to the amount of floating point work done, even these inaccurate measurements should provide insight when used within the same codes.

If these numbers inspire or challenge you to make more detailed observations with this hardware, please share your conclusions with us. We'd be happy to add further insight into the above report.


Counting Floating Point Operations on Intel Haswell

As pointed out by John McCalpin at TACC, the floating point counters have been disabled in the Intel Haswell cpu architecture. For Sandy Bridge and Ivy Bridge products, these counters were mainly useful for determining what kinds of floating-point instructions were being executed (scalar, 128-bit vector, 256-bit vector) and in what precision (single, double) by different jobs on the system. We are waiting on Intel to provide accurate floating-point counters (preferably counted at retirement to eliminate the over-counting problem that makes the counters less useful (quantitatively) on Sandy Bridge and Ivy Bridge.

Clone this wiki locally