diff --git a/_posts/2024-11-05-perf-tools-06-perf-practice-01.md b/_posts/2024-11-05-perf-tools-06-perf-practice-01.md new file mode 100644 index 0000000..7526a89 --- /dev/null +++ b/_posts/2024-11-05-perf-tools-06-perf-practice-01.md @@ -0,0 +1,136 @@ +--- +title: perf性能分析(6) -- perf实战(1) -- 分支预测 +date: 2024-11-05 +0800 # 2022-01-01 13:14:15 +0800 只写日期也行;不写秒也行;这样也行 2022-03-09T00:55:42+08:00 +categories: [perf] +tags: [perf, vtune, tbb] # TAG names should always be lowercase + +# 以下默认false +math: true +mermaid: true +# pin: true +--- + +使用`Debug`模式编译 `sort` 和 `unsort` 代码: + +```c++ +#include +#include +#include + +int main(int argc, char* argv[]) { + constexpr int kArrLen = 1024 * 1024; + int* data = new int[kArrLen]; + + for (int c = 0; c < kArrLen; c++) + data[c] = std::rand() % 256; + + // std::sort(data, data + kArrLen); // 是否排序 + + long long sum = 0; + for (int i = 0; i < 30000; i++) { + for (int c = 0; c < kArrLen; c++) { + if (data[c] >= 128) { // 故意選 256 一半 + sum += data[c]; + } + } + } + + delete[] data; + std::cout << "sum = " << sum << std::endl; +} +``` + +## 1. sort/unsort 性能测试及比较: perf stat ## + +The CPU is `Intel(R) Core(TM) i5-8260U CPU @ 1.60GHz`. + +`perf stat ./sort` 测试结果: + +```bash +$ perf stat ./sort + +sum = 3016043160000 + + Performance counter stats for './sort': + + 63,995.28 msec task-clock # 1.000 CPUs utilized + 1,039 context-switches # 16.236 /sec + 7 cpu-migrations # 0.109 /sec + 1,151 page-faults # 17.986 /sec + 240,342,649,070 cycles # 3.756 GHz + 473,524,864,500 instructions # 1.97 insn per cycle + 63,163,134,842 branches # 986.997 M/sec + 7,402,185 branch-misses # 0.01% of all branches + + 64.008866998 seconds time elapsed + + 63.983827000 seconds user + 0.010999000 seconds sys +``` + +`perf stat ./unsort` 测试结果: + +```bash +$ perf stat ./unsort + +sum = 3016043160000 + + Performance counter stats for './unsort': + + 206,441.99 msec task-clock # 1.000 CPUs utilized + 484 context-switches # 2.344 /sec + 44 cpu-migrations # 0.213 /sec + 1,150 page-faults # 5.571 /sec + 771,936,412,315 cycles # 3.739 GHz + 473,410,337,438 instructions # 0.61 insn per cycle + 63,144,272,701 branches # 305.869 M/sec + 15,732,825,641 branch-misses # 24.92% of all branches + + 206.439026771 seconds time elapsed + + 206.440760000 seconds user + 0.002000000 seconds sys +``` + +### 1.1. Result comparison ### + +`unsort`的分支预测失败率为`24.92%`, 且`CPU IPC`为`0.61`, 小于1.0. + +`sort`分支预测失败率为`0.01%`, `CPU IPC`为`1.97`. + +## 2. analysis using `perf record` and `perf ananotate` ## + +### 2.1. program unsort ### + +运行 `perf record` 和 `perf annotate` 命令: + +```bash +perf record ./unsort # 结束后生产 perf.data 文件 +perf annotate # 自动定位到最耗时的地方 +``` + +![perf_annotate_unsort](/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_unsort.png) + +`if`语句比较失败跳转至循环体, 循环体占用`24.44%`的CPU时间. 原因是`分支预测失败` -- `if`语句在`for`循环中失败次数较多. + +## 2.2. program sort ## + +![perf_annotate_sort](/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_sort.png) + +分支预测失败率为`.051%` + `4.31%`, 且`CPU IPC`为`1.97`. + +* 内存`load`操作: `41.68%`. +* `add`指令: `10.54%`. + +## 3. When build with `Release` ## + +如果使用`-O2`或者`-O3`编译,编译器使用了`vector`相关指令(针对32位的 `pcmpgtd`),消除了`if`分支语句。其结果就是,`sort`与`unsort`的性能差距不大,甚至`sort`的性能更好。 + +## references ## + +* [在 Linux 上使用 Perf 做效能分析(入門篇)](https://tigercosmos.xyz/post/2020/08/system/perf-basic/) +* [Category:Architecture/x86-assembly](https://www.aldeid.com/wiki/Category:Architecture/x86-assembly) +* [StackOverflow: Why is processing a sorted array faster than processing an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array) +* [StackOverflow: Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?](https://stackoverflow.com/questions/66521344/why-is-processing-an-unsorted-array-the-same-speed-as-processing-a-sorted-array) +* [知乎 -- 分支对代码性能的影响和优化](https://zhuanlan.zhihu.com/p/487690985) +* [Modern Microprocessors -- A 90-Minute Guide!](https://www.lighterra.com/papers/modernmicroprocessors/) diff --git a/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_sort.png b/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_sort.png new file mode 100644 index 0000000..2f4c4d5 Binary files /dev/null and b/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_sort.png differ diff --git a/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_unsort.png b/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_unsort.png new file mode 100644 index 0000000..79055b5 Binary files /dev/null and b/assets/images/perf/20241105_perf_practice_branch_prediction/perf_annotate_unsort.png differ