Skip to content

Latest commit

 

History

History
98 lines (72 loc) · 5.35 KB

README.md

File metadata and controls

98 lines (72 loc) · 5.35 KB

高性能计算™

语言级

C/C++性能优化

Python性能优化


并发及第三方高性能库

OpenMP


解释器/虚拟机

JVM

CPython


编译器级

Clang/LLVM

Graalvm

参考资料

  1. Engineer a Compiler
  2. Static Single Assignment Book
  3. Modern Compiler Implementation in Java/C/ML
  4. Parsing Techniques
  5. The Definitive ANTLR 4 Reference
  6. Advanced Virtual Machine Design and Implementation

操作系统级

参考资料

  1. Performance Analysis Methodology by Brendan Gregg, Joyent
  2. Tnking Methodically about Performance by Brendan Gregg, Joyent

硬件级

CPU/PMU/TMAM/VTune or PMU-TOOLS

GPU

参考资料


其他资料汇总

  1. Intel® 64 and IA-32 Architectures Optimization Reference Manual
  2. IA-32 Intel架构如那件开发人员手册 卷3:系统编程指南
  3. Intel® VTune™ Profiler Performance Analysis Cookbook
  4. Analyzing Open vSwitch* with DPDK Bottlenecks Using Intel® VTune™ Amplifier
  5. The microarchitecture of Intel, AMD, and VIA CPUs An optimization guide for assembly programmers and compiler makers
  6. MicroFusion in Intel CPUs
  7. 分支预测,uOP,乱序执行
  8. How TMA* Addresses Challenges in Modern Servers and Enhancements Coming in IceLake
  9. Andi Kleen's blog - pmu-tools part I, pmu-tools, other fork
  10. Top-Down performance analysis methodology
  11. Transactional Synchronization Extensions
  12. Performance Prediction Toolkit (PPT)
  13. wiki - Linux kernel profiling with perf
  14. pprof++: A Go Profiler with Hardware Performance Monitoring
  15. CUDA C Programming Guide
  16. CUDA C++ Best Practices Guide, 解读《CUDA C最佳实践指南》
  17. 阅读CUDA英文手册100天
  18. Python通过Numba实现GPU加速
  19. 使用PYTHON进程GPU编程
  20. Numba for CUDA GPUs
  21. CUDA 编程 源码仓库, pyCUDA
  22. 探讨TensorRT加速AI模型的简易方案—以图像超分为例
  23. C/C++ 性能优化背后的方法论:TMAM
  24. BLAS (Basic Linear Algebra Subprograms)
  25. LAPACK — Linear Algebra PACKage
  26. OpenBLAS
  27. NVIDIA深度学习Tensor Core全面解析(上篇)
  28. OpenBLAS gemm从零入门
  29. TVM(端到端的优化栈)概述
  30. 通过Numba调用CUDA用GPU为Python加速:进阶理解网格跨步、多流、共享内存
  31. 并行编程OpenMP基础及简单示例
  32. 知乎 - Computer Arch专栏 让CPU黑盒不再黑
  33. CUDA与OpenCL架构
  34. Intel 处理器架构演进
  35. 计算机组成与设计。硬件 / 软件接口 学习笔记(二)
  36. C++的性能优化
  37. C++性能优化技术导论