- Engineer a Compiler
- Static Single Assignment Book
- Modern Compiler Implementation in Java/C/ML
- Parsing Techniques
- The Definitive ANTLR 4 Reference
- Advanced Virtual Machine Design and Implementation
- Performance Analysis Methodology by Brendan Gregg, Joyent
- Tnking Methodically about Performance by Brendan Gregg, Joyent
- Intel® 64 and IA-32 Architectures Optimization Reference Manual
- IA-32 Intel架构如那件开发人员手册 卷3:系统编程指南
- Intel® VTune™ Profiler Performance Analysis Cookbook
- Analyzing Open vSwitch* with DPDK Bottlenecks Using Intel® VTune™ Amplifier
- The microarchitecture of Intel, AMD, and VIA CPUs An optimization guide for assembly programmers and compiler makers
- MicroFusion in Intel CPUs
- 分支预测,uOP,乱序执行
- How TMA* Addresses Challenges in Modern Servers and Enhancements Coming in IceLake
- Andi Kleen's blog - pmu-tools part I, pmu-tools, other fork
- Top-Down performance analysis methodology
- Transactional Synchronization Extensions
- Performance Prediction Toolkit (PPT)
- wiki - Linux kernel profiling with perf
- pprof++: A Go Profiler with Hardware Performance Monitoring
- CUDA C Programming Guide
- CUDA C++ Best Practices Guide, 解读《CUDA C最佳实践指南》
- 阅读CUDA英文手册100天
- Python通过Numba实现GPU加速
- 使用PYTHON进程GPU编程
- Numba for CUDA GPUs
- CUDA 编程 源码仓库, pyCUDA
- 探讨TensorRT加速AI模型的简易方案—以图像超分为例
- C/C++ 性能优化背后的方法论:TMAM
- BLAS (Basic Linear Algebra Subprograms)
- LAPACK — Linear Algebra PACKage
- OpenBLAS
- NVIDIA深度学习Tensor Core全面解析(上篇)
- OpenBLAS gemm从零入门
- TVM(端到端的优化栈)概述
- 通过Numba调用CUDA用GPU为Python加速:进阶理解网格跨步、多流、共享内存
- 并行编程OpenMP基础及简单示例
- 知乎 - Computer Arch专栏 让CPU黑盒不再黑
- CUDA与OpenCL架构
- Intel 处理器架构演进
- 计算机组成与设计。硬件 / 软件接口 学习笔记(二)
- C++的性能优化
- C++性能优化技术导论