diff --git a/docs/contents/contributors.html b/docs/contents/contributors.html index e777f368..dab10699 100644 --- a/docs/contents/contributors.html +++ b/docs/contents/contributors.html @@ -848,7 +848,7 @@
Operator Fusion: TensorFlow XLA does aggressive fusion to create optimized binary for TPUs. On mobile, frameworks like NCNN also support fused operators.
+Hardware-Specific Code: Libraries are used to generate optimized binary code specialized for the target hardware. For example, TensorRT uses Nvidia CUDA/cuDNN libraries which are hand-tuned for each GPU architecture. This hardware-specific coding is key for performance. On TinyML devices, this can mean assembly code optimized for a Cortex M4 CPU for example. Vendors provide CMSIS-NN and other libraries.
Data Layout Optimizations - We can efficiently leverage memory hierarchy of hardware like cache and registers through techniques like tensor/weight rearrangement, tiling, and reuse. For example, TensorFlow XLA optimizes buffer layouts to maximize TPU utilization. This helps any memory constrained systems.
Profiling-based Tuning - We can use profiling tools to identify bottlenecks. For example, adjust kernel fusion levels based on latency profiling. On mobile SoCs, vendors like Qualcomm provide profilers in SNPE to find optimization opportunities in CNNs. This data-driven approach is important for performance.
By integrating framework models with these hardware libraries through conversion and execution pipelines, ML developers can achieve significant speedups and efficiency gains from low-level optimizations tailored to the target hardware. The tight integration between software and hardware is key to enabling performant deployment of ML applications, especially on mobile and TinyML devices.
diff --git a/docs/index.html b/docs/index.html index c68dd2cc..df8e8aab 100644 --- a/docs/index.html +++ b/docs/index.html @@ -7,7 +7,7 @@ - +