[ARM] Support FP16 post-ops fusion into ACL kernels #2067
Labels
enhancement
A feature or an optimization request
help wanted
platform:cpu-aarch64
Codeowner: @oneapi-src/onednn-cpu-aarch64
Summary
Current ACL integration prohibits first post op fusion into ACL kernel in case FP16 dst data_type. The request is to conditionally enable such behavior.
Problem statement
OneDNN post-ops fusion mechanism provides significant performance boost by skipping intermediate memory movements overheads. However in bounds of ACL such behavior is disabled for FP16 execution due to oneDNN requirements on precision of post-ops computations (should be equal to FP16). Fusion of single post op for FP16 primitives leads to multiple FP16<->FP32 datatype conversions and expensive memory access overheads. As a result separate execution of corresponding operations (via separate oneDNN primitives call) provides better performance in comparision with fusion version.
Preferred solution
Inside OpenVINO we just relaxed the condition to allow FP16 post-op fusion (with FP16 insternal compute) inside ACL integration. However that solution might not be sutable for all oneDNN users due to accuracy restrictions.
Based on that the proposal is to adopt dnnl::accumulation_mode atribute as a trigger for different post-ops computational precision. As a results desired behavior in terms of balance between accuracy and performance can be choosen on oneDNN user level.
The text was updated successfully, but these errors were encountered: