very fast face detection for ARM platform.
The code is based on dlib with the following enhancement
- reduce work load : only use 1 filter(front looking) instead of 5 filters in frontal_face_detector.h
- thread level parallelism : use 3 threads to do face detection
- SIMD : use arm neon to implement dlib/simd/