Adjust Separable Filter 2D for performance
The following optimizations were carried out:
- Using kernel vectors instead of immediate values in vector intrinsics
- Using combined widen/multiply and widen/multiply-accumulate intrinsics (now possible due to the previous change)
- Prefer "high" versions of these intrinsics (for NEON)
- Using a bigger intermediate type (uint16_t), thus avoiding extra narrowing inbetween vertical and horizontal code paths
Edited by Igor Podgainoi