Let's discuss how can we improve the performance of this library. I expect to use `arm_math.h` rather than ASM.