Skip to content

Optimization of mat_add_f32 and mat_sub_f32: leveraging LOB to merge …#291

Merged
christophe0606 merged 1 commit intoARM-software:mainfrom
AlbertHuang-CPU:main
Jan 15, 2026
Merged

Optimization of mat_add_f32 and mat_sub_f32: leveraging LOB to merge …#291
christophe0606 merged 1 commit intoARM-software:mainfrom
AlbertHuang-CPU:main

Conversation

@AlbertHuang-CPU
Copy link
Contributor

This style of C codes can result in the asm using LOB instruction which benefit both speed and code size as well as readbility.
The changes have been validated by using the Benchmarks and Tests under CMSIS-DSP/Testing/ directory.
The tests passed.
And the optimization result looks good.
e.g. for the Tests with Arm Clang 6.23 in FPGA(MPS3) environment,
the optimized: total cycle 72826; Program Size: Code=66424 RO-data=244968 RW-data=28 ZI-data=2098048
while the origin result: total cycle = 73043, Program Size: Code=66456 RO-data=244968 RW-data=28 ZI-data=2098048
Similarly we can make such optimization to f16, q15, mat_cmplx_mult, mat_vec_mult, etc.

@christophe0606
Copy link
Contributor

@AlbertHuang-CPU The reason why this optimization is not yet applied to all CMSIS-DSP functions is because there are still lots of cases where the compiler is generating worse code.
But, since in this case it works, I merge the PR.

@christophe0606 christophe0606 merged commit 7bfa537 into ARM-software:main Jan 15, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants