Conversation
|
The binary size change of libncnn.so (bytes)
|
|
Please enable github action in YOUR FORKED REPO to make code-format workflow work |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6293 +/- ##
==========================================
- Coverage 95.89% 95.59% -0.30%
==========================================
Files 837 837
Lines 264994 264997 +3
==========================================
- Hits 254105 253327 -778
- Misses 10889 11670 +781 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
nihui
left a comment
There was a problem hiding this comment.
A lot of code is duplicated in gemm_arm_asimdhp.cpp and should be extracted into gemm_fp16sa.h to unify the implementation into a single file to reduce duplication.
Apple AMX requires additional macro definitions, such as __ARM_FEATURE_APPLE_AMX or __ARM_FEATURE_APPLE_AMX2
| { | ||
| try_initialize_global_cpu_info(); | ||
| #if __aarch64__ && __APPLE__ | ||
| return g_hw_cpufamily == CPUFAMILY_ARM_FIRESTORM_ICESTORM // M1 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_AVALANCHE_BLIZZARD // M2 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_IBIZA // M3 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_LOBOS // M3 Pro | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_PALMA // M3 Max | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_DONAN // M4 | ||
| || g_hw_cpufamily == CPUFAMILY_ARM_BRAVA; // M4 Pro / M4 | ||
|
|
||
| #else | ||
| return 0; |
There was a problem hiding this comment.
discover cpu isa info in initialize_global_cpu_info()
hw.optional.amx_version == 2
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
854164c to
27ce7b1
Compare
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Progress:
Unfortunately, I've been too busy with my internship works to fully finish this optimization (only implemented 32x8 microkernels with packB 32). The performance gain could be much higher if fully implemented.
Benchmarking
test_gemm.param.zip
benchncnn.cpp:
32 layers of [dim, dim] @ [dim, dim] gemms on Apple M4:
Testing