Releases: simd-everywhere/simde
v0.8.4-rc2
What's Changed since v0.8.4-rc1
- prefer NATIVE implementations over SIMDE_VECTOR_SUBSCRIPT_OPS by @mr-c in 4f11a25
- Fix the portable memory functions and test them in CI by @mr-c in #1311
- psabi warnings: always silence by @mr-c in #1332
Clang
- detect-clang: detect version 20 by @mr-c in 7c7031e
- clang: mark two bugs as being fixed in Clang 19 by @mr-c in d6b282d
GCC
- arm64 gcc FRINT: skip native call on GCC by @mr-c in #1299
- gcc 13 + 15: two bugs fixed, skip their workarounds if possible by @mr-c in #1300
- silence some warnings when in
-pedanticmode by @mr-c in #1302
MSCV
loongarch implementations
- x86 sse2 for loongarch: fix GCC build failure by @jinboson in #1287
- x86 sse2:
_mm_div_pdremove broken LSX implementation by @mr-c in 239188b - x86 sse2: LSX fix for simde_x_mm_select_pd by @mr-c in c05bb81
PowerPC implementations
- x86 sse2: add powerpc*-darwin support for
simde_mm_pauseby @barracuda156 in #1319
X86
- sse2,sse3, avx: silence some false-positive warnings about uninitialized structs by @mr-c in 437b4f2
MMX
SSE*
AVX*
- use memcpy instead of HEDLEY_REINTERPRET_CAST by @mr-c in 40f8f99
- avx: fix native implementation of
simde_mm256_permute2f128_si256by @ethomag in #1338 - {sll,srl,sra}: clamp and test out-of-range variable shift amounts by @aqrit in 7bf1589 edb6dfd
- avx2: make simde_mm256_blend_epi16(a, b, imm8) map to native intrinsic by @ethomag in #1342
- avx2:
_mm256_permute4x64_epi64: Avoid "may be used uninitialized" warning by @ethomag in 9898205
AVX512
- misc fixes for
AVX512{F,VL}_NATIVEby @mr-c in #1279 - srli: add mask{,z} varients by @mr-c in #1310
- expand{,loadu}: implement more intrinsics by @mr-c in #1312
- cmpeq: finish the rest of the intrinsics by @mr-c in #1314
- slli & srli: add missing intrinsics by @mr-c in #1317
- loadu: almost all of the remaining intrinsics by @mr-c in #1320
__mmask8native by @mr-c in #1323- sub: add mask{,z} varients by @michael-chuh in #1327
- storeu: add
_mm256_mask_storeu_epi{8,32,64}by @michael-chuh in #1328 - storeu: fix mask storeu zeroing out original memory by @michael-chuh in #1331
- storeu: implement remaining instrinsics by @mr-c in #1334
- AVX512CD: implement remaining instrinsics by @mr-c in #1336
- cmpneq: finish the implementations; synch with cmpeq by @mr-c in 83141fd
RISC-V
- arm neon cge: riscv64 remove broken vcge_s32 impl by @mr-c in 942df2b
- [Neon][RVV] Enable RVV when there are zve64d and zvl128b flags. by @Ruhung in #1284
- [Neon][RVV] Enable RVV segment load/store only when we have __riscv_zvlsseg flag. by @Ruhung in #1285
- [Neon][RVV] add min.h and max.h RVV implementations. by @Ruhung in #1283
- [Neon][RVV] Fallback to autovec without mrvv-vector-bits flag. by @Ruhung in #1282
- [NEON] Fix mvn.h to correctly handle RVV instructions. by @Ruhung in 1d74b5b
wasm
- remove use of non-standard
_divinstructions by @mr-c in 36e30c5 __wasm_fp16__is required for Float16 with emscripten or wasi by @mr-c in #1333
ARM
NEON
ld1{,q_x[234]}: speedups on SSE[32] & WASM by @mr-c in #1329- Avoid undefined behaviour with signed integer multiplication by @rathann in #1296
- Add NEON float16 multi-vectors to native aliases by @stellar-aria in bb7c56f
CI / testing
- build(deps): bump hendrikmuhs/ccache-action from 1.2.16 to 1.2.17 by @dependabot[bot] in #1278
- gh-action: test on clang 19 & 20 by @mr-c in #1303
- build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #1305
- build(deps): bump actions/setup-python from 5 to 6 by @dependabot[bot] in #1316
- build(deps): bump actions/setup-dotnet from 4 to 5 by @dependabot[bot] in #1315
- gh-actions: x86 upgrade to GCC-14 by @mr-c in 1995014
- emscripten64 testing: the "--experimental-wasm-memory64" flag is no longer needed by @mr-c in #1280
- gh-actions: run actionlint automatically by @mr-c in 7da3fb1
- gh-actions: make actionlint/shellcheck clean by @mr-c in 0f2aecf
- gh-actions: simplify gcc loongarch64 testing by @mr-c in c9ee1e6
- gh-actions: macos 16, undef _LIBCPP_ENABLE_ASSERTIONS by @mr-c in 51743e7
- arm neon: extend native alias testing to the vector types by @mr-c in 53f8c08
- circleci: drop loongson test, already covered better in gh-actions by @mr-c in 1dd80a3
- test.h: define
__STDC_FORMAT_MACROSif not defined by @barracuda156 in #1318 - Add i686 and clang packit config by @rathann in #1297
- build(deps): bump ad-m/github-push-action from 0.8.0 to 1.0.0 by @dependabot[bot] in #1325
- ccache upgrade by @mr-c in #1324
- gh-actions emscripten: silence several warnings by @mr-c in 47513c1
- Revert "gh-actions: pin emsdk to earlier version until llvm/llvm-project#117200 is fixed and released" by @mr-c in ced7f1e
- x86 avx512: skip testing NANs with fpclass if -ffast-math or equivalent by @mr-c in 0efee69
New Contributors
- @rathann made their first contribution in #1296
- @stellar-aria made their first contribution in #1291
- @aqrit made their first contribution in #1309
- @barracuda156 made their first contribution in #1319
- @michael-chuh made their first contribution in #1327
*...
v0.8.4-rc1
SIMDe 0.8.4
Summary
TBD
Details
NEON
- avoid warnings when "__ARM_NEON_FP" is not defined. f046ab7 @clopez
- Rename ARM ROL/ROR functions with a SIMDE prefix. cb846d9 @Syonyk
- define native alias only under the inverse of the conditions of a pass-through 2b450c0 @mr-c
- cmla{_rot{90,180,270},}_lane: fix implementations with correct tests (confirmed on an ARMv8.3 system) 00ea77e @wewe5215
- crc32: define
SIMDE_ARCH_ARM_CRC32and consistently use it 01470d2 @mr-c - qdmlal: fix saturation (#1194) cf1db25 @Ryo-not-rio
- qdmlsl: fix instructions to use saturation correctly 44a748a @Ryo-not-rio
- qdmulh: Fix vqdmulhs_s32 native alias. 403e942 @Syonyk
- qdmull: Fix SQDMULL implementation for 32-bit inputs. (#1255) 948b236 @Syonyk
- qrdmulh: Remove incorrect SSE code. 8e27139 @Syonyk
- qrshl: Fix incorrect UQRSHL implementation. 2c6adb6 @Syonyk
- qshl: Fix UQSHL to match hardware. Add extensive test vectors. (#1256) e5d5064 @Syonyk
- qshlu: Fix vqshlud_n_s64 implementation to be 64-bit. 3527e86 @Syonyk
- sli_n: Fix invalid shifts (#1253) 8067442 @Syonyk
- vminnmv_f16: remove duplicate statement (#1208) d1d9f82 @mr-c
x86 intrinsics
- avx512f: new intrinsics family: fmaddsub (#1246) 6daf535 @robinchrist
- fma: Use 128 bit fnmadd_pd to do 256 bit fnmadd_pd (#1197) bd05320 @AlexK-BD
- avx:
_mm256_storeu_pdand_mm256_loadu_pdusing 128 bit lanes 96054b8 @AlexK-BD - avx: use INT64_C when the destination is i64 (#1238) 60a3a24 @jinboson
- sse4.2: Apply half tabular method in
_mm_crc32family for the best trade-off between performance and lookup table size 0f68b62 @Cuda-Chen - sse2: move definition of 'value' to correct branch in
simde_mm_loadl_epi64b8e468a @K-os - sse2: fix overflow error detected by clang scan-build in simde_mm_srl_epi{16,32,64} when count is too high 1a9d47f @mr-c
- some better implementations for MSVC and others without
SIMDE_STATEMENT_EXPR_1691ae0 @mr-c
Arch support
Altivec
arm / arm64
- wasm: add u16x8 and u8x16 avgr NEON optimized implementations 7e65734 @wrv
- wasm simd128: fix a FAST_NANS error on arm64 a9ebb8a @mr-c
- arm neon native: FCMLA with 16-bit floats, requires the FP16 feature 4936149 @mr-c
- arm neon native: replace use of
SIMDE_ARCH_ARM_CHECK(8+)with feature checks. afd77a9 @mr-c
LongAarch
- float16: use a portable version to avoid compilation errors 600050d @XiWeiGu
- x86/sse2: add lsx support b331ea2 @HecaiYuan
- x86/sse2: small fixes for loongarch d344e3c @jinboson
- x86/sse4.2: add loongarch lsx optimized implementations fa6a869 @HecaiYuan
- x86/sse4.1: add loongarch lsx optimized implementations f85ad3b @HecaiYuan
- x86/ssse3: add loongarch lsx optimized implementations 879be03 @HecaiYuan
- x86/sse3: add loongarch lsx optimized implementations 8fdc0e8 @HecaiYuan
- x86/sse: Fix type convert error for LSX. a6d4207 @yinshiyou
- x86/sse: add loongarch lsx optimized implementations 49f73d9 @HecaiYuan
- x86/avx2: add loongarch lasx optimized implementations (#1241) d62ab5a @jinboson
- x86/avx2: small fixes for loongarch 1bbb5af @jinboson
- x86/avx: add loongarch lasx optimized implementations (#1239) 5e406dc @jinboson
- x86/avx: reoptimized
simde_mm256_addsub_ps/dwith lasx 4242de3 @jinboson - x86/clmul:
_x_bitreverse_u64: add loongarch implementation (#1249) 866cc57 @jinboson - x86/fma: add loongarch lasx optimized implementations d2cd71b @jinboson
- x86/f16c: add loongarch lasx optimized implementations a70fca2 @jinboso
RISCV64
- arm: improve performance in vqadd and vmvn in risc-v 17416b1 @zengdage
- arm/neon: additional RVV implementations (43 instructions) - part 1 (#1188) 6346405 @Ruhung
- arm/neon: additional RVV implementations (34 instructions) - part 2. (#1189) c903416 @wewe5215
- x86 sse2: fix
_mm_pausefor RISCV systems ed042d5 @mr-c
WASM
- arm neon st2: add vst2_u8 WASM optimized implementation 9aeb89e @wrv
- arm neon shll_n: add vshll WASM optimized implementations 1fdca85 @wrv
- arm neon st4: add vst4_u8 WASM optimized implementation 7f47244 @wrv
- sse2: remove redundant
mm_add_pdoptimized implementation for WASM (#1190) 8ee42f6 @wrv - sse2: Wasm SIMD version of
_mm_sad_epu8bc37d4b @wrv
z/Arch
Compiler Specific
Clang
- Don't use
_Float16on s390x a1ce45c @mcatanzaro - Don't use
_Float16on non-SSE2 x86 40f4d28 @mcatanzaro - x86 avx512: fix clang type redef error f4daa86 @bd-jahn
GCC
- Use
_Float16in C++ on aarch64 with GCC 13+ e30e6ec @mcatanzaro - arm neon: fix arm64 gcc11 build excess elements in vector failure d370f28 @Qingwu-Li
- arm neon: avoid vst1_*_x4 built-in functions in GCC 11 and before 557fd6d @Qingwu-Li
- arm neon sm3: gcc-14 -O3 complained about some possible uninitialized values 99ac62b @mr-c
- arm neon
_vext_p6: reverse logic to avoid GCC14 i586 bug (#1251) e958b0a @mr-c - risc64 gcc-14: Disable uninitialized variable warnings for some ARM neon SM3 functions b2ad094 @Syonyk
- simde-aes: gcc 13.2+ ignore unused variable warnings f4f5904 @mr-c
- arm neon gcc-12 FRINT workaround e5605e9 @mr-c
MSVC
Testing with Docker/Podman & CI
- meson: 0.55.1 is needed for Python 3.12+ 030c07c @mr-c
- x86/avx: Adding several overflow tests for various avx functions e8c881d @qvd808
- arm neon qdmlsl: unroll SIMDE_CONSTIFY for testing macro implemented functions 858b005 @mr-c
- native-aliases test: allow running on macos 6b6e4ef @mr-c
- arm neon abd & cvt tests: add missing import ab5c3e5 @mr-c
- Add tests for vqdmulhs_s32. f56ef45 @Syonyk
- x86 sse2: skip two extreme test cases for
mm_cvtps_epi32ifSIMDE_FAST_ROUND_TIESis active. 0e6756b @mr-c
Appveyor
Circle CI
- switch container for gcc11 i686 -O2 test 56b7c7a @mr-c
- run on the primary development branch to prime the cache f0de562 @mr-c
- always save ccache cache 02cc09b 6eabe36 @mr-c
- add linux arm64 native aliases testing b036110 @mr-c
- use ccache consistently ab758b5 @mr-c
GitHub Actions
- GitHub has retired the macos-11 runners, add some more -13 (x86-64) and -14 (arm64) testing 32c959c @mr-c
- ensure that gcov is present when needed 6f52a1d @mr-c
- upgrade to Ubuntu 24.04 LTS; upgrade/add GCC 13 / clang 18 d67c190 @mr-c
- test loongson + lsx with gcc14 from Ubuntu Oracular 59bf8de @mr-c
- add CI testing for gcc 11 aarch64/arm64 4b96738 @mr-c
- upgrade gcc-qemu to gcc-14 561556c @mr-c
- test aarch64 without extra features 6686232 @mr-c
- add loongarch64 clang-18 test ac3870b @mr-c
- clean up install list 9cbeced @mr-c
- pin emsdk to earlier version until llvm/llvm-project#117200 is fixed and released 3257054 @mr-c
- upgrade Ubuntu Mantic to Ubuntu Noble (24.04) e1bc420 @mr-c
- macos: xcode 14.3.1 is no longer available, switch to macos-15 to test xcode 16.0 7035777 @mr-c
- msvc-arm64: turn off due to compiler issue 6802efa @mr-c
- macos 12: deprecated, going offline on 2024-12-03 2bb7f48 @mr-c
- update CI test for loongarch 0cf3528 @jinboson
- Add some native Linux arm64 clang builds 2f0c939 @mr-c
- aarch64 qemu testing: increase arm levels and features targeted. 067ab5d @mr-c
- Add more native Linux arm64 builds 693337a @mr-c
- more ccache 17b2cbf @mr-c
Misc
- pow: consistently use simde_math_pow 8f727c0 @mr-c
- math: typo fix, check
SIMDE_MATH_NANFinstead of the old-styleSIMDE_NANF40567df @mr-c - math: Whoops, missing comma 73e43dd @Dave-Lowndes
- remove extraneous semicolons from many macro-defined functions 01f7a4f @mr-c
New Contributors
- @clopez made their first contribution in #1179
- @mcatanzaro made their first contribution in #1182
- @Ruhung made their first contribution in #1188
- @AlexK-BD made their first contribution in #1197
- @Epixu made their first contribution in #1199
- @yinshiyou made their first contribution in #1215
- @Qingwu-Li made their first contribution in #1216
- @K-os made their first contribution in #1223
- @XiWeiGu made their first contribution in #1224
- @Dave-Lowndes made their first contribution in #1233
- @bd-jahn made their first contribution in #1232
- @qvd808 made their first contribution in #1226
- @HecaiYuan made their first contribution in #1236
- @jinboson made their first contribution in #1238
- @robinchrist made their first contribution in #1246
- @Syonyk made their first contribution in #1253
- @Ryo-not-rio made their first contribution in #1195
Full Changelog: v0.8.2...v0.8.4-rc1
v0.8.2
SIMDe 0.8.2
Summary
- Start of RISCV64 optimized implementation using the RVV1.0 vector extension! Thank you @eric900115 @howjmay @zengdage
- 62 of the ARM Neon intrinsics added in SIMDe 0.8.0 had to be removed for not exactly matching the specs and real hardware
(from the FCVTZS/FCVTMS/FCVTPS/FCVTNS families). This brings us down from 100% coverage of the NEON functions to 99.07%.
For the entire project: 126 files changed, 5522 insertions(+), 2772 deletions(-)
For just the simde folder: 89 files changed, 4330 insertions(+), 2199 deletions(-)
Details
Implementation of Arm intrinsics
NEON
- arm neon: disable some FCVTZS/FCVTMS/FCVTPS/FCVTNS family intrinsics 339ffe4 @mr-c
- arm neon sm3: check constant range 3d34fcd @mr-c
- arm 32 bits: native def fixes; workarounds for gcc 22900e6 @Cuda-Chen
- x86 implementations: allow _m128 access from SSE 114c3cd @mr-c
WASM intrinsics
x86 intrinsics
SVML
XOP
Arch support
arm / arm64
- arm platform: cleanup feature detection. 08c21f3 @mr-c
- arm: enable more intrinsic function for armv7 416091e @zengdage
RISCV64
- Initial Support for the RISC-V Vector Extension (RVV1.0) in ARM NEON (#1130) b4e805a @eric900115
- arm: fix some neon2rvv intrinsic function error 2a548e5 @zengdage
- arm: Add neon2rvv support in vand series intrinsics dac67f3 @howjmay
- arm: improve performance in vabd_xxx for risc-v b63ba04 @zengdage
- arm: improve performance in vhadd_xxx for risc-v a68fa90 @zengdage
Compiler Specific
Clang
- detect clang versions 18 & 19 ed4a5cd @mr-c
- arm neon clang: skip vrnd native before clang v18 e647f10 @mr-c
- apple clang arm64: ignore SHA2 be48ef8 @mr-c
Emscripten
MSVC
- x86 test msvc: really disable warning 4799,4730 487507d @mr-c
- sse2 MSVC
_mm_pauseimplementaiton for x86 8d95f83 @mr-c - SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c
Testing with Docker/Podman & CI
Cirrus CI
GitHub Actions
- test Mac arm64 0080b28 @mr-c
- macos: report log if there is a configuration failure. df3e930 @mr-c
- build(deps): bump actions/checkout from 3 to 4 (#1149) 9605608 @dependabot[bot]
- build(deps): bump codecov/codecov-action from 3 to 4 25382c1 @dependabot[bot]
- codecov: use token 2c45dd4 @mr-c
- Add gcc arm 32bit armv8-a test in CI 72bde75 @Cuda-Chen
- build for AMD Buildozer version 2 9746537 @mr-c
Packit CI
Semaphore CI
Misc
- update list of fully implemented instruction sets (#1152) b568fcd @mr-c
- typo fixes from codespell 8639fef @mr-c
- README.md - move CLMUL to partial, list more of the CI.yml architectures 285b50d @Torinde
- Update README.md - link to VPCLMULQDQ; mention MSA (#1157) 517da84 @Torinde
- Update README.md (#1156) b88a66d @mr-c
- README: two more related projects 7429dff @mr-c
New Contributors
- @eric900115 made their first contribution in #1130
- @Cuda-Chen made their first contribution in #1116
- @Torinde made their first contribution in #1157
- @zengdage made their first contribution in #1172
- @howjmay made their first contribution in #1174
Full Changelog: v0.8.0...v0.8.2
v0.8.2-rc1
See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.8.0
Full Changelog: v0.8.0...v0.8.2-rc1
v0.8.0
SIMDe 0.8.0
Summary
- Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
- SIMDe PRs are tested using Fedora Rawhide (@junaruga)
For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)
For just the simde folder: 295 files changed, 47053 insertions(+), 896 deletions(-)
X86
There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).
Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.
Newly added function families
- AES: 5 of 6 (83.33%)
Newly AVX512 added function families
- castph: 1 of 9 (11.11%) implemented.
- cvtus_storeu: 1 of 18 (5.56%) implemented.
- fpclass: 3 of 24 (12.50%) implemented.
- i32gather: 1 of 8 (12.50%) implemented.
- i64gather: 8 of 8 💯
- permutex: 3 of 12 (25.00%) implemented.
- rcp14: 1 of 24 (4.17%) implemented.
reduce - reduce_max: 7 of 31 (22.58%) implemented.
- reduce_min: 7 of 31 (22.58%) implemented.
- shufflehi: 1 of 7 (14.29%) implemented.
- shufflelo: 1 of 7 (14.29%) implemented.
Additions to existing families
- AVX512BW: 7 additional, 337 of 790 (42.66%)
- AVX512DQ: 5 additional, 112 total of 376 (29.79%)
- AVX512F: 48 additional, 1087 total of 2812 (38.66%)
- AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)
Neon
SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!
Newly added families
- abal
- abal_high
- abd
- abdh
- abdl_high
- addhn_high
- aes
- bfdot
- bfdot_lane
- cadd_rot
- cale
- calt
- cmla_lane
- cmla_rot_lane
- copy_lane
- cvt_high
- cvt_n
- cvta
- cvtn
- cvtp
- cvtx
- cvtx_high
- div
- dupb_lane
- duph_lane
- eor3
- fmlal
- fms
- fms_lane
- fms_n
- ld2_dup
- ld2_lane
- ld3_dup
- ld3_lane
- ld4_dup
- maxnmv
- minnmv
- mla_lane
- mla_high_lane
- mls_lane
- mlsl_high_lane
- mmla
- mull_high_lane
- mull_high_n
- mulx
- mulx_lane
- pmaxnm
- pminnm
- qdmlal
- qdmlal_high
- qdmlal_high_lane
- qdmlal_high_n
- qdmlal_lane
- qdmlal_n
- qdmlsl
- qdmlsl_high
- qdmlsl_high_lane
- qdmlsl_high_n
- qdmlsl_lane
- qdmlsl_n
- qdmlslh
- qdmlslh_lane
- qdmulhh
- qdmulhh_lane
- qdmull_high
- qdmull_high_lane
- qdmull_high_n
- qdmull_lane
- qdmull_n
- qdmullh_lane
- qmovun_high
- qrdmlah
- qrdmlah_lane
- qrdmlahh
- qrdmlahh_lane
- qrdmlsh
- qrdmlsh_lane
- qrdmlshh
- qrdmlshh_lane
- qrdmulhh_lane
- qrshl
- qrshlh
- qrshrn_high_n
- qrshrnh_n
- qrshrun_high_n
- qrshrunh_n
- qshl_n
- qshlh_n
- qshluh_n
- qshrn_high_n
- qshrnh_n
- qshrun_high_n
- qshrunh_n
- raddhn
- raddhn_high
- rax
- recp
- rnd32x
- rnd32x
- rnd32x
- rnd64z
- rnda
- rndx
- rshrn_high_n
- rsubhn
- rsubhn
- set_lane
- sha1
- sha1h
- sha256
- sha512
- shll_high_n
- shrn_high_n
- sli_n
- sm3
- sm4
- sqrt
- st1_x2
- st1_x3
- st1_x4
- st1q_x2
- st1q_x3
- st1q_x4
- subhn_high
- sudot_lane
- usdot
- usdot_lane
Finally complete families
- cvtn
- mla_lane
Details
- simde-f16: improve
_Float16usage; better INFHF/NANHF defs 8910057 @mr-c - simde_float16: prefer
__fp16if available aba26f6 @mr-c
Implementation of Arm intrinsics
NEON
- cvtn:
vcvtnq_{s32_f32,s64_f64}: add SSE & AVX512 optimized implementations e134cc7 @mr-c - cvtn:
vcvtnq_u32_f32is a V8 function 8432c70 @mr-c - min: Remove non-working MMX specialization from
simde_vmin_s166858b92 @M-HT - shll: Extend constant range in
simde_vshll_n_XXXintrinsics (#1064) beb1c61 @M-HT - various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
- qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
- arm: use
SIMDE_ARCH_ARM_FMA7198d6d @mr-c - arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
- more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}: initial implementation (#1082) 879d1a0 @yyctw- part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
- Add AES instructions. 23adcd2 805ccd2 @yyctw
- Modified
simde_float16tosimde_float16_t(#1100) 8a05dc6 @yyctw - implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
- add enable
vmlaq_laneq_f32andvcvtq_n_f64_u64c7d314b @yyctw - implement all bf16-related intrinsics (#1110) c59db7c @yyctw
- arm/neon abs: negating
INT_MINis undefined behavior in C/C++ c200c16 @mr-c
SVE Intrinsics
WASM intrinsics
- simd128: fix altivec_p7 version of
wasm_f64x2_pmin96d6e53 @mr-c - simd128: add missing unsigned functions ea5e283 @mr-c
- simd128
f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c - detect support for Relaxed SIMD mode 2e66dd4 @mr-c
- simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
- relaxed: add
f{32x4,64x2}_relaxed_{min,max}9d1a34e @mr-c - relaxed: updated names; reordered FMA operations 8cc8874 @mr-c
x86 intrinsics
SSE*
- sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
- sse: implement
_mm_movelh_psfor Arm64 514564e @mr-c - sse
_mm_movemask_ps: remove unused code fba97e4 @mr-c - sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
- sse4.1: use logical OR instead of bitwise OR in neon impl of
_mm_testnzc_si128edd4678 @mr-c - sse4.1
_mm_testz_si128: fix backwards short circuit logic f132275 @mr-c
AVX
- run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pdfix for natural vector size < 128 1594d7c @mr-c
AVX2
- correction of
simde_mm256_sign_epi{8,16,32}(#1123) c376610 @Proudsalsa
AVX512
- fpclass: naive implementation 353bf5f @mr-c
- loadu: fix native detection 305f434 @mr-c
- set: add
simde_x_mm512_set_m256{,d}67e0c50 @mr-c - gather: add MSVC native fallbacks 7b7e3f6 @mr-c
- AVX512FP16 / m512h initial support e97691c @mr-c
- fix many native aliases 75014b9 @mr-c
CLMUL
SVML
AES
MIPS MSA intrinics
Arch support
x86(-64)
arm64
Altivec
Power
- sse2,wasm simd128: skip
SIMDE_CONVERT_VECTOR_impementations on PowerPC 4de999a @mr-c - wasm simd128: more powerpc fixes 7cb5691 @mr-c
Compiler Specific
GCC
- GCC AVX512F:
SIMDE_BUG_GCC_95399was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c - GCC x86/x64:
SIMDE_BUG_GCC_98521was fixed in 10.3 edde42e @mr-c - GCC x86:
SIMDE_BUG_GCC_94482was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c - Add workaround for GCC bug 111609 fdafd8e @M-HT
- arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
- avx512 abs: refine GCC compiler checks for
_mm512{,_mask}_abs_pd(#1118) 5405bbd @thomas-schlichter
Clang
- clang powerpc:
vec_bpermbug was fixed in clang-14 6feb28a @mr-c - clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
- aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
- A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
- wasm:
SIMDE_BUG_CLANG_60655is fixed in the upcoming 17.0 release 25cebbe @mr-c simde-detect-clang.h: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur
ClangCL
v0.8.0-rc2
See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.7.6
What's Changed since RC1
- WASM Relaxed SIMD updates by @mr-c in #1112
- emcc tot: set -Wno-switch-default by @mr-c in #1115
- avx512 abs: refine GCC compiler checks for
_mm512{,_mask}_abs_pdby @thomas-schlichter in #1118 - correction of simde_mm256_sign_epi16(). by @Proudsalsa in #1123
- apply arm64 windows workaround only on older version msvc by @Changqing-JING in #1121
- gh-actions: add clang-17 by @mr-c in #1127
- Improve performance of simde_mm512_add_epi32 by @AymenQ in #1126
- typo: XCode -> Xcode by @Coeur in #1129
- Update simde-detect-clang.h for clang 13 detection by @Coeur in #1131
- Update simde-detect-clang.h for clang 17 detection by @Coeur in #1132
- build(deps): bump ad-m/github-push-action from 0.6.0 to 0.8.0 by @dependabot in #1134
- build(deps): bump actions/setup-dotnet from 3 to 4 by @dependabot in #1135
- build(deps): bump actions/setup-python from 4 to 5 by @dependabot in #1137
- build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1138
- GitHub Actions emscripten: use older release for now by @mr-c in #1133
- build(deps): bump actions/checkout from 3 to 4 by @dependabot in #1139
- docs: explain how to target a single test by @mr-c in #1140
- arm/neon abs: negating INT_MIN is undefined behavior by @mr-c in #1141
New Contributors
- @thomas-schlichter made their first contribution in #1118
- @Proudsalsa made their first contribution in #1123
- @Changqing-JING made their first contribution in #1121
- @AymenQ made their first contribution in #1126
- @Coeur made their first contribution in #1129
- @dependabot made their first contribution in #1134
Full Changelog: v0.8.0-rc1...v0.8.0-rc2
v0.8.0-rc1
See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes
New Contributors
- @cbielow made their first contribution in #1055
- @M-HT made their first contribution in #1060
- @yyctw made their first contribution in #1071
- @Vineg made their first contribution in #1072
- @wewe5215 made their first contribution in #1077
Full Changelog: v0.7.6...v0.8.0-rc1
v0.7.6
Summary
See, I knew we should release more often!
Details
Implementation of Arm intrinsics
NEON
neon/abd,ext,cmla{,_rot{180,270,90}}: additional wasm128 implementations 3a18dff @mr-c
neon/cvtn: basic implementation of a few functions fefc785 @mr-c
neon/mla_lane: initial implementation using mla+dup 554ab18 @ngzhian
neon/shl,rshl: fix avx include to unbreak amalgamated hearders 3748a9f @mr-c
neon/shll_n: make vshll_n_u32 test operational 356db0c @mr-c
neon/qabs: restore SSE2 impl for vqabsq_s8 f614843 @mr-c
x86 intrinsics
mmx: loogson impl promotions over SIMDE_SHUFFLE_VECTOR_ 51bf6f2 @mr-c
x86/sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
SSE*
sse{,2,3,4.1},avx: more WASM shuffle implementations 097dd12 @mr-c
sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
sse: allow native _mm_loadh_pi on MSVC x64 314452b @mr-c
AVX512
avx512: typo fix for typedef of __mmask64 e8390a3 4a9f01a @mr-c
avx512/madd: fix native alias arguments for _mm512_madd_epi16 bcf4adb @mr-c
Arch support
simde-arch: #include Hedley for setting F16C for MSVC 2022+ with AVX2 f9cf467 @mr-c
Testing with Docker/Podman & CI
tests: simde_assert_equal_{v,}f funcs were silently failing 395efd9 @mr-c
tests: Quiet another Clang < v5 warning that resurfaced d9d2b45 @mr-c
tests: audit use of HEDLEY_DIAGNOSTIC_PUSH and _POP 284c88a @mr-c
test: ignore -Wc99-extensions e264ff5 @mr-c
neon/aba: vaba_s32 test was not being run f86346a @mr-c
sve/and: the svand_n_s8_m test is incomplete, mark it as such b962f07 @mr-c
tests: combine declarations in test functions 76c7d37 @mr-c
Local testing with Docker/Podman
docker: add wasm64 target 29db539 @mr-c
Drone.io
GitHub Actions
gh-actions: confirm that all header files are installed 8d5e05a @mr-c
gh-actions: put wasm64 under CI 6702820 @mr-c
Netlify
netlify: disable for now caa0929 @mr-c
Misc
meson install: arm/neon/ld1 & x86/avx512.h 27836b1 @mr-c
Update clang version detection for 14..16 and add link 4957a9e @jan-wassenberg
v0.7.4
SIMDe 0.7.4
Summary
- Minimum meson version is now 0.54
- 40 new NEON families implemented, SVE API implementation started (14 families)
- Initial support for x86 F16C API
- Initial support for MIPS MSA API
- Initial support for Arm Scalable Vector Extensions (SVE) API
- Initial support for WASM SIMD128 API
- Initial support for the E2K (Elbrus) architecture
- MSVC has many fixes, now compiled in CI using
/ARCH:AVX,/ARCH:AVX2, and/ARCH:AVX512
X86
There are a total of 7470 SIMD functions on x86, 2971 (39.77%) of which have been implemented in SIMDe so far.
Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1439 (27.31%)
Newly added function families
- AVX512CD: 21 of 42 (50.00%)
- AVX512VPOPCNTDQ: 18 of 18 💯
- AVX512_4VNNIW: 6 of 6 (100.00%)
- AVX512_BF16: 9 of 38 (23.68%)
- AVX512_BITALG: 24 of 24 💯
- AVX512_FP16: 2 of 1105 (0.18%)
- AVX512_VBMI2 3 of 150 (2.00%)
- AVX512_VNNI: 36 of 36 💯
- AVX_VNNI: 8 of 16 (50.00%)
Additions to existing families
- AVX512F: 579 additional, 856 total of 2660 (31.80%)
- AVX512BW: 178 additional, 335 total of 828 (40.46%)
- AVX512DQ: 77 additional, 111 total of 399 (27.82%)
- AVX512_VBMI: 9 additional, 30 total of 30 💯
- KNCNI: 113 additional, 114 total of 595 (19.16%)
- VPCLMULQDQ: 1 additional, 2 total of 2 💯
Neon
SIMDe currently implements 3745 out of 6670 (56.15%) NEON functions. If you don't count 16-bit floats and poly types, it's 3745 / 4969 (75.37%).
Newly added families
- addhn
- bcax
- cage
- cmla
- cmla_rot90
- cmla_rot180
- cmla_rot270
- fma
- fma_lane
- fma_n
- ld2
- ld4_lane
- mlal_high_n
- mlal_lane
- mls_n
- mlsl_high_n
- mlsl_lane
- mull_lane
- qdmulh_lane
- qdmulh_n
- qrdmulh_lane
- qrshrn_n
- qrshrun_n
- qshlu_n
- qshrn_n
- qshrun_n
- recpe
- recps
- rshrn_n
- rsqrte
- rsqrts
- shll_n
- shrn_n
- sqadd
- sri_n
- st2
- st2_lane
- st3_lane
- st4_lane
- subhn
- subl_high
- xar
MSA
Overall, SIMDe implementents 40 of 533 (7.50%) functions from MSA.
Details
Implementation of Arm intrinsics
NEON
- aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a22c3cc @mr-c
- neon: Implement f16 types 21496f6 @Glitch18
- neon: port additional code to new style 1c744fd @nemequ
- neon: replace some more abs/labs/llabs usage with simde_math_* versions c59853a @nemequ
- neon: refactor to use different types on all targets c17957a @nemequ
- neon: test for MMX/SSE instead of x86 when choosing implementation 0366dab @nemequ
- neon/abd: add much better implementations c3ddbbe @nemequ 220db33 @ngzhian
- neon/abs: add SSE2 integer abs implementations 6396dc8 @aqrit
- neon/addhn: initial implementation e9ee066 @nemequ
- neon/add: Implement f16 functions e69239c @Glitch18
- neon/add{l,}v: SSE2/SSSE3 opts
_vadd{lvq_s8, lvq_s16, lvq_u8, vq_u8}8b4e375 dfffdde @mr-c - neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high b897331 @nemequ
- neon/bcax: initial implementation 96ce481 0ed3dea @Glitch18
- neon/bsl: Implement f16 functions edb75b5 @Glitch18
- neon/cage: Initial f16 implementations 20df81d @Glitch18
- neon/cagt: Implement f16 functions 452a6d3 @Glitch18
- neon/ceq: Implement f16 functions f24ab3d @Glitch18
- neon/ceqz: Implement f16 functions dd2ebf2 de301cd @Glitch18
- neon/cge: Implement f16 functions a512986 f3ad0d4 647dc12 @Glitch18
- neon/cgez: complete implementation of CGEZ family 6d86a20 @Glitch18
- neon/cgt: Add implementation of remaining functions 9930c43 @Glitch18
- neon/cgt, simd128: improve some unsigned comparisons on x86 ae6702a @nemequ
- neon/cgtz: Add implementations of remaining functions 4d749b5 @Glitch18
- neon/cle: add some x86 implementations 5906cc9 d81c7e7 @nemequ 7894c7d @Glitch18
- neon/clez: Add implementaions of scalar functions bc72880 @Glitch18
- neon/clt: Add implementations of scalar functions & SSE/AVX512 fallbacks bc636e1 6a19637 @Glitch18
- neon/cltz: Add scalar functions and natural vector fallbacks 2960ef0 @Glitch18
- neon/cmla, neon/cmla_rot{90,180,270}: check compiler versions e98152f @nemequ
- neon/cmla, neon/cmla_rot{90,180,270}: CMLA requires armv8.3+ 280faae @nemequ
- neon/cmla, neon/cmla_rot{90,180,270}, neon/fma: initial implementation 2aff4f9 @Glitch18
- neon/cnt: add x86 implementations of vcntq_s8 a558d6d @nemequ
- neon/cvt: add
__builtin_convertvectorimplementations d06ea5b @nemequ - neon/cvt: add out-of-range and NaN tests 7d0e2ac @nemequ
- neon/cvt: add some faster x86 float->int/uint conversions ceaaf13 @nemequ
- neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 8398f73 @Glitch18
- neon/cvt: cast result of float/double comparison dc215cd @ngzhian
- neon/cvt: disable some code on 32-bit x86 which uses
_mm_cvttsd_si6448edfa9 @nemequ - neon/cvt: don't use vec_ctsl on POWER 8f9582a @nemequ
- neon/cvt: fix a couple of s390x implementations' NaN handling a8bd33d @nemequ
- neon/cvt: fix compilation with -ffast-math d1d070d @nemequ
- neon/cvt: Implement f16 functions b6a9882 @Glitch18
- neon/cvt, relaxed-simd: add work-around for GCC bug #101614 11aa006 @nemequ
- neon/cvt, simd128: fix compiler errors on PPC 965e68e @nemequ
- neon/cvt: clang bug 46844 was fixed in clang 12.0 71e03a6 @mr-c
- neon/dot_lane: add remaining implementation 3f1c1fa 4a9ca8a @Glitch18
- neon/dup_lane: Complete implementation of function family 12fb731 df320d1 @Glitch18 014ee00 9461557 @nemequ
- neon/dup_lane: use dup_n 2b4a009 @ngzhian
- neon/dup_n: Implement f16 functions 14fdf88 @Glitch18
- neon/dup_n: replace remaining functions with dup_n implementations 27a13b0 @nemequ
- neon/dupq_lane: native and portable 893db57 @ngzhian
- neon/ext: add
__builtin_shufflevectorimplementation de8fe89 @ngzhian - neon/ext: add
_mm_alignr_{,e}pi8implementations 6d28f04 @nemequ - neon/ext: clean up shuffle-based implementation f1de709 @nemequ
- neon/ext: simde_*{to,from}_m64 reqs MMX_NATIVE 13ee902 @mr-c
- neon/ext: unroll SIMDE_CONSTIFY for testing macro implemented functions 62834fa @mr-c
- neon/fma: add a couple x86 and PPC implementations 7a2860b @nemequ
- neon/fma: add more extensive feature checking e541dd1 @nemequ
- neon/fma_lane: Implement fmaq_lane functions a77e6ad 555ef3e @Glitch18
- neon/fma_n: initial implementation 06d5a62 @nemequ dab4342 @nemequ
- neon/get_high: add
__builtin_shufflevectoroptimizations 4003afa @ngzhian - neon/get_low: use
__builtin_shufflevectorif available ea3f75e @ngzhian - neon/hadd,hsub: optimization for Wasm ebe09d8 @ngzhian
- neon/ld1: add Wasm SIMD implementation a79bc15 @ngzhian
- neon/ld1_dup: native and portable (64-bit vectors), f64 debb3c8 @ngzhian 6c71aac @Glitch18
- neon/ld1_dup: split from ld1, dup_n fallbacks, WASM implementations 4c586e0 @nemequ
- neon/ld1: Implement f16 functions 6e89a9c f26f775 @Glitch18
- neon/ld1_lane: Implement remaining functions de2de8d @Glitch18 9051a51 @ngzhian
- neon/ld1q: u8_x2, u8_x3, u8_x4 341006c @ngzhian
- neon/ld1[q]_*_x2: initial implementation cd14634 @dgazzoni
- neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC e142a59 @nemequ
- neon/ld{2,3,4}: silence false positive diagnostic on GCC 7 3f737a3 @nemequ
- neon/ld2: Implement remaining functions e68f728 @Glitch18 3b3014f @ngzhian 078bb00 @nemequ 041b1bd @mr-c
- neon/ld4_lane: native and portable implementations a973cab @ngzhian 179fb79 @Glitch18 0d1ab79 @nemequ
- neon/ld4: use conformant array parameters 723a8a8 @nemequ
- neon/ld4: work around spurious warning on clang < 10 64e9db0 @nemequ
- neon/min: add SSE2 vminq_u32 & vqsubq_u32 implementation 2cf165e 117de35 @nemequ
- neon/{min,max}nm: add some headers for -ffast-math ebe5c7d @nemequ
- neon/{min,max}nm: use simde_math_* prefixed min/max functions c1607d2 @nemequ
- neon/mlal_high_n: initial implementation d6f75fa @dgazzoni
- neon/mlal_lane: initial implementation 82e36ed 2168ca0 @nemequ
- neon/mls: add
_mm_fnmadd_*implementations of vmls*_f* 70e0c20 @nemequ - neon/mlsl_high_n: initial implementation ca1a4c3 @dgazzoni
- neon/mlsl_lane: initial implementation de78ae9 @nemequ
- neon/mls_n: initial implementation 042c6eb @nemequ
- neon/movl: improve WASM...
v0.7.4-rc3
Full Changelog: v0.7.4-rc2...v0.7.4-rc3