30 Sep 08:05

mr-c

0efee69

v0.8.4-rc2 Pre-release

Pre-release

What's Changed since v0.8.4-rc1

prefer NATIVE implementations over SIMDE_VECTOR_SUBSCRIPT_OPS by @mr-c in 4f11a25
Fix the portable memory functions and test them in CI by @mr-c in #1311
psabi warnings: always silence by @mr-c in #1332

Clang

detect-clang: detect version 20 by @mr-c in 7c7031e
clang: mark two bugs as being fixed in Clang 19 by @mr-c in d6b282d

GCC

arm64 gcc FRINT: skip native call on GCC by @mr-c in #1299
gcc 13 + 15: two bugs fixed, skip their workarounds if possible by @mr-c in #1300
silence some warnings when in -pedantic mode by @mr-c in #1302

MSCV

x86 avx: fix test for MSVC where CONSTIFY doesn't work by @mr-c in a5f746b

loongarch implementations

x86 sse2 for loongarch: fix GCC build failure by @jinboson in #1287
x86 sse2: _mm_div_pd remove broken LSX implementation by @mr-c in 239188b
x86 sse2: LSX fix for simde_x_mm_select_pd by @mr-c in c05bb81

PowerPC implementations

x86 sse2: add powerpc*-darwin support for simde_mm_pause by @barracuda156 in #1319

X86

sse2,sse3, avx: silence some false-positive warnings about uninitialized structs by @mr-c in 437b4f2

MMX

sll,srl,sra}: clamp and test out-of-range variable shift amounts by @aqrit in 606ceb3 caf9394

SSE*

{sll,srl,sra}: clamp and test out-of-range variable shift amounts by @aqrit in cf4ec2d 11ac5d8

AVX*

use memcpy instead of HEDLEY_REINTERPRET_CAST by @mr-c in 40f8f99
avx: fix native implementation of simde_mm256_permute2f128_si256 by @ethomag in #1338
{sll,srl,sra}: clamp and test out-of-range variable shift amounts by @aqrit in 7bf1589 edb6dfd
avx2: make simde_mm256_blend_epi16(a, b, imm8) map to native intrinsic by @ethomag in #1342
avx2: _mm256_permute4x64_epi64: Avoid "may be used uninitialized" warning by @ethomag in 9898205

AVX512

misc fixes for AVX512{F,VL}_NATIVE by @mr-c in #1279
srli: add mask{,z} varients by @mr-c in #1310
expand{,loadu}: implement more intrinsics by @mr-c in #1312
cmpeq: finish the rest of the intrinsics by @mr-c in #1314
slli & srli: add missing intrinsics by @mr-c in #1317
loadu: almost all of the remaining intrinsics by @mr-c in #1320
__mmask8 native by @mr-c in #1323
sub: add mask{,z} varients by @michael-chuh in #1327
storeu: add _mm256_mask_storeu_epi{8,32,64} by @michael-chuh in #1328
storeu: fix mask storeu zeroing out original memory by @michael-chuh in #1331
storeu: implement remaining instrinsics by @mr-c in #1334
AVX512CD: implement remaining instrinsics by @mr-c in #1336
cmpneq: finish the implementations; synch with cmpeq by @mr-c in 83141fd

RISC-V

arm neon cge: riscv64 remove broken vcge_s32 impl by @mr-c in 942df2b
[Neon][RVV] Enable RVV when there are zve64d and zvl128b flags. by @Ruhung in #1284
[Neon][RVV] Enable RVV segment load/store only when we have __riscv_zvlsseg flag. by @Ruhung in #1285
[Neon][RVV] add min.h and max.h RVV implementations. by @Ruhung in #1283
[Neon][RVV] Fallback to autovec without mrvv-vector-bits flag. by @Ruhung in #1282
[NEON] Fix mvn.h to correctly handle RVV instructions. by @Ruhung in 1d74b5b

wasm

remove use of non-standard _div instructions by @mr-c in 36e30c5
__wasm_fp16__ is required for Float16 with emscripten or wasi by @mr-c in #1333

ARM

arm fp16 misc cleanups by @mr-c in #1335

NEON

ld1{,q_x[234]}: speedups on SSE[32] & WASM by @mr-c in #1329
Avoid undefined behaviour with signed integer multiplication by @rathann in #1296
Add NEON float16 multi-vectors to native aliases by @stellar-aria in bb7c56f

CI / testing

build(deps): bump hendrikmuhs/ccache-action from 1.2.16 to 1.2.17 by @dependabot[bot] in #1278
gh-action: test on clang 19 & 20 by @mr-c in #1303
build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #1305
build(deps): bump actions/setup-python from 5 to 6 by @dependabot[bot] in #1316
build(deps): bump actions/setup-dotnet from 4 to 5 by @dependabot[bot] in #1315
gh-actions: x86 upgrade to GCC-14 by @mr-c in 1995014
emscripten64 testing: the "--experimental-wasm-memory64" flag is no longer needed by @mr-c in #1280
gh-actions: run actionlint automatically by @mr-c in 7da3fb1
gh-actions: make actionlint/shellcheck clean by @mr-c in 0f2aecf
gh-actions: simplify gcc loongarch64 testing by @mr-c in c9ee1e6
gh-actions: macos 16, undef _LIBCPP_ENABLE_ASSERTIONS by @mr-c in 51743e7
arm neon: extend native alias testing to the vector types by @mr-c in 53f8c08
circleci: drop loongson test, already covered better in gh-actions by @mr-c in 1dd80a3
test.h: define __STDC_FORMAT_MACROS if not defined by @barracuda156 in #1318
Add i686 and clang packit config by @rathann in #1297
build(deps): bump ad-m/github-push-action from 0.8.0 to 1.0.0 by @dependabot[bot] in #1325
ccache upgrade by @mr-c in #1324
gh-actions emscripten: silence several warnings by @mr-c in 47513c1
Revert "gh-actions: pin emsdk to earlier version until llvm/llvm-project#117200 is fixed and released" by @mr-c in ced7f1e
x86 avx512: skip testing NANs with fpclass if -ffast-math or equivalent by @mr-c in 0efee69

New Contributors

@rathann made their first contribution in #1296
@stellar-aria made their first contribution in #1291
@aqrit made their first contribution in #1309
@barracuda156 made their first contribution in #1319
@michael-chuh made their first contribution in #1327
*...

Contributors

mr-c, aqrit, and 8 other contributors

Assets 2

01 Feb 14:12

mr-c

v0.8.4-rc1

fb20a06

v0.8.4-rc1 Pre-release

Pre-release

SIMDe 0.8.4

Summary

TBD

Details

NEON

avoid warnings when "__ARM_NEON_FP" is not defined. f046ab7 @clopez
Rename ARM ROL/ROR functions with a SIMDE prefix. cb846d9 @Syonyk
define native alias only under the inverse of the conditions of a pass-through 2b450c0 @mr-c
cmla{_rot{90,180,270},}_lane: fix implementations with correct tests (confirmed on an ARMv8.3 system) 00ea77e @wewe5215
crc32: define SIMDE_ARCH_ARM_CRC32 and consistently use it 01470d2 @mr-c
qdmlal: fix saturation (#1194) cf1db25 @Ryo-not-rio
qdmlsl: fix instructions to use saturation correctly 44a748a @Ryo-not-rio
qdmulh: Fix vqdmulhs_s32 native alias. 403e942 @Syonyk
qdmull: Fix SQDMULL implementation for 32-bit inputs. (#1255) 948b236 @Syonyk
qrdmulh: Remove incorrect SSE code. 8e27139 @Syonyk
qrshl: Fix incorrect UQRSHL implementation. 2c6adb6 @Syonyk
qshl: Fix UQSHL to match hardware. Add extensive test vectors. (#1256) e5d5064 @Syonyk
qshlu: Fix vqshlud_n_s64 implementation to be 64-bit. 3527e86 @Syonyk
sli_n: Fix invalid shifts (#1253) 8067442 @Syonyk
vminnmv_f16: remove duplicate statement (#1208) d1d9f82 @mr-c

x86 intrinsics

avx512f: new intrinsics family: fmaddsub (#1246) 6daf535 @robinchrist
fma: Use 128 bit fnmadd_pd to do 256 bit fnmadd_pd (#1197) bd05320 @AlexK-BD
avx: _mm256_storeu_pd and _mm256_loadu_pd using 128 bit lanes 96054b8 @AlexK-BD
avx: use INT64_C when the destination is i64 (#1238) 60a3a24 @jinboson
sse4.2: Apply half tabular method in _mm_crc32 family for the best trade-off between performance and lookup table size 0f68b62 @Cuda-Chen
sse2: move definition of 'value' to correct branch in simde_mm_loadl_epi64 b8e468a @K-os
sse2: fix overflow error detected by clang scan-build in simde_mm_srl_epi{16,32,64} when count is too high 1a9d47f @mr-c
some better implementations for MSVC and others without SIMDE_STATEMENT_EXPR_ 1691ae0 @mr-c

Arch support

Altivec

wasm: add u16x8 and u8x16 avgr AltiVec optimized implementations f9bf637 @wrv

arm / arm64

wasm: add u16x8 and u8x16 avgr NEON optimized implementations 7e65734 @wrv
wasm simd128: fix a FAST_NANS error on arm64 a9ebb8a @mr-c
arm neon native: FCMLA with 16-bit floats, requires the FP16 feature 4936149 @mr-c
arm neon native: replace use of SIMDE_ARCH_ARM_CHECK(8+) with feature checks. afd77a9 @mr-c

LongAarch

float16: use a portable version to avoid compilation errors 600050d @XiWeiGu
x86/sse2: add lsx support b331ea2 @HecaiYuan
x86/sse2: small fixes for loongarch d344e3c @jinboson
x86/sse4.2: add loongarch lsx optimized implementations fa6a869 @HecaiYuan
x86/sse4.1: add loongarch lsx optimized implementations f85ad3b @HecaiYuan
x86/ssse3: add loongarch lsx optimized implementations 879be03 @HecaiYuan
x86/sse3: add loongarch lsx optimized implementations 8fdc0e8 @HecaiYuan
x86/sse: Fix type convert error for LSX. a6d4207 @yinshiyou
x86/sse: add loongarch lsx optimized implementations 49f73d9 @HecaiYuan
x86/avx2: add loongarch lasx optimized implementations (#1241) d62ab5a @jinboson
x86/avx2: small fixes for loongarch 1bbb5af @jinboson
x86/avx: add loongarch lasx optimized implementations (#1239) 5e406dc @jinboson
x86/avx: reoptimized simde_mm256_addsub_ps/d with lasx 4242de3 @jinboson
x86/clmul: _x_bitreverse_u64: add loongarch implementation (#1249) 866cc57 @jinboson
x86/fma: add loongarch lasx optimized implementations d2cd71b @jinboson
x86/f16c: add loongarch lasx optimized implementations a70fca2 @jinboso

RISCV64

arm: improve performance in vqadd and vmvn in risc-v 17416b1 @zengdage
arm/neon: additional RVV implementations (43 instructions) - part 1 (#1188) 6346405 @Ruhung
arm/neon: additional RVV implementations (34 instructions) - part 2. (#1189) c903416 @wewe5215
x86 sse2: fix _mm_pause for RISCV systems ed042d5 @mr-c

WASM

arm neon st2: add vst2_u8 WASM optimized implementation 9aeb89e @wrv
arm neon shll_n: add vshll WASM optimized implementations 1fdca85 @wrv
arm neon st4: add vst4_u8 WASM optimized implementation 7f47244 @wrv
sse2: remove redundant mm_add_pd optimized implementation for WASM (#1190) 8ee42f6 @wrv
sse2: Wasm SIMD version of _mm_sad_epu8 bc37d4b @wrv

z/Arch

neon/cvz: stop using deprecated functions. 776d0b6 @mr-c

Compiler Specific

Clang

Don't use _Float16 on s390x a1ce45c @mcatanzaro
Don't use _Float16 on non-SSE2 x86 40f4d28 @mcatanzaro
x86 avx512: fix clang type redef error f4daa86 @bd-jahn

GCC

Use _Float16 in C++ on aarch64 with GCC 13+ e30e6ec @mcatanzaro
arm neon: fix arm64 gcc11 build excess elements in vector failure d370f28 @Qingwu-Li
arm neon: avoid vst1_*_x4 built-in functions in GCC 11 and before 557fd6d @Qingwu-Li
arm neon sm3: gcc-14 -O3 complained about some possible uninitialized values 99ac62b @mr-c
arm neon _vext_p6: reverse logic to avoid GCC14 i586 bug (#1251) e958b0a @mr-c
risc64 gcc-14: Disable uninitialized variable warnings for some ARM neon SM3 functions b2ad094 @Syonyk
simde-aes: gcc 13.2+ ignore unused variable warnings f4f5904 @mr-c
arm neon gcc-12 FRINT workaround e5605e9 @mr-c

MSVC

add simde_MemoryBarrier to avoid including <windows.h> f47e3c5 @Epixu

Testing with Docker/Podman & CI

meson: 0.55.1 is needed for Python 3.12+ 030c07c @mr-c
x86/avx: Adding several overflow tests for various avx functions e8c881d @qvd808
arm neon qdmlsl: unroll SIMDE_CONSTIFY for testing macro implemented functions 858b005 @mr-c
native-aliases test: allow running on macos 6b6e4ef @mr-c
arm neon abd & cvt tests: add missing import ab5c3e5 @mr-c
Add tests for vqdmulhs_s32. f56ef45 @Syonyk
x86 sse2: skip two extreme test cases for mm_cvtps_epi32 if SIMDE_FAST_ROUND_TIES is active. 0e6756b @mr-c

Appveyor

stop testing with MSVC 2022 until they fix their regressions b6ea9ba @mr-c

Circle CI

switch container for gcc11 i686 -O2 test 56b7c7a @mr-c
run on the primary development branch to prime the cache f0de562 @mr-c
always save ccache cache 02cc09b 6eabe36 @mr-c
add linux arm64 native aliases testing b036110 @mr-c
use ccache consistently ab758b5 @mr-c

GitHub Actions

GitHub has retired the macos-11 runners, add some more -13 (x86-64) and -14 (arm64) testing 32c959c @mr-c
ensure that gcov is present when needed 6f52a1d @mr-c
upgrade to Ubuntu 24.04 LTS; upgrade/add GCC 13 / clang 18 d67c190 @mr-c
test loongson + lsx with gcc14 from Ubuntu Oracular 59bf8de @mr-c
add CI testing for gcc 11 aarch64/arm64 4b96738 @mr-c
upgrade gcc-qemu to gcc-14 561556c @mr-c
test aarch64 without extra features 6686232 @mr-c
add loongarch64 clang-18 test ac3870b @mr-c
clean up install list 9cbeced @mr-c
pin emsdk to earlier version until llvm/llvm-project#117200 is fixed and released 3257054 @mr-c
upgrade Ubuntu Mantic to Ubuntu Noble (24.04) e1bc420 @mr-c
macos: xcode 14.3.1 is no longer available, switch to macos-15 to test xcode 16.0 7035777 @mr-c
msvc-arm64: turn off due to compiler issue 6802efa @mr-c
macos 12: deprecated, going offline on 2024-12-03 2bb7f48 @mr-c
update CI test for loongarch 0cf3528 @jinboson
Add some native Linux arm64 clang builds 2f0c939 @mr-c
aarch64 qemu testing: increase arm levels and features targeted. 067ab5d @mr-c
Add more native Linux arm64 builds 693337a @mr-c
more ccache 17b2cbf @mr-c

Misc

pow: consistently use simde_math_pow 8f727c0 @mr-c
math: typo fix, check SIMDE_MATH_NANF instead of the old-style SIMDE_NANF 40567df @mr-c
math: Whoops, missing comma 73e43dd @Dave-Lowndes
remove extraneous semicolons from many macro-defined functions 01f7a4f @mr-c

New Contributors

@clopez made their first contribution in #1179
@mcatanzaro made their first contribution in #1182
@Ruhung made their first contribution in #1188
@AlexK-BD made their first contribution in #1197
@Epixu made their first contribution in #1199
@yinshiyou made their first contribution in #1215
@Qingwu-Li made their first contribution in #1216
@K-os made their first contribution in #1223
@XiWeiGu made their first contribution in #1224
@Dave-Lowndes made their first contribution in #1233
@bd-jahn made their first contribution in #1232
@qvd808 made their first contribution in #1226
@HecaiYuan made their first contribution in #1236
@jinboson made their first contribution in #1238
@robinchrist made their first contribution in #1246
@Syonyk made their first contribution in #1253
@Ryo-not-rio made their first contribution in #1195

Full Changelog: v0.8.2...v0.8.4-rc1

Contributors

clopez, K-os, and 20 other contributors

Assets 2

02 May 09:58

mr-c

v0.8.2

71fd833

v0.8.2 Latest

Latest

SIMDe 0.8.2

Summary

Start of RISCV64 optimized implementation using the RVV1.0 vector extension! Thank you @eric900115 @howjmay @zengdage
62 of the ARM Neon intrinsics added in SIMDe 0.8.0 had to be removed for not exactly matching the specs and real hardware
(from the FCVTZS/FCVTMS/FCVTPS/FCVTNS families). This brings us down from 100% coverage of the NEON functions to 99.07%.

For the entire project: 126 files changed, 5522 insertions(+), 2772 deletions(-)

For just the simde folder: 89 files changed, 4330 insertions(+), 2199 deletions(-)

Details

Implementation of Arm intrinsics

NEON

arm neon: disable some FCVTZS/FCVTMS/FCVTPS/FCVTNS family intrinsics 339ffe4 @mr-c
arm neon sm3: check constant range 3d34fcd @mr-c
arm 32 bits: native def fixes; workarounds for gcc 22900e6 @Cuda-Chen
x86 implementations: allow _m128 access from SSE 114c3cd @mr-c

WASM intrinsics

wasm x86 impl: some were incorrectly marked SSE instead of SSE2 fee149a @mr-c

x86 intrinsics

SVML

SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c

XOP

fix some native functions 608200b @mr-c

Arch support

arm / arm64

arm platform: cleanup feature detection. 08c21f3 @mr-c
arm: enable more intrinsic function for armv7 416091e @zengdage

RISCV64

Initial Support for the RISC-V Vector Extension (RVV1.0) in ARM NEON (#1130) b4e805a @eric900115
arm: fix some neon2rvv intrinsic function error 2a548e5 @zengdage
arm: Add neon2rvv support in vand series intrinsics dac67f3 @howjmay
arm: improve performance in vabd_xxx for risc-v b63ba04 @zengdage
arm: improve performance in vhadd_xxx for risc-v a68fa90 @zengdage

Compiler Specific

Clang

detect clang versions 18 & 19 ed4a5cd @mr-c
arm neon clang: skip vrnd native before clang v18 e647f10 @mr-c
apple clang arm64: ignore SHA2 be48ef8 @mr-c

Emscripten

use __builtin_roundeven{f,} from version 3.1.43 onwards 4379740 @mr-c

MSVC

x86 test msvc: really disable warning 4799,4730 487507d @mr-c
sse2 MSVC _mm_pause implementaiton for x86 8d95f83 @mr-c
SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c

Testing with Docker/Podman & CI

CI: don't run twice on dependabot branches 70748cd @mr-c

GitHub Actions

test Mac arm64 0080b28 @mr-c
macos: report log if there is a configuration failure. df3e930 @mr-c
build(deps): bump actions/checkout from 3 to 4 (#1149) 9605608 @dependabot[bot]
build(deps): bump codecov/codecov-action from 3 to 4 25382c1 @dependabot[bot]
codecov: use token 2c45dd4 @mr-c
Add gcc arm 32bit armv8-a test in CI 72bde75 @Cuda-Chen
build for AMD Buildozer version 2 9746537 @mr-c

Packit CI

Drop i386 (i686) support. (#1155) cf68aaf @junaruga

Semaphore CI

stop testing on GCC 5 & 6, clang 3.9 & 4 due to forced upgrade to Ubuntu 20.04 9982f10 @mr-c

Misc

update list of fully implemented instruction sets (#1152) b568fcd @mr-c
typo fixes from codespell 8639fef @mr-c
README.md - move CLMUL to partial, list more of the CI.yml architectures 285b50d @Torinde
Update README.md - link to VPCLMULQDQ; mention MSA (#1157) 517da84 @Torinde
Update README.md (#1156) b88a66d @mr-c
README: two more related projects 7429dff @mr-c

New Contributors

@eric900115 made their first contribution in #1130
@Cuda-Chen made their first contribution in #1116
@Torinde made their first contribution in #1157
@zengdage made their first contribution in #1172
@howjmay made their first contribution in #1174

Full Changelog: v0.8.0...v0.8.2

Contributors

junaruga, mr-c, and 6 other contributors

Assets 3

30 Apr 16:39

mr-c

v0.8.2-rc1

71fd833

v0.8.2-rc1 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.8.0

Full Changelog: v0.8.0...v0.8.2-rc1

Assets 2

14 Mar 13:03

mr-c

v0.8.0

589c7d5

v0.8.0

SIMDe 0.8.0

Summary

Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
SIMDe PRs are tested using Fedora Rawhide (@junaruga)

For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)

For just the simde folder: 295 files changed, 47053 insertions(+), 896 deletions(-)

There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).

Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER, PF, 4MAPS, and 4VNNIW) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.

Newly added function families

AES: 5 of 6 (83.33%)

Newly AVX512 added function families

castph: 1 of 9 (11.11%) implemented.
cvtus_storeu: 1 of 18 (5.56%) implemented.
fpclass: 3 of 24 (12.50%) implemented.
i32gather: 1 of 8 (12.50%) implemented.
i64gather: 8 of 8 💯
permutex: 3 of 12 (25.00%) implemented.
rcp14: 1 of 24 (4.17%) implemented.
reduce
reduce_max: 7 of 31 (22.58%) implemented.
reduce_min: 7 of 31 (22.58%) implemented.
shufflehi: 1 of 7 (14.29%) implemented.
shufflelo: 1 of 7 (14.29%) implemented.

Additions to existing families

AVX512BW: 7 additional, 337 of 790 (42.66%)
AVX512DQ: 5 additional, 112 total of 376 (29.79%)
AVX512F: 48 additional, 1087 total of 2812 (38.66%)
AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

Neon

SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!

Newly added families

abal
abal_high
abd
abdh
abdl_high
addhn_high
aes
bfdot
bfdot_lane
cadd_rot
cale
calt
cmla_lane
cmla_rot_lane
copy_lane
cvt_high
cvt_n
cvta
cvtn
cvtp
cvtx
cvtx_high
div
dupb_lane
duph_lane
eor3
fmlal
fms
fms_lane
fms_n
ld2_dup
ld2_lane
ld3_dup
ld3_lane
ld4_dup
maxnmv
minnmv
mla_lane
mla_high_lane
mls_lane
mlsl_high_lane
mmla
mull_high_lane
mull_high_n
mulx
mulx_lane
pmaxnm
pminnm
qdmlal
qdmlal_high
qdmlal_high_lane
qdmlal_high_n
qdmlal_lane
qdmlal_n
qdmlsl
qdmlsl_high
qdmlsl_high_lane
qdmlsl_high_n
qdmlsl_lane
qdmlsl_n
qdmlslh
qdmlslh_lane
qdmulhh
qdmulhh_lane
qdmull_high
qdmull_high_lane
qdmull_high_n
qdmull_lane
qdmull_n
qdmullh_lane
qmovun_high
qrdmlah
qrdmlah_lane
qrdmlahh
qrdmlahh_lane
qrdmlsh
qrdmlsh_lane
qrdmlshh
qrdmlshh_lane
qrdmulhh_lane
qrshl
qrshlh
qrshrn_high_n
qrshrnh_n
qrshrun_high_n
qrshrunh_n
qshl_n
qshlh_n
qshluh_n
qshrn_high_n
qshrnh_n
qshrun_high_n
qshrunh_n
raddhn
raddhn_high
rax
recp
rnd32x
rnd32x
rnd32x
rnd64z
rnda
rndx
rshrn_high_n
rsubhn
rsubhn
set_lane
sha1
sha1h
sha256
sha512
shll_high_n
shrn_high_n
sli_n
sm3
sm4
sqrt
st1_x2
st1_x3
st1_x4
st1q_x2
st1q_x3
st1q_x4
subhn_high
sudot_lane
usdot
usdot_lane

Finally complete families

cvtn
mla_lane

Details

simde-f16: improve _Float16 usage; better INFHF/NANHF defs 8910057 @mr-c
simde_float16: prefer __fp16 if available aba26f6 @mr-c

Implementation of Arm intrinsics

NEON

cvtn: vcvtnq_{s32_f32,s64_f64}: add SSE & AVX512 optimized implementations e134cc7 @mr-c
cvtn: vcvtnq_u32_f32 is a V8 function 8432c70 @mr-c
min: Remove non-working MMX specialization from simde_vmin_s16 6858b92 @M-HT
shll: Extend constant range in simde_vshll_n_XXX intrinsics (#1064) beb1c61 @M-HT
various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
arm: use SIMDE_ARCH_ARM_FMA 7198d6d @mr-c
arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}: initial implementation (#1082) 879d1a0 @yyctw
part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
Add AES instructions. 23adcd2 805ccd2 @yyctw
Modified simde_float16 to simde_float16_t (#1100) 8a05dc6 @yyctw
implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
add enable vmlaq_laneq_f32 and vcvtq_n_f64_u64 c7d314b @yyctw
implement all bf16-related intrinsics (#1110) c59db7c @yyctw
arm/neon abs: negating INT_MIN is undefined behavior in C/C++ c200c16 @mr-c

SVE Intrinsics

Improve performance of simde_mm512_add_epi32 (#1126) 6cde31c @AymenQ

WASM intrinsics

simd128: fix altivec_p7 version of wasm_f64x2_pmin 96d6e53 @mr-c
simd128: add missing unsigned functions ea5e283 @mr-c
simd128 f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c
detect support for Relaxed SIMD mode 2e66dd4 @mr-c
simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
relaxed: add f{32x4,64x2}_relaxed_{min,max} 9d1a34e @mr-c
relaxed: updated names; reordered FMA operations 8cc8874 @mr-c

x86 intrinsics

sse{,2,4.1}, avx{,2} *_stream_{,load}: use __builtin_nontemporal_{load,store} 6ce6030 @mr-c

SSE*

sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
sse: implement _mm_movelh_ps for Arm64 514564e @mr-c
sse _mm_movemask_ps: remove unused code fba97e4 @mr-c
sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
sse4.1: use logical OR instead of bitwise OR in neon impl of _mm_testnzc_si128 edd4678 @mr-c
sse4.1 _mm_testz_si128: fix backwards short circuit logic f132275 @mr-c

AVX

run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pd fix for natural vector size < 128 1594d7c @mr-c

AVX2

correction of simde_mm256_sign_epi{8,16,32} (#1123) c376610 @Proudsalsa

AVX512

fpclass: naive implementation 353bf5f @mr-c
loadu: fix native detection 305f434 @mr-c
set: add simde_x_mm512_set_m256{,d} 67e0c50 @mr-c
gather: add MSVC native fallbacks 7b7e3f6 @mr-c
AVX512FP16 / m512h initial support e97691c @mr-c
fix many native aliases 75014b9 @mr-c

CLMUL

fix natives, some require VPCLMULQDQ f819c52 @mr-c

SVML

enable SIMDE_X86_SVML_NATIVE for MSVC 2019+ 593af95 @mr-c

AES

aes: initial implementation of most aes instructions (#1072) 8632391 @Vineg

MIPS MSA intrinics

msa neon impl: float64x2_t is not avail in A32V7 ae4c4ab @mr-c

Arch support

x86(-64)

fix SIMDE_ARCH_X86_SSE4_2 define 5e4b308 @cbielow

arm64

x86 aes: add neon implementation using the crypto extension fb3554f @mr-

Altivec

neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c

Power

sse2,wasm simd128: skip SIMDE_CONVERT_VECTOR_ impementations on PowerPC 4de999a @mr-c
wasm simd128: more powerpc fixes 7cb5691 @mr-c

Compiler Specific

GCC

GCC AVX512F: SIMDE_BUG_GCC_95399 was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c
GCC x86/x64: SIMDE_BUG_GCC_98521 was fixed in 10.3 edde42e @mr-c
GCC x86: SIMDE_BUG_GCC_94482 was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c
Add workaround for GCC bug 111609 fdafd8e @M-HT
arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
avx512 abs: refine GCC compiler checks for _mm512{,_mask}_abs_pd (#1118) 5405bbd @thomas-schlichter

Clang

clang powerpc: vec_bperm bug was fixed in clang-14 6feb28a @mr-c
clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
wasm: SIMDE_BUG_CLANG_60655 is fixed in the upcoming 17.0 release 25cebbe @mr-c
simde-detect-clang.h: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur

ClangCL

fp16: don't use _Float16 on ClangCL if not supported 8a6b8c5 @mr-c
svml: don't...

Contributors

junaruga, Coeur, and 13 other contributors

Assets 3

07 Mar 14:15

mr-c

v0.8.0-rc2

c200c16

v0.8.0-rc2 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes for changes since 0.7.6

What's Changed since RC1

WASM Relaxed SIMD updates by @mr-c in #1112
emcc tot: set -Wno-switch-default by @mr-c in #1115
avx512 abs: refine GCC compiler checks for _mm512{,_mask}_abs_pd by @thomas-schlichter in #1118
correction of simde_mm256_sign_epi16(). by @Proudsalsa in #1123
apply arm64 windows workaround only on older version msvc by @Changqing-JING in #1121
gh-actions: add clang-17 by @mr-c in #1127
Improve performance of simde_mm512_add_epi32 by @AymenQ in #1126
typo: XCode -> Xcode by @Coeur in #1129
Update simde-detect-clang.h for clang 13 detection by @Coeur in #1131
Update simde-detect-clang.h for clang 17 detection by @Coeur in #1132
build(deps): bump ad-m/github-push-action from 0.6.0 to 0.8.0 by @dependabot in #1134
build(deps): bump actions/setup-dotnet from 3 to 4 by @dependabot in #1135
build(deps): bump actions/setup-python from 4 to 5 by @dependabot in #1137
build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1138
GitHub Actions emscripten: use older release for now by @mr-c in #1133
build(deps): bump actions/checkout from 3 to 4 by @dependabot in #1139
docs: explain how to target a single test by @mr-c in #1140
arm/neon abs: negating INT_MIN is undefined behavior by @mr-c in #1141

New Contributors

@thomas-schlichter made their first contribution in #1118
@Proudsalsa made their first contribution in #1123
@Changqing-JING made their first contribution in #1121
@AymenQ made their first contribution in #1126
@Coeur made their first contribution in #1129
@dependabot made their first contribution in #1134

Full Changelog: v0.8.0-rc1...v0.8.0-rc2

Contributors

Coeur, mr-c, and 5 other contributors

Assets 2

20 Nov 17:41

mr-c

v0.8.0-rc1

e651ec3

v0.8.0-rc1 Pre-release

Pre-release

See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes

New Contributors

@cbielow made their first contribution in #1055
@M-HT made their first contribution in #1060
@yyctw made their first contribution in #1071
@Vineg made their first contribution in #1072
@wewe5215 made their first contribution in #1077

Full Changelog: v0.7.6...v0.8.0-rc1

Contributors

Vineg, cbielow, and 3 other contributors

Assets 2

16 May 16:51

mr-c

v0.7.6

fefc785

v0.7.6

Summary

See, I knew we should release more often!

Details

Implementation of Arm intrinsics

NEON

neon/abd,ext,cmla{,_rot{180,270,90}}: additional wasm128 implementations 3a18dff @mr-c
neon/cvtn: basic implementation of a few functions fefc785 @mr-c
neon/mla_lane: initial implementation using mla+dup 554ab18 @ngzhian
neon/shl,rshl: fix avx include to unbreak amalgamated hearders 3748a9f @mr-c
neon/shll_n: make vshll_n_u32 test operational 356db0c @mr-c
neon/qabs: restore SSE2 impl for vqabsq_s8 f614843 @mr-c

x86 intrinsics

mmx: loogson impl promotions over SIMDE_SHUFFLE_VECTOR_ 51bf6f2 @mr-c
x86/sse*,avx: add additional SIMD128 implementations e28a87e @mr-c

SSE*

sse{,2,3,4.1},avx: more WASM shuffle implementations 097dd12 @mr-c
sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
sse: allow native _mm_loadh_pi on MSVC x64 314452b @mr-c

AVX512

avx512: typo fix for typedef of __mmask64 e8390a3 4a9f01a @mr-c
avx512/madd: fix native alias arguments for _mm512_madd_epi16 bcf4adb @mr-c

Arch support

simde-arch: #include Hedley for setting F16C for MSVC 2022+ with AVX2 f9cf467 @mr-c

Testing with Docker/Podman & CI

tests: simde_assert_equal_{v,}f funcs were silently failing 395efd9 @mr-c
tests: Quiet another Clang < v5 warning that resurfaced d9d2b45 @mr-c
tests: audit use of HEDLEY_DIAGNOSTIC_PUSH and _POP 284c88a @mr-c
test: ignore -Wc99-extensions e264ff5 @mr-c
neon/aba: vaba_s32 test was not being run f86346a @mr-c
sve/and: the svand_n_s8_m test is incomplete, mark it as such b962f07 @mr-c
tests: combine declarations in test functions 76c7d37 @mr-c

Local testing with Docker/Podman

docker: add wasm64 target 29db539 @mr-c

Drone.io

remove Drone.io fd10911 @mr-c

GitHub Actions

gh-actions: confirm that all header files are installed 8d5e05a @mr-c
gh-actions: put wasm64 under CI 6702820 @mr-c

Netlify

netlify: disable for now caa0929 @mr-c

Misc

meson install: arm/neon/ld1 & x86/avx512.h 27836b1 @mr-c
Update clang version detection for 14..16 and add link 4957a9e @jan-wassenberg

Contributors

mr-c, ngzhian, and jan-wassenberg

Assets 3

05 May 05:35

mr-c

v0.7.4

0c26988

v0.7.4

SIMDe 0.7.4

Summary

Minimum meson version is now 0.54
40 new NEON families implemented, SVE API implementation started (14 families)
Initial support for x86 F16C API
Initial support for MIPS MSA API
Initial support for Arm Scalable Vector Extensions (SVE) API
Initial support for WASM SIMD128 API
Initial support for the E2K (Elbrus) architecture
MSVC has many fixes, now compiled in CI using /ARCH:AVX, /ARCH:AVX2, and /ARCH:AVX512

X86

There are a total of 7470 SIMD functions on x86, 2971 (39.77%) of which have been implemented in SIMDe so far.
Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1439 (27.31%)

Newly added function families

AVX512CD: 21 of 42 (50.00%)
AVX512VPOPCNTDQ: 18 of 18 💯
AVX512_4VNNIW: 6 of 6 (100.00%)
AVX512_BF16: 9 of 38 (23.68%)
AVX512_BITALG: 24 of 24 💯
AVX512_FP16: 2 of 1105 (0.18%)
AVX512_VBMI2 3 of 150 (2.00%)
AVX512_VNNI: 36 of 36 💯
AVX_VNNI: 8 of 16 (50.00%)

Additions to existing families

AVX512F: 579 additional, 856 total of 2660 (31.80%)
AVX512BW: 178 additional, 335 total of 828 (40.46%)
AVX512DQ: 77 additional, 111 total of 399 (27.82%)
AVX512_VBMI: 9 additional, 30 total of 30 💯
KNCNI: 113 additional, 114 total of 595 (19.16%)
VPCLMULQDQ: 1 additional, 2 total of 2 💯

Neon

SIMDe currently implements 3745 out of 6670 (56.15%) NEON functions. If you don't count 16-bit floats and poly types, it's 3745 / 4969 (75.37%).

Newly added families

addhn
bcax
cage
cmla
cmla_rot90
cmla_rot180
cmla_rot270
fma
fma_lane
fma_n
ld2
ld4_lane
mlal_high_n
mlal_lane
mls_n
mlsl_high_n
mlsl_lane
mull_lane
qdmulh_lane
qdmulh_n
qrdmulh_lane
qrshrn_n
qrshrun_n
qshlu_n
qshrn_n
qshrun_n
recpe
recps
rshrn_n
rsqrte
rsqrts
shll_n
shrn_n
sqadd
sri_n
st2
st2_lane
st3_lane
st4_lane
subhn
subl_high
xar

MSA

Overall, SIMDe implementents 40 of 533 (7.50%) functions from MSA.

Details

Implementation of Arm intrinsics

NEON

aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a22c3cc @mr-c
neon: Implement f16 types 21496f6 @Glitch18
neon: port additional code to new style 1c744fd @nemequ
neon: replace some more abs/labs/llabs usage with simde_math_* versions c59853a @nemequ
neon: refactor to use different types on all targets c17957a @nemequ
neon: test for MMX/SSE instead of x86 when choosing implementation 0366dab @nemequ
neon/abd: add much better implementations c3ddbbe @nemequ 220db33 @ngzhian
neon/abs: add SSE2 integer abs implementations 6396dc8 @aqrit
neon/addhn: initial implementation e9ee066 @nemequ
neon/add: Implement f16 functions e69239c @Glitch18
neon/add{l,}v: SSE2/SSSE3 opts _vadd{lvq_s8, lvq_s16, lvq_u8, vq_u8} 8b4e375 dfffdde @mr-c
neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high b897331 @nemequ
neon/bcax: initial implementation 96ce481 0ed3dea @Glitch18
neon/bsl: Implement f16 functions edb75b5 @Glitch18
neon/cage: Initial f16 implementations 20df81d @Glitch18
neon/cagt: Implement f16 functions 452a6d3 @Glitch18
neon/ceq: Implement f16 functions f24ab3d @Glitch18
neon/ceqz: Implement f16 functions dd2ebf2 de301cd @Glitch18
neon/cge: Implement f16 functions a512986 f3ad0d4 647dc12 @Glitch18
neon/cgez: complete implementation of CGEZ family 6d86a20 @Glitch18
neon/cgt: Add implementation of remaining functions 9930c43 @Glitch18
neon/cgt, simd128: improve some unsigned comparisons on x86 ae6702a @nemequ
neon/cgtz: Add implementations of remaining functions 4d749b5 @Glitch18
neon/cle: add some x86 implementations 5906cc9 d81c7e7 @nemequ 7894c7d @Glitch18
neon/clez: Add implementaions of scalar functions bc72880 @Glitch18
neon/clt: Add implementations of scalar functions & SSE/AVX512 fallbacks bc636e1 6a19637 @Glitch18
neon/cltz: Add scalar functions and natural vector fallbacks 2960ef0 @Glitch18
neon/cmla, neon/cmla_rot{90,180,270}: check compiler versions e98152f @nemequ
neon/cmla, neon/cmla_rot{90,180,270}: CMLA requires armv8.3+ 280faae @nemequ
neon/cmla, neon/cmla_rot{90,180,270}, neon/fma: initial implementation 2aff4f9 @Glitch18
neon/cnt: add x86 implementations of vcntq_s8 a558d6d @nemequ
neon/cvt: add __builtin_convertvector implementations d06ea5b @nemequ
neon/cvt: add out-of-range and NaN tests 7d0e2ac @nemequ
neon/cvt: add some faster x86 float->int/uint conversions ceaaf13 @nemequ
neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 8398f73 @Glitch18
neon/cvt: cast result of float/double comparison dc215cd @ngzhian
neon/cvt: disable some code on 32-bit x86 which uses _mm_cvttsd_si64 48edfa9 @nemequ
neon/cvt: don't use vec_ctsl on POWER 8f9582a @nemequ
neon/cvt: fix a couple of s390x implementations' NaN handling a8bd33d @nemequ
neon/cvt: fix compilation with -ffast-math d1d070d @nemequ
neon/cvt: Implement f16 functions b6a9882 @Glitch18
neon/cvt, relaxed-simd: add work-around for GCC bug #101614 11aa006 @nemequ
neon/cvt, simd128: fix compiler errors on PPC 965e68e @nemequ
neon/cvt: clang bug 46844 was fixed in clang 12.0 71e03a6 @mr-c
neon/dot_lane: add remaining implementation 3f1c1fa 4a9ca8a @Glitch18
neon/dup_lane: Complete implementation of function family 12fb731 df320d1 @Glitch18 014ee00 9461557 @nemequ
neon/dup_lane: use dup_n 2b4a009 @ngzhian
neon/dup_n: Implement f16 functions 14fdf88 @Glitch18
neon/dup_n: replace remaining functions with dup_n implementations 27a13b0 @nemequ
neon/dupq_lane: native and portable 893db57 @ngzhian
neon/ext: add __builtin_shufflevector implementation de8fe89 @ngzhian
neon/ext: add _mm_alignr_{,e}pi8 implementations 6d28f04 @nemequ
neon/ext: clean up shuffle-based implementation f1de709 @nemequ
neon/ext: simde_*{to,from}_m64 reqs MMX_NATIVE 13ee902 @mr-c
neon/ext: unroll SIMDE_CONSTIFY for testing macro implemented functions 62834fa @mr-c
neon/fma: add a couple x86 and PPC implementations 7a2860b @nemequ
neon/fma: add more extensive feature checking e541dd1 @nemequ
neon/fma_lane: Implement fmaq_lane functions a77e6ad 555ef3e @Glitch18
neon/fma_n: initial implementation 06d5a62 @nemequ dab4342 @nemequ
neon/get_high: add __builtin_shufflevector optimizations 4003afa @ngzhian
neon/get_low: use __builtin_shufflevector if available ea3f75e @ngzhian
neon/hadd,hsub: optimization for Wasm ebe09d8 @ngzhian
neon/ld1: add Wasm SIMD implementation a79bc15 @ngzhian
neon/ld1_dup: native and portable (64-bit vectors), f64 debb3c8 @ngzhian 6c71aac @Glitch18
neon/ld1_dup: split from ld1, dup_n fallbacks, WASM implementations 4c586e0 @nemequ
neon/ld1: Implement f16 functions 6e89a9c f26f775 @Glitch18
neon/ld1_lane: Implement remaining functions de2de8d @Glitch18 9051a51 @ngzhian
neon/ld1q: u8_x2, u8_x3, u8_x4 341006c @ngzhian
neon/ld1[q]_*_x2: initial implementation cd14634 @dgazzoni
neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC e142a59 @nemequ
neon/ld{2,3,4}: silence false positive diagnostic on GCC 7 3f737a3 @nemequ
neon/ld2: Implement remaining functions e68f728 @Glitch18 3b3014f @ngzhian 078bb00 @nemequ 041b1bd @mr-c
neon/ld4_lane: native and portable implementations a973cab @ngzhian 179fb79 @Glitch18 0d1ab79 @nemequ
neon/ld4: use conformant array parameters 723a8a8 @nemequ
neon/ld4: work around spurious warning on clang < 10 64e9db0 @nemequ
neon/min: add SSE2 vminq_u32 & vqsubq_u32 implementation 2cf165e 117de35 @nemequ
neon/{min,max}nm: add some headers for -ffast-math ebe5c7d @nemequ
neon/{min,max}nm: use simde_math_* prefixed min/max functions c1607d2 @nemequ
neon/mlal_high_n: initial implementation d6f75fa @dgazzoni
neon/mlal_lane: initial implementation 82e36ed 2168ca0 @nemequ
neon/mls: add _mm_fnmadd_* implementations of vmls*_f* 70e0c20 @nemequ
neon/mlsl_high_n: initial implementation ca1a4c3 @dgazzoni
neon/mlsl_lane: initial implementation de78ae9 @nemequ
neon/mls_n: initial implementation 042c6eb @nemequ
neon/movl: improve WASM...