Skip to content

Tags: hexon/vectorcmp

Tags

v1.4.2

Toggle v1.4.2's commit message
Optimize AVX(1) VPSHUFB by loading into a register

instead of repeatedly reading it from memory

AsmVectorEquals16-32   2.208µ ± 2%   1.890µ ± 2%  -14.42% (p=0.000 n=10)

v1.4.1

Toggle v1.4.1's commit message
Bugfix: we accidentally did signed comparisons instead of unsigned

v1.4.0

Toggle v1.4.0's commit message
Add VectorNotEquals APIs

v1.3.3

Toggle v1.3.3's commit message
Add bounds checking before calling into ASM

This avoids memory corruption for bad calls

v1.3.2

Toggle v1.3.2's commit message
Avoid BM2 in the AVX(1) implementation

I could've also generated all options for with/without BMI2/AVX2, but I
simply went for having the fastest option require AVX2+BMI2 and use
AVX(1) as the fallback.

v1.3.1

Toggle v1.3.1's commit message
Check cpuflag for BMI2

Found out the hard way I shouldn't assume that having AVX doesn't imply
having BMI2.

v1.3.0

Toggle v1.3.0's commit message
Add IsNaN methods for floats

v1.2.2

Toggle v1.2.2's commit message
Adding a testcase for NaNs revealed that didn't work correctly

v1.2.1

Toggle v1.2.1's commit message
Enable AVX implementations for 64 bit width inputs

The XMM registers are 128 bits, so that's only 2 rows per round. I
implemented merging 4x 2 bits together so we can write back a byte. For
that I increased the number of rounds for 64 bit inputs to 4 (instead of
the default 2).

v1.2.0

Toggle v1.2.0's commit message
Call VZEROALL for a 30% benchmark improvement

Call VZEROALL before leaving any ASM function.

I didn't expect this large a speedup, so I'm a little suspicious.