You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optimize AVX(1) VPSHUFB by loading into a register
instead of repeatedly reading it from memory
AsmVectorEquals16-32 2.208µ ± 2% 1.890µ ± 2% -14.42% (p=0.000 n=10)
Avoid BM2 in the AVX(1) implementation
I could've also generated all options for with/without BMI2/AVX2, but I
simply went for having the fastest option require AVX2+BMI2 and use
AVX(1) as the fallback.
Enable AVX implementations for 64 bit width inputs
The XMM registers are 128 bits, so that's only 2 rows per round. I
implemented merging 4x 2 bits together so we can write back a byte. For
that I increased the number of rounds for 64 bit inputs to 4 (instead of
the default 2).
Call VZEROALL for a 30% benchmark improvement
Call VZEROALL before leaving any ASM function.
I didn't expect this large a speedup, so I'm a little suspicious.