Skip to content

Conversation

@AymenQ
Copy link
Collaborator

@AymenQ AymenQ commented Oct 21, 2022

Improve performance of several of the float-to-integer intrinsics.

The first commit re-writes _mm_cvtps_pi8 and _mm_cvtps_pi16 to use
the signed saturating extract-narrow (sqxtn) intrinsic instead of
emulating the saturating-narrow behaviour with several masks.

The second commit adds support for using the round-to-integral
intrinsics introduced by FEAT_FRINTTS in Armv8.5-A. These match the
out-of-bounds behaviour specified by the SSE float-to-integer
conversions instead of saturating, so this required a test modification.

Sample codegen diffs are provided in the individual commit messages.

NEON provides a saturating extract-narrow that already respects the
saturating behaviour required by _mm_cvtps_pi[8,16]. Use this instead
of using a mask-based implementation.

Example codegen for _mm_cvtps_pi16(a) with GCC 11.2.0 (-O3):

Prior to commit:
    adrp    x0, 0x400000
    movi    v5.4s, 0x4f, lsl 24
    movi    v3.4s, 0xc7, lsl 24
    movi    v2.4s, 0x7f, msl 8
    mvni    v6.4s, 0x7f, msl 8
    ldr     q4, [x0, 2960]
    fcmge   v5.4s, v5.4s, v0.4s
    fcmgt   v3.4s, v0.4s, v3.4s
    fcmge   v1.4s, v0.4s, v4.4s
    fcmgt   v4.4s, v4.4s, v0.4s
    and     v1.16b, v1.16b, v5.16b
    and     v3.16b, v3.16b, v4.16b
    and     v2.16b, v2.16b, v1.16b
    orr     v1.16b, v3.16b, v1.16b
    bic     v1.16b, v6.16b, v1.16b
    mrs     x0, fpcr
    lsr     w1, w0, 16
    tbz     w0, 22, 0x400a9c
    tbz     w1, 7, 0x400a84
    fcvtzs  v0.4s, v0.4s
    orr     v2.16b, v2.16b, v1.16b
    and     v0.16b, v0.16b, v3.16b
    orr     v0.16b, v2.16b, v0.16b
    xtn     v0.4h, v0.4s
    ret
    fcvtps  v0.4s, v0.4s
    orr     v2.16b, v2.16b, v1.16b
    and     v0.16b, v0.16b, v3.16b
    orr     v0.16b, v2.16b, v0.16b
    xtn     v0.4h, v0.4s
    ret
    tbz     w1, 7, 0x400ab8
    fcvtms  v0.4s, v0.4s
    orr     v2.16b, v2.16b, v1.16b
    and     v0.16b, v0.16b, v3.16b
    orr     v0.16b, v2.16b, v0.16b
    xtn     v0.4h, v0.4s
    ret
    fcvtns  v0.4s, v0.4s
    orr     v2.16b, v2.16b, v1.16b
    and     v0.16b, v0.16b, v3.16b
    orr     v0.16b, v2.16b, v0.16b
    xtn     v0.4h, v0.4s

After commit:
    mrs     x0, fpcr
    lsr     w1, w0, 16
    tbz     w0, 22, 0x400a48
    tbz     w1, 7, 0x400a3c
    fcvtzs  v0.4s, v0.4s
    sqxtn   v0.4h, v0.4s
    ret
    fcvtps  v0.4s, v0.4s
    sqxtn   v0.4h, v0.4s
    ret
    tbz     w1, 7, 0x400a58
    fcvtms  v0.4s, v0.4s
    sqxtn   v0.4h, v0.4s
    ret
    fcvtns  v0.4s, v0.4s
    sqxtn   v0.4h, v0.4s
Use the frint32x instruction (FEAT_FRINTTS) for _mm_cvtps_epi32 and
_mm_cvtpd_epi32. This respects the currently set rounding mode.

The round-to-integral instructions match the behaviour specified by the
corresponding SSE instructions; floats that would not fit into 32-bits
instead return the indefinite integer (INT32_MIN). The current
implementation uses a C-style cast, which will saturate out-of-bound
results instead of returning the indefinite integer.

The test for _mm_cvtpd_epi32 has therefore been modified to check for
the integer indefinite for out-of-bound inputs in the case that these
instructions are supported on the -march target.

Example codegen for _mm_cvtpd_epi32(a) with GCC 11.2.0 (-O3,
-march=armv8.5-a):

Prior to commit:
    frinti   v0.2d, v0.2d
    sub      sp, sp, 0x10
    str      xzr, [sp, 8]
    fmov     d1, d0
    mov      d0, v0.d[1]
    fcvtzs   w1, d1
    fcvtzs   w0, d0
    stp      w1, w0, [sp]
    ldr      q0, [sp]
    add      sp, sp, 0x10

After commit:
    frint32x v0.2d, v0.2d
    fcvtzs   v0.2d, v0.2d
    xtn      v0.2s, v0.2d
    mov      d0, v0.d[0]
@AymenQ AymenQ requested review from jserv and marktwtn as code owners October 21, 2022 16:41
@jserv jserv merged commit 9c82799 into DLTcollab:master Oct 22, 2022
@jserv
Copy link
Member

jserv commented Oct 22, 2022

Thank @AymenQ for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants