Improve performance of float-to-integer intrinsics #546

AymenQ · 2022-10-21T16:41:21Z

Improve performance of several of the float-to-integer intrinsics.

The first commit re-writes _mm_cvtps_pi8 and _mm_cvtps_pi16 to use
the signed saturating extract-narrow (sqxtn) intrinsic instead of
emulating the saturating-narrow behaviour with several masks.

The second commit adds support for using the round-to-integral
intrinsics introduced by FEAT_FRINTTS in Armv8.5-A. These match the
out-of-bounds behaviour specified by the SSE float-to-integer
conversions instead of saturating, so this required a test modification.

Sample codegen diffs are provided in the individual commit messages.

NEON provides a saturating extract-narrow that already respects the saturating behaviour required by _mm_cvtps_pi[8,16]. Use this instead of using a mask-based implementation. Example codegen for _mm_cvtps_pi16(a) with GCC 11.2.0 (-O3): Prior to commit: adrp x0, 0x400000 movi v5.4s, 0x4f, lsl 24 movi v3.4s, 0xc7, lsl 24 movi v2.4s, 0x7f, msl 8 mvni v6.4s, 0x7f, msl 8 ldr q4, [x0, 2960] fcmge v5.4s, v5.4s, v0.4s fcmgt v3.4s, v0.4s, v3.4s fcmge v1.4s, v0.4s, v4.4s fcmgt v4.4s, v4.4s, v0.4s and v1.16b, v1.16b, v5.16b and v3.16b, v3.16b, v4.16b and v2.16b, v2.16b, v1.16b orr v1.16b, v3.16b, v1.16b bic v1.16b, v6.16b, v1.16b mrs x0, fpcr lsr w1, w0, 16 tbz w0, 22, 0x400a9c tbz w1, 7, 0x400a84 fcvtzs v0.4s, v0.4s orr v2.16b, v2.16b, v1.16b and v0.16b, v0.16b, v3.16b orr v0.16b, v2.16b, v0.16b xtn v0.4h, v0.4s ret fcvtps v0.4s, v0.4s orr v2.16b, v2.16b, v1.16b and v0.16b, v0.16b, v3.16b orr v0.16b, v2.16b, v0.16b xtn v0.4h, v0.4s ret tbz w1, 7, 0x400ab8 fcvtms v0.4s, v0.4s orr v2.16b, v2.16b, v1.16b and v0.16b, v0.16b, v3.16b orr v0.16b, v2.16b, v0.16b xtn v0.4h, v0.4s ret fcvtns v0.4s, v0.4s orr v2.16b, v2.16b, v1.16b and v0.16b, v0.16b, v3.16b orr v0.16b, v2.16b, v0.16b xtn v0.4h, v0.4s After commit: mrs x0, fpcr lsr w1, w0, 16 tbz w0, 22, 0x400a48 tbz w1, 7, 0x400a3c fcvtzs v0.4s, v0.4s sqxtn v0.4h, v0.4s ret fcvtps v0.4s, v0.4s sqxtn v0.4h, v0.4s ret tbz w1, 7, 0x400a58 fcvtms v0.4s, v0.4s sqxtn v0.4h, v0.4s ret fcvtns v0.4s, v0.4s sqxtn v0.4h, v0.4s

Use the frint32x instruction (FEAT_FRINTTS) for _mm_cvtps_epi32 and _mm_cvtpd_epi32. This respects the currently set rounding mode. The round-to-integral instructions match the behaviour specified by the corresponding SSE instructions; floats that would not fit into 32-bits instead return the indefinite integer (INT32_MIN). The current implementation uses a C-style cast, which will saturate out-of-bound results instead of returning the indefinite integer. The test for _mm_cvtpd_epi32 has therefore been modified to check for the integer indefinite for out-of-bound inputs in the case that these instructions are supported on the -march target. Example codegen for _mm_cvtpd_epi32(a) with GCC 11.2.0 (-O3, -march=armv8.5-a): Prior to commit: frinti v0.2d, v0.2d sub sp, sp, 0x10 str xzr, [sp, 8] fmov d1, d0 mov d0, v0.d[1] fcvtzs w1, d1 fcvtzs w0, d0 stp w1, w0, [sp] ldr q0, [sp] add sp, sp, 0x10 After commit: frint32x v0.2d, v0.2d fcvtzs v0.2d, v0.2d xtn v0.2s, v0.2d mov d0, v0.d[0]

jserv · 2022-10-22T00:59:39Z

Thank @AymenQ for contributing!

AymenQ added 2 commits October 21, 2022 12:54

AymenQ requested review from jserv and marktwtn as code owners October 21, 2022 16:41

jserv merged commit 9c82799 into DLTcollab:master Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of float-to-integer intrinsics #546

Improve performance of float-to-integer intrinsics #546

Uh oh!

AymenQ commented Oct 21, 2022

Uh oh!

jserv commented Oct 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve performance of float-to-integer intrinsics #546

Improve performance of float-to-integer intrinsics #546

Uh oh!

Conversation

AymenQ commented Oct 21, 2022

Uh oh!

jserv commented Oct 22, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants