33954 – Performance regression with r308142

LLVM Bugzilla is read-only and represents the historical archive of all LLVM issues filled before November 26, 2021. Use github to submit LLVM bugs

Bug 33954 - Performance regression with r308142

Summary: Performance regression with r308142

Status:	RESOLVED FIXED

Alias:	None

Product:	tools
Classification:	Unclassified
Component:	llc (show other bugs)
Version:	trunk
Hardware:	PC All

Importance:	P enhancement
Assignee:	Unassigned LLVM Bugs

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-26 13:53 PDT by Anna
Modified:	2018-12-17 06:10 PST (History)
CC List:	8 users (show)

See Also:	40027
Fixed By Commit(s):

Attachments
IR file to run with llc (8.92 KB, text/plain) 2017-07-26 13:53 PDT, Anna	Details
innermost loop IR (this has the cmovs thats converted to branch) (11.85 KB, text/html) 2017-07-27 11:00 PDT, Anna	Details
assembly where the optimization is triggered: cmovs to branches (3.41 KB, text/plain) 2017-07-27 11:02 PDT, Anna	Details
assembly generated without cmovs conversion (4.12 KB, text/plain) 2017-07-27 11:04 PDT, Anna	Details
A Patch to the CMOV heuristic that will print some information on the inner most loop which was optimized (2.21 KB, patch) 2017-07-27 11:59 PDT, Amjad Aboud	Details
output from the print patch (26.43 KB, text/plain) 2017-07-27 13:50 PDT, Anna	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Anna 2017-07-26 13:53:35 PDT

Created attachment 18851 [details]
IR file to run with llc

With rL308142, there was an optimization introduced to convert cmov to branches when profitable.

We noticed couple of regressions (around 2-5%) on skylake hardware on internal benchmarks. The performance was back to normal when -x86-cmov-converter=false was supplied.

I've tried to reduce the IR as much as possible, and added a main method along with it. However, the regression (using time command) is not quite visible, seems to be attributed to noise.

I've attached the IR and how the assembly is generated. Perhaps something may jump out wrt the heuristics, which seems to be having a performance cliff: we convert a cmov to branch when the gain is greater than 25% of misprediction penalty.


Reproduce as:
llc -mcpu=skylake -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 -x86-cmov-converter=false test.ll ; mv test.s test.falsenative.s; gcc test.falsenative.s -o test.falsenative

llc -mcpu=skylake -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 -x86-cmov-converter=true test.ll ; mv test.s test.truenative.s; gcc test.truenative.s -o test.truenative

echo "time for x86-cmov-converter=true"
time ./test.truenative > chk2 

echo "time for x86-cmov-converter=false" 
time ./test.falsenative > chk 

time for x86-cmov-converter=true

real	0m0.155s
user	0m0.065s
sys	0m0.007s
time for x86-cmov-converter=false

real	0m0.154s
user	0m0.064s
sys	0m0.008s

Comment 1 Davide Italiano 2017-07-26 13:55:34 PDT

(In

Comment 2 Anna 2017-07-26 14:04:07 PDT

Just FYI: 
this is the commit:
[X86] X86::CMOV to Branch heuristic based optimization.

LLVM compiler recognizes opportunities to transform a branch into IR select instruction(s) - later it will be lowered into X86::CMOV instruction, assuming no other optimization eliminated the SelectInst.
However, it is not always profitable to emit X86::CMOV instruction. For example, branch is preferable over an X86::CMOV instruction when:

Branch is well predicted
Condition operand is expensive, compared to True-value and the False-value operands
In CodeGenPrepare pass there is a shallow optimization that tries to convert SelectInst into branch, but it is not enough.
This commit, implements machine optimization pass that converts X86::CMOV instruction(s) into branch, based on a conservative heuristic.

Differential Revision: https://reviews.llvm.org/D34769

Amjad, could you please take a look at this?

Comment 3 Amjad Aboud 2017-07-27 05:02:20 PDT

Hi Anna,
Thanks for reporting this issue.
I took a look on your reproducer, and I am still convinced that the CMOV heuristic is profitable when running on industry based benchmarks. Furthermore, I saw 10-40% gains on several workloads when running on Skylake machines, and did not see regressions that are directly caused by this change.
I believe that the code you provided should not suffer from mis-prediction due to the new branches that the CMOV was converted to, do you agree?

Having said that, I agree that in this example, the benefit for converting the CMOV to branch is not high enough, and could not justify the the overhead that occur in the code (e.g. more spills).
The CMOV heuristic is already conservative, but I can add one more constrain to help overcome this issue.

Please, can you apply this patch and check if it solve the regressions you see in your benchmark?

Index: lib/Target/X86/X86CmovConversion.cpp
===================================================================
--- lib/Target/X86/X86CmovConversion.cpp	(revision 309259)
+++ lib/Target/X86/X86CmovConversion.cpp	(working copy)
@@ -401,7 +401,8 @@
     WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth;
   else if (Diff[1] > Diff[0])
     WorthOptLoop =
-        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth);
+        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth) &&
+        (Diff[1] * 8 >= LoopDepth[1].Depth);
 
   if (!WorthOptLoop)
     return false;

I checked this patch with your reproducer and it result in keeping the CMOVs.
I am evaluating this patch, before uploading for review. Your feedback would be very helpful.

Thanks,
Amjad

Comment 4 Anna 2017-07-27 07:28:58 PDT

Hi Amjad, thanks for your response. 
(In reply to Amjad Aboud from comment #3)
> Hi Anna,
> Thanks for reporting this issue.
> I took a look on your reproducer, and I am still convinced that the CMOV
> heuristic is profitable when running on industry based benchmarks.
It might be profitable for *many* industry based benchmarks, but these regressions we saw are also from industry standard benchmarks for java programs (i.e. not internally developed). 

> Furthermore, I saw 10-40% gains on several workloads when running on Skylake
> machines, and did not see regressions that are directly caused by this
> change.
My main concern is that I didn't see any performance numbers on the review thread regarding this optimization. Spec benchmarks or the LNT performance suite perhaps? I understand that identifying CMOV to branch conversion is itself tricky to get right performance wise, since it is difficult to statically estimate the branch predictor. 
The second concern I have is perhaps the heuristic is not conservative enough: is there a reason for the gain being greater than the 25% misprediction penalty, where that penalty is calculated from the scheduler model? Curious about the "25%".

> I believe that the code you provided should not suffer from mis-prediction
> due to the new branches that the CMOV was converted to, do you agree?

Could you provide more information about this transform? 
While reducing the testcase, I kept the branches intact, but I removed instructions which were outside of these loops. I assumed this will not affect the rate of mispredictions due to the new branches in reduced versus original IR.
> 
> Having said that, I agree that in this example, the benefit for converting
> the CMOV to branch is not high enough, and could not justify the the
> overhead that occur in the code (e.g. more spills).
> The CMOV heuristic is already conservative, but I can add one more constrain
> to help overcome this issue.
> 
> Please, can you apply this patch and check if it solve the regressions you
> see in your benchmark?
> 
> Index: lib/Target/X86/X86CmovConversion.cpp
> ===================================================================
> --- lib/Target/X86/X86CmovConversion.cpp	(revision 309259)
> +++ lib/Target/X86/X86CmovConversion.cpp	(working copy)
> @@ -401,7 +401,8 @@
>      WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth;
>    else if (Diff[1] > Diff[0])
>      WorthOptLoop =
> -        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth -
> LoopDepth[0].Depth);
> +        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth -
> LoopDepth[0].Depth) &&
> +        (Diff[1] * 8 >= LoopDepth[1].Depth);
>  
>    if (!WorthOptLoop)
>      return false;
> 
> I checked this patch with your reproducer and it result in keeping the CMOVs.
> I am evaluating this patch, before uploading for review. Your feedback would
> be very helpful.
Thanks for the patch! I will try this on the regressed benchmarks and then on the entire set of java benchmarks. Also, could you please post some performance numbers from SPEC or LNT for the optimization on the original review thread, so that the community has a sense of the performance gains/losses we expect? 
Thank you!

> 
> Thanks,
> Amjad

Comment 5 Anna 2017-07-27 09:55:54 PDT

Hi Amjad,
The patch has helped for one of the benchmarks who's reduced reproducer I had initially attached to the bug. 

However, the second benchmark still has a 5% regression. There is one cmovns which is converted to branch and it's within an innermost nested loop at level 3. 


Here's the assembly snippet without the cmov to branch conversion:
# BB#219:                               # %bb1382
                                        #   in Loop: Header=BB0_72 Depth=2
    lea rbx, [rax + 12]
    test    r9d, r9d
    je  .LBB0_220
# BB#221:                               # %select.false
                                        #   in Loop: Header=BB0_72 Depth=2
    mov edi, r11d
    sub edi, r9d

.LBB0_222:                              # %bb1386
                                        #   Parent Loop BB0_42 Depth=1
                                        #     Parent Loop BB0_72 Depth=2
                                        # =>    This Inner Loop Header: Depth=3
    mov ecx, r9d
    sub ecx, edi
    mov r9d, r11d
    mov ebp, 0
; this one is converted to branches.
    cmovns  r9d, ebp
    add r9d, ecx
    cmp dword ptr gs:[104], 0
    jne .LBB0_223
# BB#224:                               # %bb1467
                                        #   in Loop: Header=BB0_222 Depth=3
    cmp esi, r9d
    jbe .LBB0_289
.LBB0_225:                              # %bb1498
                                        #   in Loop: Header=BB0_222 Depth=3
    mov ebp, r9d
    mov ecx, dword ptr [rbx + 4*rbp]


With the cmov to branch conversion:
# BB#220:                               # %bb1382
                                        #   in Loop: Header=BB0_73 Depth=2
    lea rbx, [rax + 12]
    test    r15d, r15d
    je  .LBB0_221
# BB#222:                               # %select.false
                                        #   in Loop: Header=BB0_73 Depth=2
    mov edi, r13d
    sub edi, r15d
.LBB0_223:                              # %bb1386
                                        #   in Loop: Header=BB0_73 Depth=2
; no cmovns here: gets converted to branch.
    xor ecx, ecx
    sub r15d, edi
    js  .LBB0_224
.LBB0_225:                              # %bb1386
                                        #   in Loop: Header=BB0_73 Depth=2
    add r15d, ecx
    cmp dword ptr gs:[104], 0
    jne .LBB0_226
.LBB0_227:                              # %bb1467
                                        #   in Loop: Header=BB0_73 Depth=2
    cmp esi, r15d
    jbe .LBB0_292
.LBB0_228:                              # %bb1498
                                        #   in Loop: Header=BB0_73 Depth=2
    mov ebp, r15d
    mov ecx, dword ptr [rbx + 4*rbp]
    cmp ecx, r9d
    je  .LBB0_230

Comment 6 Amjad Aboud 2017-07-27 10:11:41 PDT

Hi Anna,
The code you posted is missing some basic blocks (e.g. LBB0_224 that the "js" jumps to).

Also, can you provide me with the LLVM IR for this inner most loop? I can use it to figure out why the heuristic chose to convert to branch.

Thanks,
Amjad

Comment 7 Anna 2017-07-27 11:00:07 PDT

Created attachment 18862 [details]
innermost loop IR (this has the cmovs thats converted to branch)

This cannot be run with llc, but can be matched directly with the assembly output coming up.

Comment 8 Anna 2017-07-27 11:02:05 PDT

Created attachment 18863 [details]
assembly where the optimization is triggered: cmovs to branches

relevant (innermost loop) Snippet from:
llc input.ll  -x86-cmov-converter=true -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 --x86-asm-syntax=intel


Note: This is with patch Amjad has pasted in the bug.

Comment 9 Anna 2017-07-27 11:04:05 PDT

Created attachment 18864 [details]
assembly generated without cmovs conversion

relevant assembly (innermost loop) Snippet from:

llc input.ll  -x86-cmov-converter=false -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 --x86-asm-syntax=intel

Comment 10 Anna 2017-07-27 11:06:08 PDT

Hi Amjad,
I've attached the IR and the assembly with versus without the cmovs conversion. All snippets are for the innermost loop. Thanks

Comment 11 Amjad Aboud 2017-07-27 11:59:01 PDT

Created attachment 18865 [details]
A Patch to the CMOV heuristic that will print some information on the inner most loop which was optimized

Hi Anna,
Thanks for the reproducer, I modified it a bit and managed to run llc.
However, the CMOV was not converted, I assume that my changes lead into different reproducer.

So, I suggest a different approach, which will save you from sharing more sources.
I am attaching another patch to the heuristic that will print some information on the inner most loop which was optimized.

If you can share with me that info I might be able to understand the case and provide a solution.

Thanks,
Amjad

Comment 12 Anna 2017-07-27 13:50:03 PDT

Created attachment 18868 [details]
output from the print patch

Hi Amjad,
Attached is the output from running llc using the print patch you had on the IR. Since, exactly one CMOVns is changed to a branch, I've attached the entire output from that print. 
(The assembly and input IR is same as the one previously attached)

Hope this helps.
Thanks,
Anna

Comment 13 Amjad Aboud 2017-07-27 14:25:43 PDT

Thanks Anna, that was super helpful.

I modified the solution patch a little more, it should fix both regressions now.
I will need to evaluate this solution before upload for review.
Hopefully, I will commit the fix in couple of days.

Index: lib/Target/X86/X86CmovConversion.cpp
===================================================================
--- lib/Target/X86/X86CmovConversion.cpp	(revision 309259)
+++ lib/Target/X86/X86CmovConversion.cpp	(working copy)
@@ -401,7 +401,8 @@
     WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth;
   else if (Diff[1] > Diff[0])
     WorthOptLoop =
-        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth);
+        (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth) &&
+        (Diff[1] * 7 >= LoopDepth[1].Depth);
 
   if (!WorthOptLoop)
     return false;


Regards,
Amjad

Comment 14 Anna 2017-07-27 15:16:32 PDT

Thanks a lot Amjad! That has brought the performance back on both benchmarks.

Could you perhaps check in (a smaller version) of this print patch as part of DEBUG? It might help in future regressions. Also, what I noticed from the patch is that we're trying to fix exactly these cases, based on some hardcoded value (7). I guess you're trying to make it more general? Looking forward to the fix :)

Comment 15 Amjad Aboud 2017-08-03 07:00:53 PDT

I uploaded a fix to https://reviews.llvm.org/D36081.

Comment 16 Amjad Aboud 2017-08-08 05:19:43 PDT

Fix was committed to https://reviews.llvm.org/rL310352.