Created attachment 18851 [details] IR file to run with llc With rL308142, there was an optimization introduced to convert cmov to branches when profitable. We noticed couple of regressions (around 2-5%) on skylake hardware on internal benchmarks. The performance was back to normal when -x86-cmov-converter=false was supplied. I've tried to reduce the IR as much as possible, and added a main method along with it. However, the regression (using time command) is not quite visible, seems to be attributed to noise. I've attached the IR and how the assembly is generated. Perhaps something may jump out wrt the heuristics, which seems to be having a performance cliff: we convert a cmov to branch when the gain is greater than 25% of misprediction penalty. Reproduce as: llc -mcpu=skylake -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 -x86-cmov-converter=false test.ll ; mv test.s test.falsenative.s; gcc test.falsenative.s -o test.falsenative llc -mcpu=skylake -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 -x86-cmov-converter=true test.ll ; mv test.s test.truenative.s; gcc test.truenative.s -o test.truenative echo "time for x86-cmov-converter=true" time ./test.truenative > chk2 echo "time for x86-cmov-converter=false" time ./test.falsenative > chk time for x86-cmov-converter=true real 0m0.155s user 0m0.065s sys 0m0.007s time for x86-cmov-converter=false real 0m0.154s user 0m0.064s sys 0m0.008s
(In
Just FYI: this is the commit: [X86] X86::CMOV to Branch heuristic based optimization. LLVM compiler recognizes opportunities to transform a branch into IR select instruction(s) - later it will be lowered into X86::CMOV instruction, assuming no other optimization eliminated the SelectInst. However, it is not always profitable to emit X86::CMOV instruction. For example, branch is preferable over an X86::CMOV instruction when: Branch is well predicted Condition operand is expensive, compared to True-value and the False-value operands In CodeGenPrepare pass there is a shallow optimization that tries to convert SelectInst into branch, but it is not enough. This commit, implements machine optimization pass that converts X86::CMOV instruction(s) into branch, based on a conservative heuristic. Differential Revision: https://reviews.llvm.org/D34769 Amjad, could you please take a look at this?
Hi Anna, Thanks for reporting this issue. I took a look on your reproducer, and I am still convinced that the CMOV heuristic is profitable when running on industry based benchmarks. Furthermore, I saw 10-40% gains on several workloads when running on Skylake machines, and did not see regressions that are directly caused by this change. I believe that the code you provided should not suffer from mis-prediction due to the new branches that the CMOV was converted to, do you agree? Having said that, I agree that in this example, the benefit for converting the CMOV to branch is not high enough, and could not justify the the overhead that occur in the code (e.g. more spills). The CMOV heuristic is already conservative, but I can add one more constrain to help overcome this issue. Please, can you apply this patch and check if it solve the regressions you see in your benchmark? Index: lib/Target/X86/X86CmovConversion.cpp =================================================================== --- lib/Target/X86/X86CmovConversion.cpp (revision 309259) +++ lib/Target/X86/X86CmovConversion.cpp (working copy) @@ -401,7 +401,8 @@ WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth; else if (Diff[1] > Diff[0]) WorthOptLoop = - (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth); + (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth) && + (Diff[1] * 8 >= LoopDepth[1].Depth); if (!WorthOptLoop) return false; I checked this patch with your reproducer and it result in keeping the CMOVs. I am evaluating this patch, before uploading for review. Your feedback would be very helpful. Thanks, Amjad
Hi Amjad, thanks for your response. (In reply to Amjad Aboud from comment #3) > Hi Anna, > Thanks for reporting this issue. > I took a look on your reproducer, and I am still convinced that the CMOV > heuristic is profitable when running on industry based benchmarks. It might be profitable for *many* industry based benchmarks, but these regressions we saw are also from industry standard benchmarks for java programs (i.e. not internally developed). > Furthermore, I saw 10-40% gains on several workloads when running on Skylake > machines, and did not see regressions that are directly caused by this > change. My main concern is that I didn't see any performance numbers on the review thread regarding this optimization. Spec benchmarks or the LNT performance suite perhaps? I understand that identifying CMOV to branch conversion is itself tricky to get right performance wise, since it is difficult to statically estimate the branch predictor. The second concern I have is perhaps the heuristic is not conservative enough: is there a reason for the gain being greater than the 25% misprediction penalty, where that penalty is calculated from the scheduler model? Curious about the "25%". > I believe that the code you provided should not suffer from mis-prediction > due to the new branches that the CMOV was converted to, do you agree? Could you provide more information about this transform? While reducing the testcase, I kept the branches intact, but I removed instructions which were outside of these loops. I assumed this will not affect the rate of mispredictions due to the new branches in reduced versus original IR. > > Having said that, I agree that in this example, the benefit for converting > the CMOV to branch is not high enough, and could not justify the the > overhead that occur in the code (e.g. more spills). > The CMOV heuristic is already conservative, but I can add one more constrain > to help overcome this issue. > > Please, can you apply this patch and check if it solve the regressions you > see in your benchmark? > > Index: lib/Target/X86/X86CmovConversion.cpp > =================================================================== > --- lib/Target/X86/X86CmovConversion.cpp (revision 309259) > +++ lib/Target/X86/X86CmovConversion.cpp (working copy) > @@ -401,7 +401,8 @@ > WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth; > else if (Diff[1] > Diff[0]) > WorthOptLoop = > - (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - > LoopDepth[0].Depth); > + (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - > LoopDepth[0].Depth) && > + (Diff[1] * 8 >= LoopDepth[1].Depth); > > if (!WorthOptLoop) > return false; > > I checked this patch with your reproducer and it result in keeping the CMOVs. > I am evaluating this patch, before uploading for review. Your feedback would > be very helpful. Thanks for the patch! I will try this on the regressed benchmarks and then on the entire set of java benchmarks. Also, could you please post some performance numbers from SPEC or LNT for the optimization on the original review thread, so that the community has a sense of the performance gains/losses we expect? Thank you! > > Thanks, > Amjad
Hi Amjad, The patch has helped for one of the benchmarks who's reduced reproducer I had initially attached to the bug. However, the second benchmark still has a 5% regression. There is one cmovns which is converted to branch and it's within an innermost nested loop at level 3. Here's the assembly snippet without the cmov to branch conversion: # BB#219: # %bb1382 # in Loop: Header=BB0_72 Depth=2 lea rbx, [rax + 12] test r9d, r9d je .LBB0_220 # BB#221: # %select.false # in Loop: Header=BB0_72 Depth=2 mov edi, r11d sub edi, r9d .LBB0_222: # %bb1386 # Parent Loop BB0_42 Depth=1 # Parent Loop BB0_72 Depth=2 # => This Inner Loop Header: Depth=3 mov ecx, r9d sub ecx, edi mov r9d, r11d mov ebp, 0 ; this one is converted to branches. cmovns r9d, ebp add r9d, ecx cmp dword ptr gs:[104], 0 jne .LBB0_223 # BB#224: # %bb1467 # in Loop: Header=BB0_222 Depth=3 cmp esi, r9d jbe .LBB0_289 .LBB0_225: # %bb1498 # in Loop: Header=BB0_222 Depth=3 mov ebp, r9d mov ecx, dword ptr [rbx + 4*rbp] With the cmov to branch conversion: # BB#220: # %bb1382 # in Loop: Header=BB0_73 Depth=2 lea rbx, [rax + 12] test r15d, r15d je .LBB0_221 # BB#222: # %select.false # in Loop: Header=BB0_73 Depth=2 mov edi, r13d sub edi, r15d .LBB0_223: # %bb1386 # in Loop: Header=BB0_73 Depth=2 ; no cmovns here: gets converted to branch. xor ecx, ecx sub r15d, edi js .LBB0_224 .LBB0_225: # %bb1386 # in Loop: Header=BB0_73 Depth=2 add r15d, ecx cmp dword ptr gs:[104], 0 jne .LBB0_226 .LBB0_227: # %bb1467 # in Loop: Header=BB0_73 Depth=2 cmp esi, r15d jbe .LBB0_292 .LBB0_228: # %bb1498 # in Loop: Header=BB0_73 Depth=2 mov ebp, r15d mov ecx, dword ptr [rbx + 4*rbp] cmp ecx, r9d je .LBB0_230
Hi Anna, The code you posted is missing some basic blocks (e.g. LBB0_224 that the "js" jumps to). Also, can you provide me with the LLVM IR for this inner most loop? I can use it to figure out why the heuristic chose to convert to branch. Thanks, Amjad
Created attachment 18862 [details] innermost loop IR (this has the cmovs thats converted to branch) This cannot be run with llc, but can be matched directly with the assembly output coming up.
Created attachment 18863 [details] assembly where the optimization is triggered: cmovs to branches relevant (innermost loop) Snippet from: llc input.ll -x86-cmov-converter=true -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 --x86-asm-syntax=intel Note: This is with patch Amjad has pasted in the bug.
Created attachment 18864 [details] assembly generated without cmovs conversion relevant assembly (innermost loop) Snippet from: llc input.ll -x86-cmov-converter=false -mattr=+sse2,+cx16,+prfchw,+bmi2,+xsavec,+fsgsbase,+popcnt,+aes,+xsaves,+mmx,+rdseed,+clflushopt,+xsave,+avx,+rtm,+fma,+bmi,+rdrnd,+sse4.1,+sse4.2,+avx2,+sse,+lzcnt,+pclmul,+f16c,+ssse3,+sgx,+cmov,+movbe,+xsaveopt,+adx,+sse3, -O3 --x86-asm-syntax=intel
Hi Amjad, I've attached the IR and the assembly with versus without the cmovs conversion. All snippets are for the innermost loop. Thanks
Created attachment 18865 [details] A Patch to the CMOV heuristic that will print some information on the inner most loop which was optimized Hi Anna, Thanks for the reproducer, I modified it a bit and managed to run llc. However, the CMOV was not converted, I assume that my changes lead into different reproducer. So, I suggest a different approach, which will save you from sharing more sources. I am attaching another patch to the heuristic that will print some information on the inner most loop which was optimized. If you can share with me that info I might be able to understand the case and provide a solution. Thanks, Amjad
Created attachment 18868 [details] output from the print patch Hi Amjad, Attached is the output from running llc using the print patch you had on the IR. Since, exactly one CMOVns is changed to a branch, I've attached the entire output from that print. (The assembly and input IR is same as the one previously attached) Hope this helps. Thanks, Anna
Thanks Anna, that was super helpful. I modified the solution patch a little more, it should fix both regressions now. I will need to evaluate this solution before upload for review. Hopefully, I will commit the fix in couple of days. Index: lib/Target/X86/X86CmovConversion.cpp =================================================================== --- lib/Target/X86/X86CmovConversion.cpp (revision 309259) +++ lib/Target/X86/X86CmovConversion.cpp (working copy) @@ -401,7 +401,8 @@ WorthOptLoop = Diff[0] * 8 >= LoopDepth[0].Depth; else if (Diff[1] > Diff[0]) WorthOptLoop = - (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth); + (Diff[1] - Diff[0]) * 2 >= (LoopDepth[1].Depth - LoopDepth[0].Depth) && + (Diff[1] * 7 >= LoopDepth[1].Depth); if (!WorthOptLoop) return false; Regards, Amjad
Thanks a lot Amjad! That has brought the performance back on both benchmarks. Could you perhaps check in (a smaller version) of this print patch as part of DEBUG? It might help in future regressions. Also, what I noticed from the patch is that we're trying to fix exactly these cases, based on some hardcoded value (7). I guess you're trying to make it more general? Looking forward to the fix :)
I uploaded a fix to https://reviews.llvm.org/D36081.
Fix was committed to https://reviews.llvm.org/rL310352.