the question had been asked member, disappointingly deleted. comments saying measurements flawed , not make sense.
however able reproduce original problem small benchmark under jmh:
package bench; import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.runner.*; import org.openjdk.jmh.runner.options.*; import java.util.concurrent.*; @state(scope.benchmark) public class loopinc { private int getvalue() { return threadlocalrandom.current().nextint(2); } @benchmark public int directinc() { int result = 0; (int = 0; < 1000; i++) { switch (getvalue()) { case 0: break; case 1: result++; break; } } return result; } @benchmark public int indirectinc() { int result = 0; (int = 0; < 1000; i++) { boolean incr = false; switch (getvalue()) { case 0: break; case 1: incr = true; break; } if (incr) { result++; } } return result; } public static void main(string[] args) throws runnerexception { options options = new optionsbuilder() .include("bench.loopinc.*") .warmupiterations(5) .measurementiterations(10) .forks(3) .timeunit(timeunit.milliseconds) .build(); new runner(options).run(); } } the benchmarks shows indirectinc works 3 times faster, though "optimization" not obvious @ all. 1 assume indirectinc should work bit slower because involves intermediate operation.
benchmark mode cnt score error units loopinc.directinc thrpt 30 127,301 ± 0,202 ops/ms loopinc.indirectinc thrpt 30 378,147 ± 1,144 ops/ms
java version "1.8.0_51" java(tm) se runtime environment (build 1.8.0_51-b16) java hotspot(tm) 64-bit server vm (build 25.51-b03, mixed mode) what causes jit compile indirectinc better similar directinc?
ok, how approach these things.
try reproduce it. okay, reproduces:
benchmark mode cnt score error units loopinc.directinc thrpt 15 175.678 ± 1.118 ops/ms loopinc.indirectinc thrpt 15 641.413 ± 9.722 ops/mstry see generated assembly
-prof perfasm. long story short, produces lot of generated code, , want limit loop unrolling. but, can affect performance, , can pretty cause. so, let's re-run-xx:loopunrolllimit=1. okay, score lower, difference still there, excellent:benchmark mode cnt score error units loopinc.directinc thrpt 15 161.147 ± 6.101 ops/ms loopinc.indirectinc thrpt 15 489.430 ± 1.698 ops/mstake @ generated code, still nothing pops out our eye. well, seems interesting. let's on properly. can characterize workload? of course can, of
-prof perfnorm, normalizes hardware counters per benchmark op. let's see:benchmark mode cnt score error units loopinc.directinc thrpt 15 161.875 ± 3.038 ops/ms loopinc.directinc:·cpi thrpt 3 0.967 ± 0.196 #/op loopinc.directinc:·l1-dcache-load-misses thrpt 3 0.394 ± 3.663 #/op loopinc.directinc:·l1-dcache-loads thrpt 3 2149.594 ± 228.166 #/op loopinc.directinc:·l1-dcache-store-misses thrpt 3 0.114 ± 1.001 #/op loopinc.directinc:·l1-dcache-stores thrpt 3 1073.666 ± 96.066 #/op loopinc.directinc:·l1-icache-load-misses thrpt 3 0.965 ± 22.984 #/op loopinc.directinc:·llc-loads thrpt 3 0.204 ± 2.763 #/op loopinc.directinc:·llc-stores thrpt 3 0.060 ± 0.633 #/op loopinc.directinc:·branch-misses thrpt 3 536.068 ± 43.293 #/op loopinc.directinc:·branches thrpt 3 3728.890 ± 220.539 #/op loopinc.directinc:·cycles thrpt 3 26219.146 ± 6287.590 #/op loopinc.directinc:·dtlb-load-misses thrpt 3 0.063 ± 0.124 #/op loopinc.directinc:·dtlb-loads thrpt 3 2136.942 ± 165.990 #/op loopinc.directinc:·dtlb-store-misses thrpt 3 0.022 ± 0.029 #/op loopinc.directinc:·dtlb-stores thrpt 3 1084.787 ± 417.281 #/op loopinc.directinc:·itlb-load-misses thrpt 3 0.081 ± 0.333 #/op loopinc.directinc:·itlb-loads thrpt 3 3.623 ± 19.955 #/op loopinc.directinc:·instructions thrpt 3 27114.052 ± 1843.720 #/op loopinc.indirectinc thrpt 15 489.164 ± 2.692 ops/ms loopinc.indirectinc:·cpi thrpt 3 0.281 ± 0.015 #/op loopinc.indirectinc:·l1-dcache-load-misses thrpt 3 0.503 ± 9.071 #/op loopinc.indirectinc:·l1-dcache-loads thrpt 3 2149.806 ± 369.040 #/op loopinc.indirectinc:·l1-dcache-store-misses thrpt 3 0.167 ± 1.370 #/op loopinc.indirectinc:·l1-dcache-stores thrpt 3 1073.895 ± 186.741 #/op loopinc.indirectinc:·l1-icache-load-misses thrpt 3 0.313 ± 1.275 #/op loopinc.indirectinc:·branch-misses thrpt 3 1.102 ± 0.375 #/op loopinc.indirectinc:·branches thrpt 3 2143.670 ± 228.475 #/op loopinc.indirectinc:·cycles thrpt 3 8701.665 ± 706.183 #/op loopinc.indirectinc:·dtlb-load-misses thrpt 3 0.020 ± 0.301 #/op loopinc.indirectinc:·dtlb-loads thrpt 3 2141.965 ± 135.852 #/op loopinc.indirectinc:·dtlb-store-misses thrpt 3 0.002 ± 0.029 #/op loopinc.indirectinc:·dtlb-stores thrpt 3 1070.376 ± 81.445 #/op loopinc.indirectinc:·itlb-load-misses thrpt 3 0.007 ± 0.135 #/op loopinc.indirectinc:·itlb-loads thrpt 3 0.310 ± 5.768 #/op loopinc.indirectinc:·instructions thrpt 3 30968.207 ± 3627.540 #/opoh, both benchmarks have comparable number of instructions. slower 1 takes more cycles (that's why cpi not ideal in
directinc;indirectinc, however, produces close-to-ideal cpi). if closely @ possible causes: there not many cache misses, not many tlb misses, slow benchmark has lots of branch misses. aha! know in generated code.let's @ generated code again.
-prof perfasmconveniently highlights jumps. , see this...directinc:
╭│ 0x00007fa0a82a50ff: jmp 0x00007fa0a82a5116 11.39% 16.90% ││ ↗ 0x00007fa0a82a5101: inc %edx ;*iinc ││ │ ; - org.openjdk.loopinc::directinc@46 (line 18) 12.52% 23.11% ││ │↗↗ 0x00007fa0a82a5103: mov %r10,0xe8(%r11) ;*invokevirtual putlong ││ │││ ; - java.util.concurrent.threadlocalrandom::nextseed@27 (line 241) 12.00% 8.14% ││ │││ 0x00007fa0a82a510a: inc %r8d ;*iinc ││ │││ ; - org.openjdk.loopinc::directinc@46 (line 18) 0.03% 0.03% ││ │││ 0x00007fa0a82a510d: cmp $0x3e8,%r8d │╰ │││ 0x00007fa0a82a5114: jge 0x00007fa0a82a50c7 ;*aload_0 │ │││ ; - org.openjdk.loopinc::directinc@11 (line 19) 0.80% 0.91% ↘ │││ 0x00007fa0a82a5116: mov 0xf0(%r11),%r10d ;*invokevirtual getint │││ ; - java.util.concurrent.threadlocalrandom::current@9 (line 222) 4.28% 1.23% │││ 0x00007fa0a82a511d: test %r10d,%r10d ╭│││ 0x00007fa0a82a5120: je 0x00007fa0a82a517b ;*ifne ││││ ; - java.util.concurrent.threadlocalrandom::current@12 (line 222) 2.11% 0.01% ││││ 0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10 0.01% 0.07% ││││ 0x00007fa0a82a512c: add 0xe8(%r11),%r10 ;*ladd ││││ ; - java.util.concurrent.threadlocalrandom::nextseed@24 (line 242) 7.73% 1.89% ││││ 0x00007fa0a82a5133: mov %r10,%r9 1.21% 1.84% ││││ 0x00007fa0a82a5136: shr $0x21,%r9 1.90% 0.03% ││││ 0x00007fa0a82a513a: xor %r10,%r9 2.02% 0.03% ││││ 0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx 0.94% 1.82% ││││ 0x00007fa0a82a5147: imul %rcx,%r9 ;*lmul ││││ ; - java.util.concurrent.threadlocalrandom::mix32@9 (line 182) 7.01% 2.40% ││││ 0x00007fa0a82a514b: mov %r9,%rcx ││││ 0x00007fa0a82a514e: shr $0x21,%rcx 1.89% 0.70% ││││ 0x00007fa0a82a5152: xor %r9,%rcx 3.11% 2.55% ││││ 0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9 0.99% 1.50% ││││ 0x00007fa0a82a515f: imul %r9,%rcx 7.66% 2.89% ││││ 0x00007fa0a82a5163: shr $0x20,%rcx 3.70% 1.97% ││││ 0x00007fa0a82a5167: mov %ecx,%r9d 0.11% ││││ 0x00007fa0a82a516a: , $0x1,%r9d ;*iand ││││ ; - java.util.concurrent.threadlocalrandom::nextint@34 (line 356) 3.76% 11.13% ││││ 0x00007fa0a82a516e: cmp $0x1,%r9d │╰││ 0x00007fa0a82a5172: je 0x00007fa0a82a5101 10.48% 16.62% │ ││ 0x00007fa0a82a5174: test %r9d,%r9d │ ╰│ 0x00007fa0a82a5177: je 0x00007fa0a82a5103 ;*lookupswitch │ │ ; - org.openjdk.loopinc::directinc@15 (line 19) │ ╰ 0x00007fa0a82a5179: jmp 0x00007fa0a82a5103 ;*aload_0 │ ; - org.openjdk.loopinc::directinc@11 (line 19) ↘ 0x00007fa0a82a517b: mov $0xffffff5d,%esiindirectinc:
0.01% 0.01% ↗ 0x00007f65588d8260: mov %edx,%r9d 0.01% │ 0x00007f65588d8263: nopw 0x0(%rax,%rax,1) 11.99% 11.38% │ 0x00007f65588d826c: data16 data16 xchg %ax,%ax ;*iconst_0 │ ; - org.openjdk.loopinc::indirectinc@11 (line 34) │ 0x00007f65588d8270: mov 0xf0(%r8),%r10d ;*invokevirtual getint │ ; - java.util.concurrent.threadlocalrandom::current@9 (line 222) │ 0x00007f65588d8277: test %r10d,%r10d │ 0x00007f65588d827a: je 0x00007f65588d8331 ;*ifne │ ; - java.util.concurrent.threadlocalrandom::current@12 (line 222) 0.01% │ 0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10 11.80% 11.49% │ 0x00007f65588d828a: add 0xe8(%r8),%r10 ;*ladd │ ; - java.util.concurrent.threadlocalrandom::nextseed@24 (line 242) 0.01% 0.01% │ 0x00007f65588d8291: mov %r10,0xe8(%r8) ;*invokevirtual putlong │ ; - java.util.concurrent.threadlocalrandom::nextseed@27 (line 241) │ 0x00007f65588d8298: mov %r9d,%edx 0.01% 0.01% │ 0x00007f65588d829b: inc %edx 11.12% 12.40% │ 0x00007f65588d829d: mov %r10,%rcx 0.01% │ 0x00007f65588d82a0: shr $0x21,%rcx 0.03% │ 0x00007f65588d82a4: xor %r10,%rcx 0.06% 0.03% │ 0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10 12.38% 13.94% │ 0x00007f65588d82b1: imul %r10,%rcx ;*lmul │ ; - java.util.concurrent.threadlocalrandom::mix32@9 (line 182) 0.03% 0.01% │ 0x00007f65588d82b5: mov %rcx,%r10 │ 0x00007f65588d82b8: shr $0x21,%r10 0.03% │ 0x00007f65588d82bc: xor %rcx,%r10 11.43% 12.62% │ 0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx 0.01% │ 0x00007f65588d82c9: imul %rcx,%r10 0.34% 0.30% │ 0x00007f65588d82cd: shr $0x20,%r10 0.85% 0.76% │ 0x00007f65588d82d1: mov %r10d,%r10d 11.81% 11.51% │ 0x00007f65588d82d4: , $0x1,%r10d 2.16% 1.78% │ 0x00007f65588d82d8: cmp $0x1,%r10d 3.45% 3.00% │ 0x00007f65588d82dc: cmovne %r9d,%edx <----- here 17.55% 15.86% │ 0x00007f65588d82e0: inc %r11d ;*iinc │ ; - org.openjdk.loopinc::indirectinc@56 (line 33) │ 0x00007f65588d82e3: cmp $0x3e8,%r11d ╰ 0x00007f65588d82ea: jl 0x00007f65588d8260 ;*if_icmpge ; - org.openjdk.loopinc::indirectinc@8 (line 33)notice
cmovneinstead ofjmp-- why have more "predictable" branches. hotspot profiles branches, , emits conditional move when branch profile branch flat. in other words, dodge branch misprediction paying bit added latency of conditional move. however, in case, switch special: has more two alternatives (0, 1, , "nothing"). why, speculate,resultincrement not being folded cmov. (generally speaking, hotspot have stored 0resultin "default", blew it, oh well)to confirm hypothesis, let's make
directcompleteinccase, still useswitch, cover cases:@benchmark public int directcompleteinc() { int result = 0; (int = 0; < 1000; i++) { switch (getvalue()) { case 1: result++; break; default: break; } } return result; }...and measure it, , time without options, op did:
benchmark mode cnt score error units loopinc.directcompleteinc thrpt 5 644.414 ± 0.371 ops/ms loopinc.directinc thrpt 5 174.974 ± 0.103 ops/ms loopinc.indirectinc thrpt 5 644.015 ± 0.533 ops/msthere.
confirm
directcompleteincusingcmov-prof perfasm. does.drink up.
Comments
Post a Comment