java - Why indirect increment is faster than direct inc? -

the question had been asked member, disappointingly deleted. comments saying measurements flawed , not make sense.

however able reproduce original problem small benchmark under jmh:

package bench;  import org.openjdk.jmh.annotations.*; import org.openjdk.jmh.runner.*; import org.openjdk.jmh.runner.options.*; import java.util.concurrent.*;  @state(scope.benchmark) public class loopinc {      private int getvalue() {         return threadlocalrandom.current().nextint(2);     }      @benchmark     public int directinc() {         int result = 0;         (int = 0; < 1000; i++) {             switch (getvalue()) {                 case 0:                     break;                 case 1:                     result++;                     break;             }         }         return result;     }      @benchmark     public int indirectinc() {         int result = 0;         (int = 0; < 1000; i++) {             boolean incr = false;             switch (getvalue()) {                 case 0:                     break;                 case 1:                     incr = true;                     break;             }              if (incr) {                 result++;             }         }         return result;     }      public static void main(string[] args) throws runnerexception {         options options = new optionsbuilder()                 .include("bench.loopinc.*")                 .warmupiterations(5)                 .measurementiterations(10)                 .forks(3)                 .timeunit(timeunit.milliseconds)                 .build();         new runner(options).run();     } }

the benchmarks shows indirectinc works 3 times faster, though "optimization" not obvious @ all. 1 assume indirectinc should work bit slower because involves intermediate operation.

benchmark             mode  cnt    score   error   units loopinc.directinc    thrpt   30  127,301 ± 0,202  ops/ms loopinc.indirectinc  thrpt   30  378,147 ± 1,144  ops/ms

java version "1.8.0_51" java(tm) se runtime environment (build 1.8.0_51-b16) java hotspot(tm) 64-bit server vm (build 25.51-b03, mixed mode)

what causes jit compile indirectinc better similar directinc?

ok, how approach these things.

try reproduce it. okay, reproduces:

benchmark             mode  cnt    score   error   units loopinc.directinc    thrpt   15  175.678 ± 1.118  ops/ms loopinc.indirectinc  thrpt   15  641.413 ± 9.722  ops/ms

try see generated assembly -prof perfasm. long story short, produces lot of generated code, , want limit loop unrolling. but, can affect performance, , can pretty cause. so, let's re-run -xx:loopunrolllimit=1. okay, score lower, difference still there, excellent:
```
benchmark             mode  cnt    score   error   units loopinc.directinc    thrpt   15  161.147 ± 6.101  ops/ms loopinc.indirectinc  thrpt   15  489.430 ± 1.698  ops/ms 
```

take @ generated code, still nothing pops out our eye. well, seems interesting. let's on properly. can characterize workload? of course can, of -prof perfnorm, normalizes hardware counters per benchmark op. let's see:

benchmark                                     mode  cnt      score      error   units loopinc.directinc                            thrpt   15    161.875 ±    3.038  ops/ms loopinc.directinc:·cpi                       thrpt    3      0.967 ±    0.196    #/op loopinc.directinc:·l1-dcache-load-misses     thrpt    3      0.394 ±    3.663    #/op loopinc.directinc:·l1-dcache-loads           thrpt    3   2149.594 ±  228.166    #/op loopinc.directinc:·l1-dcache-store-misses    thrpt    3      0.114 ±    1.001    #/op loopinc.directinc:·l1-dcache-stores          thrpt    3   1073.666 ±   96.066    #/op loopinc.directinc:·l1-icache-load-misses     thrpt    3      0.965 ±   22.984    #/op loopinc.directinc:·llc-loads                 thrpt    3      0.204 ±    2.763    #/op loopinc.directinc:·llc-stores                thrpt    3      0.060 ±    0.633    #/op loopinc.directinc:·branch-misses             thrpt    3    536.068 ±   43.293    #/op loopinc.directinc:·branches                  thrpt    3   3728.890 ±  220.539    #/op loopinc.directinc:·cycles                    thrpt    3  26219.146 ± 6287.590    #/op loopinc.directinc:·dtlb-load-misses          thrpt    3      0.063 ±    0.124    #/op loopinc.directinc:·dtlb-loads                thrpt    3   2136.942 ±  165.990    #/op loopinc.directinc:·dtlb-store-misses         thrpt    3      0.022 ±    0.029    #/op loopinc.directinc:·dtlb-stores               thrpt    3   1084.787 ±  417.281    #/op loopinc.directinc:·itlb-load-misses          thrpt    3      0.081 ±    0.333    #/op loopinc.directinc:·itlb-loads                thrpt    3      3.623 ±   19.955    #/op loopinc.directinc:·instructions              thrpt    3  27114.052 ± 1843.720    #/op  loopinc.indirectinc                          thrpt   15    489.164 ±    2.692  ops/ms loopinc.indirectinc:·cpi                     thrpt    3      0.281 ±    0.015    #/op loopinc.indirectinc:·l1-dcache-load-misses   thrpt    3      0.503 ±    9.071    #/op loopinc.indirectinc:·l1-dcache-loads         thrpt    3   2149.806 ±  369.040    #/op loopinc.indirectinc:·l1-dcache-store-misses  thrpt    3      0.167 ±    1.370    #/op loopinc.indirectinc:·l1-dcache-stores        thrpt    3   1073.895 ±  186.741    #/op loopinc.indirectinc:·l1-icache-load-misses   thrpt    3      0.313 ±    1.275    #/op loopinc.indirectinc:·branch-misses           thrpt    3      1.102 ±    0.375    #/op loopinc.indirectinc:·branches                thrpt    3   2143.670 ±  228.475    #/op loopinc.indirectinc:·cycles                  thrpt    3   8701.665 ±  706.183    #/op loopinc.indirectinc:·dtlb-load-misses        thrpt    3      0.020 ±    0.301    #/op loopinc.indirectinc:·dtlb-loads              thrpt    3   2141.965 ±  135.852    #/op loopinc.indirectinc:·dtlb-store-misses       thrpt    3      0.002 ±    0.029    #/op loopinc.indirectinc:·dtlb-stores             thrpt    3   1070.376 ±   81.445    #/op loopinc.indirectinc:·itlb-load-misses        thrpt    3      0.007 ±    0.135    #/op loopinc.indirectinc:·itlb-loads              thrpt    3      0.310 ±    5.768    #/op loopinc.indirectinc:·instructions            thrpt    3  30968.207 ± 3627.540    #/op

oh, both benchmarks have comparable number of instructions. slower 1 takes more cycles (that's why cpi not ideal in directinc; indirectinc, however, produces close-to-ideal cpi). if closely @ possible causes: there not many cache misses, not many tlb misses, slow benchmark has lots of branch misses. aha! know in generated code.

let's @ generated code again. -prof perfasm conveniently highlights jumps. , see this...

directinc:

                  ╭│      0x00007fa0a82a50ff: jmp    0x00007fa0a82a5116  11.39%   16.90%  ││ ↗    0x00007fa0a82a5101: inc    %edx               ;*iinc                   ││ │                                                  ; - org.openjdk.loopinc::directinc@46 (line 18)  12.52%   23.11%  ││ │↗↗  0x00007fa0a82a5103: mov    %r10,0xe8(%r11)    ;*invokevirtual putlong                   ││ │││                                                ; - java.util.concurrent.threadlocalrandom::nextseed@27 (line 241)  12.00%    8.14%  ││ │││  0x00007fa0a82a510a: inc    %r8d               ;*iinc                   ││ │││                                                ; - org.openjdk.loopinc::directinc@46 (line 18)   0.03%    0.03%  ││ │││  0x00007fa0a82a510d: cmp    $0x3e8,%r8d                   │╰ │││  0x00007fa0a82a5114: jge    0x00007fa0a82a50c7  ;*aload_0                   │  │││                                                ; - org.openjdk.loopinc::directinc@11 (line 19)   0.80%    0.91%  ↘  │││  0x00007fa0a82a5116: mov    0xf0(%r11),%r10d   ;*invokevirtual getint                      │││                                                ; - java.util.concurrent.threadlocalrandom::current@9 (line 222)   4.28%    1.23%     │││  0x00007fa0a82a511d: test   %r10d,%r10d                     ╭│││  0x00007fa0a82a5120: je     0x00007fa0a82a517b  ;*ifne                     ││││                                                ; - java.util.concurrent.threadlocalrandom::current@12 (line 222)   2.11%    0.01%    ││││  0x00007fa0a82a5122: movabs $0x9e3779b97f4a7c15,%r10   0.01%    0.07%    ││││  0x00007fa0a82a512c: add    0xe8(%r11),%r10    ;*ladd                     ││││                                                ; - java.util.concurrent.threadlocalrandom::nextseed@24 (line 242)   7.73%    1.89%    ││││  0x00007fa0a82a5133: mov    %r10,%r9   1.21%    1.84%    ││││  0x00007fa0a82a5136: shr    $0x21,%r9   1.90%    0.03%    ││││  0x00007fa0a82a513a: xor    %r10,%r9   2.02%    0.03%    ││││  0x00007fa0a82a513d: movabs $0xff51afd7ed558ccd,%rcx   0.94%    1.82%    ││││  0x00007fa0a82a5147: imul   %rcx,%r9           ;*lmul                     ││││                                                ; - java.util.concurrent.threadlocalrandom::mix32@9 (line 182)   7.01%    2.40%    ││││  0x00007fa0a82a514b: mov    %r9,%rcx                     ││││  0x00007fa0a82a514e: shr    $0x21,%rcx   1.89%    0.70%    ││││  0x00007fa0a82a5152: xor    %r9,%rcx   3.11%    2.55%    ││││  0x00007fa0a82a5155: movabs $0xc4ceb9fe1a85ec53,%r9   0.99%    1.50%    ││││  0x00007fa0a82a515f: imul   %r9,%rcx   7.66%    2.89%    ││││  0x00007fa0a82a5163: shr    $0x20,%rcx   3.70%    1.97%    ││││  0x00007fa0a82a5167: mov    %ecx,%r9d            0.11%    ││││  0x00007fa0a82a516a: ,    $0x1,%r9d          ;*iand                     ││││                                                ; - java.util.concurrent.threadlocalrandom::nextint@34 (line 356)   3.76%   11.13%    ││││  0x00007fa0a82a516e: cmp    $0x1,%r9d                     │╰││  0x00007fa0a82a5172: je     0x00007fa0a82a5101  10.48%   16.62%    │ ││  0x00007fa0a82a5174: test   %r9d,%r9d                     │ ╰│  0x00007fa0a82a5177: je     0x00007fa0a82a5103  ;*lookupswitch                     │  │                                                ; - org.openjdk.loopinc::directinc@15 (line 19)                     │  ╰  0x00007fa0a82a5179: jmp    0x00007fa0a82a5103  ;*aload_0                     │                                                   ; - org.openjdk.loopinc::directinc@11 (line 19)                     ↘     0x00007fa0a82a517b: mov    $0xffffff5d,%esi

indirectinc:

  0.01%    0.01%  ↗  0x00007f65588d8260: mov    %edx,%r9d   0.01%           │  0x00007f65588d8263: nopw   0x0(%rax,%rax,1)  11.99%   11.38%  │  0x00007f65588d826c: data16 data16 xchg %ax,%ax  ;*iconst_0                   │                                                ; - org.openjdk.loopinc::indirectinc@11 (line 34)                   │  0x00007f65588d8270: mov    0xf0(%r8),%r10d    ;*invokevirtual getint                   │                                                ; - java.util.concurrent.threadlocalrandom::current@9 (line 222)                   │  0x00007f65588d8277: test   %r10d,%r10d                   │  0x00007f65588d827a: je     0x00007f65588d8331  ;*ifne                   │                                                ; - java.util.concurrent.threadlocalrandom::current@12 (line 222)   0.01%           │  0x00007f65588d8280: movabs $0x9e3779b97f4a7c15,%r10  11.80%   11.49%  │  0x00007f65588d828a: add    0xe8(%r8),%r10     ;*ladd                   │                                                ; - java.util.concurrent.threadlocalrandom::nextseed@24 (line 242)   0.01%    0.01%  │  0x00007f65588d8291: mov    %r10,0xe8(%r8)     ;*invokevirtual putlong                   │                                                ; - java.util.concurrent.threadlocalrandom::nextseed@27 (line 241)                   │  0x00007f65588d8298: mov    %r9d,%edx   0.01%    0.01%  │  0x00007f65588d829b: inc    %edx  11.12%   12.40%  │  0x00007f65588d829d: mov    %r10,%rcx            0.01%  │  0x00007f65588d82a0: shr    $0x21,%rcx   0.03%           │  0x00007f65588d82a4: xor    %r10,%rcx   0.06%    0.03%  │  0x00007f65588d82a7: movabs $0xff51afd7ed558ccd,%r10  12.38%   13.94%  │  0x00007f65588d82b1: imul   %r10,%rcx          ;*lmul                   │                                                ; - java.util.concurrent.threadlocalrandom::mix32@9 (line 182)   0.03%    0.01%  │  0x00007f65588d82b5: mov    %rcx,%r10                   │  0x00007f65588d82b8: shr    $0x21,%r10            0.03%  │  0x00007f65588d82bc: xor    %rcx,%r10  11.43%   12.62%  │  0x00007f65588d82bf: movabs $0xc4ceb9fe1a85ec53,%rcx            0.01%  │  0x00007f65588d82c9: imul   %rcx,%r10   0.34%    0.30%  │  0x00007f65588d82cd: shr    $0x20,%r10   0.85%    0.76%  │  0x00007f65588d82d1: mov    %r10d,%r10d  11.81%   11.51%  │  0x00007f65588d82d4: ,    $0x1,%r10d   2.16%    1.78%  │  0x00007f65588d82d8: cmp    $0x1,%r10d   3.45%    3.00%  │  0x00007f65588d82dc: cmovne %r9d,%edx         <----- here  17.55%   15.86%  │  0x00007f65588d82e0: inc    %r11d              ;*iinc                   │                                                ; - org.openjdk.loopinc::indirectinc@56 (line 33)                   │  0x00007f65588d82e3: cmp    $0x3e8,%r11d                   ╰  0x00007f65588d82ea: jl     0x00007f65588d8260  ;*if_icmpge                                                            ; - org.openjdk.loopinc::indirectinc@8 (line 33)

notice cmovne instead of jmp -- why have more "predictable" branches. hotspot profiles branches, , emits conditional move when branch profile branch flat. in other words, dodge branch misprediction paying bit added latency of conditional move. however, in case, switch special: has more two alternatives (0, 1, , "nothing"). why, speculate, result increment not being folded cmov. (generally speaking, hotspot have stored 0 result in "default", blew it, oh well)

to confirm hypothesis, let's make directcompleteinc case, still use switch, cover cases:

@benchmark public int directcompleteinc() {     int result = 0;     (int = 0; < 1000; i++) {         switch (getvalue()) {             case 1:                 result++;                 break;             default:                 break;         }     }     return result; }

...and measure it, , time without options, op did:

benchmark                   mode  cnt    score    error   units loopinc.directcompleteinc  thrpt    5  644.414 ±  0.371  ops/ms loopinc.directinc          thrpt    5  174.974 ±  0.103  ops/ms loopinc.indirectinc        thrpt    5  644.015 ±  0.533  ops/ms

there.

confirm directcompleteinc using cmov -prof perfasm. does.
drink up.

WIKI

Search This Blog

java - Why indirect increment is faster than direct inc? -

Comments

Post a Comment