TL:DR: __int128
division helper functions internally end up doing an unsigned div reg64
(after some branching on the values being positive and the upper halves being 0
). On Intel CPUs before Ice Lake, 64-bit div
is faster than the signed idiv reg64
that GCC inlines for signed long long
. Faster by enough to make up for all the extra overhead of the helper function, and extended-precision for the other operations.
You probably wouldn't see this effect on AMD CPUs: long long
would be faster as expected because idiv r64
is similar enough in perf to div r64
there.
And unsigned long long
is faster than unsigned __int128
even on older Intel CPUs. For example on my i7-6700k (Skylake) at 3.9GHz (run under perf stat
to be sure of CPU frequency during the test):
- 2097 (i128) vs. 2332 (i64) - your original test (run back-to-back for CPU freq warm-up)
- 2075 (u128) vs. 1900 (u64) - unsigned versions. Slightly less branching in u128 division vs. i128, but major difference for i64 vs. u64 where the only difference is
div
vs. idiv
.
Also, drawing any general conclusions from a very specific micro-benchmark like this would be a bad idea. It is interesting to dig into why exactly the extended-precision __int128
type manages to be faster in this division benchmark with positive numbers small enough to fit in a 32-bit integer, though. (Fun fact: clang with some tuning options makes code to use 32-bit division if numbers fits in 32 bits, like if the high half of dividend|divisor
is all-zero for unsigned. This is worthwhile on older Intel, but shouldn't be on AMD, or Intel since Ice Lake.)
Your benchmark is heavily weighted towards division, which you do twice per iteration (/
and %
), even though it's much more expensive than other operations and in most code used much less often. (e.g. sum a whole array then divide once to get the average.)
Your benchmark is also has no instruction-level parallelism: each step has a data dependency on the previous step. This prevents auto-vectorization or anything that would show some of the advantages of narrower types, or the benefits of wider CPU pipelines like Alder Lake's 6 uops per cycle vs. Skylake's (or Core2's) 4 uops. Similar widths in recent AMD, like 6 uops / clock in Zen 3 and 4, and in Zen 1 and 2 a limit of 5 instruction or 6 uops, whichever is less.
(It's also not careful to avoid warm-up effects like the first timed region being slow until the CPU gets up to max turbo. Idiomatic way of performance evaluation?. But that happens much faster than the couple seconds of your timed regions, so that's not a problem here.)
128-bit integer division (especially signed) is too complicated for GCC to want to inline, so gcc emits a call to a helper function, __divti3
or __modti3
. (TI = tetra-integer, GCC's internal name for an integer that's 4x the size of int
.) These functions are documented in the GCC-internals manual.
You can see the compiler-generated asm on the Godbolt compiler-explorer. i.e. 128-bit addition with add/adc, multiplication with one mul
full-multiply of the low halves, and 2x non-widening imul
of the cross products. Yes these are slower than the single-instruction equivalents for int64_t
.
But Godbolt doesn't show you the asm for libgcc helper functions. It doesn't disassemble them even in "compile-to-binary" and disassemble mode (instead of the usual compiler asm text output) because it dynamically links libgcc_s instead of libgcc.a
.
Extended-precision signed division is done by negating if necessary and doing unsigned division of 64-bit chunks, then fixing up the sign of the result if necessary.
With both inputs small and positive, no actual negation is needed (just testing and branching). There are also fast-paths for small numbers (high-half divisor = 0, and quotient will fit in 64 bits), which is the case here. The end result is that the path of execution through __divti3
looks like this:
This is from manually single-stepping into the call to __divti3
with gdb, after compiling with g++ -g -O3 int128-bench.cpp -o int128-bench.O3
on my Arch GNU/Linux system, with gcc-libs 10.1.0-2.
# Inputs: dividend = RSI:RDI, divisor = RCX:RDX
# returns signed quotient RDX:RAX
| >0x7ffff7c4fd40 <__divti3> endbr64 # in case caller was using CFE (control-flow enforcement), apparently this instruction has to pollute all library functions now. I assume it's cheap at least in the no-CFE case.
│ 0x7ffff7c4fd44 <__divti3+4> push r12
│ 0x7ffff7c4fd46 <__divti3+6> mov r11,rdi
│ 0x7ffff7c4fd49 <__divti3+9> mov rax,rdx │ 0x7ffff7c4fd4c <__divti3+12> xor edi,edi
│ 0x7ffff7c4fd4e <__divti3+14> push rbx
│ 0x7ffff7c4fd4f <__divti3+15> mov rdx,rcx
│ 0x7ffff7c4fd52 <__divti3+18> test rsi,rsi # check sign bit of dividend (and jump over a negation)
│ 0x7ffff7c4fd55 <__divti3+21> jns 0x7ffff7c4fd6e <__divti3+46>
... taken branch to
| >0x7ffff7c4fd6e <__divti3+46> mov r10,rdx
│ 0x7ffff7c4fd71 <__divti3+49> test rdx,rdx # check sign bit of divisor (and jump over a negation), note there was a mov rdx,rcx earlier
│ 0x7ffff7c4fd74 <__divti3+52> jns 0x7ffff7c4fd86 <__divti3+70>
... taken branch to
│ >0x7ffff7c4fd86 <__divti3+70> mov r9,rax
│ 0x7ffff7c4fd89 <__divti3+73> mov r8,r11
│ 0x7ffff7c4fd8c <__divti3+76> test r10,r10 # check high half of abs(divisor) for being non-zero
│ 0x7ffff7c4fd8f <__divti3+79> jne 0x7ffff7c4fdb0 <__divti3+112> # falls through: small-number fast path
│ 0x7ffff7c4fd91 <__divti3+81> cmp rax,rsi # check that quotient will fit in 64 bits so 128b/64b single div won't fault: jump if (divisor <= high half of dividend)
│ 0x7ffff7c4fd94 <__divti3+84> jbe 0x7ffff7c4fe00 <__divti3+192> # falls through: small-number fast path
│ 0x7ffff7c4fd96 <__divti3+86> mov rdx,rsi
│ 0x7ffff7c4fd99 <__divti3+89> mov rax,r11
│ 0x7ffff7c4fd9c <__divti3+92> xor esi,esi
│ >0x7ffff7c4fd9e <__divti3+94> div r9 #### Do the actual division ###
│ 0x7ffff7c4fda1 <__divti3+97> mov rcx,rax
│ 0x7ffff7c4fda4 <__divti3+100> jmp 0x7ffff7c4fdb9 <__divti3+121>
...taken branch to
│ >0x7ffff7c4fdb9 <__divti3+121> mov rax,rcx
│ 0x7ffff7c4fdbc <__divti3+124> mov rdx,rsi
│ 0x7ffff7c4fdbf <__divti3+127> test rdi,rdi # check if the result should be negative
│ 0x7ffff7c4fdc2 <__divti3+130> je 0x7ffff7c4fdce <__divti3+142>
... taken branch over a neg rax / adc rax,0 / neg rdx
│ >0x7ffff7c4fdce <__divti3+142> pop rbx
│ 0x7ffff7c4fdcf <__divti3+143> pop r12
│ 0x7ffff7c4fdd1 <__divti3+145> ret
... return back to the loop body that called it
Intel CPUs (since IvyBridge) have zero-latency mov
(except for Ice Lake), so all that overhead doesn't worsen the critical path latency (which is your bottleneck) significantly. Or at least not enough to make up the difference between idiv
and div
.
The branching is handled by branch prediction and speculative execution, only checking the predictions after the fact when the actual input register values are the same. The branching goes the same way every time so it's trivial for branch prediction to learn. Since division is so slow, there's plenty of time for out-of-order exec to catch up.
64-bit operand-size integer division is very slow on Intel CPUs, even when the numbers are actually small and would fit in a 32-bit integer, and the extra microcode for signed integer division is even more expensive.
e.g. on my Skylake (i7-6700k), https://uops.info/ shows that (table search result )
idiv r64
is 56 uops for the front-end, with latency from 41 to 95 cycles (from divisor to quotient, which is the relevant case here I think).
div r64
is 33 uops for the front-end, with latency from 35 to 87 cycles. (for that same latency path).
The latency best-case happens for small quotients or small dividends or something, I can never remember which.
Similar to the branching that GCC does in software for 128-bit division in terms of 64-bit, I think the CPU microcode is internally doing 64-bit division in terms of narrower operations, probably the 32-bit that's only 10 uops for signed or unsigned, with much lower latency. (Ice Lake improves the divider so 64-bit division isn't much slower than 32-bit.)
This is why you found long long
so much slower than int
for this benchmark. In a lot of cases it's about the same, or half speed if memory bandwidth or SIMD are involved. (Only 2 elements per 128-bit of vector width, not 4).
AMD CPUs handle 64-bit operand size more efficiently, with performance only depending on the actual values, so about the same for div r32 vs. div r64 with the same numbers.
BTW, the actual values tend to be something like a=1814246614 / b=1814246613
= 1, then a=1 % b=1814246612
(with b
decreasing by 1 every iteration). Only ever testing division with quotient=1 seems very silly. (The first iteration might be different, but we get into this state for the 2nd and later.)
Performance of integer operations other than division is not data-dependent on modern CPUs. (Unless of course there are compile-time constants that allow different asm to be emitted. Like division by a constant is much cheaper when done with a multiplicative inverse calculated at compile time.)
re: double
: see Floating point division vs floating point multiplication for division vs. multiplication. FP division is often harder to avoid, and its performance is relevant in more cases, so it's handled better.
Related: