x86 - how abundant is hardware support for FMA instruction set -


steam's hardware survey helpful because gives overview of hardware support sse instruction sets. however, can't find resources on how abundant fma support is. there data on somewhere? or there other instruction set fma more or less tied to, if have 1 have other, can base estimation on?

fma3 introduced amd in piledriver (may 2012). (vishera fx cpu, trinity & richland apu). piledriver has serious performance bug 256b (avx ymm) store throughput (vmovaps/vmovups: 1 per 17/20 cycles). (see agner fog's microarch doc, , other sources.) either disable 256b avx routines on piledriver, or write piledriver-specific version uses 128b xmm fma. (or fma4, , can run on bulldozer, too.)

the successor, steamroller found in kaveri apus. (fx cpus still piledriver.) steamroller fixes perf bug 256b stores, 256b takes twice many cycles 128b version, you're not gaining (except tiny reduction in loop overhead) 256b avx. i.e. might write code run 128b fma4 version if fma4 available.

fma3 introduced intel @ same time avx2, in haswell (june 2013). many people have not upgraded sandybridge/ivybridge, because there's small performance diff, except in code can use avx2 / fma advantage. (i.e. not stuff.)

fma3 separate cpuid feature flag avx2. wrong answer saying it's part of avx2 due intel introducing haswell.

so in summary, lot of amd users have fma support, if it's bulldozer fma4-only. intel, nehalem cpus fast enough people, there hasn't been reason upgrade. don't have numbers, though.


Comments