c++ - How to avoid SSE pipeline flush? -


i've been encountering subtle issue on sse. here case, want optimise ray tracer sse can basic feeling how improve performance sse.

i'd start function.

vector3f add( const vector3f& v0 , vector3f& v1 ); 

(actually tried optimise crossproduct first, adding shown here simplicity , knew not bottleneck of ray tracer.)

here part of definition of struct:

struct vector3f { union { struct{ float x ; float y ; float z; float reserved; }; __m128 data; }; 

the issue there sse register flush declaration, compiler not smart enough hold sse register further uses. , following declaration, avoids flushing.

__m128 add( __m128 v0_data, __m128 v1_data ); 

i can go way on case, ugly design matrix holds 4 __m128 data. , can't have operator works on vector3f on data, :(.

the disturbing thing have change higher level code everywhere adapt change. , way of optimisation through sse no option large huge game engine, you'll change huge amount of code before works.

without avoiding sse register flushing, power drained out useless flushing command renders sse useless, guess.

it seems union bad thing use here. long compiler sees __m128 unified something, has problems understanding when update values, leading excessive memory operations.

msvc not worst performing compiler in situation. check the code generated gcc 5.1.0, works 12 times slower code generated msvc2013 (which with registers spilling) on machine, , 20+ times slower optimal code.

it interesting compilers start doing silly things when use x, y, z members access data. instance, msvc2013 spills registers when read them via scalar members after computation (i guess make sure these members actual). terrible behavior of gcc seen above disappears if set initial values _mm_setr_ps instead of writing them directly members.

it better avoid unions in case. seems op has come same decision (see current vector3fv code). making harder access single coordinate has "psychological" performance effect: person think twice before writing scalar code. can write setters/getters either extract/insert intrinsics (which makes compiler generate these instructions), or simple pointer arithmetic (which makes compiler choose way):

float getx() const { return ((float*)&data)[0]; } 

when remove union , use __m128, generated code becomes better on compilers. however, msvc2013 still has unnecessary moves: 1 useless register move per each arithmetic operation. suppose inefficiency in compiler's inlining algorithm. can remove these moves in msvc2013 declaring functions __vectorcall. note using new calling convention allows avoid register spilling in case simd functions have not been inlined @ all.


Comments