Ekstatische Lyriken Pinnwand

-ffast-math, fesetround(), and nearbyint()

written by Pj on Friday April 6th, 2012 -- 9:41 a.m.
in reply to FORTRAN is the answer

edit this message - return to message index
(only moderators may edit messages)
I really don't want to do assembly anymore.  Last time I was doing assembly I had created a set of NASM macros that allowed me to specify names for my function parameters and local stack-based variables.  I then realized I had a C compiler, just with a different language syntax.  It worked for me, since I found C's syntax repulsive, but I've since grown used to it, so it doesn't bother me anymore.  Thus I'd rather stay away from assembly if I can.  Especially since I lost the patches I had to make to NASM to make my macros work, and although I sent them to the NASM author, he rejected them, and spoke of his own ideas for an even better set of macros, which last I checked (which was years afterwards) hadn't materialized in the current version.

So if I can find a way to make GCC generate the code I want, then I'm fine with that.  Writing C code is a lot faster and easier than writing assembly code.

Anyway, I decided to look into this some more, and I thought that, with all of GCC's compiler options, maybe there's something that deals with this.  So I searched a little and found -ffast-math

Without that option, code like "int x = floor(y)" turns into shit like this:

  08049524  83EC24            sub esp,byte +0x24
08049527 DD4508 fld qword [ebp+0x8]
0804952a DD1C24 fstp qword [esp]
0804952d E87EF8FFFF call dword 0x8048db0 floor@plt

08049532 D97DF6 fnstcw [ebp-0xa]
08049535 0FB745F6 movzx eax,word [ebp-0xa]
08049539 B40C mov ah,0xc
0804953b 668945F4 mov [ebp-0xc],ax
0804953f D96DF4 fldcw [ebp-0xc]

08049542 DB5DF0 fistp dword [ebp-0x10]
08049545 D96DF6 fldcw [ebp-0xa]
08049555 83C424 add esp,byte +0x24

The instructions in red are the ones necessary to call floor().  They set up room on the stack for the parameter, put the parameter on the stack, call the function, then remove the space from the stack at the end.  The instructions in blue implements GCC's standard "round towards zero" when storing floating point values to integers.  (which is why it's necessary to call floor() explicitly rather than simply storing the value to an integer) These instructions store the current floating point control word, modify its flags to round towards zero, then at the end, restore the previous flag settings.  In all, only the two instructions in black are actually required to perform the operation, assuming you already have your floating point control word set up for the style of rounding you prefer. 

...and, if that wasn't enough of a disaster, here's the dump of the floor() function:

  00008e40  DD442404          fld qword [esp+0x4]
00008e44 83EC08 sub esp,byte +0x8
00008e47 9BD97C2404 fstcw [esp+0x4]
00008e4c BA00040000 mov edx,0x400
00008e51 0B542404 or edx,[esp+0x4]
00008e55 81E2FFF70000 and edx,0xf7ff
00008e5b 891424 mov [esp],edx
00008e5e D92C24 fldcw [esp]

00008e61 D9FC frndint
00008e63 D96C2404 fldcw [esp+0x4]
00008e67 83C408 add esp,byte +0x8

00008e6a C3 ret

Here, the instruction at the top retrieves the argument from the stack.  Then the instructions in red make room on the stack to store the original and modified versions of the control word, then set the control word to round to nearest, and at the end, restore the previous value of the control word.  Obviously the "frndint" instruction is what performs the rounding. 

So in all, what could have simply been two instructions (load from memory, store to memory) was turned into 24 instructions.

With -ffast-math, this improves somewhat, and becomes this:

  08049466  D97DFE            fnstcw [ebp-0x2]
08049469 DD4508 fld qword [ebp+0x8]
0804946c 0FB745FE movzx eax,word [ebp-0x2]
08049470 B404 mov ah,0x4
08049472 668945FC mov [ebp-0x4],ax
08049476 D96DFC fldcw [ebp-0x4]

08049479 DB5DF8 fistp dword [ebp-0x8]
0804947c D96DFE fldcw [ebp-0x2]

Here, the instructions in red store the floating point control word, then set it to round towards negative.  The instructions in black load the value into the FPU, then perform the floor() by storing it back to memory as an integer.  Then the instruction in blue restores the previous value of the control word.

While 8 instructions is still too many, it's certainly better than 24.

Curious what exactly the default settings for rounding were, I tried using round() instead of floor(), but found that it again reverted to calling a function rather than doing the round() in-line.

So instead I looked again at the dump of libm, where I found the round() function, which contains the following code:

  0000ab80  55                push ebp
0000ab81 89E5 mov ebp,esp
0000ab83 83EC18 sub esp,byte +0x18
0000ab86 DD4508 fld qword [ebp+0x8]
0000ab89 DD55E8 fst qword [ebp-0x18]
0000ab8c 8B55EC mov edx,[ebp-0x14]
0000ab8f 895DF4 mov [ebp-0xc],ebx
0000ab92 E8D089FFFF call dword 0x3567 __cxa_finalize@plt+0xd3
0000ab97 81C35DA40100 add ebx,0x1a45d
0000ab9d 8975F8 mov [ebp-0x8],esi
0000aba0 8B75E8 mov esi,[ebp-0x18]
0000aba3 89D1 mov ecx,edx
0000aba5 C1F914 sar ecx,0x14
0000aba8 81E1FF070000 and ecx,0x7ff
0000abae 8D8101FCFFFF lea eax,[ecx-0x3ff]
0000abb4 83F813 cmp eax,byte +0x13
0000abb7 897DFC mov [ebp-0x4],edi
0000abba 7F54 jg 0xac10 round+0x90
0000abbc 85C0 test eax,eax
0000abbe 0F88AC000000 js dword 0xac70 round+0xf0
0000abc4 BFFFFF0F00 mov edi,0xfffff
0000abc9 89C1 mov ecx,eax
0000abcb D3FF sar edi,cl
0000abcd 85D7 test edi,edx
0000abcf 0F84CB000000 jz dword 0xaca0 round+0x120
0000abd5 DC834490FFFF fadd qword [ebx-0x6fbc]
0000abdb D9EE fldz
0000abdd D9C9 fxch st1
0000abdf DFE9 fucomip st1
0000abe1 DDD8 fstp st0
0000abe3 7612 jna 0xabf7 round+0x77
0000abe5 BE00000800 mov esi,0x80000
0000abea 89C1 mov ecx,eax
0000abec D3FE sar esi,cl
0000abee F7D7 not edi
0000abf0 8D1416 lea edx,[esi+edx]
0000abf3 31F6 xor esi,esi
0000abf5 21FA and edx,edi
0000abf7 8955EC mov [ebp-0x14],edx
0000abfa 8975E8 mov [ebp-0x18],esi
0000abfd DD45E8 fld qword [ebp-0x18]
0000ac00 8B5DF4 mov ebx,[ebp-0xc]
0000ac03 8B75F8 mov esi,[ebp-0x8]
0000ac06 8B7DFC mov edi,[ebp-0x4]
0000ac09 89EC mov esp,ebp
0000ac0b 5D pop ebp
0000ac0c C3 ret
0000ac0d 8D7600 lea esi,[esi+0x0]
0000ac10 83F833 cmp eax,byte +0x33
0000ac13 7E13 jng 0xac28 round+0xa8
0000ac15 3D00040000 cmp eax,0x400
0000ac1a 75E4 jnz 0xac00 round+0x80
0000ac1c D8C0 fadd st0
0000ac1e 6690 xchg ax,ax
0000ac20 EBDE jmp short 0xac00 round+0x80
0000ac22 8DB600000000 lea esi,[esi+0x0]
0000ac28 81E913040000 sub ecx,0x413
0000ac2e BFFFFFFFFF mov edi,0xffffffff
0000ac33 D3EF shr edi,cl
0000ac35 85F7 test edi,esi
0000ac37 74C7 jz 0xac00 round+0x80
0000ac39 DC834490FFFF fadd qword [ebx-0x6fbc]
0000ac3f D9EE fldz
0000ac41 D9C9 fxch st1
0000ac43 DFE9 fucomip st1
0000ac45 DDD8 fstp st0
0000ac47 7615 jna 0xac5e round+0xde
0000ac49 B933000000 mov ecx,0x33
0000ac4e 29C1 sub ecx,eax
0000ac50 B801000000 mov eax,0x1
0000ac55 D3E0 shl eax,cl
0000ac57 01C6 add esi,eax
0000ac59 7303 jnc 0xac5e round+0xde
0000ac5b 83C201 add edx,byte +0x1
0000ac5e F7D7 not edi
0000ac60 21FE and esi,edi
0000ac62 8955EC mov [ebp-0x14],edx
0000ac65 8975E8 mov [ebp-0x18],esi
0000ac68 DD45E8 fld qword [ebp-0x18]
0000ac6b EB93 jmp short 0xac00 round+0x80
0000ac6d 8D7600 lea esi,[esi+0x0]
0000ac70 DC834490FFFF fadd qword [ebx-0x6fbc]
0000ac76 D9EE fldz
0000ac78 D9C9 fxch st1
0000ac7a DFE9 fucomip st1
0000ac7c DDD8 fstp st0
0000ac7e 0F8673FFFFFF jna dword 0xabf7 round+0x77
0000ac84 81E200000080 and edx,0x80000000
0000ac8a 31F6 xor esi,esi
0000ac8c 83F8FF cmp eax,byte -0x1
0000ac8f 0F8562FFFFFF jnz dword 0xabf7 round+0x77
0000ac95 81CA0000F03F or edx,0x3ff00000
0000ac9b E957FFFFFF jmp dword 0xabf7 round+0x77
0000aca0 85F6 test esi,esi
0000aca2 0F852DFFFFFF jnz dword 0xabd5 round+0x55
0000aca8 E953FFFFFF jmp dword 0xac00 round+0x80

Fuck! Is that what it's doing every time I call round()?

Assuming they were trying to get a rounding style that the FPU doesn't support, I looked to the man page, where I found this:

These functions round x to the nearest integer, but round halfway cases away from zero (regardless of the current rounding direction, see fenv(3)), instead of to the nearest even integer like rint(3).

...and sure enough, looking at the dump of rint(), I finally see some reasonable code:

  00009230  DD442404          fld qword [esp+0x4]
00009234 D9FC frndint
00009236 C3 ret

So then I change my code to use rint(), and finally get the output I was looking for:

  08049467  DD4508            fld qword [ebp+0x8]
0804946a DB5DF4 fistp dword [ebp-0xc]

So I tried my test function again from the other day:

void whatever(double *x, double *y, double a) {
double t;
t = *x * cos(a) - *y * sin(a);
*y = *x * sin(a) + *y * cos(a);
*x = t;

With -ffast-math, the output is this:

  08049480  55                push ebp
08049481 89E5 mov ebp,esp
08049483 DD4510 fld qword [ebp+0x10]

08049486 D9FB fsincos
08049488 8B4508 mov eax,[ebp+0x8]
0804948b 8B550C mov edx,[ebp+0xc]

0804948e DD00 fld qword [eax]
08049490 DD02 fld qword [edx]

08049492 D9C2 fld st2
08049494 D8C9 fmul st1
08049496 D9C4 fld st4
08049498 D8CB fmul st3
0804949a DEC1 faddp st1
0804949c DD1A fstp qword [edx]
0804949e D9CA fxch st2
080494a0 DEC9 fmulp st1
080494a2 D9C9 fxch st1
080494a4 DECA fmulp st2
080494a6 DEE1 fsubrp st1
080494a8 DD18 fstp qword [eax]
080494aa 5D pop ebp
080494ab C3 ret

The three instructions in red set up the stack, then load the first parameter, and wouldn't be there if this weren't a function, so they don't matter.  (If this code were in-line with other code, the values would likely already be in the FPU stack.) The instruction in orange is the fsincos that I wanted it to in-line.  The instructions in green retrieve the pointers from the stack, again they don't matter since they wouldn't be there if this wasn't a function.  The instructions in blue load the values from the pointer, and again, wouldn't be there if this wasn't a function.  (In other words, this is a great example of why you want to in-line simple functions, and GCC will do so automatically as long as they are contained within the same source file, as it needs to see the code for them.) Finally, the instructions in black solve the equations, storing the results back to the original pointers.  The instructions in purple restore the stack and return.

There's really nothing to complain about with the above code.  It's as good as anything I might write myself.

At this point, my complaints rest entirely with GCC's need to set the control word all the time.  To solve this problem, I looked for more compiler options, but found none.  So then I looked for some functions and found fesetround(), but it appeared that GCC ignores the current rounding mode when compiling code, which makes sense since it usually won't know what it is since it it may have been changed in another function.  Thus setting the rounding mode is irrelevant without a function that rounds according to the current rounding mode.  So I went looking for one, and found nearbyint() which does just that.  By using nearbyint(), the code output by GCC when converting a double to an int (like with "int x = nearbyint(y)") is simply the instruction to store the value to an integer, which will use whatever rounding mode was set with fesetround(). 

So in summary, if you want decent math output from GCC, it seems that the following rules apply:

1.  Compile with the -ffast-math option.  Ignore the documentation's mention of "unsafe" optimizations.  All it means is that it will perform math the way the FPU performs math, rather than exactly the way some random standard you don't care about says that the math must be performed.  It's all a bunch of shit about the exact result you get when you perform calculations involving infinity or not-a-number, or in other words, nothing you care about.

2.  Avoid using floor(), ceil(), and especially round().  Instead use fesetround() to set your preferred rounding mode, then use nearbyint() for rounding.  Also use nearbyint() when storing floating-point results to integer variables rather than using the implicit conversion, as the implicit conversion is slower and likely isn't how you want to round anyway. 

3.  Don't forget -O3.  Always use -O3.

Applying those rules to the first code example in this post (it was already compiled with -O3, so I only had to apply the first two rules) reduces it's execution time from 16.0 ms to 6.4 ms, which beats the 6.6 ms time of my assembly code. 

So, awesome! I don't have to write any assembly code now!

Now if only there were an option to make the modulus functions work correctly.

Here's a graph of how modulus is supposed to work: Wolfram|Alpha graph of x mod 1 for -3 < x < +3

In C, the left half of the graph is shifted downward so that it is below the x axis, and so that there is a continuous line from (-1, -1) to (+1, +1).  In other words, the waveform repeats with a period equal to the divisor, except around zero where the period size is double.  ...and that's rather undesired behavior when I need a modulus value, since I'm usually trying to divide a continuous range of values into discrete values, each of which represents an equal-size piece of the input range. 

The integer modulus operator (the % symbol) has the same issue, but since I haven't yet needed to use it with potentially negative values, I haven't written a clever macro to perform it correctly.

The CPU and FPU lack instructions that produce these results.  Each does have a way of calculating remainders, but that isn't exactly the same thing, so it's no surprise that the results for negative inputs aren't correct.  So using a macro to get the correct result is as good as a dedicated function would be.

Presently I use this macro to replace fmod() with some simple math that calculates the correct result:

#define fmod(x, y) ((x) - (y) * floor((double) (x) / (y)))

I suppose I need to change that to nearbyint() now.

I have no such macro for integer math since I haven't yet written any code where it would matter, as I've always used % with values that are never negative.  ...and I can't think of any way to do it that isn't just nasty.  I tried looking for some assembly code, but couldn't find anything that wasn't just a forum post by someone who didn't know that the divide instruction exists.  So the best I have to offer is this:

#define imod(x, y) (0 <= x ? x % y : (x + 1) % y + y - 1)

It isn't too bad I guess...

  0804ab5c  85D2              test edx,edx
0804ab5e 7820 js 0x804ab80
0804ab60 89D0 mov eax,edx
0804ab62 C1FA1F sar edx,0x1f
0804ab65 F7F9 idiv ecx

[some code removed]
0804ab80 83C201 add edx,byte +0x1
0804ab83 89D0 mov eax,edx
0804ab85 C1FA1F sar edx,0x1f
0804ab88 F7F9 idiv ecx
0804ab8a 8D5411FF lea edx,[ecx+edx-0x1]

[jump to removed code]

Normally the "[some code removed]" would be a jump to the "[jump to removed code]" but GCC decided to do it the other way for some reason.  So it appears that the % just gives us the remainder of IDIV, using three instructions, and that the case for negative numerators requires five instructions.  Plus it adds 2 to 3 instructions of overhead depending on which path it follows.  However, IDIV isn't exactly a fast instruction anyway. 

The real problem here is that, with constant divisors, GCC will do some neat tricks to perform the calculation faster than IDIV.  However, since it has defined the behavior for negative inputs the way it has, you don't get automatic clever tricks that work correctly.

For example, if you want to do "x % 16" you can instead do "x & 15" and you get the correct result even for negative numbers.  However, for "x % 16" GCC generates six instructions, presumably to preserve the incorrect behavior for negative numerators.  So you get better code than IDIV, but worse code than a simple AND instruction.  There may just as well be similar tricks for correct results with constant denominators like 10, but GCC won't generate those, instead it will generate other code that produces the wrong results.

...but at this point I'm almost complaining that GCC won't optimize my code for me.  So it doesn't matter.  I just wanted it to stop de-optimizing it, and the three steps listed above seem to do that just fine.  If I ever need the integer modulus fixed, I'm sure I can use an inline assembly function provided I can find a way to calculate it that doesn't suck.


return to previous message - return to message index

Your Reply

Name: No registration necessary. Simply choose
a name and password and type them in.
You may want to read the rules before you spend a lot of time writing something.
Plain Text - What you type is what you will see.
Some HTML - Use this if you are including HTML tags.
Pure HTML - Copies your post directly into the web page.
first, then