-ffast-math, fesetround(), and nearbyint() -- Ekstatische Lyriken Pinnwand

I really don't want to do assembly anymore. Last time I was doing assembly I had created a set of NASM macros that allowed me to specify names for my function parameters and local stack-based variables. I then realized I had a C compiler, just with a different language syntax. It worked for me, since I found C's syntax repulsive, but I've since grown used to it, so it doesn't bother me anymore. Thus I'd rather stay away from assembly if I can. Especially since I lost the patches I had to make to NASM to make my macros work, and although I sent them to the NASM author, he rejected them, and spoke of his own ideas for an even better set of macros, which last I checked (which was years afterwards) hadn't materialized in the current version.

So if I can find a way to make GCC generate the code I want, then I'm fine with that. Writing C code is a lot faster and easier than writing assembly code.

Anyway, I decided to look into this some more, and I thought that, with all of GCC's compiler options, maybe there's something that deals with this. So I searched a little and found -ffast-math.

Without that option, code like "int x = floor(y)" turns into shit like this:

  08049524  83EC24            sub esp,byte +0x24
  08049527  DD4508            fld qword [ebp+0x8]
  0804952a  DD1C24            fstp qword [esp]
  0804952d  E87EF8FFFF        call dword 0x8048db0 floor@plt
  08049532  D97DF6            fnstcw [ebp-0xa]
  08049535  0FB745F6          movzx eax,word [ebp-0xa]
  08049539  B40C              mov ah,0xc
  0804953b  668945F4          mov [ebp-0xc],ax
  0804953f  D96DF4            fldcw [ebp-0xc]
  08049542  DB5DF0            fistp dword [ebp-0x10]
  08049545  D96DF6            fldcw [ebp-0xa]
  08049555  83C424            add esp,byte +0x24

The instructions in red are the ones necessary to call floor(). They set up room on the stack for the parameter, put the parameter on the stack, call the function, then remove the space from the stack at the end. The instructions in blue implements GCC's standard "round towards zero" when storing floating point values to integers. (which is why it's necessary to call floor() explicitly rather than simply storing the value to an integer) These instructions store the current floating point control word, modify its flags to round towards zero, then at the end, restore the previous flag settings. In all, only the two instructions in black are actually required to perform the operation, assuming you already have your floating point control word set up for the style of rounding you prefer.

...and, if that wasn't enough of a disaster, here's the dump of the floor() function:

  00008e40  DD442404          fld qword [esp+0x4]
  00008e44  83EC08            sub esp,byte +0x8
  00008e47  9BD97C2404        fstcw [esp+0x4]
  00008e4c  BA00040000        mov edx,0x400
  00008e51  0B542404          or edx,[esp+0x4]
  00008e55  81E2FFF70000      and edx,0xf7ff
  00008e5b  891424            mov [esp],edx
  00008e5e  D92C24            fldcw [esp]
  00008e61  D9FC              frndint
  00008e63  D96C2404          fldcw [esp+0x4]
  00008e67  83C408            add esp,byte +0x8
  00008e6a  C3                ret

Here, the instruction at the top retrieves the argument from the stack. Then the instructions in red make room on the stack to store the original and modified versions of the control word, then set the control word to round to nearest, and at the end, restore the previous value of the control word. Obviously the "frndint" instruction is what performs the rounding.

So in all, what could have simply been two instructions (load from memory, store to memory) was turned into 24 instructions.

With -ffast-math, this improves somewhat, and becomes this:

  08049466  D97DFE            fnstcw [ebp-0x2]
  08049469  DD4508            fld qword [ebp+0x8]
  0804946c  0FB745FE          movzx eax,word [ebp-0x2]
  08049470  B404              mov ah,0x4
  08049472  668945FC          mov [ebp-0x4],ax
  08049476  D96DFC            fldcw [ebp-0x4]
  08049479  DB5DF8            fistp dword [ebp-0x8]
  0804947c  D96DFE            fldcw [ebp-0x2]

Here, the instructions in red store the floating point control word, then set it to round towards negative. The instructions in black load the value into the FPU, then perform the floor() by storing it back to memory as an integer. Then the instruction in blue restores the previous value of the control word.

While 8 instructions is still too many, it's certainly better than 24.

Curious what exactly the default settings for rounding were, I tried using round() instead of floor(), but found that it again reverted to calling a function rather than doing the round() in-line.

So instead I looked again at the dump of libm, where I found the round() function, which contains the following code:

  0000ab80  55                push ebp
  0000ab81  89E5              mov ebp,esp
  0000ab83  83EC18            sub esp,byte +0x18
  0000ab86  DD4508            fld qword [ebp+0x8]
  0000ab89  DD55E8            fst qword [ebp-0x18]
  0000ab8c  8B55EC            mov edx,[ebp-0x14]
  0000ab8f  895DF4            mov [ebp-0xc],ebx
  0000ab92  E8D089FFFF        call dword 0x3567 __cxa_finalize@plt+0xd3
  0000ab97  81C35DA40100      add ebx,0x1a45d
  0000ab9d  8975F8            mov [ebp-0x8],esi
  0000aba0  8B75E8            mov esi,[ebp-0x18]
  0000aba3  89D1              mov ecx,edx
  0000aba5  C1F914            sar ecx,0x14
  0000aba8  81E1FF070000      and ecx,0x7ff
  0000abae  8D8101FCFFFF      lea eax,[ecx-0x3ff]
  0000abb4  83F813            cmp eax,byte +0x13
  0000abb7  897DFC            mov [ebp-0x4],edi
  0000abba  7F54              jg 0xac10 round+0x90
  0000abbc  85C0              test eax,eax
  0000abbe  0F88AC000000      js dword 0xac70 round+0xf0
  0000abc4  BFFFFF0F00        mov edi,0xfffff
  0000abc9  89C1              mov ecx,eax
  0000abcb  D3FF              sar edi,cl
  0000abcd  85D7              test edi,edx
  0000abcf  0F84CB000000      jz dword 0xaca0 round+0x120
  0000abd5  DC834490FFFF      fadd qword [ebx-0x6fbc]
  0000abdb  D9EE              fldz
  0000abdd  D9C9              fxch st1
  0000abdf  DFE9              fucomip st1
  0000abe1  DDD8              fstp st0
  0000abe3  7612              jna 0xabf7 round+0x77
  0000abe5  BE00000800        mov esi,0x80000
  0000abea  89C1              mov ecx,eax
  0000abec  D3FE              sar esi,cl
  0000abee  F7D7              not edi
  0000abf0  8D1416            lea edx,[esi+edx]
  0000abf3  31F6              xor esi,esi
  0000abf5  21FA              and edx,edi
  0000abf7  8955EC            mov [ebp-0x14],edx
  0000abfa  8975E8            mov [ebp-0x18],esi
  0000abfd  DD45E8            fld qword [ebp-0x18]
  0000ac00  8B5DF4            mov ebx,[ebp-0xc]
  0000ac03  8B75F8            mov esi,[ebp-0x8]
  0000ac06  8B7DFC            mov edi,[ebp-0x4]
  0000ac09  89EC              mov esp,ebp
  0000ac0b  5D                pop ebp
  0000ac0c  C3                ret
  0000ac0d  8D7600            lea esi,[esi+0x0]
  0000ac10  83F833            cmp eax,byte +0x33
  0000ac13  7E13              jng 0xac28 round+0xa8
  0000ac15  3D00040000        cmp eax,0x400
  0000ac1a  75E4              jnz 0xac00 round+0x80
  0000ac1c  D8C0              fadd st0
  0000ac1e  6690              xchg ax,ax
  0000ac20  EBDE              jmp short 0xac00 round+0x80
  0000ac22  8DB600000000      lea esi,[esi+0x0]
  0000ac28  81E913040000      sub ecx,0x413
  0000ac2e  BFFFFFFFFF        mov edi,0xffffffff
  0000ac33  D3EF              shr edi,cl
  0000ac35  85F7              test edi,esi
  0000ac37  74C7              jz 0xac00 round+0x80
  0000ac39  DC834490FFFF      fadd qword [ebx-0x6fbc]
  0000ac3f  D9EE              fldz
  0000ac41  D9C9              fxch st1
  0000ac43  DFE9              fucomip st1
  0000ac45  DDD8              fstp st0
  0000ac47  7615              jna 0xac5e round+0xde
  0000ac49  B933000000        mov ecx,0x33
  0000ac4e  29C1              sub ecx,eax
  0000ac50  B801000000        mov eax,0x1
  0000ac55  D3E0              shl eax,cl
  0000ac57  01C6              add esi,eax
  0000ac59  7303              jnc 0xac5e round+0xde
  0000ac5b  83C201            add edx,byte +0x1
  0000ac5e  F7D7              not edi
  0000ac60  21FE              and esi,edi
  0000ac62  8955EC            mov [ebp-0x14],edx
  0000ac65  8975E8            mov [ebp-0x18],esi
  0000ac68  DD45E8            fld qword [ebp-0x18]
  0000ac6b  EB93              jmp short 0xac00 round+0x80
  0000ac6d  8D7600            lea esi,[esi+0x0]
  0000ac70  DC834490FFFF      fadd qword [ebx-0x6fbc]
  0000ac76  D9EE              fldz
  0000ac78  D9C9              fxch st1
  0000ac7a  DFE9              fucomip st1
  0000ac7c  DDD8              fstp st0
  0000ac7e  0F8673FFFFFF      jna dword 0xabf7 round+0x77
  0000ac84  81E200000080      and edx,0x80000000
  0000ac8a  31F6              xor esi,esi
  0000ac8c  83F8FF            cmp eax,byte -0x1
  0000ac8f  0F8562FFFFFF      jnz dword 0xabf7 round+0x77
  0000ac95  81CA0000F03F      or edx,0x3ff00000
  0000ac9b  E957FFFFFF        jmp dword 0xabf7 round+0x77
  0000aca0  85F6              test esi,esi
  0000aca2  0F852DFFFFFF      jnz dword 0xabd5 round+0x55
  0000aca8  E953FFFFFF        jmp dword 0xac00 round+0x80

Fuck! Is that what it's doing every time I call round()?

Assuming they were trying to get a rounding style that the FPU doesn't support, I looked to the man page, where I found this:

These functions round x to the nearest integer, but round halfway cases away from zero (regardless of the current rounding direction, see fenv(3)), instead of to the nearest even integer like rint(3).

...and sure enough, looking at the dump of rint(), I finally see some reasonable code:

  00009230  DD442404          fld qword [esp+0x4]
  00009234  D9FC              frndint
  00009236  C3                ret

So then I change my code to use rint(), and finally get the output I was looking for:

  08049467  DD4508            fld qword [ebp+0x8]
  0804946a  DB5DF4            fistp dword [ebp-0xc]

So I tried my test function again from the other day:

void whatever(double *x, double *y, double a) {
  double t;
  t = *x * cos(a) - *y * sin(a);
  *y = *x * sin(a) + *y * cos(a);
  *x = t;
};

With -ffast-math, the output is this:

  08049480  55                push ebp
  08049481  89E5              mov ebp,esp
  08049483  DD4510            fld qword [ebp+0x10]
  08049486  D9FB              fsincos
  08049488  8B4508            mov eax,[ebp+0x8]
  0804948b  8B550C            mov edx,[ebp+0xc]
  0804948e  DD00              fld qword [eax]
  08049490  DD02              fld qword [edx]
  08049492  D9C2              fld st2
  08049494  D8C9              fmul st1
  08049496  D9C4              fld st4
  08049498  D8CB              fmul st3
  0804949a  DEC1              faddp st1
  0804949c  DD1A              fstp qword [edx]
  0804949e  D9CA              fxch st2
  080494a0  DEC9              fmulp st1
  080494a2  D9C9              fxch st1
  080494a4  DECA              fmulp st2
  080494a6  DEE1              fsubrp st1
  080494a8  DD18              fstp qword [eax]
  080494aa  5D                pop ebp
  080494ab  C3                ret

The three instructions in red set up the stack, then load the first parameter, and wouldn't be there if this weren't a function, so they don't matter. (If this code were in-line with other code, the values would likely already be in the FPU stack.) The instruction in orange is the fsincos that I wanted it to in-line. The instructions in green retrieve the pointers from the stack, again they don't matter since they wouldn't be there if this wasn't a function. The instructions in blue load the values from the pointer, and again, wouldn't be there if this wasn't a function. (In other words, this is a great example of why you want to in-line simple functions, and GCC will do so automatically as long as they are contained within the same source file, as it needs to see the code for them.) Finally, the instructions in black solve the equations, storing the results back to the original pointers. The instructions in purple restore the stack and return.

There's really nothing to complain about with the above code. It's as good as anything I might write myself.

At this point, my complaints rest entirely with GCC's need to set the control word all the time. To solve this problem, I looked for more compiler options, but found none. So then I looked for some functions and found fesetround(), but it appeared that GCC ignores the current rounding mode when compiling code, which makes sense since it usually won't know what it is since it it may have been changed in another function. Thus setting the rounding mode is irrelevant without a function that rounds according to the current rounding mode. So I went looking for one, and found nearbyint() which does just that. By using nearbyint(), the code output by GCC when converting a double to an int (like with "int x = nearbyint(y)") is simply the instruction to store the value to an integer, which will use whatever rounding mode was set with fesetround().

So in summary, if you want decent math output from GCC, it seems that the following rules apply:

1. Compile with the -ffast-math option. Ignore the documentation's mention of "unsafe" optimizations. All it means is that it will perform math the way the FPU performs math, rather than exactly the way some random standard you don't care about says that the math must be performed. It's all a bunch of shit about the exact result you get when you perform calculations involving infinity or not-a-number, or in other words, nothing you care about.

2. Avoid using floor(), ceil(), and especially round(). Instead use fesetround() to set your preferred rounding mode, then use nearbyint() for rounding. Also use nearbyint() when storing floating-point results to integer variables rather than using the implicit conversion, as the implicit conversion is slower and likely isn't how you want to round anyway.

3. Don't forget -O3. Always use -O3.

Applying those rules to the first code example in this post (it was already compiled with -O3, so I only had to apply the first two rules) reduces it's execution time from 16.0 ms to 6.4 ms, which beats the 6.6 ms time of my assembly code.

So, awesome! I don't have to write any assembly code now!

Now if only there were an option to make the modulus functions work correctly.

Here's a graph of how modulus is supposed to work: Wolfram|Alpha graph of x mod 1 for -3 < x < +3

In C, the left half of the graph is shifted downward so that it is below the x axis, and so that there is a continuous line from (-1, -1) to (+1, +1). In other words, the waveform repeats with a period equal to the divisor, except around zero where the period size is double. ...and that's rather undesired behavior when I need a modulus value, since I'm usually trying to divide a continuous range of values into discrete values, each of which represents an equal-size piece of the input range.

The integer modulus operator (the % symbol) has the same issue, but since I haven't yet needed to use it with potentially negative values, I haven't written a clever macro to perform it correctly.

The CPU and FPU lack instructions that produce these results. Each does have a way of calculating remainders, but that isn't exactly the same thing, so it's no surprise that the results for negative inputs aren't correct. So using a macro to get the correct result is as good as a dedicated function would be.

Presently I use this macro to replace fmod() with some simple math that calculates the correct result:

#define fmod(x, y) ((x) - (y) * floor((double) (x) / (y)))

I suppose I need to change that to nearbyint() now.

I have no such macro for integer math since I haven't yet written any code where it would matter, as I've always used % with values that are never negative. ...and I can't think of any way to do it that isn't just nasty. I tried looking for some assembly code, but couldn't find anything that wasn't just a forum post by someone who didn't know that the divide instruction exists. So the best I have to offer is this:

#define imod(x, y) (0 <= x ? x % y : (x + 1) % y + y - 1)

It isn't too bad I guess...

  0804ab5c  85D2              test edx,edx
  0804ab5e  7820              js 0x804ab80
  0804ab60  89D0              mov eax,edx
  0804ab62  C1FA1F            sar edx,0x1f
  0804ab65  F7F9              idiv ecx
                              [some code removed]
  0804ab80  83C201            add edx,byte +0x1
  0804ab83  89D0              mov eax,edx
  0804ab85  C1FA1F            sar edx,0x1f
  0804ab88  F7F9              idiv ecx
  0804ab8a  8D5411FF          lea edx,[ecx+edx-0x1]
                              [jump to removed code]

Normally the "[some code removed]" would be a jump to the "[jump to removed code]" but GCC decided to do it the other way for some reason. So it appears that the % just gives us the remainder of IDIV, using three instructions, and that the case for negative numerators requires five instructions. Plus it adds 2 to 3 instructions of overhead depending on which path it follows. However, IDIV isn't exactly a fast instruction anyway.

The real problem here is that, with constant divisors, GCC will do some neat tricks to perform the calculation faster than IDIV. However, since it has defined the behavior for negative inputs the way it has, you don't get automatic clever tricks that work correctly.

For example, if you want to do "x % 16" you can instead do "x & 15" and you get the correct result even for negative numbers. However, for "x % 16" GCC generates six instructions, presumably to preserve the incorrect behavior for negative numerators. So you get better code than IDIV, but worse code than a simple AND instruction. There may just as well be similar tricks for correct results with constant denominators like 10, but GCC won't generate those, instead it will generate other code that produces the wrong results.

...but at this point I'm almost complaining that GCC won't optimize my code for me. So it doesn't matter. I just wanted it to stop de-optimizing it, and the three steps listed above seem to do that just fine. If I ever need the integer modulus fixed, I'm sure I can use an inline assembly function provided I can find a way to calculate it that doesn't suck.
/div

-ffast-math, fesetround(), and nearbyint()

Replies

Your Reply