Today I noticed GCC using the fsqrt instruction in a function, rather than calling a sqrt() function which would certainly just use the instruction.  It was so much more optimized than I recall seeing in the past that I felt inclined to write a simple function to examine how well it generates code now.
So I wrote this function:
void whatever(double *x, double *y, double a) {
  double t;
  t = *x * cos(a) - *y * sin(a);
  *y = *x * sin(a) + *y * cos(a);
  *x = t;
};In case you're interested, this is the equation to rotate an (x, y) coordinate around the origin by the angle a.  The temporary variable is necessary because both the new x and new y values need to be calculated from both of the original x and y values, and so we can't change either until we've calculated both results.  
Here's the code that GCC generated for this function, in NASM syntax because GAS is fucking unreadable:
0804a460 <whatever>:
  0804a460  55                push ebp
  0804a461  89E5              mov ebp,esp
  0804a463  56                push esi
  0804a464  53                push ebx
  0804a465  83EC20            sub esp,byte +0x20
  0804a468  8B5D08            mov ebx,[ebp+0x8]
  0804a46b  8D45F0            lea eax,[ebp-0x10]
  0804a46e  DD4510            fld qword [ebp+0x10]
  0804a471  8D55E8            lea edx,[ebp-0x18]
  0804a474  8B750C            mov esi,[ebp+0xc]
  0804a477  DD1C24            fstp qword [esp]
  0804a47a  8954240C          mov [esp+0xc],edx
  0804a47e  89442408          mov [esp+0x8],eax
  0804a482  E861EBFFFF        call dword 0x8048fe8 <sincos@plt>
  0804a487  DD45E8            fld qword [ebp-0x18]
  0804a48a  DD45F0            fld qword [ebp-0x10]
  0804a48d  DD03              fld qword [ebx]
  0804a48f  DD06              fld qword [esi]
  0804a491  D9C1              fld st1
  0804a493  D8CB              fmul st3
  0804a495  D9C4              fld st4
  0804a497  D8CA              fmul st2
  0804a499  DEC1              faddp st1
  0804a49b  DD1E              fstp qword [esi]
  0804a49d  D9CB              fxch st3
  0804a49f  DEC9              fmulp st1
  0804a4a1  D9C9              fxch st1
  0804a4a3  DECA              fmulp st2
  0804a4a5  DEE1              fsubrp st1
  0804a4a7  DD1B              fstp qword [ebx]
  0804a4a9  83C420            add esp,byte +0x20
  0804a4ac  5B                pop ebx
  0804a4ad  5E                pop esi
  0804a4ae  5D                pop ebp
  0804a4af  C3                ret
Wow...
On the one hand, I'm impressed to see that it figured out that all of the sin() and cos() use the same angle, and thus it only needs to calculate the values once.  I'm also impressed to see that it realizes that both values can be calculated at once, and calls a sincos() function.  It also doesn't utilize a temporary variable, since the values have to be copied into the FPU stack anyway, and so the original x and y values are available even as the old ones are overwritten with the results of the two equations.  Even the series of instructions to solve the equations are nicely written.
However, why the fuck is it calling sincos()?  Ever since the FPU was introduced it's had a fsincos instruction which does exactly the same thing.  Indeed, it isn't even possible to calculate the sine or cosine alone as the fsincos instruction is your only choice and so you have to calculate both values at once.  So why is it calling sincos()?  It obviously expects that I have an FPU as it has planted the call to this function in the middle of a bunch of FPU instructions, so it can't be expecting that maybe the instruction isn't available.  So what the hell?
I can't get over how absurd this is.  I feel compelled to write a color-coded example, so here's what the above code does written in plain english:
The code above first sets up the stack as all functions must do upon entry and saves registers according to the calling convention.  It then loads the angle into the FPU, 
and from there stores it onto the CPU stack.  It also places onto the CPU stack two pointers to local variables.  Then it calls sincos().  In sincos(), the usual stack manipulation will occur, then the angle will be loaded from the CPU stack into the FPU stack.  Then the FPU instruction fsincos will be used to calculate the sine and cosine of that angle.  Then the sine and cosine will be stored from the FPU stack into the two pointers which were passed as parameters to sincos().  Then sincos() will undo its stack manipulation and return.  Now, the code above loads the sine and cosine values into the FPU stack from the local variables they were stored to by the sincos() function.  It then loads the x and y values from their pointers into the FPU stack.  Finally, the two equations are solved, and the results are written to the pointers given to the function as parameters.  Finally, the stack setup is reversed, and the function returns.
The ridiculous thing about this is that everything above in red is the inverse of everything in blue.  All of the red and blue can be removed and the exact same fucking thing will happen, but without a bunch of unnecessary movement of data.
...but, whatever.  I always knew GCC was bad at compiling math.  I'm just thrilled to see that it no longer calls a function every time I use sqrt().