Ekstatische Lyriken Pinnwand

I rescind my previous statement.

written by Pj on Thursday April 5th, 2012 -- 6:57 a.m.
in reply to GCC sucks less at compiling math equations!

edit this message - return to message index
(only moderators may edit messages)
Today I'm working on some code for my Minecraft server.  (Why I'm doing that rather than writing a server for my own game is a very good question.) Since Minecraft refuses to process block changes faster than a certain rate, I have to send them no faster than that rate in order to avoid lag as Minecraft buffers the block changes and everything that is received after them.  So I store a copy of the map for each player (a nice waste of 36 MB per user) and periodically send the differences between that map and the real map to the user.  Note that I'm not simply buffering the block changes locally because I want to always be sending the player the changes which are closest to the player's position, and the player can move a lot in the several minutes of backed-up data that can occur when large portions of the map are changed.

This introduces a new problem: How to compare the player's map to the real map in a time frame that makes sense for something that is done 10 times a second for up to 36 players.  Given that the map is 36 MB in size, simply comparing each block isn't feasible.

So instead I've divided the map into 8x8x64 chunks, each of which has a flag (per user) which is set by the function that changes map blocks.  Then I simply scan those flags, and find the closest chunk with a flag set, and send blocks from that chunk.  This is repeated a few times if necessary to come up with the 100 blocks I'm allowed to send each player every 0.1 seconds. 

I'd like to reduce this to 1x1x64 chunks, one chunk for each (x, y) position on the map, as it seems a bit more responsive to player movement and looks better.  However, as the map is 768x768x64, the (x, y) array contains 589824 bytes whose flags need to be checked, and, if the flag is set, then the distance to that chunk calculated.  Then, because the chunk size is so small, it's far more likely that I'll need to repeat this search multiple times to find the next closest chunk.  ...and if that wasn't bad enough, everyone demands this "/fly" command on the server which places glass blocks under the player so that he can walk in the air, and implementing it means that the chunks around the player are always marked as needing updating, and so every search first results in the 9 chunks nearest the player before any changes I actually need to send turn up.

So I've been writing some code to test the feasibility of sorting an array of all of the (x, y) positions by distance from the player.  I figure that even if this takes a bit too long on its own, I can simply not do it only once for one player every 0.1 seconds, choosing the player whose distance table is most out of date, and also adding a requirement for how far the player must move before the table is recalculated.

So, first thing to do is fill in the table with the distances:

__attribute__ ((noinline)) static void calculate_distances() {
double px = floor(234.5);
double py = floor(123.4);
double pz = floor(012.3);
int i = 0;
for (int y = 0; y < MAP_Y_SIZE; y++) {
for (int x = 0; x < MAP_X_SIZE; x++) {
double dx = px - x;
double dy = py - y;
int d = sqrt(dx * dx + dy * dy) * 1000;
distance_data[i].x = x;
distance_data[i].y = y;
distance_data[i].d = d;
i++;
};
};
};

This results in the following assembly code:

08049490 calculate_distances:
08049490 55 push ebp
08049491 89E5 mov ebp,esp
08049493 57 push edi
08049494 31FF xor edi,edi
08049496 56 push esi
08049497 53 push ebx
08049498 83EC4C sub esp,byte +0x4c
0804949b C745DC00000000 mov dword [ebp-0x24],0x0
080494a2 8DB600000000 lea esi,[esi+0x0]
080494a8 897DE4 mov [ebp-0x1c],edi
080494ab 8B75DC mov esi,[ebp-0x24]
080494ae 31DB xor ebx,ebx
080494b0 DB45E4 fild dword [ebp-0x1c]
080494b3 DC2DC87A0508 fsubr qword [dword 0x8057ac8]
080494b9 C1E604 shl esi,0x4
080494bc 81C6C4E8FA0A add esi,0xafae8c4
080494c2 D8C8 fmul st0
080494c4 D905C07A0508 fld dword [dword 0x8057ac0]
080494ca 8DB600000000 lea esi,[esi+0x0]
080494d0 895DE4 mov [ebp-0x1c],ebx
080494d3 DB45E4 fild dword [ebp-0x1c]
080494d6 D8E9 fsubr st1
080494d8 D8C8 fmul st0
080494da D8C2 fadd st2
080494dc D9C0 fld st0
080494de D9FA fsqrt
080494e0 DBE8 fucomi st0
080494e2 7A06 jpe 0x80494ea calculate_distances+0x5a
080494e4 742A jz 0x8049510 calculate_distances+0x80
080494e6 DDD8 fstp st0
080494e8 EB06 jmp short 0x80494f0 calculate_distances+0x60
080494ea DDD8 fstp st0
080494ec 8D742600 lea esi,[esi+0x0]
080494f0 DD1C24 fstp qword [esp]
080494f3 D9C9 fxch st1
080494f5 DD5DC8 fstp qword [ebp-0x38]
080494f8 DD5DB8 fstp qword [ebp-0x48]
080494fb E8E4F8FFFF call dword 0x8048de4 sqrt@plt
08049500 DD45B8 fld qword [ebp-0x48]
08049503 DD45C8 fld qword [ebp-0x38]
08049506 D9CA fxch st2
08049508 EB08 jmp short 0x8049512 calculate_distances+0x82
0804950a 8DB600000000 lea esi,[esi+0x0]
08049510 DDD9 fstp st1
08049512 891E mov [esi],ebx
08049514 83C301 add ebx,byte +0x1
08049517 897E04 mov [esi+0x4],edi
0804951a D97DE2 fnstcw [ebp-0x1e]
0804951d D80DC47A0508 fmul dword [dword 0x8057ac4]
08049523 0FB745E2 movzx eax,word [ebp-0x1e]
08049527 B40C mov ah,0xc
08049529 668945E0 mov [ebp-0x20],ax
0804952d D96DE0 fldcw [ebp-0x20]
08049530 DB5EFC fistp dword [esi-0x4]
08049533 D96DE2 fldcw [ebp-0x1e]
08049536 83C610 add esi,byte +0x10
08049539 81FB00030000 cmp ebx,0x300
0804953f 758F jnz 0x80494d0 calculate_distances+0x40
08049541 DDD8 fstp st0
08049543 DDD8 fstp st0
08049545 8145DC00030000 add dword [ebp-0x24],0x300
0804954c 83C701 add edi,byte +0x1
0804954f 817DDC00000900 cmp dword [ebp-0x24],0x90000
08049556 0F854CFFFFFF jnz dword 0x80494a8 calculate_distances+0x18
0804955c 83C44C add esp,byte +0x4c
0804955f 5B pop ebx
08049560 5E pop esi
08049561 5F pop edi
08049562 5D pop ebp
08049563 C3 ret

I'll spare everyone my analysis of the code, mostly because I just don't fucking care to analyze it.  It's too fucking long, I'll tell you that much.

Here's the time data from a loop of 100 executions of the code:


Lag percentiles per function:
90% 50% 10% 1% worst total
16.0 16.0 16.0 19.9 19.9 21.0% calculate_distances()

Units are milliseconds.  The 'worst' field is obviously the longest time of all trials.  The "90%" field means that 90% of the trials took at least that long.  The "10%" field means that 10% of the trials took at least that long.  I guess the 50% field is the best to pay attention to.

Anyway, I decided to rewrite this code in assembly, and came up with this: (displayed in objdump format, to match the display of GCC's output, as I doubt anyone would try to read the assembly code anyway)

080493d0 calculate_distances:
080493d0 60 pushad
080493d1 9BD93D60E20508 fstcw [dword 0x805e260]
080493d8 9BDBE3 finit
080493db D92D307A0508 fldcw [dword 0x8057a30]
080493e1 DB052C7A0508 fild dword [dword 0x8057a2c]
080493e7 DD051C7A0508 fld qword [dword 0x8057a1c]
080493ed DB1D50E20508 fistp dword [dword 0x805e250]
080493f3 DB0550E20508 fild dword [dword 0x805e250]
080493f9 DD05147A0508 fld qword [dword 0x8057a14]
080493ff DB1D50E20508 fistp dword [dword 0x805e250]
08049405 DB0550E20508 fild dword [dword 0x805e250]
0804940b DD050C7A0508 fld qword [dword 0x8057a0c]
08049411 DB1D50E20508 fistp dword [dword 0x805e250]
08049417 DB0550E20508 fild dword [dword 0x805e250]
0804941d BF00000000 mov edi,0x0
08049422 B900000000 mov ecx,0x0

08049427 y_loop:
08049427 BB00000000 mov ebx,0x0

0804942c x_loop:
0804942c 898FC4E8D60A mov [edi+0xad6e8c4],ecx
08049432 899FC0E8D60A mov [edi+0xad6e8c0],ebx
08049438 DB87C4E8D60A fild dword [edi+0xad6e8c4]
0804943e D8E2 fsub st2
08049440 DCC8 fmul to st0
08049442 DB87C0E8D60A fild dword [edi+0xad6e8c0]
08049448 D8E2 fsub st2
0804944a DCC8 fmul to st0
0804944c DEC1 faddp st1
0804944e D9FA fsqrt
08049450 D8CC fmul st4
08049452 DB9FC8E8D60A fistp dword [edi+0xad6e8c8]
08049458 81C70C000000 add edi,0xc
0804945e 43 inc ebx
0804945f 81FB00030000 cmp ebx,0x300
08049465 75C5 jnz 0x804942c x_loop
08049467 41 inc ecx
08049468 81F900030000 cmp ecx,0x300
0804946e 75B7 jnz 0x8049427 y_loop
08049470 9BDBE3 finit
08049473 D92D60E20508 fldcw [dword 0x805e260]
08049479 61 popad
0804947a C3 ret

No real "optimization" since with code this trivial there really isn't anything to optimize as its rather obvious how it should go together.  I used ebx and ecx to as the x and y loop iterators, and edi as the 'i' variable, except that I just use it as a pointer to the structure and increment it by 12 each loop.  GCC probably made a pointer too, I imagine.  Don't care to dig through that code to look for it though.

Here's the run time data for that code...

Lag percentiles per function:
90% 50% 10% 1% worst total
6.6 6.6 6.6 10.3 10.3 9.9% calculate_distances()

Same calculations in only 41% of the time.  Nice.

...and, yes, I compiled with -O3.  Like I always tell everyone, compiler optimization doesn't so much improve your code as it merely improves the awful mess the compiler would otherwise output without optimizations enabled.

I'll now have to try rewriting the sorting algorithm in assembly.  I'm interested to see how the improvement in execution speed compares when the code being rewritten is non-trivial.

Replies

FORTRAN is the answer - crunge - 4/5/12
return to previous message - return to message index

Your Reply

Name: No registration necessary. Simply choose
a name and password and type them in.
Password:
Subject:
You may want to read the rules before you spend a lot of time writing something.
Message:
Plain Text - What you type is what you will see.
Some HTML - Use this if you are including HTML tags.
Pure HTML - Copies your post directly into the web page.
first, then