Ekstatische Lyriken Pinnwand

The joy of debugging...

written by Pj on Monday September 12th, 2011 -- 3:39 a.m.

edit this message - return to message index
(only moderators may edit messages)
Every now and then I come across a bug in some program I'm writing that just doesn't seem to have a logical cause.  So I decided to document this one.  Then, hating to leave the story incomplete, I figured out the problem, documenting all the steps along the way.  It's amazing how a simple mistake can turn into hours of debugging.

I'm working on a multi-source program, and find that sometimes it doesn't work correctly.  After trying to find the problem for a while, I discover something interesting...

When I edit a certain source file, the program stops working.  If I then edit another certain source file, it starts working again.  This is true even if the edits are trivial, such as adding or removing a comment.

So I check the object files.  They're identical for both working and non-working versions.  So I check the executable...  It's different. 

I eventually figure out this is because my build script sorts files according to modification time, so that the most recently modified files are compiled first, so that there's less wait to see errors.  So, I modify it so that it still does this, but always puts the object files in the same order when linking. 

I then start fresh looking for the problem.  After commenting out various parts of the program, I track down a single line of code which can be removed to fix the problem, which is the one shown here in bold type:

double new_real = real[f] * cos(angle) - imaginary[f] * sin(angle);
double new_imaginary = real[f] * sin(angle) + imaginary[f] * cos(angle);
real[f] = new_real; imaginary[f] = new_imaginary; phase[f] = new_phase;
real[f] = 1.0 / 32.0 * sample + 31.0 / 32.0 * real[f];
double a = sqrt(real[f] * real[f] + imaginary[f] * imaginary[f]);
zeo_frequency_data[FREQUENCY_HISTORY_SIZE - 1][f + s * F_COUNT] = a;


Exactly what could be wrong with this is a bit of a mystery.  The memory access to real[f] is clearly valid, given that the previous line doesn't trigger the problem.  ...and sample, being a simple variable, can't cause an improper memory access, not to mention that it's not a write and so won't corrupt anything anyway.  The only effect is the value of real[f], which isn't accessed outside of the code I've presented.  Once in the array on the last line of code, the value is only accessed by the drawing code, but disabling the drawing code doesn't prevent the problem from occurring. 

So I start playing with the math.  Removing either sample or real[f] fails to fix the problem, however, removing both will fix the problem.  So I play some more and discover this:

real[f] = sample + real[f]; // Triggers the problem...
real[f] = sample * real[f]; // Corrects the problem...

I also note that assigning something else to real[f] after the problem-inducing line of code causes the problem to disappear, but that may simply be because the optimization removes the previous line of code. 

So I convert the line to a function call, and insert the function in a different file (for reasons to be explained) and end up with this:

real[f] = main_function(sample, real[f]);

double main_function(double sample, double real) {
return 1.0 / 32.0 * sample + 31.0 / 32.0 * real;
};


The problem still occurs, which is good.  So now I add another function call, and another function in yet another file, to end up with this:

real[f] = main_function(sample, real[f]);
real[f] = easy_function(sample, real[f]);

double main_function(double sample, double real) {
return 1.0 / 32.0 * sample + 31.0 / 32.0 * real;
};

double easy_function(double sample, double real) {
return 0;
};


In this way, the code is still executed, but the value is discarded, and the line can't be optimized away because, when creating the object code, the compiler doesn't know what the functions it is calling do, and so it doesn't know that the second one doesn't actually depend on the result of the first.

However, now the code works.  So it must be something caused by the calculated value.  Exactly what, I can't imagine.  I try assigning values like 0, +1000000, -1000000, etc., but none trigger the problem.  ...and with the draw code disabled, the problem still occurs.

So, there's a few lines (the one listed above) which access the value, but none of them make any data-access decisions based on the value.  They only perform calculations with it.  Math shouldn't cause memory corruption issues, or any issues at all for that matter, since the value is never used for anything besides more math.

Then I realize the last two lines in the code I listed can be removed without affecting the value, and so I remove them, and the problem continues to exist.  I also recall that the line before the problem-inducing line can be removed without fixing the problem, so I remove it too.  That allows me to remove the two lines before it which also access the value of real[f].  The result is that there's only one line left in the program that accesses real[f], the one which, when removed, makes the problem go away...

real[f] = 1.0 / 32.0 * sample + 31.0 / 32.0 * real[f];

...and then I remember from before that if I remove sample without removing real[f] that the problem remains.  So I do that again, and again, the problem remains.

So I try some variations on the formula...

real[f] = 1.0 / 32.0 + 31.0 / 32.0 * real[f]; // Doesn't work.
real[f] = 1.0 + 31.0 * real[f]; // Doesn't work.
real[f] = 1.0 + real[f]; // Works!
real[f] = 1 + 3 * real[f]; // Doesn't work.
real[f] = 1 + 2 * real[f]; // Works!
real[f] = 0.1 + 2 * real[f]; // Doesn't work.
real[f] = 0.1 + 3 * real[f]; // Doesn't work.
real[f] = 0.1 + 1 * real[f]; // Doesn't work.
real[f] = 0.1 + real[f]; // Doesn't work.
real[f] = 1 + real[f]; // Works!


What the hell is going on here? At this point I'm not even changing the instructions involved, just the value of the constants used by those instructions.  ...and it's all being done to a variable whose result is never used for anything.  ...and, not that it should be possible or anything, but it apparently has nothing to do with overflow, given that the second-to-last line works, but the last doesn't.

So I try variations of real[f] = X + 2 * real[f], using different values for X. 

Values that result in a working program: 0, 1, 2, 4, 5, 8, 9, 10, 13, 16, 17, 18, 20

Values that result in a broken program: 3, 6, 7, 11, 12, 14, 15, 19

Seriously, I double-checked.  It's repeatable.

Perhaps it's time I insert a printf() to see what the value is turning into...

X = 3 yeilds 1020847100762815239274396370466657796096
X = 7 yeilds 2381976568446568942012167348365083803648
X = 19 yeilds 6465364971497829596878297926574421311488

X = 20, one of the values that fixes the problem, results in a number that just continues to grow in length.  Eventually it reaches infinity (or so says the FPU).

Now it occurs to me that perhaps the memory access error isn't here in this code.  Perhaps some other code is incorrectly accessing the memory that is correctly used to store this variable.  To test this theory, I replace the line with this:

if (real[f] != 0.0) printf("Holy fuck!\n");

That gets triggered a lot.

Just to double-check, I text search the file for "real" and find a variable of the same name in a different set of brackets.  I rename the variable, just in case, but the problem remains. 

Then I replace the above code with this:

if (real[f] != 0.0) {
printf("Holy fuck!\n");
real[f] = 0.0;
};


I was hoping to see how often the variable is overwritten, but interestingly, when I do this I never see the "Holy fuck!" message, which is super-interesting because the code addition that prevents "Holy fuck!" from being displayed isn't actually executed unless "Holy fuck!" is displayed.  What the fuck?

I really don't know where to go from here.

So I continue with my conclusion that the memory corruption is elsewhere.  I revert to the earlier "Holy fuck!" code so that the message is displayed again, then begin disabling pieces of code to see if disabling anything else causes the message to go away.  What code I find that I can disable isn't responsible, and being less than certain what the problem might be, I'm not willing to start disabling things on a line-by-line basis to tear it down to the bare minimum.  So I decide to display the contents of the variables in binary, hoping to see something useful.

if (real[f] != 0.0) {
printf("Holy fuck!\n");
for (int i = 0; i < 8; i++) {
printf(" %02x", ((unsigned char*) &real[f])[i]);
};
printf("\n");
//real[f] = 0.0;
};


However, once again, this puts an end to the problem, so I never see the contents.

So I try adding the value to the printf() for the "Holy fuck!" message, and from that, I get this:

Holy fuck! 531424756029718528.000000

It's the very same number in every message.

So I add more code to assign this number to a variable, then display the number.  I get this:

00 00 00 00 00 80 9d 43

...and I also get this:

Holy fuck! 416798398060...and about a hundred more digits. 

So I add this new number into the program to display it in binary, which yields this:

00 00 00 00 00 00 ec 5c

The "Holy fuck!" number now changes back to its original value. 

So I count, and see that the message is displayed 128 times.  Looking at the code, I see that the opportunity to display it is much more often than that.  So I add more to the printf() to check the state of the two loop variables, i and f

Then, I see the problem only occurs when f is 257.  ...and I realize, f shouldn't be 257.

for (int f = 0; f <= F_COUNT; f++) {

Oops, I guess that should be "<" rather than "<=".

Well, now my code works, but...since imaginary[] is after real[] in the declaration, isn't it after it in memory? (not that I assume that must be true, but it seems likely) Checking the pointers reveals that the reverse is true.  So what is in the memory that was being incorrectly accessed?

I check with objdump...

objdump -t monitor | grep bss | sort
0804fac0 l     O .bss	00000808              phase.18449
080502e0 l O .bss 00000808 imaginary.18448
08050b00 l O .bss 00000808 real.18447
08051308 g O .bss 00000004 main_pause_flag
Oh, reverse order, the second most likely order.

I'm going to have to remember to use objdump next time I suspect a variable is being corrupted.  I figured out early-on that it seemed to just be paused, since clicking on the window (the only command the program accepts, to pause and un-pause it) would cause it to process one more data set before pausing again.  If I'd looked to see what was near the pause flag, it would have told me the problem right away. 

As for all those weird symptoms...

The reason the first assignment to real[f] caused no problems was because, if you look at the formula that calculates new_real, it always results in zero when real[f] was zero to begin with.  Thus, it writes to the memory, it just doesn't change the value.

As for sample + real[f] vs. sample * real[f], again, the latter fails to change real[f] from zero to non-zero.

As for the various formulas that work vs. the ones that don't...  To understand that you have to look up the binary format of a floating point number.  Only the first four bytes of the eight byte float overwrite the pause variable.  Those bytes are (part of) the mantissa, the exponent is stored in the last two bytes. 

real[f] = 1.0 / 32.0 + 31.0 / 32.0 * real[f]; // Doesn't work.
Binary result after 128 interations: 2b f1 76 96 3c 73 ef 3f

real[f] = 1.0 + 31.0 * real[f]; // Doesn't work.
Binary result after 128 interations: 0d e9 f0 bc b8 c4 42 67

real[f] = 1.0 + real[f]; // Works!
Binary result after 128 interations: 00 00 00 00 00 00 60 40

real[f] = 1 + 3 * real[f]; // Doesn't work.
Binary result after 128 interations: 68 69 43 9a 23 59 8d 4c

real[f] = 1 + 2 * real[f]; // Works!
Binary result after 128 interations: 00 00 00 00 00 00 f0 47

real[f] = 0.1 + 2 * real[f]; // Doesn't work.
Binary result after 128 interations: 9a 99 99 99 99 99 b9 47

...and I'm sure the rest follow the same pattern.

So what about this?

Then I replace the above code with this:

if (real[f] != 0.0) {
printf("Holy fuck!\n");
real[f] = 0.0;
};

I was hoping to see how often the variable is overwritten, but interestingly, when I do this I never see the "Holy fuck!" message, which is super-interesting because the code addition that prevents "Holy fuck!" from being displayed isn't actually executed unless "Holy fuck!" is displayed.  What the fuck?

This one is rather interesting.  Without the assignment to real[f], it gets moved from the "bss" section, which is for zero-initialized data not included in the executable, to the "rodata" section, which is for pre-initialized read-only data contained in the executable.  There, whatever data is located after it, is non-zero, triggering the message to be displayed.  Once the assignment is added, it becomes "bss" data again, where once again the pause flag is located after it, and so it is always zero.

The same thing happens with the code that attempts to display a hex dump of the number, even though that code doesn't write to the number.  Both situations don't make any sense.  Why move the array to the rodata section, which requires you to initialize it (increasing executable size) just because it isn't written to? ...and if you're willing to do that, why decide not to do that just because the value is accessed twice instead of once? I have no idea.

Anyway, my whole point of this blog post is that I run into problems like this far too often.  Something just seems totally out of whack, as if there's a bug in the compiler or something, yet in the end it's just something trivially stupid that I've done to cause the problem.  The problem wasn't more than a few lines away from where I was looking, but the "<=" was at one time in the code's history correct, so it didn't stand out as an error, and the fact that the first write to real[f] didn't cause a problem seemed to indicate that all of the memory access was valid. 

Lesson learned, I guess.  Just because you can write to a variable without triggering a problem doesn't exclude the possibility that you're not writing to it the same thing that was there to begin with.  ...and also, just because you do write different values to it doesn't mean that all of the values you choose aren't identical in the few bytes that matter. 

...and I'm not sure this is even worth a blog post, but it's already written, so whatever.

Replies

return to message index

Your Reply

Name: No registration necessary. Simply choose
a name and password and type them in.
Password:
Subject:
You may want to read the rules before you spend a lot of time writing something.
Message:
Plain Text - What you type is what you will see.
Some HTML - Use this if you are including HTML tags.
Pure HTML - Copies your post directly into the web page.
first, then