In my nine-to-five job, I was looking at crashes with a certain input of several thousand data records.
I tried to find the data record responsible for the crash. Divide in half, check which half crashes.
At some point, I was left with a set of ~200 data records that crashed. No further subdivision crashed.
Reshuffling the records meant it no longer crashed. Having so many data records meant that debugger breakpoints got triggered
all the time, with little to guide me as to which call it was that triggered the crash. (Hint: It wasn't the last data record that crashed...)
I think I repressed most of the subsequent bughunt.
I know I was frantically juggling printf()'s, gdb, and valgrind. In the end, it turned out to be this:
An internal string class used memory pools for allocation. A certain off-by-one error gave a crash
only if the string in question resided at the very start of the memory pool, which happened only in very specific circumstances (and very deep into the process, those strings were
not the complete data records, and actually it was
still non-trivial to figure out which data record I was looking at when I found the point of crash).
A subsequent off-by-one error covered the data corruption of the first error, so if the process didn't crash, everything
appeared to be correct.
Eventually I refactored the whole string class to, basically, std::string.
(The internal class was written, reputedly, "because not all target platforms had good-enough std::string back then". Go figure. The Linux and Windows versions of the software were actually
faster after the refactoring; the AIX version ran at half speed, but I figured, to hell with AIX.
And no, AIX's std::string was not the culprit that "inspired" the homegrown string handling, it was HP-UX -- which we'd stopped supporting over a decade ago.
)
Mikumiku747 wrote:
...not much has tipped me off as to why undefined behaviour is feared so much by everyone.
UB can (and usually does...) give you
heisenbugs, breakage that cannot be readily reproduced. Crashes that happen only on production, or that happen only once in a thousand calls, and when you make
the exact same call a second time, nothing happens. Breakage that happens all the time
except when run in a debugger. Stuff like that, where most of the debugging procedures you might be familiar with stop working.
(Sidenote, I found a bug in IBM's debugger for AIX in the process, as that was exactly what was happening -- the error vanished when run in the debugger. Turned out the IBM debugger, in that version, did not honor the library search path of the AIX host, i.e. the debugger loaded a different set of libraries than the original process. The IBM lady on the phone was rather stricken when she told me that, indeed, their product was at fault for
that particular muckup.
)