Hi,
Another interesting gcc optimization issue. I've reorganized the timer queue check before the time shared queue check in my scheduler, and my kernel started to behave inconsistently, randomly throwing page faults in the sched_awake() function. Only if compiled with gcc, not with Clang, that worked as it should. After hours and hours of debugging, I've found the very unlikely and unexpected reason, and if it's an UB, I really would like to know what. As far as I know using "continue" in a loop should not cause any UB.
Here's the relevant part of the code (yes, the algorithm is a bit unortodox, but perfectly correct and does exactly what I want):
Code:
void sched_pick()
{
tcb_t *tcba = (tcb_t*)LDYN_tcbalarm;
ccb_t *ccb = (ccb_t*)LDYN_ccb;
uint i, nonempty;
do {
for(nonempty=false,i=PRI_SYS; i<PRI_IDLE; i++) {
if(ccb->hd_timerq && i==tcba->priority && tcba->alarmusec <= ccb->sched_ticks) {
sched_awake(tcba);
goto found;
}
if(ccb->hd_active[i]) {
nonempty = true;
if(ccb->cr_active[i] == 0) {
ccb->cr_active[i] = ccb->hd_active[i];
continue;
} else
goto found;
}
}
} while(nonempty);
...
found:
...
And this is what "gcc -ansi -Wall -Wextra -Wpedantic -O2 -fno-delete-null-pointer-checks -fno-stack-protector" compiled of it:
Code:
ffffffffffe0ef40 <sched_pick>:
ffffffffffe0ef40: 31 c0 xor %eax,%eax
ffffffffffe0ef42: e9 59 fa ff ff jmpq ffffffffffe0e9a0 <sched_awake+0x290>
ffffffffffe0ef47: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
ffffffffffe0ef4e: 00 00
WTF, hah? All the loops and queue head checks are gone! This is definitely changed semantics! No wonder that poor sched_awake() got random input! Even worse, this jumps into the sched_awake
after the input verification!
I've tried all the loop related "-fno-*" command line arguments, but nothing changed. Interestingly, when I finally tried "-fno-partial-inlining", then gcc generated the correct code. Also, if I remove the "continue" (which is not needed any more, yet I think should not cause any trouble), that solves the issue, and the generated code now contains the necessary and very important checks and loops, along with setting the proper argument for sched_awake():
Code:
ffffffffffe0ec80 <sched_pick>:
ffffffffffe0ec80: 55 push %rbp
ffffffffffe0ec81: 53 push %rbx
ffffffffffe0ec82: 48 83 ec 08 sub $0x8,%rsp
ffffffffffe0ec86: 48 8b 34 25 78 00 00 mov 0xffffffff80000078,%rsi
ffffffffffe0ec8d: 80
ffffffffffe0ec8e: 31 d2 xor %edx,%edx
ffffffffffe0ec90: 31 db xor %ebx,%ebx
ffffffffffe0ec92: eb 40 jmp ffffffffffe0ecd4 <sched_pick+0x54>
ffffffffffe0ec94: 0f 1f 40 00 nopl 0x0(%rax)
ffffffffffe0ec98: 48 8b 04 dd 88 00 00 mov -0x7fffff78(,%rbx,8),%rax
ffffffffffe0ec9f: 80
ffffffffffe0eca0: 48 8d 0c dd 00 00 00 lea 0x0(,%rbx,8),%rcx
ffffffffffe0eca7: 00
ffffffffffe0eca8: 48 85 c0 test %rax,%rax
ffffffffffe0ecab: 74 1d je ffffffffffe0ecca <sched_pick+0x4a>
ffffffffffe0ecad: 48 8b 14 dd c8 00 00 mov -0x7fffff38(,%rbx,8),%rdx
...
My question is, does any of you know of an UB regarding to "continue" in a loop? I have never seen anything like this before, and I'm a professional developer for quite some time now. If this is a developer error on my part, I'd like to learn from it, but right know I fail to see the UB.
Cheers,
bzt