OSDev.org

The Place to Start for Operating System Developers
It is currently Thu Mar 28, 2024 6:38 am

All times are UTC - 6 hours




Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: What is your longest bug?
PostPosted: Thu Apr 27, 2017 6:18 pm 
Offline
Member
Member
User avatar

Joined: Sun Sep 19, 2010 10:05 pm
Posts: 1074
After fighting with a particularly annoying EHCI issue for nearly 4 days, I'm trying to think of the longest time it's taken me to find and fix a bug. This one feels like the longest right now, but I'm sure I've spent over a week or two on one issue before. I could probably search back through the forums and find out.

What about you guys? What was your "white whale" bug that eluded you the longest? Or maybe still eludes you...

_________________
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Thu Apr 27, 2017 7:41 pm 
Offline
Member
Member
User avatar

Joined: Fri Oct 21, 2011 9:47 pm
Posts: 286
Location: Tustin, CA USA
I will never forget it. 3 months. It was for work, not the personal stuff, and it was an ERP suite. It was all centered around memory leaks. My business locked me in a room for 2 weeks of that and slipped pizza under the door.

_________________
Adam

The name is fitting: Century Hobby OS -- At this rate, it's gonna take me that long!
Read about my mistakes and missteps with this iteration: Journal

"Sometimes things just don't make sense until you figure them out." -- Phil Stahlheber


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Thu Apr 27, 2017 8:16 pm 
Offline
Member
Member
User avatar

Joined: Sun Feb 09, 2014 7:11 pm
Posts: 89
Location: Within a meter of a computer
I've been dealing with SMP bugs for a while, although they've been different bugs, but all falling under the banner of 'SMP doesn't work, find and fix the race conditions'

_________________
"If the truth is a cruel mistress, than a lie must be a nice girl"
Working on Cardinal
Find me at #Cardinal-OS on freenode!


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Thu Apr 27, 2017 9:19 pm 
Offline
Member
Member
User avatar

Joined: Sun Sep 19, 2010 10:05 pm
Posts: 1074
I just looked back through my old posts, and realized that one of my first questions on this forum in 2013 was about how to set up queues properly for OHCI controllers. And my last question I posted yesterday was about how to set up EHCI queues properly.

I've really come a long way...

_________________
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 12:44 am 
Offline
Member
Member
User avatar

Joined: Thu Nov 16, 2006 12:01 pm
Posts: 7612
Location: Germany
I remember a bug in the printf() implementation for PDCLib that drove me to distraction. You can still see the comment here.

I hadn't checked in the source before I got this working, but the code I had gave almost correct results. I had been trying to track this bug down for weeks (not full time, of course, this has always been a spare-time endeavour). And I did spend virtually all of the Breakpoint 2006 demo party staring at the code, stepping through the debugger and generally tearing my hair out.

And then, after three days of drinking, eating junk food, and staring at the screen in frustration, it struck me like... well... #-o :oops: :evil: -- I was adding to the wrong variable...

(Unfortunately I hadn't checked in the previous version of the source yet, as I wanted it to work before checking in, so I cannot show you the diff.)

I had a similar issue at the office once, where I did spend almost two weeks full-time trying to nail down a bug, which turned out to be something along the lines of a sign error. I was ashamed to report this to my superior. But he smiled and said:

"Every bug is trivial... once you've found it."

Similarities of that uttering with my signature are not coincidental.

_________________
Every good solution is obvious once you've found it.


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 1:15 am 
Offline
Member
Member
User avatar

Joined: Sun Sep 19, 2010 10:05 pm
Posts: 1074
Solar wrote:
...Breakpoint 2006 demo party...

8)

That's probably the worst thing about living in the U.S. I've always wanted to go to one of those.

I did get to watch some of Revision 2017 a few weeks ago live on Twitch. Almost like being there...

_________________
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 1:58 am 
Offline
Member
Member

Joined: Sun Feb 01, 2009 6:11 am
Posts: 1070
Location: Germany
I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.

_________________
Developer of tyndur - community OS of Lowlevel (German)


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 5:40 am 
Offline
Member
Member
User avatar

Joined: Thu Aug 06, 2015 6:41 am
Posts: 97
Location: Netherlands
My second longest bug was an issue with my GDT when I tried switching to PMode for the first time, I think it took me about one or two weeks to solve it. I originally had my 32bit PMode data selector at offset 0x08 and my code selector at 0x10, as opposed to CS = 0x08 and DS = 0x10 which is what almost all tutorials and code examples used, but it shouldn't matter as long as you put the right descriptor in the right register. After switching it around the jump to PMode worked but loading DS didn't. It took me ages to figure it out because both entries worked when placed at offset 0x08 but neither did when placed at offset 0x10... So it couldn't be a problem with the selector structures themselves, at least that's what I thought for at least a week. At some point I decided to go through it character by character, comparing it to everything else I could find on the internet, and it turned out to be a typo that caused the length of my descriptor structure to be off by one byte, which is why the second descriptor never worked: it wasn't at the actual offset it was supposed to be.

My longest bug still hasn't been properly fixed, but I found a way to at least make it work. When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 5:57 am 
Offline
Member
Member

Joined: Tue Mar 04, 2014 5:27 am
Posts: 1108
Kevin wrote:
I haven't looked at it for a few years now and I probably won't fix it any more, but the most puzzling OS-Dev bug for me is still in my sis900 driver. I have one computer with an early revision where it works just fine. And I have another computer with a newer revision on-board and that one works fine, too - but only as long as you never send a packet that is larger than 128 bytes (exactly 128 is fine). If you do, the NIC is dead and won't receive or send anything any more. I spent quite some hours on it back then and I still don't know what the problem is. It doesn't really make any sense to me, but I'm not usually using this test computer any more, so whatever. But I'm sure it would be one of the trivial bugs once I would have found it.


I've read erratas on a few of Intel one and ten gigabit chips and saw their DPDK code (in places contradicting the chip documentation (e.g. doing exactly what the doc says not to)) and learned that the whole thing is a big mess. I also had an interesting bug in that the ring buffer would sometimes get stuck before completing the first round if data was arriving fast. But if I did some extra flushing or something of the sort during that first round only, everything would then just work. Still not sure if it was some odd caching issue as the workaround was found sufficient at the time.


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 6:04 am 
Offline
Member
Member

Joined: Tue Mar 04, 2014 5:27 am
Posts: 1108
sleephacker wrote:
When I reboot in bochs (only in bochs) using either the PS/2 or the port 0xcf9 method (but not when triple faulting or pressing 'reset'), the RTC doesn't fire interrupts anymore. The way I made this work is by doing a int 28h, which is mapped to IRQ 8 (the RTC interrupt), after doing that it worked. The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...


What if you just do iret without doing int 0x28?

I had a weird PS/2 mouse problem years ago. I could never properly disable it on one PC. I don't remember if the PC hung or if the mouse couldn't be re-enabled again afterwards or it never got disabled. Things seemed to work on other PCs, though. Never got to the bottom of it.


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 6:23 am 
Offline
Member
Member

Joined: Mon Mar 25, 2013 7:01 pm
Posts: 5099
sleephacker wrote:
The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...

When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 6:57 am 
Offline
Member
Member
User avatar

Joined: Thu Aug 06, 2015 6:41 am
Posts: 97
Location: Netherlands
Octocontrabass wrote:
sleephacker wrote:
The mysterious part of this is that it doesn't work if I just send an EOI, it has to be an int 28h...

When you initialize the RTC, do you acknowledge its pending interrupts by reading status register C?

It all makes sense now!

My int 28h handler reads status register C, which is why int 28h fixed it. I removed the int 28h and put a read from status reg C in the initialisation, and now it works!

Thank you!


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 10:48 am 
Offline
Member
Member

Joined: Thu May 17, 2007 1:27 pm
Posts: 999
The hardest bug I remember is not related to OS development. It was a subtle race condition at work. We have a program that solves a certain computationally hard problem. It does this by running multiple iterations of some algorithm and that algorithm itself is distributed over many compute nodes. In extremely rare cases (like once in a billion) it was possible for one of the compute nodes to prove that the whole problem was infeasible without taking into account all other computations. Thus the code looked like this:
Code:
while(problem_not_solved()) {
    Lots of precomputation, multiple code paths of MPI calls to set everything up.

    while(any_work_left()) {
        if(do_work() == GLOBALLY_INFEASIBLE)
            break;
    }
    wait_for_other_nodes();

    Multiple MPI code paths to collect results and postprocess them.
}

The program would work mostly correct but hang sometimes (like once per 500 runs or so). I spend days to debug the precomputation and postprocessing code and the actual do_work() procedure. Just to illustrate how difficult to debug this code was: It typically ran on ~160 cores (we had dual-socket Xeon nodes with 10 cores/socket) concurrently and the outer loop ran for some thousands of times per invocation while the inner loop ran billions of times per invocation. In the end the problem was that if the break statement was executed the work queue would not be emptied which prevented the wait_for_other_nodes() to complete. However there was a load balancer that moved work between multiple nodes. So that bug would actually go undetected because a node's work queue was still indirectly drained by other nodes. Unless all of them triggered the bug simultaneously! The fix was just to clear the local work queue before the break.

_________________
managarm: Microkernel-based OS capable of running a Wayland desktop (Discord: https://discord.gg/7WB6Ur3). My OS-dev projects: [mlibc: Portable C library for managarm, qword, Linux, Sigma, ...] [LAI: AML interpreter] [xbstrap: Build system for OS distributions].


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 11:38 am 
Offline
Member
Member
User avatar

Joined: Fri Apr 03, 2015 9:41 am
Posts: 492
U365 had the worst bug I ever encountered. It was a memory bug: memory was overlapping. Me and my team were forced to rewrite quite a 1/5 of the whole OS: the memory manager, the file system... It was a nightmare. The bug was open for more than a year... Now it's finally fixed. Fully.

_________________
Developing U365.
Source:
only testing: http://gitlab.com/bps-projs/U365/tree/testing

OSDev newbies can copy any code from my repositories, just leave a notice that this code was written by U365 development team, not by you.


Top
 Profile  
 
 Post subject: Re: What is your longest bug?
PostPosted: Fri Apr 28, 2017 12:41 pm 
Offline
Member
Member

Joined: Mon Aug 25, 2014 1:27 pm
Posts: 67
My longest bug?

It was a millipedes... I called him Tony :-)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 23 posts ]  Go to page 1, 2  Next

All times are UTC - 6 hours


Who is online

Users browsing this forum: SemrushBot [Bot] and 35 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group