OSDev.org
https://forum.osdev.org/

Fault tolerant OS
https://forum.osdev.org/viewtopic.php?f=15&t=33090
Page 1 of 1

Author:  bharathm1 [ Sun Jul 29, 2018 1:16 pm ]
Post subject:  Fault tolerant OS

Hello, I am wondering if there any operating systems ( or OS theory) that are fault tolerant . By fault tolerant I mean, if there is kernel process that is faulty and hogging a cpu or got some error is there a way to isolate the problem to that single kernel thread or some pool of the threads related to the faulty thread and let the rest of system work properly.

I know that getting an error in one part of the kernel can compromise the correctness of kernel but sometimes it might not affect most of the kernel. For example when a kernel thread stuck in while loop with no progress hogging the CPU without taking any locks we might be able to safely terminate it if some magic oracle says this kernel thread ( possibly a driver) is no good. But again having such an oracle is a major problem as well.

So are they any OS design that have fault tolerance ( of the kind I described ) as a design goal?

Thank you

Author:  Brendan [ Sun Jul 29, 2018 3:02 pm ]
Post subject:  Re: Fault tolerant OS

Hi,

bharathm1 wrote:
I know that getting an error in one part of the kernel can compromise the correctness of kernel but sometimes it might not affect most of the kernel.


Sometimes it might not effect most of the kernel, but you never know if it did or not so that doesn't help - you have to assume that almost all of the kernel might have been ruined regardless.

To fix that, you want to isolate the pieces so you can know that if one piece has a problem it can't ruin other pieces. In other words; a micro-kernel ends up being necessary.

Of course a micro-kernel isn't enough on its own. You'd also need code to monitor, terminate and restart drivers; and (in some cases) ways to recover lost state.

This has been done before (e.g. Minix 3).


Cheers,

Brendan

Author:  bharathm1 [ Mon Jul 30, 2018 3:20 pm ]
Post subject:  Re: Fault tolerant OS

Do you think there is any hope for fault tolerant monolithic kernels?

I think ideas such as nooks where we isolate the address space of kernel drivers is a good line of research though it pushes some burden to driver programmers.

What are the main principles that the present monolithic kernels (eg. Linux) violating that made kernel terrible at fault tolerance?

Author:  OSwhatever [ Mon Jul 30, 2018 3:36 pm ]
Post subject:  Re: Fault tolerant OS

Department of Computer Science University of Illinois did a paper "Building a Self-Healing Operating System"

http://choices.cs.illinois.edu/selfhealing.pdf

It goes through a few techniques for "healing" an error.

Author:  Brendan [ Mon Jul 30, 2018 9:38 pm ]
Post subject:  Re: Fault tolerant OS

Hi,

bharathm1 wrote:
Do you think there is any hope for fault tolerant monolithic kernels?


That really depends on what kinds of faults you're trying to tolerate. A driver fails to initialise because it can't allocate enough memory? Easy. A single bit flip (if there's no memory encryption)? Maybe. A CPU failing while holding kernel locks? No.

bharathm1 wrote:
I think ideas such as nooks where we isolate the address space of kernel drivers is a good line of research though it pushes some burden to driver programmers.


If drivers are isolated it's either a micro-kernel or a hybrid (and is no longer a monolithic); regardless of whether that isolation is implemented with the hardware's virtual memory management or if it's done in software only, and regardless of whether the driver is still in an area that would've been considered "kernel space".

bharathm1 wrote:
What are the main principles that the present monolithic kernels (eg. Linux) violating that made kernel terrible at fault tolerance?


The main principle that is missing is isolation (that would prevent it from being called a true monolithic kernel if it existed).

Linux is a special case - it maps all physical memory into kernel space (so any dodgy pointer anywhere in many millions of lines of code can corrupt anything that's in memory anywhere); so you can get all your hopes for fault tolerance and nail them to all your hopes for security, and glue on a few extra hopes (e.g. for decent NUMA optimisations), and then throw the that huge ball of hopes in the trash.


Cheers,

Brendan

Page 1 of 1 All times are UTC - 6 hours
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/