Fault tolerant OS

Discussions on more advanced topics such as monolithic vs micro-kernels, transactional memory models, and paging vs segmentation should go here. Use this forum to expand and improve the wiki!
Post Reply
bharathm1
Posts: 4
Joined: Sun Jul 01, 2018 7:23 am

Fault tolerant OS

Post by bharathm1 »

Hello, I am wondering if there any operating systems ( or OS theory) that are fault tolerant . By fault tolerant I mean, if there is kernel process that is faulty and hogging a cpu or got some error is there a way to isolate the problem to that single kernel thread or some pool of the threads related to the faulty thread and let the rest of system work properly.

I know that getting an error in one part of the kernel can compromise the correctness of kernel but sometimes it might not affect most of the kernel. For example when a kernel thread stuck in while loop with no progress hogging the CPU without taking any locks we might be able to safely terminate it if some magic oracle says this kernel thread ( possibly a driver) is no good. But again having such an oracle is a major problem as well.

So are they any OS design that have fault tolerance ( of the kind I described ) as a design goal?

Thank you
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fault tolerant OS

Post by Brendan »

Hi,
bharathm1 wrote:I know that getting an error in one part of the kernel can compromise the correctness of kernel but sometimes it might not affect most of the kernel.
Sometimes it might not effect most of the kernel, but you never know if it did or not so that doesn't help - you have to assume that almost all of the kernel might have been ruined regardless.

To fix that, you want to isolate the pieces so you can know that if one piece has a problem it can't ruin other pieces. In other words; a micro-kernel ends up being necessary.

Of course a micro-kernel isn't enough on its own. You'd also need code to monitor, terminate and restart drivers; and (in some cases) ways to recover lost state.

This has been done before (e.g. Minix 3).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
bharathm1
Posts: 4
Joined: Sun Jul 01, 2018 7:23 am

Re: Fault tolerant OS

Post by bharathm1 »

Do you think there is any hope for fault tolerant monolithic kernels?

I think ideas such as nooks where we isolate the address space of kernel drivers is a good line of research though it pushes some burden to driver programmers.

What are the main principles that the present monolithic kernels (eg. Linux) violating that made kernel terrible at fault tolerance?
OSwhatever
Member
Member
Posts: 595
Joined: Mon Jul 05, 2010 4:15 pm

Re: Fault tolerant OS

Post by OSwhatever »

Department of Computer Science University of Illinois did a paper "Building a Self-Healing Operating System"

http://choices.cs.illinois.edu/selfhealing.pdf

It goes through a few techniques for "healing" an error.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Fault tolerant OS

Post by Brendan »

Hi,
bharathm1 wrote:Do you think there is any hope for fault tolerant monolithic kernels?
That really depends on what kinds of faults you're trying to tolerate. A driver fails to initialise because it can't allocate enough memory? Easy. A single bit flip (if there's no memory encryption)? Maybe. A CPU failing while holding kernel locks? No.
bharathm1 wrote:I think ideas such as nooks where we isolate the address space of kernel drivers is a good line of research though it pushes some burden to driver programmers.
If drivers are isolated it's either a micro-kernel or a hybrid (and is no longer a monolithic); regardless of whether that isolation is implemented with the hardware's virtual memory management or if it's done in software only, and regardless of whether the driver is still in an area that would've been considered "kernel space".
bharathm1 wrote:What are the main principles that the present monolithic kernels (eg. Linux) violating that made kernel terrible at fault tolerance?
The main principle that is missing is isolation (that would prevent it from being called a true monolithic kernel if it existed).

Linux is a special case - it maps all physical memory into kernel space (so any dodgy pointer anywhere in many millions of lines of code can corrupt anything that's in memory anywhere); so you can get all your hopes for fault tolerance and nail them to all your hopes for security, and glue on a few extra hopes (e.g. for decent NUMA optimisations), and then throw the that huge ball of hopes in the trash.


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
Post Reply