How do B-tree-based file systems work?

physecfed · **Posted:** Sun Mar 13, 2016 12:42 am

alexfru wrote:

Within a node, keys and their respective values are logically tied/paired. You can have an array of k-v pairs in a node. Or you can have it split into two distinct arrays, an array of keys and an array of values, still directly corresponding to one another.
Obviously, if you want quick key search (which is the hole point of the data structure at hand), you keep the array of keys ordered/sorted. So, when you insert a new key between, let's say, key3 and key4 (because key3 < new key < key 4), you also insert its corresponding value between value3 and value4 (irrespective of how values relate to one another). The new key and its value both have the same index in separate arrays or in the same array, the index that used to belong to key4-value4 pair. key4-value4 moves one place.

The same ordering logic applies to links to children as those links are stuck between adjacent keys and there are two extra links on the sides.

I'm sorry, I'm having trouble understanding what it is that you don't understand.

When I first sat down to write a response, I confused myself. That, combined with the fact that I'm using iffy terminology at best, should show you the grip I have on the subject. It might stem from the fact that I'm trying to think of B-trees in terms of linked lists, a data structure I understand well enough. So, let's reapproach this.

I'm looking at the Wikipedia article for a 2-3 tree. It says,

Wikipedia wrote:

...where every node with children (internal node) has either two children (2-node) and one data element or three children (3-nodes) and two data elements. Nodes on the outside of the tree (leaf nodes) have no children and one or two data elements.

So, it sounds like this is a general representation of the following nodes/leaves (some of the syntax is pseudocodish, I wrote it quickly to see if I was getting the point):

Code:

typedef struct _node2 {
   data_t    data;         // Some hypothetical data
   node*    children[2];   // Two child links 
} node2_t;                 // 2-node

typedef struct _node3 {
   data_t   data[2];       // Two data elements
   node*   children[3];    // Three child links
} node3_t;                 // 3-node

typedef struct _leaf1 {
   data_t   data;         
} leaf1_t;                 // Leaf with one data element 

typedef struct _leaf2 {
   data_t   data[2];         
} leaf2_t;                 // Leaf with two data element 

So, what does "self-balancing" refer to? To me, it sounds like the processes of altering the tree perform balancing in order to assure that the search and accesses on the tree take place quickly - it sounds like trading speed of modification for speed of access.

As to the key-value stuff, is this as simple as exchanging the hypothetical "data_t" element in the above representation to say, char* key and char* value?

I suppose what is tripping me up now is the determination of where keys go in the tree. The Wikipedia article's section on insertion talks about searching for the proper insertion point for the key - what determines where an arbitrary key goes when inserted into the tree?

alexfru · **Joined:** Tue Mar 04, 2014 5:27 am **Posts:** 1108

physecfed wrote:

alexfru wrote:

Within a node, keys and their respective values are logically tied/paired. You can have an array of k-v pairs in a node. Or you can have it split into two distinct arrays, an array of keys and an array of values, still directly corresponding to one another.
Obviously, if you want quick key search (which is the hole point of the data structure at hand), you keep the array of keys ordered/sorted. So, when you insert a new key between, let's say, key3 and key4 (because key3 < new key < key 4), you also insert its corresponding value between value3 and value4 (irrespective of how values relate to one another). The new key and its value both have the same index in separate arrays or in the same array, the index that used to belong to key4-value4 pair. key4-value4 moves one place.

The same ordering logic applies to links to children as those links are stuck between adjacent keys and there are two extra links on the sides.

I'm sorry, I'm having trouble understanding what it is that you don't understand.

When I first sat down to write a response, I confused myself. That, combined with the fact that I'm using iffy terminology at best, should show you the grip I have on the subject. It might stem from the fact that I'm trying to think of B-trees in terms of linked lists, a data structure I understand well enough. So, let's reapproach this.

You can think of B trees as of a collection of lists, each list residing in a node of a tree and each node pointing to many child nodes (or each list element pointing to a child node (almost, since there's a difference of 1 in the number of keys and child nodes)). It's sort of a hybrid if you want to think of it that way. But I don't think it adds any clarity to what should already be obvious. I think you should've seen enough depictions of B trees.

physecfed wrote:

I'm looking at the Wikipedia article for a 2-3 tree. It says,

Wikipedia wrote:

...where every node with children (internal node) has either two children (2-node) and one data element or three children (3-nodes) and two data elements. Nodes on the outside of the tree (leaf nodes) have no children and one or two data elements.

So, it sounds like this is a general representation of the following nodes/leaves (some of the syntax is pseudocodish, I wrote it quickly to see if I was getting the point):

Code:

typedef struct _node2 {
   data_t    data;         // Some hypothetical data
   node*    children[2];   // Two child links 
} node2_t;                 // 2-node

typedef struct _node3 {
   data_t   data[2];       // Two data elements
   node*   children[3];    // Three child links
} node3_t;                 // 3-node

typedef struct _leaf1 {
   data_t   data;         
} leaf1_t;                 // Leaf with one data element 

typedef struct _leaf2 {
   data_t   data[2];         
} leaf2_t;                 // Leaf with two data element 

Right, but you don't have to have distinct types for leaf nodes (it's easier to manipulate with one type) and your keys are magically lost (unless you've packed them together with their corresponding values into data_t). It's the keys that bring order to a B tree. Without them the tree is just a pile of values eccentrically (dis)organized.

physecfed wrote:

So, what does "self-balancing" refer to? To me, it sounds like the processes of altering the tree perform balancing in order to assure that the search and accesses on the tree take place quickly - it sounds like trading speed of modification for speed of access.

A question out of the blue.

Are we done with the previous stuff and moving on to other things?
There isn't much "self-balancing", not in the sense of a tree magically rebalancing itself or maintaining the balance having a cost of zero. Certainly, there are additional tree manipulations for this. If you've read about 2-3/2-3-4/red-black/AVL/etc trees, there are rotation operations that are done to make sure that the distances from every leaf node to the root are about the same. B trees are very similar.

physecfed wrote:

As to the key-value stuff, is this as simple as exchanging the hypothetical "data_t" element in the above representation to say, char* key and char* value?

Like I said, you need a key and a value for it and you can have them packed together inside data_t or you can have them in separate arrays of data_t and key_t, whatever. Doesn't matter.

physecfed wrote:

I suppose what is tripping me up now is the determination of where keys go in the tree.

They go inside nodes.

physecfed wrote:

The Wikipedia article's section on insertion talks about searching for the proper insertion point for the key - what determines where an arbitrary key goes when inserted into the tree?

Look at all those pics and recall that we're talking about ordered/sorted trees and that the search assumes that keys are ordered, so you can choose to go left, right or between based on the sought key and the key you're looking at. You can't insert a key anywhere you like without breaking the order.

physecfed · **Posted:** Mon Mar 21, 2016 6:12 pm

alexfru wrote:

A question out of the blue.

Are we done with the previous stuff and moving on to other things?
There isn't much "self-balancing", not in the sense of a tree magically rebalancing itself or maintaining the balance having a cost of zero. Certainly, there are additional tree manipulations for this. If you've read about 2-3/2-3-4/red-black/AVL/etc trees, there are rotation operations that are done to make sure that the distances from every leaf node to the root are about the same. B trees are very similar.

I won't say all the previous stuff is in the rearview yet, but I can tell you where I was tripping up and why I didn't seem to grasp even the most basic concepts.

The pictures and graphical representations of B-trees had me all backwards with respect to how the tree was implemented. Because of the fact that they're used for key-value stuff and databasing, I was thinking that the data values in the nodes themselves were a "key" directly; not from an organizational approach, but that the data values in the tree served as a pointer to the next node directly. In a nutshell, I thought the data and pointers in a node were intrinsically linked; that is, for nodes that had two elements and three children, I thought that element A led to C1, element B to C3 and some operation blahblah(A, B) mysteriously gave the location of C2. I was confusing the tree-structure with the key-value stuff too much and couldn't see the forest for a tree that was upside down.

In the past week I've done a lot of looking into B-trees from various angles only to find out that my old college C++ textbook had a good implementation of a binary tree in the later chapters that we didn't originally cover in class. I know enough about trees so far (however little, granted) to understand somewhat shakily that "not all binary trees are B-trees"; that B-trees are a subset of binary trees that meet certain conditions. The book contains a tree implementation where the data value contained is an int, with two pointers called left and right. It appears that where the node is inserted is based upon the new value compared to the old, where values less than the parent are inserted as the left child, and values greater are inserted to the right. Where does that stand with relationship to a B-tree (that is, what are the criteria that must be met for a binary tree to be a B-tree? I understand there's different types of B-trees).

I have other, more "practical" questions and code I'd like to write to help figure out the concept but let's square away all previous issues first.

gerryg400 · **Posted:** Mon Mar 21, 2016 8:27 pm

physecfed wrote:

I know enough about trees so far (however little, granted) to understand somewhat shakily that "not all binary trees are B-trees"; that B-trees are a subset of binary trees that meet certain conditions. The book contains a tree implementation where the data value contained is an int, with two pointers called left and right. It appears that where the node is inserted is based upon the new value compared to the old, where values less than the parent are inserted as the left child, and values greater are inserted to the right. Where does that stand with relationship to a B-tree (that is, what are the criteria that must be met for a binary tree to be a B-tree? I understand there's different types of B-trees).

You need to begin by understanding that B-trees are not a type of binary tree and binary trees are not a type of B-tree. They are entirely different. If a B-tree were constructed with only 1 inner node per node it might be arguably described as a binary tree but I would then argue that it is no longer a B-tree.

I see that Wikipedia wrote:

The B-tree is a generalization of a binary search tree in that a node can have more than two children

What this is saying is that a B-Tree is a tree. (Not a binary tree). However it misses the major feature of a B-Tree. Not only do the nodes have potentially more than 2 children B-tree nodes contain potentially more than one inner node. The size of the nodes is chosen to maximise performance.

Schol-R-LEA · **Posted:** Thu Mar 24, 2016 9:20 am

There is an excellent, if quite dated, book called File Structures which discusses data structures for organizing and traversing information in secondary storage. It covers B-Trees and B+-Trees in depth, along with hash tables and tries, mainly from application perspective but still applicable to file system design (indeed, you would need to understand how it works in general use before you can realistically hope to use it in FS implementation).

Gerry is (mostly) right: discussing B-Trees in terms of binary trees is a Bad Idea, even if some discussion of binary trees is unavoidable as groundwork. While the B-Tree can in some ways be seen as a generalization of a balanced binary tree, and is often presented in that way due to persistent confusion on the subject, that view is very misleading - they are very different in structure, purpose, and dynamic behavior. They are related, especially historically, so comparing them isn't apples-to-oranges, but it is more like apples-to-apple-orchards. You need to understand binary trees before you can understand B-Trees, but saying that a B-Tree is a binary tree is like saying an aircraft carrier is the same thing as a log canoe.

From a practical standpoint, the main difference is that binary trees are used for organizing and traversing data nodes in memory, while B-Trees and their derivatives are mainly for use sorting and searching keys related to data in some sort of secondary medium with a significantly longer access time than memory (disk, CD, mag tape, some kinds of flash memory, etc.). Whereas a balanced binary tree (which includes red-black trees, AVL trees, splay trees, etc.) is meant to reduce the number of nodes traversed while searching, the B-Tree also has to reduce the number of reads from and writes to mass storage, which on modern machines can be several orders of magnitude slower than primary store (milliseconds or even seconds, versus nanoseconds). You wouldn't use a B-Tree for data in memory, as it would be slower on average than a more basic form of balanced tree, and more cumbersome; conversely, using a general binary tree for traversing nodes in secondary storage would require so many disk reads that the advantages of a tree over a list would be irrelevant, since the access time for the disk would dominate the operation.

The salient point is that not only is the data in secondary storage, usually so are at least some of the keys as well. The tree itself can be part in and part out of main memory, so in order to traverse the B-Tree, you need to read from secondary storage, usually more than once. Being able to do so efficiently is an important property of B-Trees and their ilk, as it allows you to work on structures where not only the is data too large to fit in memory, the index to the data is as well, so caching the whole index isn't an option. This is important in a large database, but it has obvious implications for file system structures as well (it could even be applied to paging schedulers, I think, though I've never heard of it being done).

SpyderTL · **Joined:** Sun Sep 19, 2010 10:05 pm **Posts:** 1074

So, again, for the sake of simplicity, a B-Tree is similar to a Binary Tree, in that it is a natural evolution of a tree structure that has nodes that fit within (or completely fill) a block of data on a device that can only read/write entire blocks?

physecfed · **Posted:** Mon Mar 28, 2016 3:35 am

gerryg400 wrote:

physecfed wrote:

I know enough about trees so far (however little, granted) to understand somewhat shakily that "not all binary trees are B-trees"; that B-trees are a subset of binary trees that meet certain conditions. The book contains a tree implementation where the data value contained is an int, with two pointers called left and right. It appears that where the node is inserted is based upon the new value compared to the old, where values less than the parent are inserted as the left child, and values greater are inserted to the right. Where does that stand with relationship to a B-tree (that is, what are the criteria that must be met for a binary tree to be a B-tree? I understand there's different types of B-trees).

You need to begin by understanding that B-trees are not a type of binary tree and binary trees are not a type of B-tree. They are entirely different. If a B-tree were constructed with only 1 inner node per node it might be arguably described as a binary tree but I would then argue that it is no longer a B-tree.

I see that Wikipedia wrote:

The B-tree is a generalization of a binary search tree in that a node can have more than two children

What this is saying is that a B-Tree is a tree. (Not a binary tree). However it misses the major feature of a B-Tree. Not only do the nodes have potentially more than 2 children B-tree nodes contain potentially more than one inner node. The size of the nodes is chosen to maximise performance.

So binary trees and B-trees are only related in that they are trees, with binary trees being the subset where you can have at most two child nodes. That makes sense well enough; when it comes to the self-balancing variety; I'm assuming this is on the part of the interface logic - when nodes are added or removed, the functions modifying the tree remap and "anneal" the tree to ensure it exists in the most balanced state (that is, there's nothing magic about the tree that corrects itself). Am I on the right track?

Schol-R-LEA wrote:

There is an excellent, if quite dated, book called File Structures which discusses data structures for organizing and traversing information in secondary storage. It covers B-Trees and B+-Trees in depth, along with hash tables and tries, mainly from application perspective but still applicable to file system design (indeed, you would need to understand how it works in general use before you can realistically hope to use it in FS implementation).

Gerry is (mostly) right: discussing B-Trees in terms of binary trees is a Bad Idea, even if some discussion of binary trees is unavoidable as groundwork. While the B-Tree can in some ways be seen as a generalization of a balanced binary tree, and is often presented in that way due to persistent confusion on the subject, that view is very misleading - they are very different in structure, purpose, and dynamic behavior. They are related, especially historically, so comparing them isn't apples-to-oranges, but it is more like apples-to-apple-orchards. You need to understand binary trees before you can understand B-Trees, but saying that a B-Tree is a binary tree is like saying an aircraft carrier is the same thing as a log canoe.

From a practical standpoint, the main difference is that binary trees are used for organizing and traversing data nodes in memory, while B-Trees and their derivatives are mainly for use sorting and searching keys related to data in some sort of secondary medium with a significantly longer access time than memory (disk, CD, mag tape, some kinds of flash memory, etc.). Whereas a balanced binary tree (which includes red-black trees, AVL trees, splay trees, etc.) is meant to reduce the number of nodes traversed while searching, the B-Tree also has to reduce the number of reads from and writes to mass storage, which on modern machines can be several orders of magnitude slower than primary store (milliseconds or even seconds, versus nanoseconds). You wouldn't use a B-Tree for data in memory, as it would be slower on average than a more basic form of balanced tree, and more cumbersome; conversely, using a general binary tree for traversing nodes in secondary storage would require so many disk reads that the advantages of a tree over a list would be irrelevant, since the access time for the disk would dominate the operation.

The salient point is that not only is the data in secondary storage, usually so are at least some of the keys as well. The tree itself can be part in and part out of main memory, so in order to traverse the B-Tree, you need to read from secondary storage, usually more than once. Being able to do so efficiently is an important property of B-Trees and their ilk, as it allows you to work on structures where not only the is data too large to fit in memory, the index to the data is as well, so caching the whole index isn't an option. This is important in a large database, but it has obvious implications for file system structures as well (it could even be applied to paging schedulers, I think, though I've never heard of it being done).

Looking at the current prices for that book, I'd probably be some sort of fool to not go for it. The past couple of days I've been attempting to draw up a bunch of silly self-taught data structure classes in C++ to try and get the hang of these trees and what-not. That being said about binary trees in comparison, because you say they're better suited to memory ops, can they be efficiently used in program (non-disk) applications such as scheduling? I'm trying to (gently) plan for the future in that regard.

Going off the 2-3 tree as an example, seeing that it has the condition that all leaves must remain at the same level, I'm taking it this is useful in disk searching to establish a hard upper limit on the number of reads that must take place to traverse it.

Earlier it was said that

iansjack wrote:

In addition, you want to make the tree as balanced as possible, so that each node at a given level has roughly the same number of subnodes.

Does this mean that at level, say N, that each node attempts to maintain the same number of links (that is, 3 nodes with 3 children each is better than a 2-3-2 fanout)? Is this sort of rebalancing for B-trees similar in implementation mechanics to balancing binary trees after insertion/removal?

Finally, in the case of a 2-3 tree, is it advised to use a single structure, i.e.

Code:

typedef struct node {
    datatype_t d1, d2;
    struct node* a;
    struct node* b;
    struct node* c;
} node_t;

where in the case of a 2-node, the third pointer is null and in the case of a leaf, all pointers are null?

OSDev.org

How do B-tree-based file systems work?

Who is online