gerryg400 wrote:
physecfed wrote:
I know enough about trees so far (however little, granted) to understand somewhat shakily that "not all binary trees are B-trees"; that B-trees are a subset of binary trees that meet certain conditions. The book contains a tree implementation where the data value contained is an int, with two pointers called left and right. It appears that where the node is inserted is based upon the new value compared to the old, where values less than the parent are inserted as the left child, and values greater are inserted to the right. Where does that stand with relationship to a B-tree (that is, what are the criteria that must be met for a binary tree to be a B-tree? I understand there's different types of B-trees).
You need to begin by understanding that
B-trees are not a type of binary tree and
binary trees are not a type of B-tree. They are entirely different. If a B-tree were constructed with only 1 inner node per node it might be arguably described as a binary tree but I would then argue that it is no longer a B-tree.
I see that Wikipedia wrote:
The B-tree is a generalization of a binary search tree in that a node can have more than two children
What this is saying is that a B-Tree is a tree. (Not a binary tree). However it misses the major feature of a B-Tree. Not only do the nodes have potentially more than 2 children B-tree nodes contain potentially more than one inner node. The size of the nodes is chosen to maximise performance.
So binary trees and B-trees are only related in that they are trees, with binary trees being the subset where you can have at most two child nodes. That makes sense well enough; when it comes to the
self-balancing variety; I'm assuming this is on the part of the interface logic - when nodes are added or removed, the functions modifying the tree remap and "anneal" the tree to ensure it exists in the most balanced state (that is, there's nothing magic about the tree that corrects itself). Am I on the right track?
Schol-R-LEA wrote:
There is an excellent, if quite dated, book called
File Structures which discusses data structures for organizing and traversing information in secondary storage. It covers B-Trees and B+-Trees in depth, along with hash tables and tries, mainly from application perspective but still applicable to file system design (indeed, you would need to understand how it works in general use before you can realistically hope to use it in FS implementation).
Gerry is (mostly) right: discussing B-Trees in terms of binary trees is a Bad Idea, even if
some discussion of binary trees is unavoidable as groundwork. While the B-Tree
can in some ways be seen as a generalization of a balanced binary tree, and is often presented in that way due to persistent confusion on the subject, that view is very misleading - they are very different in structure, purpose, and dynamic behavior. They
are related, especially historically, so comparing them isn't apples-to-oranges, but it is more like apples-to-apple-orchards. You need to understand binary trees before you can understand B-Trees, but saying that a B-Tree
is a binary tree is like saying an aircraft carrier is the same thing as a log canoe.
From a practical standpoint, the main difference is that binary trees are used for organizing and traversing data nodes in memory, while B-Trees and their derivatives are mainly for use sorting and searching keys related to data in some sort of secondary medium with a significantly longer access time than memory (disk, CD, mag tape, some kinds of flash memory, etc.). Whereas a balanced binary tree (which includes red-black trees, AVL trees, splay trees, etc.) is meant to reduce the number of nodes traversed while searching, the B-Tree
also has to reduce the number of reads from and writes to mass storage, which on modern machines can be several orders of magnitude slower than primary store (milliseconds or even seconds, versus nanoseconds). You wouldn't use a B-Tree for data in memory, as it would be slower on average than a more basic form of balanced tree, and more cumbersome; conversely, using a general binary tree for traversing nodes in secondary storage would require so many disk reads that the advantages of a tree over a list would be irrelevant, since the access time for the disk would dominate the operation.
The salient point is that not only is the data in secondary storage, usually so are at least some of the
keys as well.
The tree itself can be part in and part out of main memory, so in order to traverse the B-Tree, you need to read from secondary storage, usually more than once. Being able to do so efficiently is an important property of B-Trees and their ilk, as it allows you to work on structures where not only the is data too large to fit in memory, the index to the data is as well, so caching the whole index isn't an option. This is important in a large database, but it has obvious implications for file system structures as well (it could even be applied to paging schedulers, I think, though I've never heard of it being done).
Looking at the current prices for that book, I'd probably be some sort of fool to not go for it. The past couple of days I've been attempting to draw up a bunch of silly self-taught data structure classes in C++ to try and get the hang of these trees and what-not. That being said about binary trees in comparison, because you say they're better suited to memory ops, can they be efficiently used in
program (non-disk) applications such as scheduling? I'm trying to (gently) plan for the future in that regard.
Going off the
2-3 tree as an example, seeing that it has the condition that all leaves must remain at the same level, I'm taking it this is useful in disk searching to establish a hard upper limit on the number of reads that must take place to traverse it.
Earlier it was said that
iansjack wrote:
In addition, you want to make the tree as balanced as possible, so that each node at a given level has roughly the same number of subnodes.
Does this mean that at level, say N, that each node attempts to maintain the same number of links (that is, 3 nodes with 3 children each is better than a 2-3-2 fanout)? Is this sort of rebalancing for B-trees similar in implementation mechanics to balancing binary trees after insertion/removal?
Finally, in the case of a 2-3 tree, is it advised to use a single structure, i.e.
Code:
typedef struct node {
datatype_t d1, d2;
struct node* a;
struct node* b;
struct node* c;
} node_t;
where in the case of a 2-node, the third pointer is null and in the case of a leaf, all pointers are null?