B-tree Index

On the previous page, we learned what indices are and when to use them.

The standard index structure in most databases is the B-tree. For a long time, I thought the “B” in B-tree meant binary. I incorrectly assumed that each level in the tree had two branches and didn’t care to understand it further.

It wasn’t until I was giving a talk on Postgres that I realized that I should really study how B-trees worked. In this talk, I brought a case of Skittles and made a deal with the audience. If they could ask me a question about Postgres that I couldn’t answer, they got a package of Skittles.

When it came to the intricacies of how data is stored, and how indices really worked, I ended up giving away a lot of candy. It’s probably not the best motivation to learn something new, but I like Skittles, and I wanted to keep more of them to myself.

What is the “B” in B-tree?

So if “B” doesn’t refer to binary, what does it refer to? Edward McCreight, one of the inventors of the structure, explains it 16 minutes into this talk at CPM 20131.

You just have no idea what a lunchtime conversation can turn into. So there we were, Rudy and I, at lunch. We had to give the thing a name. So, B, we were thinking… B is, you know… We were working for Boeing at the time. We couldn’t use the name without talking to the lawyers. So, there is a B.

It has to do with balance, another B.

Rudy [Bayer] was the senior author, who [is] several years older than I am and had many more publications than I did. So there is another B.

And so, at the lunch table we never did resolve whether there was one of those that made more sense than the rest.

What Rudy likes to say is: the more you think about what the B in B-trees means, the better you understand B-trees. -Edward McCreight, CPM 2013 Conference

B-tree implementation in Postgres

The original B-tree structure was invented in 1970 by Rudolf Bayer and Edward McCreight. Eleven years later, Philip Lehman and S. Bing Yao wrote a white paper on Efficient Locking for Concurrent Operations on B-Trees2. Since PG has to be highly concurrent, this is largely what its B-tree index is based on.

Though the “B” is somewhat ambiguous, I prefer to think of it as meaning “balanced”. Instead of a binary search tree where each node can only have 2 child nodes, a node in a B-tree can have many children. In Postgres, a node is called a page and consists of an 8KB block of disk space.

Tree Structure

The “tree” is a structure that contains a root page, non-leaf pages, and leaf pages, which can be scanned by following pointers.

Usually, it’s more efficient to traverse the tree to find a leaf page than to scan an entire table. Exceptions to this include tables with only a few rows or columns with a small number of unique values, i.e. low cardinality.

Boolean fields are a good example of low cardinality. These don’t benefit much from an index because the database has to scan half the index and the data file to retrieve the desired data. It’s more efficient to just scan the data file. One thing to note is that this doesn’t apply to covering or compound indices, which we’ll go into more detail in a few paragraphs.

Traversing the Tree

Each page has a right-link pointer to the page’s right sibling and a “high key” which refers to the upper bound of the keys in that page. These are required to allow the page to be split concurrently. When splitting, a lock is created to prevent this page from being modified while reading it.

When traversing the index, the search compares the page’s high key with the search key. If the search key is greater than the high key, the page must have been split concurrently and the search follows the right-link to find the new page. It’s possible to have to move right a few times, if the page was split more than once.

There is also a left-link pointer, which allows the index scan to go backwards. This is useful for ORDER BY clauses when the index was created in an ascending order, but the query calls for it to be sorted in a descending order.

Reading Pages

Because Postgres is a concurrent system, it has to manage read/write locks to ensure the data is safe. To keep things as fast as possible, it minimizes how often locks are set or unset. An index scan reads an entire leaf page and gets any matching items at once. It then copies their heap tuple IDs into local storage and releases the lock, but maintains a pin in that page due to the possibility of concurrent deletes. After that, it fetches data for those tuple IDs.

Splitting Pages

Assuming an index is created on a table without any data, it starts out simply with a root page. As new rows are inserted, the appropriate data is saved to the index. Since pages are 8KB, eventually the page will fill up and must be split.

Lehman and Yao assume fixed-size keys, but we must deal with variable-size keys. Therefore there is not a fixed maximum number of keys per page; we just stuff in as many as will fit. When we split a page, we try to equalize the number of bytes, not items, assigned to each of the resulting pages. Note we must include the incoming item in this calculation, otherwise it is possible to find that the incoming item doesn’t fit on the split page where it needs to go! -Postgres BTree Readme3

Deleting from the index

Before deleting, a super-exclusive lock is set. This prevents any other process from locking or putting a pin on a page. With the lock in place, non-full vacuums won’t cause problems.

Vacuums delete index entries before reclaiming heap tuple line pointers; therefore, the super-exclusive lock guarantees that it can’t reclaim for re-use a line pointer that an index scanning process might be about to visit.

If all the data in a page is deleted, the page is essentially hidden from scans unless it’s the right-most page. It’s a two stage process.

In the first stage, the page is unlinked from its parent and marked as half-dead. The target and parent page is locked, then the target’s downlink is changed to point to the right sibling and the old downlink is removed. At the same time, the target page is marked as half-dead which causes new searches to ignore it. In the second stage, the page is unlinked from its siblings.

The page is marked as reclaimable during a VACUUM if all transactions that might have been reading the page have been completed. Once this happens, the space in the file can be reclaimed by writing a new page over the top.

Because the right-most page on any level is never deleted, the height of the tree never decreases. This can cause the tree to become skinny with several single page levels below the root. When this happens, the “fast root” is tracked, which is the lowest single page level. From then on, searches start at the fast root instead of the real root to reduce the number of pages a scan has to look at.

When B-tree indices are effective

As we learned on the previous page, indices, in a relational database system, may speed up SELECT queries, but slow down INSERT, UPDATE, and DELETE queries. Specifically, in a SELECT, they may speed up JOIN, WHERE and ORDER BY clauses.

In Postgres, the query planner considers things like:

If there are few rows or low cardinality, it will likely do a table scan. For range queries, it will use an index if all the matches can be found on a small number of leaf nodes or pages. If the range is too large, it will determine that a table scan is faster.

Elements in an index are scanned from left to right. Therefore, a query with a LIKE operator may use an index if there is a wildcard on the right, but not on the left. Let’s assume there is a table called customers with an index like this:

CREATE TABLE customers (
  name VARCHAR(255) NOT NULL

CREATE INDEX idx_customers
    ON customers (name);

If we ran this query, the index may be used:

SELECT * FROM customers WHERE name LIKE 'B%';

But it will not use an index with a left bound wildcard:

SELECT * FROM customers WHERE name LIKE '%ria%';

Note: a trigram index solves the problem of having a wildcard on the left, which will be explained later in the chapter.

Covering and Compound indices

In a relational database, an index can be built using more than one column. This makes it possible for queries that are filtering on the same columns to use the full index. Since everything can be read in one file, it is much faster; however, it causes more strain on the database’s write speed.

For the same reason as our above example regarding the use of LIKE, the order of the columns in the index matter. The columns in the index must match columns in the WHERE clause starting from the left and moving right. For example, let’s say there is an address table with these columns and index:

CREATE TABLE addresses (
  address VARCHAR(255),
  city VARCHAR(255) NOT NULL,
  region VARCHAR(3) NOT NULL,
  postal_code VARCHAR(11) NOT NULL

CREATE INDEX idx_address
    ON addresses (postal_code, region, city);

This query can use the index because the leftmost two columns are found in the index:

SELECT * FROM addresses WHERE postal_code = '01234' AND region = 'MA'

However, this query will not be able to use the index because the leftmost column in the index is postal_code, which is not present.

SELECT * FROM addresses WHERE region = 'MA' AND city = 'Pittsfield'

  1. Edward M McCreight 

  2. Efficient Locking for Concurrent Operations on B-Trees 

  3. Postgres source file: src/backend/access/nbtree/README