PostgreSQL/Index Btree
The term B-tree index denotes the implementation of a balanced tree. B-trees are characterized by the criterion that the distance from the root node to every leaf node is the same. Such trees support the very fast evaluation of search criteria like WHERE status=5
. In most cases, such trees have a high branching factor and hence a low depth. If, for example, there is a branching factor of 500, the tree can manage about 125 million entries with 3 page-reads.
In addition, the PostgreSQL implementation optimizes the locking behavior of concurrent write operations to the tree with a strategy that was originally proposed by Lehmann and Yao. The idea is to add additional (redundant as of the perspective of the complete tree) pointers to every page.
SQL syntax
editB-tree is the default index type. It is created by the SQL command CREATE INDEX when the keyword USING
is omitted.
-- create a b-tree index
CREATE INDEX test_idx ON table_1 (column_1);
-- equivalent syntax:
CREATE INDEX test_idx ON table_1 USING BTREE(column_1);
Description
editThe file containing the B-tree consists of different page types.
- The very first page (#0) of the file holds meta-information about the index, e.g., the pointer to the root page, which is not always located on page #1, or the current tree-depth.
- Internal pages contain pairs of keys and pointers. Keys hold the values which shall be indexed, and the pointers point to internal pages of the next level or to leaf pages. Those pairs are denoted index entries.
- Leaf pages contain such pairs as well. But in this case, they point to pages and rows in the data file (heap). Such pointers are called TupleIds or TIDs.
- Internal pages plus leaf pages constitute the B-tree.
Over time, pages need to be split because their capacity gets exhausted. First, the tree grows in breadth. In rare cases, it gets necessary that the high of the tree must be extended. In this case, the root page gets split and a new root page is created.
There are some special rules to optimize the access to B-trees, especially to reduce the probability of locks in a multiuser environment. Therefore the pages contain some additional information that enhances a pure B-tree implementation.
- The first index entry of every page contains a value, which is treated as an upper bound for all keys of this page. It does not contain a pointer to another page. It is called the high key, and in the above graphic, it is shown in red color. The rightmost page of every level doesn't contain a high key. It's plus infinity per definition.
- The second (or first in the case of no high key) index entry points to the left child of the page. It does not contain a real key. Sometimes it's called minus infinity. Leaf pages don't use it.
- The pages of every level are connected to each other via a double linked list. This helps to speed up queries like
WHERE status>17
because the need to traverse the tree upwards disappears.
Statements to create the shown B-tree
edit-- PostgreSQL version 14.1 at Ubuntu 20.4
-- a helper to create huge text values via 'gen_random_bytes()'
CREATE EXTENSION IF NOT EXISTS pgcrypto;
-- a helper to inspect physical pages
CREATE EXTENSION IF NOT EXISTS pageinspect;
-- table with a text field
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
key integer,
val text -- will be indexed with B-tree
);
-- insert huge text values to enforce page splits in index file
INSERT INTO t1
(SELECT
generate_series(11, 22, 1),
concat(generate_series(11, 22, 1), ' ', gen_random_bytes(1024)::text)
);
-- same as:
-- INSERT INTO t1 VALUES (11, '11 ' || gen_random_bytes(1024)::text);
-- INSERT INTO t1 VALUES (12, '12 ' || gen_random_bytes(1024)::text);
-- ...
-- create the B-tree
CREATE INDEX t1_btree_idx ON t1 (val);
-- Inspect the pages of the created B-tree
-- read meta page: it shows that page 9 is the root of the B-tree
SELECT * FROM bt_metap('t1_btree_idx');
-- read page 9: root page of B-tree
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 9);
-- read pages 3 + 8 (internal pages)
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 3);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 8);
-- read pages 1, 2, 4 and 5, 6, 7 (leaf pages)
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 1);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 2);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 4);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 5);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 6);
SELECT itemoffset, ctid, itemlen, nulls, vars, left(data, 23) AS data FROM bt_page_items('t1_btree_idx', 7);
-- show the TIDs per key (in heap file, NOT in index file)
SELECT key, ctid FROM t1;
External links
editPostgreSQL Documentation concerning B-tree implementation