PostgreSQL/Index BRIN

PostgreSQL
Index BRIN

BRIN (Block-Range Index) is a structure that accelerates queries on tables that contain a huge number of rows (> millions) and where the rows occur in a certain physical order within the data file. Typical use cases are such where a column contains a timestamp or a generated sequence number that seldomly or never change over time, e.g.: IoT data, computed values, sensor output, log information.

You can imagine a BRIN as a 'virtual partitioning' of a table. If a query fits into such a virtual partition the number of rows, which must be scanned, decreases significantly.

The power of BRIN results from the fact that it needs only very little space. Typical BRIN sizes for a table with hundreds of millions of rows are some kB, which easily fits into RAM. All other index types need much more space, 25 - 50% of the table size is not unusual.

SQL syntax

DROP TABLE IF EXISTS t3;

-- create a table with a timestamp column
CREATE TABLE t3 (id INTEGER, ts TIMESTAMP);

-- insert data
INSERT INTO t3 VALUES (1, '2022-01-01 00:00:01');
INSERT INTO t3 VALUES (2, '2022-01-01 00:00:02');
-- ...

-- create a BRIN index
CREATE INDEX t3_brin_idx ON t3 USING BRIN(ts);

-- use the index with the usual operators. (As far as there are less than 100,000 rows, the index is not used.)
SELECT * FROM t3 WHERE ts = '2022-01-01 00:00:02';

Description

Hint: The PostgreSQL terminology knows the term 'block' and uses it synonymously with 'page' (8192 bytes), see here. In the context of BRIN, the term 'block-range' denotes a contiguous sequence of many adjacent pages within a data file.

When a BRIN is created, the sequence of pages of the data file (heap) is virtually divided into slices, called block-ranges. E.g., if the file contains 600 pages, the first 128 belongs to block-range #1, the second 128 belongs to block-range #2, ... up to block-range #5, which contains the remaining number of pages. The default size for a block-range is 128 pages; it can be changed within the CREATE INDEX command. Next, all rows are scanned and the minimum and maximum values of the indexed column per block-range are saved. Please note that each min/max pair constitutes a numeric value-range.

The BRIN structure consists of those block-range numbers and their related value-ranges, e.g., block-range #1: min=11, max=25; block-range #2: min=25, max=31. Hence the name Block Range Index.

Ideally, the value-ranges don't overlap, but this is not necessary. The correlation between the order of rows in the data file and their content in the column of interest arises from the sequence of INSERT commands combined with growing values: later INSERTs (to a table without significant free space) should contain equal or higher values. This correlation may get lost over time by later UPDATE commands. In this case, the benefit of BRIN may get lost.

When such a BRIN structure is used during the execution of a query, it is known that all values within the rows of a certain block-range are within its data-range or, vice versa, no value outside its value-range is in any of its columns. But it's unknown which concrete values are really in the data! This has significant consequences for using the BRIN structure; see below.

Complex data types, e.g., geometric objects like rectangles, store more complex data instead of min/max values, e.g., a bounding box.

The file containing the BRIN consists of 3 different page types.

The very first page (#0) of the file holds meta-information, e.g., the number of Range-map pages.
The second (#1) and some more pages contain the so-called Range-map. It consists of tuple IDs (TID) which point to pages of the next BRIN level, the Index Pages. Because TIDs have a fixed size (of 6 bytes) it's possible to store them like an array: one after the next without any links between them. Their position correlates with the block-range number.
The rest of the pages contain the Index Pages. They contain the minimum and maximum values (value-ranges) per block-range.

Modus Operandi

Select

If the WHERE condition of an SQL command specifies a criterion for a column with a BRIN, the following steps are conducted:

All range-map pages are entirely scanned.
The related index pages are read one by one. If the searched value fits into the value-range of any of its index entries, all pages of this block-range are considered part of the result set.
All rows of the identified pages are read from the heap and their columns evaluated. This is necessary because BRIN knows only that the value of every row must be in a certain range, e.g., between 11 and 25. But if the search criterium is WHERE col = 20, the identified rows potentially contain other values like 11 or 15.

Summary: BRIN doesn't contain exact pointers to certain rows in the heap file. It contains only information about ranges of blocks and ranges of values. Nevertheless, under certain conditions (huge number of rows, correlation between physical row order and column value, few data changes) this information is enough to reduce the number of rows, which must be read to evaluate a search condition, dramatically. The uncertainness of BRIN correlates with its tiny size.

Update

If a row is added or the BRIN-column of a row changes, the following steps are conducted:

The page number, where the row is physically stored in the data file, is identified.
Depending on the page number, the block-range number is computed (page number divided by block-range-size).
The block-range number determines the position within the range-map.
The related index entry at the index page is read.
If the new value of the row fits into the value-range of this index entry, no action is necessary. If the value is outside of the value-range, the value-range is updated (enlarged) on the lower or upper bound.

It is possible that value-ranges overlap. This decreases the efficiency of BRIN.

SQL syntax - more

DROP TABLE IF EXISTS t4;

-- create a table whose rows occupy about 200 byte each
CREATE TABLE t4 (
  id INTEGER,
  ts TIMESTAMP,
  some_space TEXT NOT NULL DEFAULT
              'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ' ||
              'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ' ||
              'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')
;

-- insert 1 million rows
INSERT INTO t4 (id, ts)
  (SELECT generate_series(1, 1000000, 1),
          generate_series(now(), now() + '1000000 second', '1 second'));

-- size of heap: about 200 MB
SELECT pg_size_pretty(pg_total_relation_size('t4'));

-- create BRIN and show its size: about 48 kB
CREATE INDEX t4_brin_idx ON t4 USING BRIN(ts);
SELECT pg_size_pretty(pg_total_relation_size('t4_brin_idx'));

-- create BTREE and show its size: about 21 MB
CREATE INDEX t4_btree_idx ON t4 USING BTREE(ts);
SELECT pg_size_pretty(pg_total_relation_size('t4_btree_idx'));

-- size of BRIN to BTREE is about 1 : 400

-- ----------------------------------------------------------------

-- show the meta page of BRIN (page #0)
SELECT * FROM brin_metapage_info(get_raw_page('t4_brin_idx', 0));

-- show (the only) revmap page of BRIN (page #1)
SELECT * FROM brin_revmap_data(get_raw_page('t4_brin_idx', 1)) where pages !='(0,0)';

-- show index pages (in this example there is only a single one: page #2)
SELECT itemoffset, blknum, value
FROM   brin_page_items(get_raw_page('t4_brin_idx', 2), 't4_brin_idx') 
ORDER BY itemoffset;

External links

PostgreSQL Documentation concerning BRIN

Index GIN

PostgreSQL
Index BRIN

Index Hash