Structured Query Language/Managing Indexes

Structured Query Language
Managing Indexes

Indexes are a key feature of all SQL databases. They provide quick access to the data. Therefore almost all implementations support a CREATE INDEX statement.

Nevertheless, the CREATE INDEX statement is not part of the SQL standard! The reason for this is unknown. Possibly it is a deliberate decision against all implementation issues. Or it results from the wide range of different syntaxes realized by vendors and the lack of finding a compromise.

On this page, we offer some basic ideas concerning indexes and the syntax which is common to a great number of implementations.

CREATE [UNIQUE] INDEX <index_name> ON <table_name> (<column_name> [, <column_name>]);

The Concept of Indexes

DBMSs offer quick access to data stored in their tables. One might think that such high-speed access is due to the fast hardware of modern computers: millions of CPU cycles per second, I/O rates in the range of milliseconds, access to RAM within micro- or nanoseconds, etc. That is true, but only partly so. Instead, the use of intelligent software algorithms, especially in the case of handling large amounts of data, is the dominant factor.

Consider a request to the DBMS to determine, whether or not a person with a certain name can be found in a table with 1 million entries. With a primitive, linear algorithm, the system has to read 500,000 rows (on average) to decide the question. The binary search algorithm implements a more sophisticated strategy that answers the question after reading 20 rows or less. In this case, this choice of algorithm leads to a factor of 25,000 in performance. In order to really grasp the magnitude of this improvement, you may want to multiply your salary by 25,000.

Admittedly this comparison between the linear access and the binary search algorithm is a little bit simple. First, DBMS usually read blocks containing multiple rows and not single rows. But this didn't change the situation. If a block contains 100 rows, modify the above example from 1 million to 100 million rows. Second, the binary search algorithm assumes that the data is ordered. This means that during data entry, there is an additional step for sorting the actual input into the existing data. This applies only once and is independent of the number of read accesses. In summary, there is additional work during data entry and much less work during data access. It depends on the typical use of the data whether the additional work is worthwhile.

The index is an additional storage holding data that is copied or deducted from the original data in the table. It consists only of redundant data. What parts make up the index? In the common case of a binary search strategy, the index holds the original values of the tables column plus a back-reference to the original row. In most cases, the index is organized as a balanced tree with the column's value as the tree's key and the back-reference as additional information for each key.

The binary search algorithm is one of many options for building indexes. The common characteristics of indexes are that they: consist only of redundant information; use additional resources in the sense of CPU cycles, RAM or disc space; and offer better performance for queries on large data sets. In the cases of small tables or tables with many indexes, it is possible that the disadvantages (in performance or resource usage) outweigh the benefits of using an index.

Basic Index

If an application retrieves data by a certain criterion - e.g., a person name for a phone book application - and this criterion consists of a tables column, this column should have an index.

CREATE INDEX person_lastname_idx ON person(lastname);

The index has its own freely selectable name - person_lastname_idx in this example - and is built on a certain column of a certain table. The index may be defined and created directly after the CREATE TABLE statement (when there is no data in the table) or after some or a huge number of INSERT commands. After it is created, the DBMS should be in the state to answer questions like the following quicker than before.

SELECT count(*) 
FROM   person
WHERE  lastname = 'Miller';

The index may be used during the evaluation of the WHERE clause. The DBMS has the choice between - on the one hand - reading all person rows and counting such where the lastname is 'Miller' or - on the other hand - reading the index (possibly with binary search) and counting all nodes with value 'Miller'. Which strategy is used depends on a lot of decisions. If, for example, the DBMS knows that about 30% of all rows contain 'Miller', it may choose a different strategy than if it knows that only 0.3% contains 'Miller'.

A table may have more than one index.

CREATE INDEX person_firstname_idx ON person(firstname);

What will happen in such a situation to a query like the following one?

SELECT count(*) 
FROM   person
WHERE  lastname = 'Miller'
AND    firstname = 'Henry';

Again, the DBMS has more than one choice to retrieve the expected result. It may use only one of the two indexes, read the resulting rows and look for the missing other value. Or it reads both indexes and counts the common back-references. Or it ignores both indexes, reads the data and counts such rows where both criteria apply. As mentioned it depends on a lot of decisions.

Multiple Columns

If an application typically searches in two columns within one query, e.g. for first- and lastname, it can be useful to build one index for both columns. This strategy is very different from the above example, where we build two independent indexes, one per column.

CREATE INDEX person_fullname_idx ON person(lastname, firstname);

In this case the key of the balanced tree is the concatenation of last- and firstname. The DBMS can use this index for queries which ask for last- and firstname. It can also use the index for queries for lastname only. But it cannot use the index in queries for firstname only. The firstname can occur at different places within the balanced tree. Therefore it is worthless for such queries.

Functional Index

In some cases an existing index cannot be used for queries on the underlying column. Suppose the query to person names should be case-insensitive. To do so the application converts all user-input to upper-case and use the UPPER() function to the column in scope.

-- Original user input was: 'miller'
SELECT count(*) 
FROM   person
WHERE  UPPER(lastname) = 'MILLER';

As the criterion in the WHERE clause looks only for uppercase characters and the index is built in a case-sensitive way, the key in the balanced tree is worthless: 'miller' is sorted at a very different place than 'Miller'. To overcome the problem, one can define an index, which uses exactly the same strategy as the WHERE criterion.

CREATE INDEX person_uppername_idx ON person(UPPER(lastname)); -- not supported by MySQL

Now the 'UPPER()' query can use this so-called functional index.

Unique Index

The Primary Key of every table is unique, which means that no two rows can contain the same value. Sometimes one column or the concatenation of some columns is also unique. To ensure this criterion you can define a UNIQUE CONSTRAINT, or you can define an index with the additional UNIQUE criterion. (Often UNIQUE CONSTRAINTS silently use UNIQUE INDEX in the background.)

CREATE UNIQUE INDEX person_lastname_unique_idx ON person(lastname);

Unique indexes can only be created on existing data, if the column in scope really has nothing but unique values (which is not the case in our database example).

Drop an Index

Indexes can be dropped by the command:

DROP INDEX <index_name>;

Temporary Table

Structured Query Language
Managing Indexes

Managing Rights