Oracle and DB2, Comparison and Compatibility/Storage Model/Indexes/Summary

Summary

edit

Clustering

edit

Relational Theory states that the physical organization of data should have no performance limitations. This is in an ideal world, the real world implementation of databases as structures living on both disk and memory mean that collocated data is returned quicker. This collocation is called CLUSTERING. Since what data is related (and hence clustered) is determined by your application requirements, this structuring would seem to be more of a logical constraint. In reality there is a physical implication in that data physically grouped together (either on disk or in RAM) is accessed quicker since a single block fetch can return large amounts of required data (as opposed to large numbers of expensive I/O requests to gather this same data from different physical locations). A consideration when using clustering techniques is that if your application is doing singleton selects of random data, no amount of clustering is going to help. For applications that need to return large amounts of related data, clustering will help, however there is an operational overhead in that the data needs to be stored in a specific order, and maintaining this order takes effort.

Clustering is a technique for organizing data by putting commonly accessed data in the same block. By far the biggest single overhead in database operation is physical I/O. By organizing data that is frequently accessed together into clusters, large amounts of related data can be accessed with a single block fetch. One way of clustering data is by an index. Data can be clustered in the following ways:


CTAS / CTAL.


In Oracle, Create Table As Select (CTAS) with the ORDER BY clause will perform a full-table scan. In DB2, you can use Create Table As Select, or Create Table As LIKE. When all the rows have been collected, they are sorted in a temporary tablespace (PSATEMP for Oracle, and TEMPSPACE1 for DB2) before populating the new table. Data is organized by the values of the columns specified in the ORDER BY clause. This requires you to manually re-organize the table so that related table rows are stored in adjacent blocks.

This technique is best for databases where most of the SQL queries traverse un-clustered one-to-many data. A good example of this are Bill-Of-Materials (BOM) structures that reflect all the orders a customer has, all the line items on an order, or even all the charges on a credit card statement for a billing period. In this case line items on an order (for example) are re-sequenced into adjacent data blocks.

Row re-sequencing can help by:

Clustering related rows together onto the same data block Organizing data to avoid disk sorts after retrieval. Packing adjacent rows together on a single data block to reduce block fetches for index row scans.

Row re-sequencing does not help queries that perform full-scans or index unique scans. This means that for most tables, physically re-sequencing the rows is not going to make a dramatic difference in SQL response time, so it’s important to understand the types of queries that are going to be run against the data.


Index Cluster Tables


Another mechanism for re-sequencing table data is to use an index. A table cluster is a group of tables that share common columns. Since they share common columns, they also share the same data blocks and because of this they are often used together. When you create cluster tables all rows for each table are stored in the same data blocks.

Index cluster tables can be either multi-table or single-table.


Single-table Index Cluster Tables


In a single-table index cluster table Oracle guarantees row sequence by using a clustering factor that approximates to block size. Rows of data for a table are laid out sequentially in a data block according to the values in the index columns. Because of this an index range scan will always fetch as many rows as possible in a single I/O.

While this is optimal for retrieving data, there needs to be mechanism to handle the insertion of new data into a table where there is no room in the target block. Oracle uses an overflow area to handle this and maintaining the overflow area is very important. You have to periodically reorganize the single-table index cluster table to ensure that all row orders are maintained.


Multi-table Index Cluster Tables


In a multi-table index cluster, related table rows are grouped together to reduce disk I/O. This is similar to Single-table Index Clustering, except that now related data from more than one table are stored together. Using the BOM example above, a Multi-Table Index Cluster would have all the line item rows for a customer order stored in the same block as the customer order. In this way instead of fetching the customer order, and then doing individual fetches for line items associated with that order, you will do just one block fetch and get all the data you need in a single operation.

Index clusters will only reduce I/O when the most of the data required by an application is accessed through the cluster index. If you use any other indexes that might be defined on that data, you will get random block fetches.


Sorted Hash Clusters.


Starting with 10g, it is possible to have Oracle place related data in adjacent blocks using a mathematical indexing function (called a hash function). Per the Oracle documentation

"In a sorted hash cluster, the rows corresponding to each value of the hash function are sorted on a specified set of columns in ascending order, which can improve response time during subsequent operations on the clustered data."

In English what this means is that the hash cluster allows you to retrieve data without the overhead of searching an index. The goal of hashing is to generate a unique address for each unique column value. The value of the address is generated by a mathematical expression, and it has been found that it’s not possible to guarantee that these addresses are unique. This is a teensy weensy problem and it results in collisions, where an address is computed and lo and behold, it’s already in use. Since accessing data this way is really fast, there are advantages to using this technique. The advantages come with a trade off, in that when collisions occur you can end up with block overflows and chaining (an overhead in writing data), and an overhead in reading data in that multiple blocks might need to be read to get the data you need. Because of this, sorted hash clusters work best when you are working with very large datasets that are relatively stable and are accessed mainly by equality queries against the hash key. You also need to really understand your data, since you have to specify the number of hash keys and storage per key in advance and have cluster key values that are unique and evenly distributed.

Sorted Hash Clusters are really fast, and like anything that is fast, you need to have a specific need to do it, and knowledge of the consequences.

Clustering is a very important concept in database performance, and it directly affects the layout of data in the data blocks.