Relational Database Design/Logical vs. Physical design

Physical design meaning "business implementation design" edit

The purpose of physical design in this case is to optimize performance (usually speed) of a database in a particular business case.

The particular techniques used mentioned in one textbook are: appropriate indexing, query optimization, vertical partitioning, appropriate level of normalization. This requires looking at the database structure and the queries going to be used.

Appropriate indexing can increase the performance of queries pulling a number of rows or performing joins. Range queries , or queries with inequality conditions, usually perform better with B+ Tree indexes (or ISAM indexes, a static form of B+ Trees).

Clustered indexes , where rows are sorted in index order, and probably stored as blocks as leaf nodes of B+ Trees, also often perform better when retrieving ranges, as contiguous blocks can be loaded with all or most rows in a block belonging to the range.

Multifield indexes may perform better for some queries which select on multiple attributes.

In some databases, a query planner can be forced to use indexes or do sequential scans, or use a better performing plan by doing hacks like "id1 + 0 = id2" instead of "id1 = id2", because the latter defaults to some unfavourable automatic optimization, whereas the former forces a sequential scan, which buffer loads better, perhaps. This might be called idiosyncratic physical design. Others call this query optimization. A database may have a feature to evaluate alternate forms of the same query. Examples of alternate forms include the placing of parentheses for conditional grouping e.g.

  SELECT a, b, c
    FROM A
   WHERE c > c1 AND c < c2
GROUP BY c

instead of

  SELECT * 
    FROM (select a,b,c from A where c > c1 and c< c2) 
GROUP BY c

Physical design meaning "software design" edit

It might be better to call it more abstract vs. less abstract design, as often physical design just means database specific organization of an abstract operating system concept, a File. A random access file may be organized as blocks on a disc, and the main optimization might be that the start of the blocks of whatever "physical" design, is aligned to a physical block that is read by some basic input-output system call on a hard drive which has a buffer mapped to an operating system range of memory. I'm just guessing.

Let's take an example of a file based simple database system, the XBase family of databases. A Xbase database table physical object consists of a file with a header , followed by records. The header contains essential information like the length of the header (or the start of the records), the number of records, the length of each record, and a list of field descriptions, including name, size, and type for each field. The table physical object is arranged as a heap, with any newly created rows of the table , appended to the end of the file, and the number of records incremented in the header.

This implies that if a SQL data declaration feature of ALTER TABLE were to be implemented, the easiest thing to do would be to create a new table physical object, create a new header with any altered field descriptions, and re-write every record with any additional fields, or without any deleted fields.

The field descriptions most conveniently describe fields with a fixed sized type e.g. NUMERIC(10,2) means a decimal number of 10 characters with the last 2 characters to the right of the decimal place, and this could be represented by a 11 character field , one extra character for a decimal point.

When a variable length field type is required, such as VARCHAR or MEMO, then the size of the field is the size of say a 4-byte 32 bit unsigned number (4 x 8 = 32). This number is an offset in another file associated to the main DBF file, e.g. the DBT file, which stores variable length data. At that offset, the data begins with a 4-byte number for the length of the data, followed by that number of bytes for the data itself.

The SQL data manipulation features , the "logical" design capabilities, of DELETE, UPDATE, and INSERT can be implemented by using a byte flag for each record signifying deletion, altering the content of records at a given field's offset for updating, and appending to the end of the file for inserting a record.

The select feature can be implemented either in a brute force way, or sequential scan of each record, or if there are indexes and they pertain to the fields used in conditions in the WHERE clause, an index can be looked up.

For an equality condition, the index is simply looked up, but for a comparative condition, such as BETWEEN , ≥ or ≤ , then a physical index which has an ordered physical arrangement could be used.

If the number of records are small, then an in memory hash table or in memory balanced binary tree can be used as an index by sequentially scanning a indexed table at the beginning of an application run. Given the amount of memory available nowadays, this might be almost always the better choice, and techniques such a double buffering where blocks of records are read in at a time for sequential scanning each block, may keep initializing an index acceptably fast.

If the time to build a very large index makes application startup prohibitively slow , using an external B+ tree for range index and an external , block based hash algorithm such as linear hashing or extendible hashing can be used, which would also give the advantage of persisting an index to permanent disc storage on the fly , whenever a memory cached block is written back to disc. By knowing that indexes are one of the physical mechanisms determining the time performance of the logical capability "SELECT ... WHERE ..", the database designer can :

  • optimize a query by making indexes available where needed, and checking what sort of query plan is being generated ( are joins using indexes when a large speedup is expected by doing so, instead of a sequential scan  ? )
  • attempt to influence the query planner by tweaking database features, or the SELECT statement itself,
  • be confident to be able dump data to a file, in say a DBF format, and program indexing features in order to extend an application,
  • and other cool stuff that might make a relational database designer feel good about his job.

Although the heap organization seems logical, sometimes a choice is made to use the primary key as hashing material for a hash based organization of a file in blocks, and the use of a extendible hashing or linear hashing instead of a memory built hash table, is required, because that is how the records are organized, as entries in a block/ bucket of a on disc hash table. Extendible and linear hashing grow incrementally by blocks, so a "block heap" becomes the granularity of the heap append-to file area, instead of appending a new record to a "record heap" file area.

Extendible and linear hashing is described in the Data Structures wiki book, as well as B+ trees . A Java implementation for a memory caching file persisting linear hashing is also there, as well as 2 java and a C++ implementation of B+ trees, as well as a java implementation of a B Tree , which gives contrast to the practical design difference in B+ trees.