Relational Database Design/Normalization

Database practitioners talk about normalization a lot, but it's often poorly understood. Some people see it as an academic detail that is impractical in the real world. Many practitioners are convinced that normalization is always too costly in terms of performance. In fact there are important practical benefits to normalization. Even though there may be legitimate reasons to eschew normalization in some circumstances, it is always best to make this decision from a standpoint of understanding the costs and benefits than dismissing it out of ignorance.

Normalization provides a set of rules and patterns that can be applied to any database to avoid common logical inconsistencies. Normalizing a database design will typically improve:

Consistency, since errors that could be made with the database would be structurally impossible
Extensibility, since changes to the database structure will only affect parts of the database they are logically dependent on
Efficiency, since redundant information will not need to be stored

The database community has identified several distinct levels of normalization, each more stringent than the last. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). It's quite common in practice to speak of one database design as being more or less normalized than another, as defined by these levels.

In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen but is briefly discussed in this article. Fifth normal form means that the SQL selecting newbies won't produce misleading results by doing a join with artificial, spurious rows occurring due to a many-to-many dependency, or something equal to or more than a 3-way relationship, which are rare in comparison to the most of the many-to-one that will be detected by the earlier forms.

Why perform normalization ?

A database is a set of facts about the world. Database administrators would like their database to contain only true facts about the world, and no false facts. Normalization allows the RDBMS to prevent some classes of false fact from occurring in the database, and therefore helps consistency. Obviously much of the information in the database can't be verified by the computer; the only types of error that can be prevented are errors that are inconsistent and entail some logical contradiction. Generally speaking, with each additional level of normalization one or more classes of logical error are made impossible by virtue of the structure of the tables.

An anomaly occurs in a database if the database starts in a consistent state and an apparently valid action causes the database to become inconsistent. This can be further specified as an update, insertion or deletion anomaly depending on whether the action that causes the error is a row update, insertion or deletion respectively.

The example above shows an update anomaly. A database user has attempted to update an employee's address. However, since the address occurs more than once in the table and only one copy has been updated, there is now inconsistent data in the table: the employee cannot have two addresses at once.

This example shows an insertion anomaly. A new faculty member has been hired, but this information cannot be inserted into the database until he has been assigned to teach at least one course.

What's an FD ?

A Functional Dependency, or FD, looks like

$a\rightarrow b$

Which means that b is functionally dependent on a. Another way to read this is to say, "a determines b."

The idea here is just that if you know a value for one field (a) then you also know the value for the other field (b). For example, if you know someone's Social Security Number (in the U.S.) then you can find that person's last name, first name, etc. Knowing one piece of data is enough for you to determine the others.

Another example might be that if you know the serial number on a car, you can also find (determine) the car's manufacturer and model.

${\text{serial number}}\rightarrow ({\text{manufacturer}},{\text{model}})$

Thus manufacturer and model depend on serial number. This is a functional dependency.

Notice that this relationship often does not work in reverse. If you know a car has a given manufacturer, that is not enough information to find out what its serial number is (since there are many individual cars, with different serial numbers, all made by the same manufacturer).

What's the point of knowing Functional Dependencies ?

If you don't know your functional dependencies, then forever will you be condemned to reincarnate your update, insertion and deletion anomalies. Contemplate your Functional Dependencies, and you may be able to achieve BCNF (Boyce-Codd normal form) across your whole database, and if not, at least there are two levels in normalization. You could do 3NF, and break the hold of inefficient database design, which your ER modeling did not account for. The point of Functional Dependencies is that they're like used car warranties, and preserving FDs avoids update anomaly costly repairs.

What Boyce-Codd Normal Form does is to connect the dots between the concept of FDs and the concept of relational table key attributes, remembering that any candidate key of a table can be a primary key. In BCNF, all FDs left-hand-sided attributes', are candidate keys.

Consider the negative case, that when one or more FDs are not candidate keys, then the schema is not in BCNF, and the schema is prone to a lossy join or a functional dependency integrity constraint is not preserved.

The normal forms

Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms.

Normal forms are as fun as watching legal dramas: "I swear to tell the truth, the whole truth, and nothing but the truth, so help me god". "All non-key attributes will tell a fact about the key, the whole key, and nothing but the key, so help me Codd." (about 2NF, 3NF, and BCNF).

Some terms used to understand each of the normal forms:

1NF - atomic values

2NF, 3NF, BCNF - a key is a set of attributes of a relation which can be used to identify a row in a relation. A superkey is a superset of a candidate key, a minimal set of attributes which can be used to identify a row. Hence, all the attributes of a relation can be a superkey.

2NF - non-key attributes are dependent on other non-key attributes, whole candidate keys, but not dependent on a part of a composite key.
3NF - non-key attributes are directly dependent on whole candidate keys, or are a part of a candidate key.
BCNF - non-key attributes are directly dependent on whole candidate keys and are not part of keys.

The above summary seems good enough, but it isn't, because the definition of superkey is that all the attributes of a relation can be a key, and therefore, there are no "non-key" attributes. So it is better to express the normal forms in terms of functional dependency, X -> A "X determines A", or "A depends on X". 2NF then becomes "X cannot be a part of a candidate key, but can be a candidate key, or an attribute of a superkey that is not in the minimal key set of the superkey." 3NF becomes "X is a whole superkey, or A is part of a candidate key". BCNF becomes "X is a whole superkey"

Why have a NF ?

1NF - makes it easier to write a query. The example is a Clients table where the Client has three separate phone numbers (perhaps to allow office, home and cellphone numbers to be entered). Your DB is not even 1NF if you have a client / customer table with multiple telephone number fields.

Not being in first normal form makes queries complex and inefficient if you want to look up a client from a given phone number. Since you don't know which of the fields the phone number will appear in, you have to do something like the following:

SELECT Id
  FROM Clients
 WHERE FirstPhoneNumber = '555-123-4567'
       OR SecondPhoneNumber = '555-123-4567'
       OR ThirdPhoneNumber = '555-123-4567';

2NF - one textbook says it isn't relevant given 3NF, and BCNF, and is probably being posh about why one splits a table into a parent and child relations linked by a foreign key. An example, like having a client table, is a contractor table, where contractors have more than one skill, and having the contractor-skill table own a contractor's address field means owning a non-prime attribute that is partially dependent on the whole key contractor-skill, i.e. address determined by contractor only. If a contractor has more than one skill, then there will be redundant rows containing the same contractor name and address. 2NF would mandate splitting the table according to the FDs, N -> A , and the multivalue dependency, N ->-> S, or (Name , Address), and (Name, Skill ); as well as being 2NF, this is 3NF, BCNF, and also 4NF. Why ?

3NF - is like a restrictive form of BCNF, so maybe easier to look at BCNF first to see if it doable, before doing 3NF.

BCNF - the mapping is easier because for every FD X->A, X must be a superkey. This makes it easy to show when breaking a non-normalized 1NF or 2NF relation into a BC normal form can't be done without losing a conflicting FD, and therefore the decomposition is not dependency preserving , and the normalization should then revert to 3NF , which can always accommodate for conflicting FDs during attempted BCNF.

3NF - the only additional relaxation is that if a FD X-> A applies, then A can be part of a candidate key. To see a decomposition which is not BCNF because an FD is not preserved, but is 3NF, an example would be helpful, and this author is not cognitive enough to think of one, so some internet and book derived examples will be given, with the proviso that they may be withdrawn for unrelated socially backward reason (usually legal). The examples are : enrol ( student, class, assistant), with FDs student class -> assistant , assistant -> class ( only allowed to assist one class) , and part_supply, (contract, supplier, department, project, part, quantity, value), with the FDs

1) project part -> contract

2) supplier department -> part

3) project -> supplier , and

4) contract -> the rest of the original relation.

For the original part_supply relation, the second and thirds FDs have determinants X which don't fit the statement "X is a superkey".

The transitive axiom of Armstrong's Axioms suggests that the first FD, 1) Project part -> contract , doesn't make the original relation violate BCNF, because Project,Part is a superkey or alternate key , as it transitively maps to the original singly attributed key, contract.

Steps for attempting BCNF decomposition

The recommended decomposition step is to use a violating FD which has a single dependent A in X -> A, possible because of another Armstrong Axiom stating X-> A B C , can always be decomposed to X -> A , X->B, X -> C, singly attributed dependent FDs .

(The decomposition axiom is another way of stating the Reflexivity Axiom combined with the Transitivity Axiom, i.e. that if X is a superset of Y then X -> Y, and since, in this instance, A B C is a superset of A, and X -> A B C , so X transitively -> A because A B C -> A. )

Going back to the statement 2 paragraphs back, to use a FD X -> A where

a) A is a single attribute of R , and

b) X must be a subset of R .

c) the FD X -> A is a member of a minimal cover set of FDs , which implies the original set of FDs have all been transformed into FDs with single dependents, and no lesser set of FDs can be created with Armstrong's Axioms , that can regenerate the original set of FDs using Armstrong's axioms.

if a) and b), then decompose a relation R into R - A, and X A e.g.

R = ( contract, supllier, department, project, part, quantity, value) , X = supplier department, A = part

So R -A = ( contract, supplier, department, project, quantity, value) , and XA = ( supplier, department, part ).

Because X -> A, X can be a key of XA and X can be joined naturally with R - A , because the attributes of R is a superset of X.

Then step 2 is to check the resulting decomposition relations for BCNF e.g. in this example, R-A is not BCNF because project -> supplier FD exists, and project is not a superkey of (contract supplier dept project, quantity, value).

Recurse if BCNF is violated, so now R = contract, supplier, dept, project, quantity, value, the next singly attributed dependent FD is project -> supplier, so remove A (supplier) to form X-A as before, to give( contract , department, project, quantity , value ).

The decomposition is now ,

(supplier, department, part) (project, supplier) (contract, department, project, quantity, value)

However, the only remaining violating FD is project, part -> contract , but here X ( project, part) is not a subset of R ( contract, department, project, quantity, value).

Therefore BCNF decomposition cannot continue. At this point, attempting to achieve BCNF is abandoned, and achieving 3NF is the next goal.

3NF decomposition follows on with the products of a failed BCNF decomposition, and achieves a lossless joining , FD preserving decomposition in all cases

The technique for decomposition into 3NF is to start with a lossless decomposition of the original relation, and the set of violating FDs. The former is created by the attempt to decompose into BCNF as above , and the latter is the remaining FDs when the BCNF decomposition cannot proceed any further.

This is the rationale for starting with BCNF as a goal, because the abandoned results of failure to achieve BCNF can be used to achieve 3NF.

To achieve 3NF, for every violating FD X -> A, simply add a relation XA . In this example, the last FD , project, part -> contract, can form a relation so a successful 3NF decomposition is now:

(supplier, department, part) (project, supplier) (contract, department, project, quantity, value) (project, part, contract).

To summarize without the example,

Step 1: Attempt to decompose into BCNF.

Step 2: Attempt to decompose into 3NF.

To do step 1,

a. remember Armstrong's axioms :

Reflexivity, if X is a subset of Y, then Y -> X,

Transitivity, if X -> Y and Y -> Z, then X -> Z,

Augmentation, if XB -> YB, then X-> Y

Reflexitivy and Transitivity implies Decomposition: ABC -> A , X -> ABC, then X -> A, ( and X->B, and X-> C)

Union is also implied, X -> Y, X -> Z, then X -> YZ

b) convert the original set of FDs , into a minimal covering set of FDs using Armstrong's axioms.

b1. Use Decomposition to make all FDs singly attributed on the right hand side ( singly attributed dependent).

b2. Use Armstrong's axioms to remove redundant determinant attributes , and then

b3. remove equivalent transformed FDs.

c) Use one FD from the remaining set of minimal covering set of FDs.

d) for this FD, X -> A (single), X and A must be respectively a subset and a member of R 's attributes. i.e. R is a superset of X union A .

e) remove from A from R , to get R - A as one decomposition product, and another relation X A as another decomposition product (i.e. attributes making X and A ).

f) remove the used FD from the covering set of FDs, and go back to c). , using the set of relations - R, union with each decomposition product R -A, and X A as the source set of relations to be decomposed , and continue until t all the minimal covering set of FDs has been exhausted (emptied), or the remaining singly attributed FDs cannot be used to decompose any of the growing set of decomposed relations.

If there are remaining FDs, then BCNF cannot be achieved whilst preserving all FDs.

Switch to 3NF covering relation generation by constructing:

Get each FD that is part of a residual minimal set of FDs i.e. each FD is of the form X-> A where A is a singly attributed dependent, with Armstrong's axioms applied to remove any redundancy on the determinant side (X), and then removing any remaining essentially equivalent FDs , until no more FDs can be removed without losing the ability to reconstruct the original set of FDs.

With this FD, construct a relation XA and remove the FD from the remaining set of FDs; continue until all FDs are processed.

The sales pitch for normalization

The resulting set of relations , either from successful BCNF conversion, or from BCNF/3NF conversion, can be recomposed by joins into the original relation without loss or generation of spurious rows, and preserves all the original functional dependencies. Update, insertion and deletion anomalies should not occur when using these normalized relations.

4NF and 5NF deal with 1-to-many and many-to-many relationships , or multivalued dependencies

Achieving BCNF may achieve 4NF if the resulting decomposed relations have single attributed keys. A multivalued dependency is expressed as X ->-> A, which means for every instance of X, there is a fixed set of A values, which are independent of any other attributes except X. In fact, a FD X -> A where each X value implies a single A value, is a special case of MVD.

For instance, stereotype_expertise is a multivalued relation, with stereotype values / multivalued dependency expertise instances:

medical_graduate ->-> physiology, anatomy , pharmacology ; 

computer_science_graduate ->-> concrete mathematics, database design, programming ; 

idiot ->->  pharmacology, database design.

In 4NF, a multivalue dependency will always hold if the determinant and the multivalued dependents of this single MVD are the only attributes of the relation, or if the determinant is a superkey and there are other attributes apart from X, Y in the MVD X->-> Y. This last criteria doesn't make sense, because if X is a superkey, then there is only one row for each X.

The rationale for 4NF is illustrated by the example of a relation which has 2 independent multivalue dependencies, e.g. Person Skill Language. A person has a set of skills, and a set of languages. An example of a design problem is when a person adds a language, there is a design choice : have null values for skill in the row with the new person language detail, to having always a cross product of skills and language for every person, so for each row of cross product skill and language , have person, so this facilitates a search for someone with a certain skill and language. In this kind of table, person cannot be a candidate key, due to multiple skill + language mappings, and actually all 3 attributes form the key. So by 4NF, this relation needs to be decomposed until all MVDs X->->Y have either only X Y as attributes , or X is a superkey ( which really doesn't make sense since there are multiple Y for a X ). What is the 4NF decomposition of Person Skill Language ?

[http://www.bkent.net/Doc/simple5.htm#label4 William Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26(2), Feb. 1983, 120-125]

5NF

This author's good but slightly inconsistent textbook states that

a 3NF decomposition relation is in 5NF if there is no candidate key in the relation which is composite.

Presumably, a composite key can be projected with only one of its two or more attributes, into a smaller decomposition, which when re-joined, with the rest of the relation , will result in spurious rows. 5NF is like a declarative version of the procedural decomposition of BCNF by its functional dependencies, and the patching with additional single FD relations of 3NF, except it is being applied to MVDs as well, in that the lossless join condition is incorporated into the definition of join dependency, which is that a 5NF relation can be decomposed into any set of smaller relations, and all of these can be rejoined without spurious rows.

Normal Forms, without the logic fetish

1NF

First normal form (1NF) sets the very basic rules for an organized database: Eliminate duplicate columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

2NF

Second normal form (2NF) further addresses the concept of removing duplicate data: Meet all the requirements of the first normal form. Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys.

3NF

Third normal form (3NF) goes one large step further: Meet all the requirements of the second normal form. Remove columns that are not dependent upon the primary key. If a remaining column ( attribute) plays a part in a functional dependency, all the attributes of the functional dependency are in the table, and the determinants of the attribute form a key, or all the dependent attributes are keying attributes. The latter means that another table which has the determinant attributes as a key, maps these determinant attributes against the dependent attributes , and these form a foreign key into this table.

Boyce-Codd NF

If an attribute of a table plays a part as a determinant attribute in a functional dependency, then it and all the other determinant attributes of the functional dependency are in the table and all of them together form a superset of or are equal to a key of the table. All the dependent attributes of the functional dependency are also in the table.

4NF

Fourth normal form (4NF) has one additional requirement: Meet all the requirements of the third normal form. A relation is in 4NF if it is in BCNF and it has a single column key, or the table only has attributes of one multivalued dependency, or the determinant of the multivalued dependency is a superset of or equal to a the key.

5th Normal Form (5NF)

The purpose of 5NF is to achieve a lossless join decomposition with respect to all the candidate keys of a relation. 5NF is defined in terms of not breaking join dependencies that hold over a relation. A Join dependency is a statement that a particular decomposition will result in a lossless recomposition. A multivalued dependency, is a special case of a join dependency, expressed as a subrelation consisting of the MVDs determinant and dependent attributes, and a subrelation consisting of all the other attributes in the original relation and the MVDs determinant attributes, and hence the join occurs on the determinant attributes. A paper showed that if a 3NF relation has all candidate keys as single attributes, then all join dependencies will hold (there will be no decomposition involving candidate keys which will lead to a lossy recomposition).

Domain-Key NF

Domain-Key Normal Form eliminates insertion and deletion anomalies while relying only on domain constraints and key constraints (no arbitrary CHECK constraints are needed). Unfortunately there is no general method for transforming a database into DKNF, so it is usually the hardest normal form to achieve.

Denormalization

After reading about Normalization above and how things should be broken up, you may be thinking:

Well, now I have all these tables all over the place, and I have to join a buncha them just to get simple sets of data out of the system! How is that gonna affect the performance?

Well, you may be right. In some cases where performance is an issue, it may be necessary to denormalize parts of your database/schema and have wider tables that may contain duplicated/redundant data.

However, such decisions are best left in the hands of an experienced DBA, because there may be other ways to improve performance such as physical design, indexes, query optimization, etc. that DBA's specialize in. Database Denormalization, just like Normalization, is best left in the hands of the experienced.

But if you were cocky enough to want to be good at Database Denormalization, then reading a chapter out of a good database textbook about database tuning would help; the one I read, simply reiterated the previous lessons about Boyce-Codd Normalization and 3NF Normalization, and the choice between the two, so that there is already a denormalization step from BCNF to 3NF when attempting to achieve the normalization goals of no lossy joins and functional dependency preservation. This book also goes on to talk briefly about using update, insert, and delete, SQL triggers to enforce functional dependencies, that aren't being enforced because of denormalization, and then to compare the query plan and execution costs of these triggers, vs the query plan and execution costs of preserving normalization , and hence the costs of mandatory joins for some queries, the latter often requiring the creation of indexes to lower the cost of joins where only a small proportion of rows are needed in the query.