# Data Management in Bioinformatics/Normalization

 Chapter Navigation Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

# Functional Dependency (FD)

## Introduction of FD

gene_id name annotation
... ... ...
g23 hsp39 XX protein
g24 hsp39 XX protein

Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A1, A2, ...At} is a key for relation R if

(1) {A1, A2, ...At} functionally determines all other attributes;

(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs:

Observation:

        Correct:
X->Y, A->B:   XA->BY;
X->AY:   X->A, X->Y;
X->Y, A->Y: XA->Y;
Incorrect:
XY->A: X->A, Y->A;


Formal inference rules for functional dependencies:

        Armstrong’s axioms:
1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD);
2. Transitivity: If X -> Y and Y -> Z then X -> Z;
3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.


Where do FDs come from?

looking either at

              (1) Domain Knowledge (the meaning of the attributes);
(2) the data;

a b c
1 2 3
4 2 6

b->a is not FD, but a->b is a valid FD.

## BCNF

Formal way to break tables:

gene name annotation experiment id experiment description experiment level
... ... ... ... ... ...

         R1 (gid, name, annotation);    gid->name annotation;
R2 (exp id, exp desc);         exp id -> exp desc;
R3 (gid, expid, exp level);   gid, expid -> exp level;


${\displaystyle OriginalRelation->find\ \ a\ FD\ X->Y\left\{{\begin{array}{ll}{X,Y},&{\hbox{gid, name, annotation;}}\\{everything\ \ exceptY},&{\hbox{gid, exp id, exp desc, exp level;}}\end{array}}\right.}$

${\displaystyle gid,expid,expdesc,explevel\left\{{\begin{array}{ll}{expid,expdesc},&{\hbox{;}}\\{gid,\ expid,\ explevel},&{\hbox{;}}\end{array}}\right.}$

A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

## 3NF

Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

• Y ⊂ a candidate key for R

## Comparison of BCNF and 3NF

RemovesRedundancy PreservesFDs LosslessDecomposition Yes No Yes No Yes Yes

## Lossy Decomposition

Suppose we have a table such as the one below.

a b c
1 2 3
4 2 6

into the following "normalized" tables

a b
1 2
4 2
b c
2 3
2 6

This means we followed

b → a or
b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

a b c
1 2 3
1 2 6
4 2 3
4 2 6

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

# Multivalued Dependency (MD)

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of

SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like

SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:

SID →→ clothes

We can get the following table:

SID car clothes
1 Honda jeans
1 Toyota khakis
1 Toyota jeans
1 Honda khakis

Perhaps we want to decompose this table further. We decompose along the MD of

SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

SID →→ car {SID, clothes}
SID car
1 Honda
1 Toyota
sid clothes
1 jeans
1 khakis

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

## Rules for MDs

### Spotting trivial MDs

We say that an MD X →→ Y is trivial in R if XY = {all attributes} .

For example, consider the following relation R1 where we have the MD a →→ b:

a b R1 1 2 1 3 2 2 … …

Because we can have many bs to each a, we can not generate data to violate the MD.

Now consider the following relation R2:

a b c R3 1 2 6 1 3 7 2 2 6 … … …

In this case, a →→ b is non-trivial, because the relation is three-way.

### Non-trivial MDs come in pairs

In fact, a →→ b has a partner MD which is also non-trivial: a →→ c

Here are the listings of all non-trivial pairs of MDs in R2

 a →→ b a →→ c
 b →→ a b →→ c
 c →→ a c →→ b

Compare this to examples of trivial MDs for R2:

 a →→ bc ab →→ c b →→ ac bc →→ a c →→ ab ca →→ b …

### Implications of FDs do not hold for MDs

In FDs, we know ${\displaystyle X\rightarrow YZ\iff {\begin{array}{c}X\rightarrow Y\\{\mbox{and}}\\X\rightarrow Z\end{array}}}$

However, this same implication does not exist for MDs. That is, ${\displaystyle X\rightarrow \rightarrow YZ\nLeftrightarrow {\begin{array}{c}X\rightarrow \rightarrow Y\\{\mbox{and}}\\X\rightarrow \rightarrow Z\end{array}}}$

### MDs are actually about independence

Given the MDs

a →→ b
and
a →→ c

We know that, for each a

• there can be many a
• there can be many a
• and, bs are independent of cs with respect to a, which we represent symbolically as ${\displaystyle b\perp c|a}$  [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]

### Every FD is an MD

By definition, XYX →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.

For example, given the following relation

a b

and an FD of ab, then we also have a trivial MD of a →→ b.

Now, consider the FD ab with following relation:

a b c d

In this case, we actually have two non-trivial MDs:

 a →→ b a →→ cd

(Remember, non-trivial MDs come in pairs.)