Chapter Navigation
Top	E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Functional Dependency (FD)

Introduction of FD

gene_id	name	annotation
...	...	...
g23	hsp39	XX protein
g24	hsp39	XX protein

Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.

Relationship between key and FD:

A set of attribute{A₁, A₂, ...A_t} is a key for relation R if

(1) {A₁, A₂, ...A_t} functionally determines all other attributes;

(2) when no proper subset of {A₁, A₂, ...A_t} functionally determin all other attributes, it is called superkey (or Candidate key?)

rules of FDs:

Observation:

        Correct: 
               X->Y, A->B:   XA->BY;   
               X->AY:   X->A, X->Y;
               X->Y, A->Y: XA->Y;
        Incorrect:
               XY->A: X->A, Y->A;

Formal inference rules for functional dependencies:

        Armstrong’s axioms:
               1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD);
               2. Transitivity: If X -> Y and Y -> Z then X -> Z;
               3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.

Where do FDs come from?

looking either at

              (1) Domain Knowledge (the meaning of the attributes); 
              (2) the data;

a	b	c
1	2	3
4	2	6

b->a is not FD, but a->b is a valid FD.

BCNF

Formal way to break tables:

gene	name	annotation	experiment id	experiment description	experiment level
...	...	...	...	...	...

⇒

         R1 (gid, name, annotation);    gid->name annotation;
         R2 (exp id, exp desc);         exp id -> exp desc;
         R3 (gid, expid, exp level);   gid, expid -> exp level;

$OriginalRelation->find\ \ a\ FD\ X->Y\left\{{\begin{array}{ll}{X,Y},&{\hbox{gid, name, annotation;}}\\{everything\ \ exceptY},&{\hbox{gid, exp id, exp desc, exp level;}}\end{array}}\right.$

$gid,expid,expdesc,explevel\left\{{\begin{array}{ll}{expid,expdesc},&{\hbox{;}}\\{gid,\ expid,\ explevel},&{\hbox{;}}\end{array}}\right.$

A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

3NF

Formal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:

• X -> Y is a trivial FD (Y ⊆ X)

• X is a superkey for R

• Y ⊂ a candidate key for R

Comparison of BCNF and 3NF

	Removes Redundancy	Preserves FDs	Lossless Decomposition
BCNF	Yes	No	Yes
3NF	No	Yes	Yes

Lossy Decomposition

Suppose we have a table such as the one below.

a	b	c
1	2	3
4	2	6

into the following "normalized" tables

a	b
1	2
4	2

b	c
2	3
2	6

This means we followed

b → a or

b → c

as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:

a	b	c
1	2	3
1	2	6
4	2	3
4	2	6

We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.

Multivalued Dependency (MD)

Let's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of

SID →→ car

read as, "A given student ID can have many cars."

Compare this to an FD like

SID → name

which we read as, "A given SID can have at most one name."

Suppose we also had the following MD:

SID →→ clothes

We can get the following table:

SID	car	clothes
1	Honda	jeans
1	Toyota	khakis
1	Toyota	jeans
1	Honda	khakis
…	…	…

Perhaps we want to decompose this table further. We decompose along the MD of

SID →→ car

This leaves us with {SID, clothes}, leaving us with the following tables

SID →→ car

{SID, clothes}

SID	car
1	Honda
1	Toyota
…	…

sid	clothes
1	jeans
1	khakis
…	…

We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.

Rules for MDs

Spotting trivial MDs

We say that an MD $X \to\to Y$ is trivial in $R$ if $X \cup Y = {all attributes}$ .

For example, consider the following relation $R 1$ where we have the MD $a \to\to b$ :

a	b
$R 1$
1	2
1	3
2	2
…	…

Because we can have many $b$ s to each $a$ , we can not generate data to violate the MD.

Now consider the following relation $R 2$ :

a	b	c
$R 3$
1	2	6
1	3	7
2	2	6
…	…	…

In this case, $a \to\to b$ is non-trivial, because the relation is three-way.

Non-trivial MDs come in pairs

In fact, $a \to\to b$ has a partner MD which is also non-trivial: $a \to\to c$

Here are the listings of all non-trivial pairs of MDs in $R 2$

a \to\to b

a \to\to c

b \to\to a

b \to\to c

c \to\to a

c \to\to b

Compare this to examples of trivial MDs for $R 2$ :

a \to\to bc

ab \to\to c

b \to\to ac

bc \to\to a

c \to\to ab

ca \to\to b

…

Implications of FDs do not hold for MDs

In FDs, we know $X\rightarrow YZ\iff {\begin{array}{c}X\rightarrow Y\\{\mbox{and}}\\X\rightarrow Z\end{array}}$

However, this same implication does not exist for MDs. That is, $X\rightarrow \rightarrow YZ\nLeftrightarrow {\begin{array}{c}X\rightarrow \rightarrow Y\\{\mbox{and}}\\X\rightarrow \rightarrow Z\end{array}}$

MDs are actually about independence

Given the MDs

a \to\to b

and

a \to\to c

We know that, for each $a$

there can be many $a$
there can be many $a$
and, $b$ s are independent of $c$ s with respect to $a$ , which we represent symbolically as $b\perp c|a$ [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]

Every FD is an MD

By definition, $X \to Y \Rightarrow X \to\to Y$ . If $X$ has at most one $Y$ , then it satisfies the condition of having many $Y$ .

For example, given the following relation

a	b
…	…

and an FD of $a \to b$ , then we also have a trivial MD of $a \to\to b$ .

Now, consider the FD $a \to b$ with following relation:

a	b	c	d
…	…	…	…

In this case, we actually have two non-trivial MDs:

a \to\to b

a \to\to cd

(Remember, non-trivial MDs come in pairs.)

Data Management in Bioinformatics/Normalization

Contents

Functional Dependency (FD)

Introduction of FD

BCNF

3NF

Comparison of BCNF and 3NF

Lossy Decomposition

Multivalued Dependency (MD)

Rules for MDs

Spotting trivial MDs

Non-trivial MDs come in pairs

Implications of FDs do not hold for MDs

MDs are actually about independence

Every FD is an MD