Data Management in Bioinformatics/Normalization
Chapter Navigation | |
Top | E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages |
Functional Dependency (FD)
editIntroduction of FD
editgene_id | name | annotation |
---|---|---|
... | ... | ... |
g23 | hsp39 | XX protein |
g24 | hsp39 | XX protein |
Functional Dependency: X->Y if two tuples agree on X, then they agree on Y. e.g. "gene_id -> name" and "gid-> annotation" are FDs, but "name->gene_id" and "name->annotation" are NOT FDs.
Relationship between key and FD:
A set of attribute{A1, A2, ...At} is a key for relation R if
(1) {A1, A2, ...At} functionally determines all other attributes;
(2) when no proper subset of {A1, A2, ...At} functionally determin all other attributes, it is called superkey (or Candidate key?)
rules of FDs:
Observation:
Correct: X->Y, A->B: XA->BY; X->AY: X->A, X->Y; X->Y, A->Y: XA->Y; Incorrect: XY->A: X->A, Y->A;
Formal inference rules for functional dependencies:
Armstrong’s axioms: 1. Reflexivity: If Y ⊆ X then X -> Y (called trivial FD); 2. Transitivity: If X -> Y and Y -> Z then X -> Z; 3. Augmentation: If X -> Y then XZ -> YZ where Z is a set of attributes.
Where do FDs come from?
looking either at
(1) Domain Knowledge (the meaning of the attributes); (2) the data;
a | b | c |
---|---|---|
1 | 2 | 3 |
4 | 2 | 6 |
b->a is not FD, but a->b is a valid FD.
BCNF
editFormal way to break tables:
gene | name | annotation | experiment id | experiment description | experiment level |
---|---|---|---|---|---|
... | ... | ... | ... | ... | ... |
⇒
R1 (gid, name, annotation); gid->name annotation; R2 (exp id, exp desc); exp id -> exp desc; R3 (gid, expid, exp level); gid, expid -> exp level;
A relation R is in BCNF if for all dependencies X -> Y , at least one of the following holds:
• X -> Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
3NF
editFormal Definition: a relation R is in 3NF if for all dependencies X® Y in F+, at least one of the following holds:
• X -> Y is a trivial FD (Y ⊆ X)
• X is a superkey for R
• Y ⊂ a candidate key for R
Comparison of BCNF and 3NF
editRemoves Redundancy |
Preserves FDs |
Lossless Decomposition | |
---|---|---|---|
BCNF | Yes | No | Yes |
3NF | No | Yes | Yes |
Lossy Decomposition
editSuppose we have a table such as the one below.
a | b | c |
---|---|---|
1 | 2 | 3 |
4 | 2 | 6 |
into the following "normalized" tables
|
|
This means we followed
- b → a or
- b → c
as our FD for the decomposition of the original table. (Remember, BCNF decomposes X → Y to the sets {X, Y} and {everything except Y}). However, notice that column b has non-unique values (two of digit 2). When we recombine these two tables in a join, we wind up with a table in which the original relationships are lost:
a | b | c |
---|---|---|
1 | 2 | 3 |
1 | 2 | 6 |
4 | 2 | 3 |
4 | 2 | 6 |
We declare this as a lossy decomposition, because the FDs assumed by the decomposed tables do not hold true (both b → a and b → c are false). Thus, BCNF can not remove all forms of redundancy. We will need to use another model, Multivalued Dependency (MD), to aid us in decomposing tables properly for normalization.
Multivalued Dependency (MD)
editLet's explore a concrete scenario. We'd like to represent belongings of affluent students. We create a multivalued dependency (MD) of
- SID →→ car
read as, "A given student ID can have many cars."
Compare this to an FD like
- SID → name
which we read as, "A given SID can have at most one name."
Suppose we also had the following MD:
- SID →→ clothes
We can get the following table:
SID | car | clothes |
---|---|---|
1 | Honda | jeans |
1 | Toyota | khakis |
1 | Toyota | jeans |
1 | Honda | khakis |
… | … | … |
Perhaps we want to decompose this table further. We decompose along the MD of
- SID →→ car
This leaves us with {SID, clothes}, leaving us with the following tables
SID →→ car | {SID, clothes} | ||||||||||||||||
|
|
We can't actually do this, because these are not FDs! Neither SID →→ car or SID →→ clothes have a superkey. The key must be all three attributes. We need new rules for decomposing MDs.
Rules for MDs
editSpotting trivial MDs
editWe say that an MD X →→ Y is trivial in R if X ∪ Y = {all attributes} .
For example, consider the following relation R1 where we have the MD a →→ b:
R1 | |
a | b |
---|---|
1 | 2 |
1 | 3 |
2 | 2 |
… | … |
Because we can have many bs to each a, we can not generate data to violate the MD.
Now consider the following relation R2:
R3 | ||
a | b | c |
---|---|---|
1 | 2 | 6 |
1 | 3 | 7 |
2 | 2 | 6 |
… | … | … |
In this case, a →→ b is non-trivial, because the relation is three-way.
Non-trivial MDs come in pairs
editIn fact, a →→ b has a partner MD which is also non-trivial: a →→ c
Here are the listings of all non-trivial pairs of MDs in R2
| ||
| ||
|
Compare this to examples of trivial MDs for R2:
a →→ bc |
ab →→ c |
b →→ ac |
bc →→ a |
c →→ ab |
ca →→ b |
… |
Implications of FDs do not hold for MDs
editIn FDs, we know
However, this same implication does not exist for MDs. That is,
MDs are actually about independence
editGiven the MDs
- a →→ b
- and
- a →→ c
We know that, for each a
- there can be many a
- there can be many a
- and, bs are independent of cs with respect to a, which we represent symbolically as [NOTE: the actual first binary relation symbol should be \Perp instead of \perp, but this symbol is not supported by Mediawiki.]
Every FD is an MD
editBy definition, X → Y ⇒ X →→ Y. If X has at most one Y, then it satisfies the condition of having many Y.
For example, given the following relation
a | b |
---|---|
… | … |
and an FD of a → b, then we also have a trivial MD of a →→ b.
Now, consider the FD a → b with following relation:
a | b | c | d |
---|---|---|---|
… | … | … | … |
In this case, we actually have two non-trivial MDs:
a →→ b |
a →→ cd |
(Remember, non-trivial MDs come in pairs.)