Data Management in Bioinformatics/Data Querying

Chapter Navigation
Top E/R Theory - Normalization - Data Querying - Integrating SQL and Programming Languages

Data Querying

edit

Query Languages

edit
  • SQL (Structured Query Language ). SQL is a Data Definition and Manipulation Language (both DDL and DML)
  • RA (Relational Algebra)
  • Datalog

Creating Tables

edit
  • Command : CREATE TABLE

For the relations:

Gene(gid, name, annotation)

Experiment(eid, name,date)

Microarray(gid, eid, exprlevel)

Tables can be created using the following set of commands in SQL:

CREATE TABLE Gene( gid INTEGER, name VARCHAR(20), annotation  
VARCHAR(50), PRIMARY KEY (gid));

CREATE TABLE Experiment(eid INTEGER, name VARCHAR(10), date  
DATE, PRIMARY KEY (eid));

CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,  
PRIMARY KEY (gid, eid));

A primary key is used to uniquely identify each row in a table. It can either be part of the actual record itself, or it can be an artificial field. SQL will automatically index the table based on the primary key. Indexing with other keys (other than the primary key) is also possible; it can be done by using the CREATE INDEX command.

Note that the primary key for the "Microarray" table is composed of keys from the other two tables i.e. Gene and Experiment. These fields are called Foreign Keys. It is important to reference where the table gets its primary key from. The syntax is as follows.

CREATE TABLE Microarray (a INTEGER, b INTEGER, exprlevel REAL,  
PRIMARY KEY (gid, eid), FOREIGN KEY (a) REFERENCES Gene 
(gid), FOREIGN KEY (b) Experiment (eid) );

These types of constraints are often called Referential Integrity Constraints.

Storing Data

edit
  • Command : INSERT INTO

Example:

INSERT INTO Gene VALUES(1, "1433E", "enzyme binding");

Most database management systems have the capability to load bulk data at once.

Querying

edit

Projection

edit
  1. To display all the data stored in the Gene table
    • SQL
      SELECT * 
      FROM Gene;
      
    • RA  
    • Datalog  
  2. To display the names of all genes in the Gene table
    • SQL
      SELECT name
      FROM Gene;
      
    • RA  
    • Datalog  
    • Note: to list out only distinct values use SELECT DISTINCT in SQL. RA and Datalog return distinct values by default.
  3. To display gene names that meet certain criterion
    • SQL
      SELECT * 
      FROM Gene
      WHERE annotation like '%phosphates%';
      
    • RA  
    • Datalog  

Cartesian Product

edit

Definition : The Cartesian product of two tables of size nxa and mxb is a table having n*m rows and a+b columns

  • SQL
    SELECT * 
    FROM Gene, Experiment;
    
  • RA  
  • Datalog  

Join

edit

Example 1

edit
  • Table
Genes
Expression
Experiments
gid name anno
... ... ...
... ... ...
... ... ...
gid eid explevel
... ... ...
... ... ...
... ... ...
eid name date
... ... ...
... ... ...
... ... ...
  • E-R Diagram

File:Cs5634-ER-pic1.gif

  • SQL
    SELECT Genes.*, Expts.*, exp_level
      FROM Genes, Expression, Expts
     WHERE Genes.gid = Expression.gid
       AND Expression.eid = Expts.gid;
    
  • RA
     Genes ?? Expression ?? Expts
    
     (A ?? B) ?? C = A ?? (B ?? C)
    
  • Datalog
     Answer(x, y, t, a, b, w) ← Genes(x, y, z), Expression(x, t, w), Expts(t, a, b)
    
     Genes' x match Expression's x
     Expts' t match Expression's t
    

Example 2

edit

Question: ...

Expression
gid eid explevel
... ... ...
... ... ...
... ... ...
  • RA CannotbeMax = ... Allpairs = ... Allpairs - CannotbeMax
  • SQL
    CREATE VIEW CannotbeMax(gid, eid) AS
        SELECT E1.gid, E1.eid
          FROM Expression AS E1, Expression AS E2
         WHERE E1.explevel < E2.explevel;
    
    CREATE VIEW Allpairs(gid, eid) AS
        SELECT gid, eid
          FROM expression;
    
    SELECT * FROM Allpairs
    EXCEPT
    SELECT * FROM CannotbeMax;
    
    - EXCEPT
    ∪ UNION
    ∩ INTERSECT
    
  • Datalog
     CannotbeMax(x, y) <- Expression(x, y, t), Expression(a, b, w), t < w.
    
     Answer(x, y) <- Expression(x, y, a), NOT CannotbeMax(x, y).
    

Interesting Query

edit

Without MAX() operator

  • max
- ref. Join section Example 2
  • min
  • second largest
  • median

Relational Division

edit
A X B = C
C / B = ?
Student & Courses
StudentId CourseID
one CS2204
one CS4104
one CS6354
two CS2204
two CS4104
Required For Graduation(R4G)
CourseID
CS2204
CS4104
CS6354

Through Relational Algebra
1. sid = sidStudent&Courses
2. Ideal = sid X R4G
3. Reality = Student&Courses
4. Defaulters = Ideal - Reality
5. Answer = sid -  sidDefaulters
Through SQL

1.

CREATE VIEW SID(sid)<BR>
          SELECT sid<BR>
          FROM   Student&Courses<BR>

2.

CREATE VIEW IDEAL(sid,courseId)<BR>
           SELECT sid,courseID<BR>
           FROM  SID,R4G<BR>

3.

CREATE VIEW REALITY(sid,courseId)<BR>
           SELECT *<BR>
           FROM Student&Courses<BR>

4.

CREATE VIEW Defaulters(sid,courseId)<BR>
           SELECT * <BR>
           FROM IDEAL<BR>
           EXCEPT<BR>
           SELECT *<BR>
           FROM REALITY<BR>

5.

SELECT * <BR>
  FROM SID<BR>
  EXCEPT <BR>
  SELECT sid<BR>
  FROM Defaulters <BR>


Exercise:

  • Find students who satisfied graduate requirements except for 1.
  • Find students who satisfied most of the graduate requirements.


A View is not really computed, an row most of times cannot be inserted because of ambiguities.
SQL Injection, is a hacking technique where the input is formatted such that it executes a SQL query which is not expected.


Bells and Whistles
1. ORDER BY
2. MAX,AVG,SUM,MIN
3. GROUP BY, HAVING
4. SELECT has a functionality of Printing