Data Management in Bioinformatics/SQL Exercises
For these questions, we will consider the following tables about microarray gene expression data. Your task is to express each of the given queries in SQL. Example data are provided in these tables for your convenience but note that they are only example data. Your queries must work for all potential data residing in the given tables, not just those illustrated here.
Genes gid | name | organism | annotation -----+---------+----------+---------------------------- g1 | YLR180C | yeast | hypothetical protein g2 | YLR181D | yeast | response to desiccation g3 | sp15 | yeast | drought stress responsive g4 | pdp77 | pine | putative stress responsive g5 | hsp70 | pine | heat shock protein g6 | hsp90 | pine | heat shock protein
Expression gid | experimentid | level | significance -----+--------------+-------+-------------- g1 | exp12 | 3.5 | 1 g2 | exp23 | -3 | 1 g3 | exp12 | 1 | 2 g3 | exp13 | -1.5 | 2 g3 | exp23 | 1.7 | 4 g4 | exp12 | 1.5 | 2 g4 | exp13 | 1.5 | 2 g4 | exp23 | 1.5 | 2 g4 | exp6 | 1.5 | 2 g5 | exp6 | 2 | 1 g5 | exp13 | 2.5 | 2 g6 | exp6 | -3.86 | 3
Experiments experimentid | name | whoperformed | date --------------+--------------------+---------------+------------ exp12 | Systematic Torture | Prof. Pain | 2004-06-02 exp23 | Heaped Abuse | Tommy Student | 2004-06-03 exp13 | Salt Stress | Gasch | 1998-07-04 exp6 | Sorbitol Exposure | Gasch | 1999-07-05
Membership gid | category -----+-------------------- g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding
GOTree category | parent_category --------------------+-------------------- antioxidant | molecular function binding | molecular function glutathione | antioxidant glycine binding | amino acid binding amino acid binding | binding
Question 1
editFind the names of experiments performed by Prof. Pain after Jan 1, 2004.
Answer
editSELECT name
FROM Experiments
WHERE whoperformed = 'Prof. Pain'
AND date > '2004-01-01';
Question 2
editFind the names of genes that were either positively expressed twofold or more with a significance of at least 1.0, in some experiment, or negatively expressed twofold or less with a significance of at least 1.0, in some experiment. List them alongside their organisms in a two-column format.
Answer
editSELECT Genes.gid, name, level, significance
FROM Expression, Genes
WHERE Expression.gid = Genes.gid
AND significance >= 1.0
AND (level >= 2.0 OR level <= -2.0);
Question 3
editWhat is the grandparent category of 'glycine binding'?
Answer
editSELECT Parents.parent_category
FROM GOTree as Children, GOTree as Parents
WHERE Children.category = 'glycine binding'
AND Children.parent_category = Parents.category;
Question 4
editFind the names of experiments that were performed before some Gasch experiment.
Straightforward Answer
editSELECT E1.name
FROM Experiments AS E1, Experiments AS E2
WHERE E1.date < E2.date
AND E2.whoperformed = 'Gasch';
Subquery Answer
editSELECT name
FROM Experiments
WHERE Experiments.date < (
SELECT MAX(date)
FROM Experiments
WHERE whoperformed = 'Gasch'
);
Question 5
editFind the names of pine genes that were positively expressed more than 0.5-fold (with a significance of 1.0 or more) in at least two experiments.
Straightforward Answer
editFirst, we must find the experiments where genes are upreglated and significant.
CREATE VIEW Upregulated AS
SELECT gid, experimentid
FROM Expression
WHERE significance >= 1.0
AND level >= 0.5;
Next, we must determine the genes which were upregulated in at least two experiments. We do this by taking the product of the upregulated genes and selecting rows where the gene ID is the same but the experiment ID is different.
CREATE VIEW UpInTwoOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid;
Finally, we determine which of these genes come from pine, and project their names.
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
We can alternatively do these steps all in one single query:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine';
Subquery Answer
editAs another approach, we can make use of subqueries to find the answer. The key to this is to make a correlated subquery where the subquery depends on some property (in this case the gene ID) of the parent query. Note that we'll still need to make use of the Upregulated view created above to reduce code redundancy.
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
);
GROUP BY Answer
editWe can make use of advanced features for certain database systems such as PostgreSQL and MySQL to make these queries in slightly more convenient ways via the GROUP BY and COUNT.
CREATE VIEW UpInTwoOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 1;
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
Question 6
editFind the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at least three experiments.
Straightforward Answer
editSimilar to the answer for question 5. The caveat here is that while the equality evaluations are transitive, while inequality evaluations are not, and so every case must be covered.
CREATE VIEW UpInThreeOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2, Upregulated as U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
Alternatively:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
Subquery Answer
editWe need to build in another correlated subquery for our original correlated subquery to make this work.
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
GROUP BY Answer
editSimply adjust the count evaluation.
CREATE VIEW UpInThreeOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 2;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
Question 7
editFind the names of pine genes that were up-regulated 0.5-fold or more (with a significance of 1.0 or more) in at exactly two experiments.
Straightforward Answer
editThe key here is identifying that taking the set of genes upregulated in two or more experiments and subtracting the set of genes upregulated in three or mor experiments gives the set of genes upregulated in precisely two experiments. Thus, our answer is the answer to question 5 subtracted by the answer to question 6.
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
Or alternatively:
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine'
EXCEPT
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
Subquery Answer
editSELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
)
EXCEPT
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
GROUP BY Answer
editCREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
Question 8
editFind the genes that showed positive expression in every experiment recorded for it.
SELECT Expression.gid, level from Expression WHERE level>1.0;
Question 9
editWhat is the name of the gene that was most positively expressed in experiment exp23? Assume a minimum significance of 1.0.
Answer [suggested by ChatGPT3]
editHere's one way you could write an SQL query to find the name of the gene that was most positively expressed in experiment exp23:
SELECT Genes.name FROM Genes JOIN Expression ON Genes.gid = Expression.gid WHERE Expression.experimentid = 'exp23' AND Expression.level > 0 AND Expression.significance >= 1 ORDER BY Expression.level DESC LIMIT 1;
This query selects the "name" column from the "Genes" table and join it with the "Expression" table on the "gid" column. It then filters the results based on the conditions provided in the "WHERE" clause. The conditions in the WHERE clause are checking for rows where the experiment id is 'exp23', level is greater than 0 and significance is greater than or equal to 1. This will return all the rows in the "Genes" table where the genes were positively expressed in experiment exp23 with a minimum significance of 1.0. Then the query orders the results by level in descending order and limits the results to 1, which will give the gene with the highest level of expression.
This query will return the name of the gene that was most positively expressed in experiment exp23 with a minimum significance of 1.0.
Question 10
editThis question builds on question 8; what is the name of the gene that was "second most positively expressed"? Assume again a minimum significance of 1.0.
Question 11
editWhich gene(s) were positively expressed in ALL the experiments listed in the Experiments table? No constraints on significance level.
Question 12
editPrepare a table of genes, their annotation, and any experiment in which they were either the highest or lowest expressed (of any significance level). Include a fourth column to say if they were the highest or lowest.