生物信息学数据管理/SQL练习
针对以下问题,我们将考虑以下关于微阵列基因表达数据的表。您的任务是将每个给定的查询用 SQL 表达出来。为了方便起见,这些表中提供了示例数据,但请注意,它们只是示例数据。您的查询必须适用于所有潜在的数据,而不仅仅是这里所示的示例数据。
Genes gid | name | organism | annotation -----+---------+----------+---------------------------- g1 | YLR180C | yeast | hypothetical protein g2 | YLR181D | yeast | response to desiccation g3 | sp15 | yeast | drought stress responsive g4 | pdp77 | pine | putative stress responsive g5 | hsp70 | pine | heat shock protein g6 | hsp90 | pine | heat shock protein
Expression gid | experimentid | level | significance -----+--------------+-------+-------------- g1 | exp12 | 3.5 | 1 g2 | exp23 | -3 | 1 g3 | exp12 | 1 | 2 g3 | exp13 | -1.5 | 2 g3 | exp23 | 1.7 | 4 g4 | exp12 | 1.5 | 2 g4 | exp13 | 1.5 | 2 g4 | exp23 | 1.5 | 2 g4 | exp6 | 1.5 | 2 g5 | exp6 | 2 | 1 g5 | exp13 | 2.5 | 2 g6 | exp6 | -3.86 | 3
Experiments experimentid | name | whoperformed | date --------------+--------------------+---------------+------------ exp12 | Systematic Torture | Prof. Pain | 2004-06-02 exp23 | Heaped Abuse | Tommy Student | 2004-06-03 exp13 | Salt Stress | Gasch | 1998-07-04 exp6 | Sorbitol Exposure | Gasch | 1999-07-05
Membership gid | category -----+-------------------- g1 | glutathione g2 | antioxidant g3 | glycine binding g1 | amino acid binding g4 | amino acid binding g5 | amino acid binding g6 | binding
GOTree category | parent_category --------------------+-------------------- antioxidant | molecular function binding | molecular function glutathione | antioxidant glycine binding | amino acid binding amino acid binding | binding
查找 Pain 教授在 2004 年 1 月 1 日之后进行的实验名称。
SELECT name
FROM Experiments
WHERE whoperformed = 'Prof. Pain'
AND date > '2004-01-01';
查找在某些实验中表达量至少为两倍或更高的基因名称,或在某些实验中表达量至少为两倍或更低的基因名称。将它们与其生物体一起列出,以两列格式显示。
SELECT Genes.gid, name, level, significance
FROM Expression, Genes
WHERE Expression.gid = Genes.gid
AND significance >= 1.0
AND (level >= 2.0 OR level <= -2.0);
'甘氨酸结合' 的祖先类别是什么?
SELECT Parents.parent_category
FROM GOTree as Children, GOTree as Parents
WHERE Children.category = 'glycine binding'
AND Children.parent_category = Parents.category;
查找在某个 Gasch 实验之前进行的实验名称。
SELECT E1.name
FROM Experiments AS E1, Experiments AS E2
WHERE E1.date < E2.date
AND E2.whoperformed = 'Gasch';
SELECT name
FROM Experiments
WHERE Experiments.date < (
SELECT MAX(date)
FROM Experiments
WHERE whoperformed = 'Gasch'
);
查找在至少两个实验中表达量超过 0.5 倍(显著性为 1.0 或更高)的松树基因名称。
首先,我们必须找到基因上调且显著的实验。
CREATE VIEW Upregulated AS
SELECT gid, experimentid
FROM Expression
WHERE significance >= 1.0
AND level >= 0.5;
接下来,我们必须确定在至少两个实验中上调的基因。我们通过获取上调基因的乘积并选择基因 ID 相同但实验 ID 不同的行来做到这一点。
CREATE VIEW UpInTwoOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid;
最后,我们确定这些基因中哪些来自松树,并投影它们的名称。
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
我们也可以在一个查询中完成所有这些步骤。
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine';
作为另一种方法,我们可以使用子查询来找到答案。关键是要创建一个 **相关子查询**,其中子查询依赖于父查询的某些属性(在本例中为基因 ID)。请注意,我们仍然需要使用上调上面创建的视图以减少代码冗余。
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
);
我们可以使用某些数据库系统(如 PostgreSQL 和 MySQL)的扩展功能,通过以下方式以更方便的方式执行这些查询GROUP BY和COUNT.
CREATE VIEW UpInTwoOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 1;
SELECT name
FROM Genes, UpInTwoOrMore
WHERE Genes.gid = UpInTwoOrMore.gid
AND organism = 'pine';
查找在至少三个实验中表达量上调 0.5 倍或更高(显著性为 1.0 或更高)的松树基因名称。
类似于问题 5 的答案。这里的关键是,虽然等式评估是可传递的,但不等式评估不是,因此必须涵盖每种情况。
CREATE VIEW UpInThreeOrMore AS
SELECT DISTINCT U1.gid AS gid
FROM Upregulated AS U1, Upregulated AS U2, Upregulated as U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
或者
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
我们需要在原始的相关子查询中构建另一个相关子查询才能使它工作。
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
只需调整计数评估。
CREATE VIEW UpInThreeOrMore AS
SELECT gid
FROM Expression
WHERE level >= 0.5
AND significance >= 1.0
GROUP BY gid
HAVING COUNT(*) > 2;
SELECT name
FROM Genes, UpInThreeOrMore
WHERE Genes.gid = UpInThreeOrMore.gid
AND organism = 'pine';
找出在恰好两个实验中上调 0.5 倍或更多(显著性为 1.0 或更多)的松树基因的名称。
这里的关键是识别,将上调于两个或更多个实验的基因集减去上调于三个或更多个实验的基因集,即可得到在恰好两个实验中上调的基因集。因此,我们的答案是问题 5 的答案减去问题 6 的答案。
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
或者
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND organism = 'pine'
EXCEPT
SELECT DISTINCT name
FROM Genes, Expression as E1, Expression as E2, Expression as E3
WHERE Genes.gid = E1.gid
AND E1.gid = E2.gid
AND E1.gid = E3.gid
AND E1.level >= 0.5
AND E2.level >= 0.5
AND E3.level >= 0.5
AND E1.significance >= 1.0
AND E2.significance >= 1.0
AND E3.significance >= 1.0
AND E1.experimentid <> E2.experimentid
AND E1.experimentid <> E3.experimentid
AND E2.experimentid <> E3.experimentid
AND organism = 'pine';
SELECT DISTINCT name
FROM Genes, Upregulated as U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated as U2
WHERE U1.gid = U2.gid
AND U1.experimentid <> U2.experimentid
)
EXCEPT
SELECT DISTINCT name
FROM Genes, Upregulated AS U1
WHERE Genes.gid = U1.gid
AND organism = 'pine'
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U2
WHERE U1.gid = U2.gid
AND U1.gid = (
SELECT DISTINCT U1.gid
FROM Upregulated AS U3
WHERE U1.gid = U2.gid
AND U1.gid = U3.gid
AND U1.experimentid <> U2.experimentid
AND U1.experimentid <> U3.experimentid
AND U2.experimentid <> U3.experimentid
)
);
CREATE VIEW UpInTwo AS
SELECT *
FROM UpInTwoOrMore
EXCEPT
SELECT *
FROM UpInThreeOrMore;
SELECT name
FROM Genes, UpInTwo
WHERE Genes.gid = UpInTwo.gid
AND organism = 'pine';
找出在所有记录的实验中表现出阳性表达的基因。
SELECT Expression.gid, level from Expression WHERE level>1.0;
在实验 exp23 中表达最正面的基因的名称是什么?假设最小显著性为 1.0。
以下是一种编写 SQL 查询以查找在实验 exp23 中表达最正面的基因名称的方法
SELECT Genes.name FROM Genes JOIN Expression ON Genes.gid = Expression.gid WHERE Expression.experimentid = 'exp23' AND Expression.level > 0 AND Expression.significance >= 1 ORDER BY Expression.level DESC LIMIT 1;
此查询从 "Genes" 表中选择 "name" 列,并根据 "gid" 列将其与 "Expression" 表联接。然后,它根据 "WHERE" 子句中提供的条件过滤结果。WHERE 子句中的条件检查实验 ID 是否为 'exp23',水平是否大于 0,显著性是否大于或等于 1。这将返回 "Genes" 表中所有在实验 exp23 中以至少 1.0 的显著性水平正向表达的基因的行。然后,查询按水平降序对结果进行排序,并将结果限制为 1,这将给出表达水平最高的基因。
此查询将返回在实验 exp23 中以至少 1.0 的显著性水平表达最正面的基因的名称。
这个问题建立在问题 8 的基础上;表达“第二正面”的基因的名称是什么?再次假设最小显著性为 1.0。
哪些基因在“Experiments”表中列出的所有实验中都表现出阳性表达?对显著性水平没有限制。
准备一个基因表,其中包含它们的注释以及任何在其中表达最高或最低(任何显著性水平)的实验。包含第四列来表明它们是最高还是最低。