下一代测序 (NGS)/Velvet

Velvet 实践：第 1 部分

本实践将涵盖以下内容

编译 Velvet
单端
K-mer 长度
覆盖率截止值
整个基因组序列作为输入？？？

准备环境

首先确保您位于主目录，方法是键入

cd

并通过键入以下内容绝对确保您在那里

pwd

现在为本练习和其他两个 Velvet 实践创建子目录。所有这些目录都将作为整个课程名为 NGS 的目录的子目录创建。为此，您可以使用以下命令

mkdir -p NGS/velvet/{part1,part2,part3}

# The -p tells mkdir (make directory) not to worry if a parent directory is missing.
# That is, if a sub-directory cannot be made because its parent directory does not exist,
# just make the parent directory  first rather than reporting an error.
# The “one at a time” approach would be:
mkdir NGS
mkdir NGS/velvet
mkdir NGS/velvet/part1
mkdir NGS/velvet/part2
mkdir NGS/velvet/part3

创建目录后，检查结构并通过键入以下内容移动到准备进行第一个 Velvet 练习的目录中

ls -R NGS;

cd NGS/velvet/part1; pwd;

下载并编译 Velvet

您可以在以下位置找到最新版本的 Velvet

http://www.ebi.ac.uk/~zerbino/velvet/

您可以访问此 URL 并下载最新版本的 Velvet，或者等效地，您可以键入以下内容，这将下载、解压缩、检查、编译并执行您本地编译的 Velvet 版本

cd ~/NGS/velvet/part1; pwd;

cp ~/NGS/Data/velvet_1.2.07.tgz . ;

tar xzf ~/NGS/Data/velvet_1.2.07.tgz;

ls -R; cd velvet_1.2.07;

make velveth velvetg;

./velveth'''

查看您创建的可执行文件。它们将通过以下命令显示为绿色

ls --color=always;

The switch --color, instructs that files be coloured according to their type.
This is often the default. Here we are just being cautious. 
The =always is included as colouring is usually suppressed in scripts.
If you run this exercise using the script provided, just --color would not be enough.
--color=always insists on file colouring even from a script.

查看命令生成的输出，您将看到以下参数传递到编译器中

“MAXKMERLENGTH=31” 和 “CATEGORIES=2”

这表示默认编译设置为最大大小为 31 的 De Bruijn 图 KMER，并允许最多 2 个读取类别。您可以使用命令行参数覆盖这些和其他默认配置选项。假设您想使用 3 个类别运行 KMER 长度为 41 的 Velvet，需要通过键入以下内容重新编译 Velvet 来启用此功能

make clean; make velveth velvetg MAXKMERLENGTH=41 CATEGORIES=3; ./velveth

Velvet 也可用于处理 SOLID 颜色空间数据。为此，您需要另一个 make 参数。使用以下命令清除上次编译并尝试以下参数

make clean; make MAXKMERLENGTH=41 CATEGORIES=3 color ./velveth_de

有关 Velvet 编译和运行时参数的更多说明，请参阅 Velvet 手册：https://github.com/dzerbino/velvet/wiki/Manual

单端读取组装

The data you will examine is from Staphylococcus aureus USA300 which has a genome of around 3MB.
The reads are  Illumina and are unpaired, also known as single-end library. 
Even though you have carefully installed velvet in your own  workspace, we will use a pre-installed version. 
The data needed for this section can be obtained from the Sequence Read Archive (SRA). 
For the following example use the run data SRR022825 and SRR022823 from the SRA Sample SRS004748.
The SRA experiment could be viewed by setting your browser to the URL:
	
http://www.ebi.ac.uk/ena/data/view/SRS004748

以下练习重点介绍使用单端读取的 Velvet，可用参数如何影响组装以及如何衡量和比较变化。首先，先回到您为本练习准备的目录，创建一个新文件夹，为本部分命名一个合适的名称，然后进入该文件夹。从互联网下载文件的命令将是

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022825/SRR022825.fastq.gz

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR022/SRR022823/SRR022823.fastq.gz

或者，如果您在本地安装了这些文件，只需创建指向这些文件的软链接。继续复制（或键入）

<syntax highlight="text> cd ~/NGS/velvet/part1 mkdir SRS004748 cd SRS004748 pwd ln -s ~/NGS/Data/SRR022825.fastq.gz . ln -s ~/NGS/Data/SRR022823.fastq.gz . ls -l </syntaxhighlight>

You are ready to process your data with velvet, which is a compressed fastq file. Velvet has two main components:
velveth -used to construct, from raw read data, a dataset organised in the fashion expected by the second component, velvetg.
velvetg -the core of velvet where the de Bruijn graph assembly is built and manipulated.
You can always get further information about the usage of both velvet programs by typing velvetg or velveth in your terminal.

现在使用以下选项为 SRR022825.fastq.gz 和 SRR022823.fastq.gz 中的读取运行 velveth

- 25 的 De Bruijn 图 k-mer

- 名为 run_25 的输出目录

velveth run_25 25 -fastq.gz -short SRR022825.fastq.gz SRR022823.fastq.gz

velveth 自行运行一段时间，最后在输出目录中生成一些文件。进入输出目录 run_25，看看 velveth 到目前为止做了什么。UNIX 命令 less 允许您查看输出文件（按 q 退出）。以防您仍然需要提示

cd run_25; 
ls -l;
head Sequences;

现在向上移动一个目录级别，并使用以下命令在您的输出目录上运行 velvetg

cd ..
time velvetg run_25

回到您的结果目录以检查 velvetg 的效果

cd run_25; ls -l;

为您：运行上面的命令后

Q1：您在 run_25 文件夹中看到了哪些额外文件？
Q2：您认为它们可能代表什么？
Q3：在 run_25 中的 Log 文件中，N50 是多少？

N50 statistic: Broadly, it is the median (not average) of a sorted data set using the length of a set of sequences.
Usually it is the length of the contig whose length.
When added to the length of all longer contigs, makes a total greater that half the sum of the lengths of all contigs. 
Easy, but messy – a more formal definition can be found here:
	
http://www.broadinstitute.org/crd/wiki/index.php/N50

备份 contigs.fa 文件，并使用以下命令计算 N50（以及 N25、N75）值

cp contigs.fa contigs.fa.0

您现在尝试

gnx -min 100 -nx 25,50,75 contigs.fa

Q4. N50 值是否与 Log 文件中存储的值一致？
Q5. 如果没有，您认为可能是什么原因？

In order to improve our results, take a closer look at the standard options of velvetg by typing 'velvetg' without parameters.
For the moment focus on the two options -cov_cutoff and -exp_cov. 
Clearly -cov_cutoff will allow you to exclude contigs for which the kmer coverage is low, implying unacceptably poor quality. 
The -exp_cov switch is used to give velvetg an idea of the coverage to expect. 
If the expected coverage of any contig is substantially in excess of the suggested expected value,
maybe this would indicate a repeat. 
For further details of how to choose the parameters, go to 'Choice of a coverage cutoff': 	
  http://wiki.github.com/dzerbino/velvet/

简而言之，每个重叠群的 Kmer 覆盖率（以及更多信息）都存储在 stats.txt 文件中，可与 R 一起使用以可视化覆盖率分布。查看 stats.txt 文件，启动 R，加载并使用以下命令可视化数据

R
library(plotrix) 
data <- read.table("stats.txt", header=TRUE)
x11()
weighted.hist(data$short1_cov, data$lgth, breaks=0:50)

加权直方图是可视化覆盖率信息的更好方法，因为存在噪音（许多非常短的重叠群）。您可以在下面看到示例输出

选择预期覆盖率和覆盖率截止值后，您可以通过键入以下内容退出 R

q()
n

加权直方图表明，预期覆盖率约为 14，而低于 6 的所有内容可能都是噪音。

覆盖率也约为 20、30 和大于 50，这可能是污染或重复（取决于数据集），但目前您不必担心。

要查看改进，请首先使用 -cov_cutoff 6 重新运行 velvetg，并检查 N50 后仅使用/添加 -exp_cov 14 到命令行选项。

还要保留 contigs 文件的副本以供比较

 cd ~/NGS/velvet/part1/SRS004748
time velvetg run_25 -cov_cutoff 6
# Make a copy of the run
cp run_25/contigs.fa run_25/contigs.fa.1

time velvetg run_25 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.2

time velvetg run_25 -cov_cutoff 6 -exp_cov 14
cp run_25/contigs.fa run_25/contigs.fa.3

Q6. 没有参数时的 N50 是多少？
Q7. -cov_cutoff 6 时的 N50 是多少？
Q8. -exp_cov 14 时的 N50 是多少？
Q9. -cov_cutoff 6 -exp_cov 14 时的 N50 是多少？
Q10 您是否注意到 velvetg 运行时间有所变化？如果是，您能解释一下原因吗？

您一直在使用给定的 -exp_cov 和 -cov_cutoff 参数运行 velvetg。现在尝试使用不同的截止值、预期参数进行实验，并探索其他设置（例如 -max_coverage、-max_branch_length、-unused_reads、-amos_file、-read_trkg 或查看 velvetg 帮助菜单）。

In particular, look at the -amos_file parameter which instructs velvetg to create a version of the assembly that can be
processed and viewed with a program called AMOS Hawkeye. 
Another  program, called tablet, can also understand and display the AMOS file. 
For now, we will take a look at tablet.

仅使用 -cov_cutoff 6 运行 velvetg，但请求 amos 文件

velvetg run_25 -cov_cutoff 6 -amos_file yes

现在使用 tablet 快速查看组装

tablet run_25/velvet_asm.afg &

velvet_asm.afg 是 velvetg 对包含 -amos_file yes 开关的响应而生成的。tablet 将占用相当多的内存，您可以安全地忽略 tablet 的抱怨。文件加载后，选择其中一个较长的重叠群。注意缺少读取信息和重叠群显示的一维性。在您已经看到足够的信息后，关闭 tablet。现在重新运行 velvetg，添加 -exp_cov 7 额外参数，无需保存任何文件，因为 amos 文件需要更改。现在重新运行 velvetg，添加 -exp_cov 14 额外参数，无需保存任何文件，因为 AMOS 文件需要更改。

velvetg run_25 -cov_cutoff 6 -exp_cov 14 -amos_file yes

再次使用 tablet 查看 amos 文件

tablet run_25/velvet_asm.afg &

与之前一样，选择一个较长的重叠群，并探索显示选项，然后关闭 tablet。为您

Q11. 您认为为什么第二次使用 tablet 时只有读取信息？
Q12. 您可能已经注意到，-exp_cov 14 开启时，velvetg 运行时间更长。这有道理吗？

如果你想进一步探索天鹅绒的行为，你可以尝试以下操作。

减少序列覆盖率，只选择一个输入文件进行组装，例如SRR022825.fastq.gz。

增加序列覆盖率，从同一个ENA样本SRS004748下载更多文件 (http://www.ebi.ac.uk/ena/data/view/SRS004748）。

针对你：N50 如何变化

Q13. 减少覆盖率？
Q14. 增加覆盖率？