-
Notifications
You must be signed in to change notification settings - Fork 104
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d9411bd
commit 5a6c2b9
Showing
6 changed files
with
143 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,36 @@ | ||
# 命令行下载公共数据库测序数据 | ||
|
||
!!! note "内容简介" | ||
!!! Abstract "内容简介" | ||
|
||
很多场合下需要用到公共数据库中他人提交的基因组,就需要下载基因组数据。常规方法可以通过NCBI的entrez或者EBI的在网页端搜索后,对检索到的结果以文件形式下载。 | ||
|
||
如果要大规模下载数据,则建议在服务器端命令行中要进行操作,并通过构建自动化工作流程,实现无人值守同步数据。本节主要介绍 NCBI/EBI/DDBJ的公共数据库中数据在服务器端命令行下获得的方法。 | ||
|
||
## 基本概念 | ||
|
||
### 1. NCBI | ||
|
||
NCBI常用的基因组数据库: | ||
|
||
- Genbank | ||
- Assembly | ||
- SRA | ||
- Geo | ||
|
||
|
||
## 关于下载速度 | ||
|
||
不同地区、不同时间下载速度会有很大差异。但总体上来说,峰值速度上,国内我们所在的网络线路速度如下: | ||
|
||
- EBI ENA ascp 下载峰值 10~30Mb/s | ||
- AWS s3 cp 下载峰值 ~1Mb/s | ||
- NCBI http wget/curl/aria2c 不同工具的速度差异明显,但是总体上速度 <1MB/s,大部分时间稳定在100KB/s | ||
- prefetch http 下载最慢,~10KB/s | ||
|
||
|
||
!!! Tip "优点vs缺点" | ||
|
||
通过`ascp`在ENA进行下载是目前的首选,但问题在于ENA也不够稳定,容易遇到无法访问或者认证错误等现象,如果是服务器的原因可能等一段时间就可以继续,这样需要人工干预。 | ||
|
||
ENA可以直接下载fastq数据,省去sra dump fastq时对于IO和CPU的消耗。但是ENA下载容易中断,导致数据不完整,fastq数据不能像sra一样,可以用`vdb-validate`来验证下载数据的完整性。因此需要调用ENA reports API来抓取fastq文件的md5校验值。 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,63 @@ | ||
# 下载raw测序数据 | ||
|
||
!!! Abstract "内容简介" | ||
|
||
本节介绍下载NCBI/EBI/DDBJ的高通量测序原始数据。比如SRA数据,或者以fastq.gz储存的原始数据。 | ||
|
||
|
||
## Kingfisher工具 | ||
|
||
|
||
|
||
### 安装 Kingfisher | ||
|
||
`aspera-cli`工具是IBM开发的aspera命令行工具,ascli是该工具运行命令,同时软件包内含ascp, curl等工具,因此安装该工具即可在命令行下载aspera服务器端数据。aspera_bypass_dsa.pem | ||
|
||
#### conda虚拟环境方式 | ||
|
||
```shell | ||
# 创建虚拟环境 | ||
$ mamba create -n getraw | ||
$ mamba activate getraw | ||
# 安装软件 | ||
(getraw)$ mamba install kingfisher | ||
|
||
# 采用aspera下载,需要安装aspera-cli | ||
(getraw)$ mamba install aspera-cli | ||
# 通过aws云端数据下载,需要安装 | ||
(getraw)$ mamba install awscli | ||
# 通过google-cloud云端数据下载,需要安装 | ||
(getraw)$ mamba | ||
``` | ||
|
||
!!! tip "注意事项" | ||
|
||
conda环境如果默认添加了bioconda和conda-forge的channels,安装aspera-cli会搜索到v4版本的aspera-cli,IBM用ruby重写了控制界面,可以使用ascli工具来调用下载。当然也可以直接使用ascp程序进行下载。而之前的v3版本需要通过channels:hcc来下载(mamba install -c hcc aspera-cli)。 | ||
|
||
v3和v4版本最大的区别在与使用的公钥文件名是不一样的:aspera_bypass_dsa.pem(v4)和asperaweb_id_dsa.openssh(v3)。而kingfisher在软件包内直接提供了openssh文件。 | ||
|
||
#### docker虚拟环境方式 | ||
|
||
```shell | ||
# | ||
$ | ||
|
||
``` | ||
|
||
### 使用 Kingfisher 下载 | ||
|
||
```shell | ||
# 下载SRA登陆号为SRR17840141,下载方式为aspera,从ENA下载 | ||
(raw-dl)$ kingfisher get -r SRR17840141 -m ena-ascp | ||
|
||
``` | ||
|
||
### Kingfisher 问题或缺点 | ||
|
||
1. 当使用`--run-identifiers-list`批量下载时,使用`ena-ascp`模式下载时容易报错STDERR,导致下载中止。因此 | ||
|
||
|
||
|
||
--- | ||
|
||
## nf-core/fetchngs |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,49 @@ | ||
# 下载BioSample信息 | ||
|
||
!!! Abstract "内容简介" | ||
|
||
获得基因组后,希望了解菌株的meta信息,就需要查看biosample数据库中基因组对应的菌株信息。 | ||
|
||
从获取一个已知菌株在biosample数据库中的记录开始。 | ||
|
||
```bash | ||
# 查询菌株ATCC25922的biosample | ||
$ esearch -db biosample -query "ATCC25922" | ||
<ENTREZ_DIRECT> | ||
<Db>biosample</Db> | ||
<WebEnv>MCID_661d37307a915242cb1612a2</WebEnv> | ||
<QueryKey>1</QueryKey> | ||
<Count>42</Count> | ||
<Step>1</Step> | ||
</ENTREZ_DIRECT> | ||
``` | ||
|
||
结果可以获得42个查询结果,将结果信息生成纯文本格式查看。 | ||
|
||
```bash | ||
# 获得纯文本格式 | ||
$ esearch -db biosample -query "ATCC25922" | efetch -format txt | ||
1: OneHealth | ||
Identifiers: BioSample: SAMN40301290; Sample name: WVDA_M07713_Ecoli_ATCC25922; SRA: SRS20722933 | ||
Organism: Escherichia coli | ||
Attributes: | ||
/strain="Escherichia coli_Lot 705489" | ||
/collected by="West Virginia Department of Agriculture" | ||
/collection date="2024-02-27" | ||
/geographic location="USA:WV" | ||
/isolation source="new culture swab" | ||
/source type="other" | ||
/purpose of sampling="research [GENEPIO:0100003]" | ||
/project name="GenomeTrakr; LFFM-FY4" | ||
/Interagency Food Safety Analytics Collaboration (IFSAC) category="clinical/research" | ||
/sequenced by="West Virginia Department of Agriculture" | ||
/LexMaprStandardizedIsolationSource="new culture swab" | ||
Accession: SAMN40301290 ID: 40301290 | ||
... | ||
``` | ||
|
||
选择我们需要的基因组,并下载biosample信息。 | ||
|
||
```bash | ||
$ esearch | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters