【生物信息】DIAMOND进行序列比对

最新推荐文章于 2024-11-14 16:24:03 发布

Eagle_Data

最新推荐文章于 2024-11-14 16:24:03 发布

阅读量2.9k

点赞数 26

CC 4.0 BY-SA版权

分类专栏：生物信息文章标签：笔记

本文链接：https://blog.csdn.net/missinghead/article/details/135703047

生物信息专栏收录该内容

3 篇文章

订阅专栏

本文介绍了DIAMOND，一款专为大规模序列数据分析设计的高效序列比对器，详细讲解了如何安装、创建数据库以及使用blastp进行蛋白质序列比对的过程，特别强调了不同灵敏度模式的选择和比对结果的解读。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

DIAMOND是一款用于蛋白质和翻译DNA搜索的序列比对器，专为大序列数据的高性能分析而设计。

官方文档：Home · bbuchfink/diamond Wiki (github.com)

1 安装DIAMOND

# 使用conda创建diamond环境并安装diamond
conda create --name diamond diamond
# 激活diamond
conda activate diamond
# 查看diamond版本
diamond --version

2 蛋白质序列比对（Protein alignment）

下载示例数据，这个数据集为FASTA格式，其中包含了14,323条蛋白质序列

wget https://scop.berkeley.edu/downloads/scopeseq-2.07/astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa

现在利用diamond makedb将刚下载的文件转换成DIAMOND数据库文件，这个数据库文件将用于后续的比对。
```
diamond makedb --in astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40
```
用同一文件进行序列查找
```
diamond blastp -q astral-scopedom-seqres-gd-sel-gs-bib-40-2.07.fa -d astral40 -o out.tsv --very-sensitive
```
参数解释：

-q 后接需要查询的文件

-d 后接上一步生成的数据库文件

-o 后接搜寻结果

DIAMOND具有多种灵敏度设置，以适应不同的应用。默认模式是最快的，专为查找 >70% 序列同一性的同源性而定制，--sensitive 模式针对 >40% 同一性的命中量身定制，而 --very-sensitive 和 --ultra-sensitive 模式在整个成对比对范围内提供较高的灵敏度。灵敏度越高，越可能匹配到阳性结果。
结果解释

部分结果：
```
d1dlwa_ d1dlwa_ 100     116     0       0       1       116     1       116     6.42e-77        220
d1dlwa_ d2gkma_ 35.4    113     73      0       1       113     13      125     1.43e-21        80.9
d1dlwa_ d4i0va_ 31.9    119     75      2       1       113     2       120     9.11e-13        58.2
d2gkma_ d2gkma_ 100     127     0       0       1       127     1       127     1.51e-87        248
d2gkma_ d1dlwa_ 34.8    115     75      0       13      127     1       115     6.90e-23        84.3
d2gkma_ d4i0va_ 33.6    110     69      1       13      118     2       111     1.35e-18        73.6
d2gkma_ d6bmea_ 35.5    110     67      1       13      118     2       111     1.32e-16        68.6
d2gkma_ d2bkma_ 37.3    67      38      2       13      76      5       70      5.18e-06        40.8
d1ngka_ d1ngka_ 100     126     0       0       1       126     1       126     4.34e-91        257
d1ngka_ d2bkma_ 38.4    125     73      2       1       125     4       124     1.42e-24        89.0
```
各列含义解释：
1. Query accession: the accession of the sequence that was the search query against the database, as specified in the input FASTA file after the > character until the first blank.
2. Target accession: the accession of the target database sequence (also called subject) that the query was aligned against.
3. Sequence identity: The percentage of identical amino acid residues that were aligned against each other in the local alignment.
4. Length: The total length of the local alignment, which including matching and mismatching positions of query and subject, as well as gap positions in the query and subject.
5. Mismatches: The number of non-identical amino acid residues aligned against each other.
6. Gap openings: The number of gap openings.
7. Query start: The starting coordinate of the local alignment in the query (1-based).
8. Query end: The ending coordinate of the local alignment in the query (1-based).
9. Target start: The starting coordinate of the local alignment in the target (1-based).
10. Target end: The ending coordinate of the local alignment in the target (1-based).
11. E-value: The expected value of the hit quantifies the number of alignments of similar or better quality that you expect to find searching this query against a database of random sequences the same size as the actual target database. This number is most useful for measuring the significance of a hit. By default, DIAMOND will report all alignments with e-value < 0.001, meaning that a hit of this quality will be found by chance on average once per 1,000 queries.
12. Bit score: The bit score is a scoring matrix independent measure of the (local) similarity of the two aligned sequences, with higher numbers meaning more similar. It is always >= 0 for local Smith Waterman alignments.