FASTQ format

FASTQ format

每个FASTQ文件中每个序列通常有四行信息:
1: 以 '@' 字符开头,后面紧接着的是序列标识符和可选字段的描述(类似FASTA title line).
2: 序列
3: 以 '+' 字符开头, 后面紧接着的是可选字段的描述性信息
4: 第二行序列的质量信息

Illumina sequence identifiers

@HWUSI-EAS100R:6:73:941:1973#0/1

sequence identifiersdescription
HWUSI-EAS100Rthe unique instrument name
6flowcell lane
73tile number within the flowcell lane
941'x'-coordinate of the cluster within the tile
1973'y'-coordinate of the cluster within the tile
#0index number for a multiplexed sample (0 for no indexing)
/1the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.

With Casava 1.8 the format of the '@' line has changed:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

sequence identifiersdescription
EAS139the unique instrument name
136the run id
FC706VJthe flowcell id
2flowcell lane
2104tile number within the flowcell lane
15343'x'-coordinate of the cluster within the tile
197393'y'-coordinate of the cluster within the tile
1the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
YY if the read is filtered, N otherwise
180 when none of the control bits are on, otherwise it is an even number(偶数)
ATCACGindex sequence

将FASTQ 转换为 FASTA 格式:

zcat input_file.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > output_file.fa


#printf 命令的语法:format-string 为格式控制字符串,arguments 为参数列表。
printf  format-string  [arguments...]


#substr(s,p) 返回字符串s中从p开始的后缀部分
#substr(s,p,n) 返回字符串s中从p开始长度为n的后缀部分。

转载于:https://www.cnblogs.com/adawong/p/8032871.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值