CUT&Tag
一篇鸡肋的跟做
一、数据下载
数据来源: Kaya-Okur et al., 2020
可从GEO下载
projPath="~/Cut/"
wget -O $projPath/data/IgG_rep2/IgG_rep2_R1_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/001/SRR8754611/SRR8754611_1.fastq.gz
wget -O $projPath/data/IgG_rep2/IgG_rep2_R2_001.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/001/SRR8754611/SRR8754611_2.fastq.gz
wget -O $projPath/data/IgG_rep2/IgG_rep2_R1_002.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/002/SRR8754612/SRR8754612_1.fastq.gz
wget -O $projPath/data/IgG_rep2/IgG_rep2_R2_002.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR875/002/SRR8754612/SRR8754612_2.fastq.gz
二、数据预处理
## 1、安装FastQC
mkdir -p $projPath/tools
wget -P $projPath/tools https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip
cd $projPath/tools
unzip fastqc_v0.11.9.zip
##2、运行 FASTQC 进行质量检查
mkdir -p ${projPath}/fastqFileQC/${histName}
$projPath/tools/FastQC/fastqc -o ${projPath}/fastqFileQC/${histName} -f fastq ${projPath}/fastq/${histName}_R1.fastq.gz
$projPath/tools/FastQC/fastqc -o ${projPath}/fastqFileQC/${histName} -f fastq ${projPath}/fastq/${histName}_R2.fastq.gz
三、比对
标准流程是对一个 HiSeq 2500 flowcell 上的 90 个汇总样本进行单指数 25x25 PE Illumina 测序,每个样本都有一个独特的 PCR 引物条形码。每个文库的数量调整为提供约 500 万对双端 reads,这为丰富的染色质特征提供了高质量的分析,具有特异性和高产的抗体。较少丰富的特征通常需要较少的 reads 量,而较低质量的抗体可能会增加产生稳健染色质谱所需的 reads 量。一个深入讨论的 feature recall 和测序深度的 CUT&Tag 已经发表(Kaya-Okur et al 2020)
#比对到hg38基因组上
##== linux 命令 ==##
cores=8
ref="/public/workspace/liincs/ref/Mus_musculus.GRCm38.dna.primary_assembly.fa"
mkdir -p ${projPath}/alignment/sam/bowtie2_summary
mkdir -p ${projPath}/alignment/bam
mkdir -p ${projPath}/alignment/bed
mkdir -p ${projPath}/alignment/bedgraph
## Build the bowtie2 reference genome index if needed:
## bowtie2-build path/to/hg38/fasta/hg38.fa /path/to/bowtie2Index/hg38
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${ref} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S ${projPath}/alignment/sam/${histName}_bowtie2.sam &> ${projPath}/alignment/sam/bowtie2_summary/${histName}_bowtie2.txt
使用 Bowtie2 进行双端比对参数应设置为:--end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700,用于回帖长度为 10-700 bp 的插入。
关键步骤:没有必要修剪从标准 25x25 PE 测序输出的测序结果,因为接头序列将不包括在 ** 插入 >25 bp** 的 reads。然而,对于执行 ** 较长测序 的用户, reads** 将需要通过 Cutadapt 进行修剪,并且在比对步骤中,比对参数应该为 --local --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 以便忽略 reads 的 3' 末端 的任何剩余接头序列。
四、spike-in校准
spikeInRef="~/shared/ngs/illumina/henikoff/Bowtie2/Ecoli"
chromSize="~/fh/fast/gottardo_r/yezheng_working/SupplementaryData/hg38/chromSize/hg38.chrom.size"
## bowtie2-build path/to/Ecoli/fasta/Ecoli.fa /path/to/bowtie2Index/Ecoli
bowtie2 --end-to-end --very-sensitive --no-mixed --no-discordant --phred33 -I 10 -X 700 -p ${cores} -x ${spikeInRef} -1 ${projPath}/fastq/${histName}_R1.fastq.gz -2 ${projPath}/fastq/${histName}_R2.fastq.gz -S $projPath/alignment/sam/${histName}_bowtie2_spikeIn.sam &> $projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.txt
seqDepthDouble=`samtools view -F 0x04`
seqDepth=$((seqDepthDouble/2))
echo $seqDepth >$projPath/alignment/sam/bowtie2_summary/${histName}_bowtie2_spikeIn.seqDepth
##2984630 reads; of these:
2984630 (100.00%) were paired; of these:
125110 (4.19%) aligned concordantly 0 times
2360430 (79.09%) aligned concordantly exactly 1 time
499090 (16.72%) aligned concordantly >1 times
95.81% overall alignment rate
五、peak calling
SEACR包
seacr="/fh/fast/gottardo_r/yezheng_working/Software/SEACR/SEACR_1.3.sh"
histControl=$2
mkdir -p $projPath/peakCalling/SEACR
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph \
$projPath/alignment/bedgraph/${histControl}_bowtie2.fragments.normalized.bedgraph \
non stringent $projPath/peakCalling/SEACR/${histName}_seacr_control.peaks
bash $seacr $projPath/alignment/bedgraph/${histName}_bowtie2.fragments.normalized.bedgraph 0.01 non stringent $projPath/peakCalling/SEACR/${histName}_seacr_top0.01.peaks
called peak的数目
peakN = c()
peakWidth = c()
peakType = c("control", "top0.01")
for(hist in sampleList){
histInfo = strsplit(hist, "_")[[1]]
if(histInfo[1] != "IgG"){
for(type in peakType){
peakInfo = read.table(paste0(projPath, "/peakCalling/SEACR/", hist, "_seacr_", type, ".peaks.stringent.bed"), header = FALSE, fill = TRUE) %>% mutate(width = abs(V3-V2))
peakN = data.frame(peakN = nrow(peakInfo), peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakN, .)
peakWidth = data.frame(width = peakInfo$width, peakType = type, Histone = histInfo[1], Replicate = histInfo[2]) %>% rbind(peakWidth, .)
}
}
}
peakN %>% select(Histone, Replicate, peakType, peakN)
六、可视化
特定区域热图可视化(还没做!!!)
cores=8
computeMatrix scale-regions -S $projPath/alignment/bigwig/K27me3_rep1_raw.bw \
$projPath/alignment/bigwig/K27me3_rep2_raw.bw \
$projPath/alignment/bigwig/K4me3_rep1_raw.bw \
$projPath/alignment/bigwig/K4me3_rep2_raw.bw \
-R $projPath/data/hg38_gene/hg38_gene.tsv \
--beforeRegionStartLength 3000 \
--regionBodyLength 5000 \
--afterRegionStartLength 3000 \
--skipZeros -o $projPath/data/hg38_gene/matrix_gene.mat.gz -p $cores
plotHeatmap -m $projPath/data/hg38_gene/matrix_gene.mat.gz -out $projPath/data/hg38_gene/Histone_gene.png --sortUsing sum
借用一下