任務:完成常規的重測序分析(下機數據質量分析和裝配),並用R語言完成PCA的分析和構建進化樹。
better to follow those directory and do each command in each step directory, keeping it clear what are those input and outputs.
In this case,
Reference species: Streptococcus pyogenes GENOME( fna.gz) and GFF (.gff.gz) file from NCBI
Samples: SRR6844817 - SRR6844837, 21 sample data in all. Downloaded from NCBI SRA
1. data preparation [bwa, sratoolkit, prefetch, fastq-dump]
#build index
$ bwa index GCF_000006*_genomic.fna.gz
#download SRA data (should be paired end),each run represent one sample
$ prefetch SRR6844817
#PE data, unzip data, *_1.fastq.gz & *_2.fastq.gz produced
$ fastq-dump -I SRR6844817 --split-files --gzip --origfmt.
2. mapping [bwa, samtools]
# do 4 commands for each sample (Runs number), get genome of each sample.
#bwa mem [-p] ref.fa read1.fq [paired-end_read2.fq] -R set the reads format
$ bwa mem -R @RG ID:s SM:37 LB:s GCF_000006*.fa.gz ../01.data/02.reads/ SRR2214817_1.fastq.gz ../01.data/02.reads/ SRR2214817_2.fastq.gz|samtools fixmate -O bam - SRR2214817.fixmate.bam
#sort and remove the original .remove fixmate file to save space
$ samtools sort -O bam -o SRR2214817.sorted.bam SRR2214817.fixmate.bam
$ rm SRR2214817.fixmate.bam
$ samtools index SRR2214817.sorted.bam
3. check genome quality (mapping result) [samtools, plot-bamstats under samtools, gnuplot]
#Summary of mapping including: mapping rate, coverage, depth, reads quality, duplication rate etc. Output files are saved under directory SRR6844817
$ samtools stats ../02.mapping/SRR6844817.sorted.bam >SRR6844817.stats.txt
$ plot-bamstats -p SRR6844817/ SRR6844817.stats.txt
# take one sample (SRR6844817) as an example, learn how to analysis output graphs.
RESULT:
- reads quality :reverse reads are relatively lower than forward reads. Overall QS higher than 25 is good and low quality at the beginning because of the adaptor.
- coverage :mapped in all region of reference sequence. Good result.
- GC Content :all fragment have similar GC content, peak at 41.
- Base content:shows base content in total sequence. Since it is PE data, A% equal to T% and G% equal C%, and AT% is higher than GC% in this sample.
4. variation calling [samtools, java, bcftools]
#First build index and dict for the reference, which was required by GATK
$ samtools faidx ../01.data/01.reference/GCF_000006*.fna
$ samtools dict ../01.data/01.reference/GCF_000006*.fna -o ../01.data/01.reference/GCF_000006*.dict
# start the indel realignment, skip BSQR because we do not have a set of know variations
for (( i = 17; i <= 37; i++ )); do
java -Xmx2g -jar ../00.bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../01.data/01.reference/GCF_000006*.fna -I ../02.mapping/SRR68448$i.sorted.bam -o SRR68448$i.intervals
java -Xmx4g -jar ../00.bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../01.data/01.reference/GCF_000006*.fna -I ../02.mapping/SRR68448$i.sorted.bam -targetIntervals SRR68448$i.intervals -o SRR68448$i.realigned.bam
done
# combine all result
$bcftools mpileup -Ou -f ../01.data/01.reference/GCF_000006*.fna SRR6844817.realigned.bam SRR6844818.realigned.bam SRR6844819.realigned.bam SRR6844820.realigned.bam SRR6844821.realigned.bam SRR6844822.realigned.bam SRR6844823.realigned.bam SRR6844824.realigned.bam SRR6844825.realigned.bam SRR6844826.realigned.bam SRR6844827.realigned.bam SRR6844828.realigned.bam SRR6844829.realigned.bam SRR6844830.realigned.bam SRR6844831.realigned.bam SRR6844832.realigned.bam SRR6844833.realigned.bam SRR6844834.realigned.bam SRR6844835.realigned.bam SRR6844836.realigned.bam SRR6844837.realigned.bam | bcftools call -vmO z -o study.raw.vcf.gz
$bcftools filter -O z -o study.vcf.gz -s LOWQUAL -i%QUAL>10 study.raw.vcf.gz
ps: file GenomeAnalysisTK.jar & snpEff.jar is saved in 00.bin dirctory.
5. further analysis [java, vcftools]
#.variation_annotation
#build the database
$gzip -cd GCF_000006*.gff.gz >java species_name/genes.gff
#download or add the gff to data directory
$cp GCF_000006*.fna species_name /sequences.fa
$java -jar ../../00.bin/snpEff/snpEff.jar build -gff3 -v species_name
# Annotation
$java -jar ../../00.bin/snpEff/snpEff.jar species_name ../../04.variation_calling/study.vcf.gz >study.annotated.vcf
#.diversity_calculation
$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --window-pi 10000
$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --TajimaD 10000
$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --SNPdensity 10000
$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --ld-window-bp 1000000
RESULT
- variation annotation
Summary
Ps: showing all sites are error is because the chromosome name is not standard one, it won』t not change the other result.
- calculate diversity
After filtering, kept 21 out of 21 Individuals
After filtering, kept 57834 out of a possible 57834 Sites
SNP diversity: General the variants rate in this species is 14-30%. In three region (10000bp), the variants rate up to 100%, probably have some large fragment loss or insert.
Tajima D test: to test if the DNA sequence is random evolution or not, how the direction selection effected. When T= 0, then no selection. T>0 means there is sudden population contraction, larger number means more effection from selection. When T<0 indicates population expansion after a recent bottleneck, a large number of low ferquence alleles in this region.
PI: PI is to test nucleotide divergency of sites
6. PCA [IN R]
#install R package SNPRelate, GWASTools,
source("http://bioconductor.org/biocLite.R")
biocLite("gdsfmt")
biocLite("SNPRelate")
#setting working path
setwd("~/Desktop/PCA")
library(gdsfmt)
library(SNPRelate)
#convert vcf file to gds file and read in gds file
vcf.t <- "study.vcf"
snpgdsVCF2GDS(vcf.t, "new_geno.gds", method = "biallelic.only")
genofile <- snpgdsOpen("new_geno.gds")
#set autosome.only=False since chromosome name is not standard
pca <- snpgdsPCA(genofile,autosome.only=FALSE)
pc.percent <- pca$varprop * 100
tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], EV2=pca$eigenvect[,2],stringsAsFactors =F)
plot(tab$EV2, tab$EV1, xlab="PC2", ylab="PC1")
21 samples are clustered into 4 groups. Samples in one group have similar characteristics. Since sample size(N) is too small, not enough to confirm whether it is a valid cluster. Better to larger the sample size for further analysis.
7. Phylogenetic tree based on VCF file
#install SNPRelate and SNPhylo http://chibba.pgml.uga.edu/snphylo/
library("gdsfmt")
library("SNPRelate")
snpgdsSummary("new_geno.gds")
genofile <- openfn.gds("new_geno.gds")
#build phylogenetic tree
set.seed(100)
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2,autosome.only = FALSE))
rv <- snpgdsCutTree(ibs.hc)
plot(rv$dendrogram, leaflab="perpendicular", main="Streptococcus pyogenes")
37, 48, 40 are outgroup, length of branch stand for p-distance, shorter branch between two sample then closer relationship they have.
附上:裝配命令 /assembly [SOAPdenovo]
#assembly,注意徐轉換成自己環境下的地址和路徑
../bin/SOAPfilter_v2.2 -M 2 -i 500 -y -z -p fq.list fq.filter.stat 1>log 2>err
ls ../00data/*.clean.gz >clean_fq.list
../bin/kmerfreq -k 15 clean_fq.list 2>15mer.log
../bin/correct_error -k 15 -l 13 ../01kmer/output.freq.cz ../01kmer/clean_fq.list 1>log 2>err
../bin/SOAPdenovo-63mer pregraph -s soapdenovo.cfg -p 1 -K 63 -R -o Training > pregraph.log
../bin/SOAPdenovo-63mer contig -R -g Training > contig.log
../bin/SOAPdenovo-63mer map -s soapdenovo.cfg -g Training > map.log
../bin/SOAPdenovo-63mer scaff -g Training -F > scaff.log
../bin/GapCloser_v1.10_gz -o Training_seq.fill -b gap_closer.cfg -a ../03_soapdenovo_assembly/Training.scafSeq 1>log 2>err
../bin/CalculateN50 Training_seq.fill >Training_seq.fill.N50
../bin/PBcR/PBcR -length 500 -partitions 200 -l E.Coli -s E.coli.spec -fastq ecoli_p6_25x.filtered.fastq genomeSize=6000000 javaPath="/ifs1/ST_PLANT/USER/xushanyun/bin/jdk1.8.0_66/bin"