任務:完成常規的重測序分析(下機數據質量分析和裝配),並用R語言完成PCA的分析和構建進化樹。

Study on re-sequencing analysis

better to follow those directory and do each command in each step directory, keeping it clear what are those input and outputs.

In this case,

Reference species: Streptococcus pyogenes GENOME( fna.gz) and GFF (.gff.gz) file from NCBI

Samples: SRR6844817 - SRR6844837, 21 sample data in all. Downloaded from NCBI SRA

1. data preparation [bwa, sratoolkit, prefetch, fastq-dump]

#build index

$ bwa index GCF_000006*_genomic.fna.gz

#download SRA data (should be paired end),each run represent one sample

$ prefetch SRR6844817

#PE data, unzip data, *_1.fastq.gz & *_2.fastq.gz produced

$ fastq-dump -I SRR6844817 --split-files --gzip --origfmt.

2. mapping [bwa, samtools]

# do 4 commands for each sample (Runs number), get genome of each sample.

#bwa mem [-p] ref.fa read1.fq [paired-end_read2.fq] -R set the reads format

$ bwa mem -R @RG ID:s SM:37 LB:s GCF_000006*.fa.gz ../01.data/02.reads/ SRR2214817_1.fastq.gz ../01.data/02.reads/ SRR2214817_2.fastq.gz|samtools fixmate -O bam - SRR2214817.fixmate.bam

#sort and remove the original .remove fixmate file to save space

$ samtools sort -O bam -o SRR2214817.sorted.bam SRR2214817.fixmate.bam

$ rm SRR2214817.fixmate.bam

$ samtools index SRR2214817.sorted.bam

3. check genome quality (mapping result) [samtools, plot-bamstats under samtools, gnuplot]

#Summary of mapping including: mapping rate, coverage, depth, reads quality, duplication rate etc. Output files are saved under directory SRR6844817

$ samtools stats ../02.mapping/SRR6844817.sorted.bam >SRR6844817.stats.txt

$ plot-bamstats -p SRR6844817/ SRR6844817.stats.txt

# take one sample (SRR6844817) as an example, learn how to analysis output graphs.

RESULT:

- reads quality :reverse reads are relatively lower than forward reads. Overall QS higher than 25 is good and low quality at the beginning because of the adaptor.

- coverage :mapped in all region of reference sequence. Good result.

- GC Content :all fragment have similar GC content, peak at 41.

- Base content:shows base content in total sequence. Since it is PE data, A% equal to T% and G% equal C%, and AT% is higher than GC% in this sample.

4. variation calling [samtools, java, bcftools]

#First build index and dict for the reference, which was required by GATK

$ samtools faidx ../01.data/01.reference/GCF_000006*.fna

$ samtools dict ../01.data/01.reference/GCF_000006*.fna -o ../01.data/01.reference/GCF_000006*.dict

# start the indel realignment, skip BSQR because we do not have a set of know variations

for (( i = 17; i <= 37; i++ )); do

java -Xmx2g -jar ../00.bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../01.data/01.reference/GCF_000006*.fna -I ../02.mapping/SRR68448$i.sorted.bam -o SRR68448$i.intervals

java -Xmx4g -jar ../00.bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../01.data/01.reference/GCF_000006*.fna -I ../02.mapping/SRR68448$i.sorted.bam -targetIntervals SRR68448$i.intervals -o SRR68448$i.realigned.bam

done

# combine all result

$bcftools mpileup -Ou -f ../01.data/01.reference/GCF_000006*.fna SRR6844817.realigned.bam SRR6844818.realigned.bam SRR6844819.realigned.bam SRR6844820.realigned.bam SRR6844821.realigned.bam SRR6844822.realigned.bam SRR6844823.realigned.bam SRR6844824.realigned.bam SRR6844825.realigned.bam SRR6844826.realigned.bam SRR6844827.realigned.bam SRR6844828.realigned.bam SRR6844829.realigned.bam SRR6844830.realigned.bam SRR6844831.realigned.bam SRR6844832.realigned.bam SRR6844833.realigned.bam SRR6844834.realigned.bam SRR6844835.realigned.bam SRR6844836.realigned.bam SRR6844837.realigned.bam | bcftools call -vmO z -o study.raw.vcf.gz

$bcftools filter -O z -o study.vcf.gz -s LOWQUAL -i%QUAL>10 study.raw.vcf.gz

ps: file GenomeAnalysisTK.jar & snpEff.jar is saved in 00.bin dirctory.

5. further analysis [java, vcftools]

#.variation_annotation

#build the database

$gzip -cd GCF_000006*.gff.gz >java species_name/genes.gff

#download or add the gff to data directory

$cp GCF_000006*.fna species_name /sequences.fa

$java -jar ../../00.bin/snpEff/snpEff.jar build -gff3 -v species_name

# Annotation

$java -jar ../../00.bin/snpEff/snpEff.jar species_name ../../04.variation_calling/study.vcf.gz >study.annotated.vcf

#.diversity_calculation

$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --window-pi 10000

$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --TajimaD 10000

$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --SNPdensity 10000

$vcftools --gzvcf ../../04.variation_calling/study.vcf.gz --ld-window-bp 1000000

RESULT

- variation annotation

Summary

Ps: showing all sites are error is because the chromosome name is not standard one, it won』t not change the other result.

- calculate diversity

After filtering, kept 21 out of 21 Individuals

After filtering, kept 57834 out of a possible 57834 Sites

SNP diversity: General the variants rate in this species is 14-30%. In three region (10000bp), the variants rate up to 100%, probably have some large fragment loss or insert.

Tajima D test: to test if the DNA sequence is random evolution or not, how the direction selection effected. When T= 0, then no selection. T>0 means there is sudden population contraction, larger number means more effection from selection. When T<0 indicates population expansion after a recent bottleneck, a large number of low ferquence alleles in this region.

PI: PI is to test nucleotide divergency of sites

6. PCA [IN R]

#install R package SNPRelate, GWASTools,

source("bioconductor.org/biocLi")

biocLite("gdsfmt")

biocLite("SNPRelate")

#setting working path

setwd("~/Desktop/PCA")

library(gdsfmt)

library(SNPRelate)

#convert vcf file to gds file and read in gds file

vcf.t <- "study.vcf"

snpgdsVCF2GDS(vcf.t, "new_geno.gds", method = "biallelic.only")

genofile <- snpgdsOpen("new_geno.gds")

#set autosome.only=False since chromosome name is not standard

pca <- snpgdsPCA(genofile,autosome.only=FALSE)

pc.percent <- pca$varprop * 100

tab <- data.frame(sample.id = pca$sample.id, EV1 = pca$eigenvect[,1], EV2=pca$eigenvect[,2],stringsAsFactors =F)

plot(tab$EV2, tab$EV1, xlab="PC2", ylab="PC1")

RESULT

21 samples are clustered into 4 groups. Samples in one group have similar characteristics. Since sample size(N) is too small, not enough to confirm whether it is a valid cluster. Better to larger the sample size for further analysis.

7. Phylogenetic tree based on VCF file

#install SNPRelate and SNPhylo chibba.pgml.uga.edu/snp

library("gdsfmt")

library("SNPRelate")

snpgdsSummary("new_geno.gds")

genofile <- openfn.gds("new_geno.gds")

#build phylogenetic tree

set.seed(100)

ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2,autosome.only = FALSE))

rv <- snpgdsCutTree(ibs.hc)

plot(rv$dendrogram, leaflab="perpendicular", main="Streptococcus pyogenes")

RESULT

37, 48, 40 are outgroup, length of branch stand for p-distance, shorter branch between two sample then closer relationship they have.

附上:裝配命令 /assembly [SOAPdenovo]

#assembly,注意徐轉換成自己環境下的地址和路徑

../bin/SOAPfilter_v2.2 -M 2 -i 500 -y -z -p fq.list fq.filter.stat 1>log 2>err

ls ../00data/*.clean.gz >clean_fq.list

../bin/kmerfreq -k 15 clean_fq.list 2>15mer.log

../bin/correct_error -k 15 -l 13 ../01kmer/output.freq.cz ../01kmer/clean_fq.list 1>log 2>err

../bin/SOAPdenovo-63mer pregraph -s soapdenovo.cfg -p 1 -K 63 -R -o Training > pregraph.log

../bin/SOAPdenovo-63mer contig -R -g Training > contig.log

../bin/SOAPdenovo-63mer map -s soapdenovo.cfg -g Training > map.log

../bin/SOAPdenovo-63mer scaff -g Training -F > scaff.log

../bin/GapCloser_v1.10_gz -o Training_seq.fill -b gap_closer.cfg -a ../03_soapdenovo_assembly/Training.scafSeq 1>log 2>err

../bin/CalculateN50 Training_seq.fill >Training_seq.fill.N50

../bin/PBcR/PBcR -length 500 -partitions 200 -l E.Coli -s E.coli.spec -fastq ecoli_p6_25x.filtered.fastq genomeSize=6000000 javaPath="/ifs1/ST_PLANT/USER/xushanyun/bin/jdk1.8.0_66/bin"


推薦閱讀:
相关文章