Intro: sequencing and the data deluge

Report
MCB3895-004 Lecture #20
Nov 18/14
Reference alignments
Today:
1. Align reads to a reference genome
2. Correct for misalignments
3. Analyze variants between reads and the
reference genome
• (i.e., differences between the sequenced genome
and the reference)
• We will roughly follow the default samtools
protocol: http://www.htslib.org/workflow/
Record keeping
• As you will see, a huge amount of this work is
converting between formats so that different
software will work
• It is therefore CRUCIAL that you keep records
of all the commands that you use
• BEWARE: different versions of all of these
software have different syntaxes
Read mapping
• Align reads to a reference genome and
determine SNPs
• Note aligning reads, not contigs as with
nucmer
• Computationally more efficient than doing de novo
assembly first
Read mapping tools
• Many different flavors, but overall dominated
by two programs:
• bwa (we will use today)
• bowtie
• Note: early versions of bowtie did not align
reads containing indels, whereas bwa did
• Some debate about which is best, trade-offs
between sensitivity (ability to map everything)
and specificity (are mappings correct)
• Also speed and memory considerations
Mapping using bwa
• Create an index of the reference genome
nucleotide fasta for the alignment software to
use for read mapping
• syntax: $ bwa index [ref.fasta file]
• e.g.: $ bwa index E_coli.fasta
• note: use ".fasta" file ending for a later step
• Creates 5 output files: [ref.fasta].amb,
.ann, .bwt, .pac, .sa
• "Index": special computer data structure that
allows fast searching; software-specific
Mapping using bwa
• bwa mem does the actual mapping step
• syntax: $ bwa mem -R
[email protected]\tID:[name1]\tSM:[name2]\tLB:libra
ry1' [ref.fasta file] [read file 1]
[read file 2] > [outfile]
• -R: indexes "read groups", required for GATK in later steps
• e.g.: $ bwa mem -R
[email protected]\tID:all\tSM:all\tLB:library'
E_coli.fasta SRR826450_1.fastq
SRR826450_2.fastq > align.sam
samtools: convert .sam to
.bam, clean up names
• .sam is the plain text output format of most
sequence alignment programs
• Because these can be large, most subsequent
programs use the compressed ".bam" format
instead
• bwa sometimes does odd things to read pairing
information, can clean up during conversion
samtools: convert .sam to
.bam, clean up names
• syntax: $ samtools fixmate -O bam
[input .sam file] [output .bam
file]
• -O: output file type
• e.g.: $ samtools fixmate -O bam
align.sam align_fixmate.bam
samtools: sort .bam file
• samtools and related software use .bam files
that are sorted by ascending genomic position
• i.e., starts from position #1 on the reference genome
and goes to the end
• syntax: $ samtools sort -O bam -o [output
sorted .bam] -T [temp file location]
[input unsorted .bam]
• -O: output file type
• -o: output file name
• -T: location for temporary files (required)
• e.g.: $ samtools sort -O bam -o
align_sorted.bam -T temp
align_fixmate.bam
GATK: realign indels
• bwa sometimes misaligns indels in reads
• One way to get rid of these is to use the
realignment functions in the GATK package
• More generally: GATK does much of the same thing
as samtools, strong focus on diploid genomes
• Unfortunately, GATK uses java (silly command
line syntax)
• Unfortunately, GATK needs its own file formats
Picard: index reference
• syntax: $ java -jar /export/apps/picard-tools1.124/picard.jar CreateSequenceDictionary
REFERENCE=[ref.fasta file] OUTPUT=[output
.dict file]
• REFERENCE: reference file name
• OUTPUT: output index file name, must be ".dict"
• e.g.: $ java -jar /export/apps/picard-tools1.124/picard.jar CreateSequenceDictionary
REFERENCE=E_coli.fasta OUTPUT=E_coli.dict
samtools: index reference
• syntax: $ samtools faidx [ref.fasta
file]
• e.g.: $ samtools faidx E_coli.fasta
• outputs [ref.fasta].fai index file
samtools: index .bam file
• syntax: $ samtools index [sorted .bam
file]
• e.g.: $ samtools index
align_sorted.bam
• outputs [sorted bam].bai output file
GATK: prepare reads for indel
realignment
• syntax: $ java -Xmx2g -jar
/opt/bioinformatics/GATK/GenomeAnalysisTK
.jar -T RealignerTargetCreator -R
[ref.fasta file] -I [sorted & indexed
.bam file] -o [output file name]
• -R: reference .fasta file name
• -I: sorted and indexed .bam file
• -o: output intervals file name
• e.g.: $ java -Xmx2g -jar
/opt/bioinformatics
GATK/GenomeAnalysisTK.jar -T
RealignerTargetCreator -R E_coli.fasta -I
align_sorted.bam -o
align_sorted.intervals
GATK: perform indel
realignment
• syntax: $ java -Xmx4g -jar
/opt/bioinformatics/GATK/GenomeAnalysisTK.j
ar -T IndelRealigner -R [ref.fasta file] -I
[sorted & indexed .bam file] targetIntervals [intervals file] -o [output
.bam file]
•
•
•
•
-T: Program function to use
-R: Reference .fasta file
-I: Intervals file from last step
-o: output .bam file name
• e.g.: $ java -Xmx2g -jar
/opt/bioinformatics/GATK/GenomeAnalysisTK.j
ar -T IndelRealigner -R E_coli.fasta -I
align_sorted.bam -targetIntervals
align_sorted.intervals -o
align_realigned.bam
Picard: remove duplicates
• Duplicate reads can arise because of PCR
artifacts during sequencing
• Because duplicate reads to not provide
additional information, it is best to remove them
for computational efficiency
• Identified by having identical start and end
mapping positions
Picard: remove duplicates
• syntax: $ java -Xmx2g -jar
/export/apps/picard-tools1.124/picard.jar MarkDuplicates
INPUT=[input bam file] OUTPUT=[output
bam file] REMOVE_DUPLICATES=true
METRICS_FILE=[metrics output file]
• INPUT: input .bam file from GATK
• OUTPUT: output .bam file lacking duplicates
• METRICS_FILE: summary file of duplicate reads removed
• e.g.: $ java -Xmx2g -jar
/export/apps/picard-tools1.124/picard.jar MarkDuplicates
INPUT=align_realigned.bam
OUTPUT=align_nodups.bam
REMOVE_DUPLICATES=true
METRICS_FILE=nodups.metrics
samtools: index .bam file
• samtools requires that the new .bam file be
indexed before variant calling
• syntax: $ samtools index [.bam file
name]
• e.g.: $ samtools index align_nodups.bam
samtools: create .bcf file
for variant calling
• bcftools is a package very similar to
samtools that handles variant calling
• Of course, it requires its own file format
• syntax: $ samtools mpileup -go
[output .bcf] -f [ref.fasta] [1 or
more indexed .bam]
• -go: specify output file name and .bcf format
• -f: reference .fasta file name
• e.g.: $ samtools mpileup -go
E_coli.bcf -f E_coli.fasta
align_nodups.bam
bcftools: call variants
• The actual variant calling step uses the call
function in bcftools
• syntax: $ bcftools call -vmO z -o
[output .vcf.gz file] [input .bcf
file]
•
•
•
•
-v: only output variant sites
-m: specify variant calling algorithm (multiallelic)
-O: specify output format, z = .vcf.gz
-o: output file name
• e.g.: $ bcftools call -vmO z -o
E_coli.vcf.gz E_coli.bcf
bcftools: index .vcf.gz
file
• tabix is a program included in bcftools that
indexes a .vcf.gz file
• syntax: $ tabix -p vcf [input .vcf.gz
file]
• -p: specifies file type
• e.g.: $ tabix -p vcf E_coli.vcf.gz
bcftools: analyze .vcf.gz
file
• bcftools has handy software to analyze the
variants that it has identified
• syntax: $ bcftools stats -F
[ref.fasta] -s - [input .vcf.gz
file] > [output file]
• -F: faidx indexed reference .fasta sequence
• -s: list of samples to analyze, "-" = all samples
• e.g.: $ bcftools stats -F
E_coli.fasta -s - E_coli.vcf.gz >
E_coli.vcf.gz.stats
Summary stats
Indel stats
Quality stats
Indel types
Substitution types
bcftools: filter variants
based on quality score
• Generally, one wants to mark low quality variants.
• How to draw a cutoff line is somewhat subjective
• syntax: $ bcftools filter -O z -o
[output .vcf.gz file] -s LOWQUAL -i
'%QUAL>10' [input .vcf.gz file]
•
•
•
•
-O: output type, "z" = .vcf.gz
-o: output file name
-s: label to mark failed variants
-i: condition under which sequences pass
• e.g.: $ bcftools filter -O z -o
E_coli_filtered.vcf.gz -s LOWQUAL -i
'%QUAL>10' E_coli.vcf.gz
bcftools: calculate stats
based on filtered variants
• You can tell bcftools stats to only analyze
variants that pass the filter
• syntax: $ bcftools stats -F
[ref.fasta] -f PASS -s - [input
filtered .vcf.gz file]
• -F: faidx indexed reference .fasta sequence
• -f: how sequences to include are marked
• -s: list of samples to analyze, "-" = all samples
• e.g.: $ bcftools stats -F E_coli.fasta
-f PASS -s - E_coli_filtered.vcf.gz
Assignment
• How do our MiSeq and HiSeq E.coli datasets
differ from the reference K-12 genome?
• Submit:
1. Number of SNP and indel differences compared
to the reference genome
2. Justification for the filtering parameters used
3. Lab notebook file listing all of the exact
parameters used
4. Output of the bcftools stats analysis

similar documents