SNP detection presentation

Report
Cataloging polymorphisms in
genomes with NGS
Graham Etherington
[email protected]
Outline
• Single-nucleotide polymorphisms (SNPs) –
what, why and where.
• Insertions and deletions (Indels)
• Mapping reads to a genome
• SNP analysis
• Indel analysis
• Filtering – making rules
• Tutorial
What is a SNP?
Why look for SNPs?
• SNPs may lead to a change in function or
expression of a gene.
– premature stop codon
– different fold in a protein
– higher or lower expression of gene
• Genetic markers
– SNP may be linked to a gene for a given trait
• response to a pathogen (susceptible or resistant)
Types of SNPs
• Effect of SNPs vary depending on location.
– Intergenic regions
• may alter the sequence of regulatory RNAs
– Non-coding regions
• alteration of promoter and enhancer sequences may
change expression of gene
– Coding regions
• Substitutions
– synonymous: no change in the amino acid
– non-synonymous: change in amino acid
Indels
• Insertion/deletion
• Differ from SNPs by having at least one
nucleotide extra or missing when compared to a
reference sequence.
• Can cause frame-shifts – codons shift by one
creating a different protein sequence after indel.
• Indels of a length divisible by 3 cause whole
amino acid insertions/deletions, but not frameshifts.
Indels
• Insertions
– an ‘extra’ nucleotide appears
(inserted) between 2
neighbouring positions
• Deletions
– a nucleotide is missing
(deleted) at a given positions
Reference
Reference
I A F A M A
ATCGCGTTTGCCATGGCC
ATCGCGTTTCGCCATGGCC
I A F R H G
ATCGCGTTTGCCATGGCC
ATCGCGTTGCCATGGCC
I A L P W P
• Sometimes a matter of perspective – does the
reference have an insertion, or does the
query (e.g. a read sequence) have a deletion?
Analysis - the basic steps
• Polymorphism analysis in Galaxy involves a
number of steps:
– Getting data
– Quality Control and filtering
– Aligning reads to reference
– SNP analysis
– Indel analysis
Paired-end reads
•
Sequences can be paired-end
– sequences occur as ‘pairs’ with one left-hand (forward) read and one right-hand
(reverse) read.
– a given distance (insert-size) between the start and end of pairs.
Paired -ends
Left (forward) read
76 nucleotides
Right (reverse) read
76 nucleotides
500 nt DNA fragment
~350 nt gap
~500 nt ‘insert size’
FASTQ Format
• Illumina sequences are stored as FASTQ
format.
FASTA format:
>AY0038453
CCTCGGAGTGGAAGGGTGAAGCTAGATTCGTGGACGAATCTATGTTAGTGGGGGAG
FASTQ format:
@HWI-EAS396_0001:6:1:11659:1311#0/1
CCTCGGAGTGGAAGGGTGAAGCTAGATTCGTGGACGAATCTATGTTAGTGGGGGAG
+HWI-EAS396_0001:6:1:11659:1311#0/1
`_Z_`WU\R\ddadafcfafcdWdca[facdW[[^W^ca^W^ac^fcdcfab]_X^
Quality Control
• Illumina sequencing
• Not perfect, contains errors
– wrongly called bases
– N’s
– reads < or > expected read length
– low quality
• Quality control reads to select highest quality
Quality Control
• Quality scores
– look at the per-base quality scores
Q
u
a
l
i
t
y
S
c
o
r
e
s
Position
Mapping
• Align sequenced reads to a reference genome
using a next-generation sequence aligner.
• Output of alignment program in Sequence
Alignment/Map (SAM) format.
• The SAM format describes the alignment of
sequenced reads to a reference sequence.
Map analysis - SNPs
• SNP finding follows a number of steps:
– calculating the individual read bases that cover
each base in a reference
– counting how many read bases are the same as
the reference and how many are different
– calculating the depth of coverage and quality of
the read bases
Map analysis - SNPs
• The mpileup format
• A multi-column representation of the one or more alignments with perline base-by-base information on chromosome, position, reference base,
consensus base, coverage, quality, etc.
First 3 columns are information about the
reference sequence:
1. Reference name
2. Position
3. Reference base
Then 3 columns per sample:
4. Coverage
5. Read bases at that position
6. Read quality at that position
1
chr1
chr1
chr1
chr1
2
412
413
414
415
Col 5
Char
.
,
^
$
Uppercase
letter
Lowercase
letter
3
A
G
C
C
4
2
4
4
4
5
.,
..t,
...a
TTTt
6
II
IIIH
III2
III7
Meaning
Identical base on the same strand
Identical base on the opposite strand
Read starts at this position
Read ends at this position
Different base on the same strand
Different base on the opposite strand
Identifying SNPs
• Use a statistical/heuristic approach to identify
homozygosity/heterozygosity in mpileups
• VarScan
– calls SNPs, indels, and consensus genotypes.
– genotypes
• homozygous to reference base
• homozygous to alternative base
• heterozygous (combination of reference and alternative
bases)
Variant Calling Format (VCF)
##fileformat=VCFv4.1
##source=VarScan2
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=<ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=<ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=<ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=<ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=<ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=<ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=<ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=<ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=<ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=<ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=<ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
#CHROM POS
ID
REF
ALT
QUAL
FILTER
INFO
FORMAT
chr01
37
.
A
T
.
PASS
ADP=27;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
chr01
55
.
A
G
.
PASS
ADP=16;WT=0;HET=0;HOM=1;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
chr01
85
.
A
G
.
PASS
ADP=34;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
chr01
88
.
A
T
.
PASS
ADP=30;WT=0;HET=1;HOM=0;NC=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR
##INFO=<ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred
score >= 15">
##INFO=<ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=<ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=<ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=<ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
Sample1
0/1:3:30:23:14:9:39.13%:7.4174E-4:39:37:14:0:9:0
1/1:4:36:12:0:9:75%:2.0568E-5:0:36:0:0:9:0
0/1:5:37:35:19:16:45.71%:1.637E-6:35:36:19:0:16:0
0/1:2:33:31:23:8:25.81%:2.3332E-3:36:19:23:0:8:0
INFO
ADP=27;WT=0;HET=2;HOM=0;NC=0
ADP=16;WT=0;HET=0;HOM=2;NC=0
ADP=34;WT=1;HET=1;HOM=0;NC=0
ADP=30;WT=1;HET=1;HOM=0;NC=0
ADP=61;WT=1;HET=1;HOM=0;NC=0
ADP=80;WT=1;HET=1;HOM=0;NC=0
ADP=88;WT=1;HET=1;HOM=0;NC=0
ADP=101;WT=1;HET=1;HOM=0;NC=0
Identifying SNPs
• Filter snps based on some rules
– Coverage: what minimum depth of coverage do
you require
– Genotype: what genotypes, or combination of
genotypes are you looking for
– Alternative allele frequency: 0.5? 0.33?
Identifying Indels
• SAM file has information about how the reads map to
genome – includes insertions and deletions
• Indel extraction output in Galaxy allows us to visualise
indels
• VarScan also has a tool to extract Indels
• Like SNP-calling, provide some parameters to filter for
most probable indels.
Visualisation
• Galaxy Trackster
– view read alignments
– view SNPs
– view Indels
Visualisation
A>T SNP
Tutorial
• Time to put what I’ve said into practice.
• Use Galaxy: galaxy.tsl.ac.uk
• Go through the Tutorial. Don’t rush things.
Take your time and make sure you understand
what each step does and how you are creating
a pipeline of analysis.

similar documents