RNA-Seq - Babraham Bioinformatics

Report
RNA-Seq Analysis
Simon Andrews
[email protected]
@simon_andrews
v2.3
RNA-Seq Libraries
rRNA depleted mRNA
Fragment
Random prime + RT
NNNN
A
T
A
A
u
u
u
u
u
u
u
u
A
A-tailing
u
u
u
u
A
T
Adapter Ligation
T
(U strand degradation)
2nd strand synthesis (+ U)
Sequencing
RNA-Seq Analysis
QC
(Trimming)
Mapping
Statistical
Analysis
Quantitation
Mapped QC
QC: Raw Data
• Sequence call quality
QC: Raw Data
• Sequence bias
QC: Raw Data
• Duplication level
Mapping
Exon 1
Exon 2
Exon 3
Genome
Simple mapping within exons
Mapping between exons
Spliced mapping
Can simplify by aligning first to a transcriptome
and then translate back to genomic coordinates.
Can map unmatched reads to the whole genome.
Spliced Mapping Software
• Tophat (http://tophat.cbcb.umd.edu/)
• Star (http://code.google.com/p/rna-star/)
TopHat pipeline
Reference FastQ files
Indexed Genome
Reference GTF Models
Indexed Transcriptome
Reads
Maps to transcriptome?
Yes
Translate coords and report
No
Maps to genome?
Yes
Report
No
Split map to genome
No
Discard
Yes
Build consensus and report
Post Mapping QC
•
•
•
•
•
Mapping statistics
Proportion of reads which are in transcripts
Proportion of reads in transcripts in exons
Strand specificity
Consistency of coverage
SeqMonk (RNA-Seq QC plot)
RNASeqQC (easiest through GenePattern)
SeqMonk Mapping QC (good)
SeqMonk Mapping QC (bad)
Quantitation
Exon 1
Exon 1
Exon 2
Exon 3
Splice form 1
Exon 3
Splice form 2
Definitely splice form 1
Definitely splice form 2
Ambiguous
Options for handling splice variants
• Ignore them – combine exons and analyse at
gene level
– Simple, powerful, inaccurate in some cases
– DE-Seq, SeqMonk
• Assign ambiguous reads based on unique ones –
quantitate transcripts and optionally merge to
gene level
– Potentially cleaner more powerful signal
– High degree of uncertainty
– Cufflinks, bitSeq, RSEM
Read counting
• Simple (exon or transcript)
– HTSeq (htseq-count)
– BEDTools (multicov)
– featureCounts
– SeqMonk (graphical)
• Complex (re-assignment)
– Cufflinks, bitSeq, RSEM
Count Corrections
• Size of library
• Length of transcript
• Other factors
RPKM / FPKM / TPM
• RPKM (Reads per kilobase of transcript per million reads of library)
– Corrects for total library coverage
– Corrects for gene length
– Comparable between different genes within the same dataset
• FPKM (Fragments per kilobase of transcript per million reads of library)
– Only relevant for paired end libraries
– Pairs are not independent observations
– RPKM/2
• TPM (transcripts per million)
– Normalises to transcript copies instead of reads
– Corrects for cases where the average transcript length differs between
samples
Normalisation
Normalisation
Filtering Genes
• Remove things which are uninteresting or
shouldn’t be measured
• Reduces noise – easier to achieve significance
–
–
–
–
–
Non-coding (miRNA, snoRNA etc) in RNA-Seq
Known mis-spliced forms (exon skipping etc)
Mitochonidrial genes
X/Y chr genes in mixed sex populations
Unknown/Unannotated genes
Filtering Mouse mRNAs
Non coding
ESTs
Predicted genes
Good transcripts
Visualising Expression
• Comparing the same gene in different samples
– Normalised log2 RPM values
• Comparing different genes in the same sample
– Normalised log2 RPKM values
Linear
Log2
CD74
Eef1a1
Actb
Lars2
Eef2
Differential Expression
• Microarrays traditionally used continuous
statistical tests (t-test ANOVA etc)
• RNA-Seq differs in that it is count based data,
so continuous tests fail at low counts
• Most differential tests use count based
distribution tests, usually based on a negative
binomial distribution
Negative binomial tests (DE-Seq)
• Are the counts we see for gene X in condition
1 consistent with those for gene X in condition
2?
• Initially modelled using simple Poisson
distribution using mean expression as the only
parameter
• Doesn’t model real data very well
Poisson vs Negative binomial
Poisson
DESeq
Parameters
• Size factors
– Estimator of library sampling depth
– More stable measure than total coverage
– Based on median ratio between conditions
• Variance – required for NB distribution
– Custom variance distribution fitted to real data
– Insufficient observation to allow direct measure
– Smooth distribution assumed to allow fitting
Dispersion shrinkage
• Plot observed per gene dispersion
• Calculate average dispersion for
genes with similar observation
• Individual dispersions regressed
towards the mean. Weighted by
– Distance from mean
– Number of observations
• Points more than 2SD above the
mean are not regressed
Other filters
• Cook’s Distance – Identifies high variance
–
–
–
–
–
Effect on mean of removal of one replicate
For n<3 test not performed
For n = 3-6 failures are removed
For n>6 outliers removed to make trimmed mean
Disable with cooksCutoff=FALSE
• Hit count optimisation
– Low intensity reads are removed
– Limits multiple-testing to give max significant hits
– Disable with independentFiltering=FALSE
Replicates
• Compared to arrays, RNA-Seq is a very clean
technical measure of expression
– Generally don’t run technical replicates
• Some statistics can be run on single replicates,
but they can only tell you about technical noise
(how likely is it that this change is due to a
technical issue)
• Assessing biological variation requires biological
replicates
Replicates
• Traditional statistics require min 3x3
• DESeq can operate at 2x2, but this is minimum, not recommended
• True number of replicates required will depend on your biology and
requirements
• 4x4 design is fairly common
• Always expect at least one sample to fail
• Randomise samples during sample prep
The problem of power…
Gene A (1kb)
Gene B (8kb)
• In a library Gene B is much better observed for
the same copy number
• Power to detect DE is proportional to length
5x5 Replicates
5,000 out of 22,000 genes
(23%) identified as DE using
DESeq (p<0.05)
Intensity difference test
• Different approach to differential expression
• Doesn’t aim to find every differentially
expressed gene
• Conservative test
• Guaranteed to never return large numbers of
hits
Assumptions
• Noise is related to observation level
– Similar to DESeq
• Differences between conditions are either
– A direct response to stimulus
– Noise, either technical or biological
• Find points whose differences aren’t explained
by general disruption
Method
Results
Exercises
• Look at raw QC
• Mapping with tophat
– Small test data
• Quantitation and visualisation with SeqMonk
– Larger replicated data
• Differential expression with DESeq
• Review in SeqMonk
Useful links
•
•
•
•
•
•
•
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tophat http://tophat.cbcb.umd.edu/
SeqMonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/
Cufflinks http://cufflinks.cbcb.umd.edu/
DESeq http://www-huber.embl.de/users/anders/DESeq/
Bioconductor http://www.bioconductor.org/
RSEM http://deweylab.biostat.wisc.edu/rsem/

similar documents