March 22 - Mouse Genome Informatics

last time
• pbx1 assignment…..find location of the probes
in another one of the probesets for zebrafish.
• Read limma documentation
• Run limma on your data set
• Be sure you have your Galaxy account set up
UCSC Genome Browser on Zebrafish Jul. 2010 (Zv9/danRer7) Assembly
From gene list to intepretation
• limma will generate a list of probeset ids for
differentially expressed genes
– What next?
• Convert the probeset ids to gene symbols
• Look for enrichment of functional terms
associated with the genes in your list
• Use of next-generation sequencing technology (NGS)
to measure RNA levels
• RNA Seq advantages:
– Wider dynamic range compared to microarray technology
– Not dependent on known genome annotations
– Higher throughput compared to microarray technology
• RNA Seq challenges:
– Specificity versus completeness of alignments..especially
for short sequence reads
– Manipulation and analysis of large files
– Data storage costs
RNA Seq Library Prep
Sequencing Technologies
Sequence “Space”
• Roche 454 – Flow space
– Measure pyrophosphate released by a nucleotide when it is added to a growing
DNA chain
– Flow space describes sequence in terms of these base incorporations
• AB SOLiD – Color space
– Sequencing by DNA ligation via synthetic DNA molecules that contain two nested
known bases with a flouorescent dye
– Each base sequenced twice
• Illumina/Solexa – Base space
– Single base extentions of fluorescent-labeled nucleotides with protected 3 ‘ OH
– Sequencing via cycles of base addition/detection followed deprotection of the 3’
• GenomeTV – Next Generation Sequencing (lecture)
Further Reading
• Metzker, ML. (2010) Sequencing technologies
– the next generation. Nature Reviews
Genetics 11:31-36.
Short Read Archive
Short Read Archive Handbook
Aspera Connect
High performance file
transfer for getting data from
the Short Read Archive
SRA Toolkit
RNA Seq Workflow
• RNA Seq
– FASTQ file format
• Alignment
– SAM file format
• Annotation
– GTF, BED file format
• Alignment Counts
• Statistical analysis
FASTQ: Data Format
– Text based
– Encodes sequence calls and quality scores with ASCII
– Stores minimal information about the sequence read
– 4 lines per sequence
• Line 1: begins with @; followed by sequence identifier and optional
• Line 2: the sequence
• Line 3: begins with the “+” and is followed by sequence identifiers
and description (both are optional)
• Line 4: encoding of quality scores for the sequence in line 2
• References/Documentation
– Cock et al. (2009). Nuc Acids Res 38:1767-1771.
FASTQ Example
For analysis, it may be
necessary to convert to
the Sanger form of
FASTQ…For example,
Illumina stores quality
scores ranging from 0-62;
Sanger quality scores
range from 0-93.
Solexa quality scores
have to be converted to
PHRED quality scores.
FASTQ example from: Cock et al. (2009). Nuc Acids Res 38:1767-1771.
Example Data
Data deposited in GEO with accession id GSE20846
SRP002119 (study/project)
SRX017794 (experiment)
SRS025246 (source)
SRR037945 (run)
SRR037946 (run)
• NCBI’s SRA Tools contains utilities to convert
SRA format to FASTQ
– fastq-dump
• If utilities and sra formatted file are in the
same directory, command line is…
fastq-dump <name of sra formatted file>
NOTE: Downloading and working with next generation sequence data will
very quickly exceed the capacity of a typical desktop or laptop computer. You
will need appropriate infrastructure in place to work with these files…or
consider scalable Cloud storage and compute services!
TopHat is a good tool for
aligning RNA Seq data
compared to other aligners
(Maq, BWA) because it takes
splicing into account during
the alignment process.
Figure from: Trapnell et al. (2010). Nature Biotechnology 28:511-515.
Trapnell et al. (2009). Bioinformatics 25:1105-1111.
TopHat is built
on the Bowtie
Trapnell C et al. Bioinformatics
SAM (Sequence Alignment/Map)
• It may not be necessary to align reads from
scratch…you can instead use existing
alignments in SAM format
– SAM is the output of aligners that map reads to a
reference genome
– Tab delimited w/ header section and alignment
• Header sections begin with @ (are optional)
• Alignment section has 11 mandatory fields
– BAM is the binary format of SAM
Mandatory Alignment Fields
Alignment Examples
Alignments in SAM format
• Assembles transcripts,
• Estimates their
abundances, and
•Tests for differential
expression and regulation
in RNA-Seq samples
Trapnell et al. (2010). Nature Biotechnology 28:511-515.
Cufflinks Output
• Gene expression
• Transcript expression
• Assembled transcripts
• Mapping reads to specific transcripts/genes
Data Visualization
• UCSC Browser (accessible from Galaxy)
• Trackster (native to Galaxy)
External visualization tools:
• Genome Workbench
• Integrative Genomics Viewer (IGV)
Statistical Analysis
• Once the mapping and genome summarization are
done, the data can be analyzed just like any other
count data
• Bullard, et al. (2010). Evaluation of statistical
methods for normalization and differential
expression in mRNA-Seq experiments. BMC
Bioinformatics 11:94.
Typical RNA_Seq Project Work Flow
Tissue Sample
Total RNA
FASTQ file
JAX Computational Sciences Service
See Tutorial 1
Build and share data and analysis workflows
No programming experience required
Strong and growing development and user community
RNA Seq Workflow
• Convert data to FASTQ
• Upload files to Galaxy
• Quality Control
– Throw out low quality sequence reads, etc.
• Map reads to a reference genome
– Many algorithms available
– Trade off between speed and sensitivity
• Data summarization
– Associating alignments with genome annotations
– Counts
• Data Visualization
• Statistical Analysis
Dialog/Parameter Selection
Uploading Data to Galaxy
Because of the size of
most sequence files it
is necessary to use ftp
to get files to Galaxy.
Select appropriate
reference genome at
time of data upload.
You can upload compressed files
and they will be uncompressed
upon loading into Galaxy.
Tutorial Web Site
This site will be accessible
after the meeting. Check
back for updates and new
next time
• Analyze project data with DAVID
– Convert probeset ids to genes
– Look for enrichment of functional terms
• Try the first part of Tutorial 5 in Galaxy

similar documents