NGS Bioinformatics Workshop
1.3 Tutorial - Sequence Alignment and
March 22nd, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Learning Objectives
A few more Linux tips
FASTA and BLAST on the web
Local BLAST
Local installation of BLAST
Making a blast database
Running a local blast query
Parsing search results using a script
First, a few more Linux tips…
 “cat”, “more” or “less”: list contents of a (text) file
 “head” or “tail”: display the start or end of a file
 One can ‘redirect’ the contents of a program into or out of
a file:
 ‘<‘ for input (‘<<TAG‘ appends from keyboard)
 ‘>’ for output (‘>>’ appends rather than overwrites)
 TRY:
cat >myfile.txt <<EOF
This is my file!
Another line
 “more” or “less”: like “cat” but controllable
 “wget”: downloads an internet file
 “which”: displays where a program is located on the system
 “mkdir”: makes a directory
 “cp”: copies files or directories
 “mv”: moves a file or directory
 “ln”: creates alias names/locations to files
 ln –s source target # -s link “symbolic”
 e.g. ln –s ncbi-blast-2.2.26+ ncbi
 “export”: exposes an environment variable, e.g.
 export BLAST=/usr/local/ncbi
Environment variables provide general configuration and global
contextual information that operating system scripts and
computer programs can read if they need to know about such
File archives (and programs)
 “.gz” files: gzip, gunzip archive commands
 “.tar” files: tar command
-x # extract
-f filename # file to extract
-v # verbose output
-z # uncompress gz file on the fly…
 tar –xvf file.tar
 Tar –zxvf file.tar.gz
 “.bz2”: bzip2 and bunzip2 archive commands
 “.zip”: zip and unzip archive commands
FASTA Walkthrough
Database: Knowledgebase
(complete source for protein
information, combines other
databases), Swiss-Prot (only
manually-curated proteins)
Paste your sequence here!!!!
Program: fasta (DNA or protein
sequence against a like
Matrix: scoring matrix used
(first click More options… )
Results: interactive or email
Don’t forget to push SUBMIT!
BLAST Walkthrough
Note: the program used;
nucleotide blast (blastn), protein
blast (blastp), etc. is chosen at
an earlier screen
Paste your sequence here!!!
Database: nr, swissprot, etc.
Algorithm: blastp (protein blast)
To configure blastp options,
click Algorithm parameters.
BLAST Walkthrough
Algorithm paramters
Expect value: can be changed
Filtering: Off by default, can turn
Don’t forget to push BLAST!
BLAST Results
Protein domains in your
Graphical representation of
the alignment results; each
line represents an
alignment; color indicates
List of the hits from the
database you searched
against (ID, name, E-value
of top HSP)
 click on score to jump
down to textual alignment
Individual display for each
alignment (HSP)
You can also use a Script to directly do the
BLAST’ing – e.g. Biopython
Example 1:
If you have a nucleotide sequence you want to search against the
nucleotide database (nt) using BLASTN, and you know the GI
number of your query sequence, you can use:
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
Then, save the result…
save_file = open("my_blast.xml", "w")
# save as XML format
More BLASTing…
Example 2:
Alternatively, if we have our query sequence already in a FASTA
formatted file, we just need to open the file and read in this
record as a string, and use that as the query argument:
from Bio.Blast import NCBIWWW
fasta_string = open("m_cold.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
See http://biopython.org/DIST/docs/tutorial/Tutorial.html for
more sample BLAST query code (or see the equivalent sections
of other open-bio toolkits)
Locally installed BLAST - Advantages
Search many input sequences at once
Customizable databases
No need for internet access
Faster for small to medium-sized databases
Integrate BLAST searches into a larger, automated
bioinformatics analysis
Can also run local BLAST through open-bio scripts (e.g. see
Parts of the standalone
BLAST equation
FASTA file
containing sequences
to build database to
BLAST against (NCBI or
your own file)
BLAST Programs
File downloaded from
NCBI contains:
Let’s try this ourselves….
We will:
Obtain and install the BLAST executables (Linux)
Set up a BLAST database
Copy an archive
Use ‘makeblastdb’ to create a novel database to search
Use the ‘blastp’ program to carry out a BLAST
analysis over the command-line
Output your BLAST results into a more flexible
Use a small BioPython script to parse the output
Step 1 – Installing BLAST tools
 Go to http://blast.ncbi.nlm.nih.gov and follow
links to latest release and follow instructions for
favorite operating system:
Windows 32/64: .exe installer
Linux 32/64: compiled binaries (RPM or tar.gz)
Other Unix: compiled binaries (in tar.gz)
 Apply platform-specific configuration details for
your operating system
 Read the good documentation:
Installing from Linux .tar.gz archive
tar -zxvf ncbi-blast-2.2.26+-x64-linux.tar.gz
sudo mv ncbi-blast-2.2.26+ /usr/local
cd /usr/local/
sudo ln -s ncbi-blast-2.2.26+ ncbi
export BLAST=/usr/local/ncbi
export PATH=$BLAST/bin:$PATH
Step 2 – Getting a BLAST database
Option A:
Pre-formatted databases: download archives
from NCBI: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
Option B:
“Roll your own”: construct de novo from a file of
custom FASTA sequences using makeblastdb
Option A: Obtain a existing NCBI database... (Linux)
sudo mkdir $BLAST/db
tar -zxvf swissprot.tar.gz $BLAST/db
Option B: Create a simple BLAST database
from a local FASTA sequence file
The makeblastdb application produces BLAST databases from
FASTA files. In the simplest case the FASTA definition lines are not
parsed by makeblastdb and may be completely unstructured
(but can only be BLAST’ed and not be directly retrieved)
makeblastdb -in mydb.fsa -dbtype nucl
creates a BLAST database from a nucleotide FASTA sequences
which can be put into the “db” directory for searching. Of course,
like all blast programs, there are a rich set of parameters which
can be used to customize the generation of the database (see the
BLAST manual).
With either option, still need some
configuration details
# back to home directory
# need to point to the database…
cat >.ncbirc <<EOF
Step 3 – Executing a BLAST operation
Command line programs (only) but
parameters are generally equivalent to (or a
superset of) the NCBI web BLAST application
Sample run:
Retrieve a sequence from the database:
blastcmd –db swissprot –entry Q9MAH0 –outform "%f" –out test_query.txt
Blast it back against the same database:
blastp –query text_query.txt –db swissprot –out output.txt –outfmt 5
Different BLAST programs
blastn – search nucleotide database using a nucleotide query
blastp – search protein database using a protein query
blastx – search protein database using a translated nucleotide
tblastn – search translated nucleotide database using a protein
tblastx – search translated nucleotide database using a translated
nucleotide query
psiblast – Position-Specific Iterated BLAST
rpsblast – Reversed Position Specific BLAST
See BLAST documentation for related utility programs
Command line parameters: statistics
-evalue: expect value, normally set to 10
-word_size: “k-tuple” size; increase for speed, decrease
for sensitivity
-gapopen: cost to open a gap; increase for stringency
-gapextend: cost to extend a gap; increase for stringency
-matrix: substitution scoring matrix (default BLOSUM62);
change if sequences too related or too distant
To get more information use option “-help”
Command line parameters: input/output
-query in.txt: specify input file
-out out.txt: specify output file
-db nr: which database (created with makeblastdb)
-dust yes/no: filter low complexity regions in nucleotide
sequence search yes/no (default is yes)
-seg yes/no: filter low complexity regions in protein sequence
search yes/no (default is no)
-html: format output as HTML
-outfmt: specify output format, e.g. 5 = XML blast output
(use –help flag to see other options)
Additional useful program options
Depending on program:
-num_threads: use multiple CPUs (speeds up search)
-subject: specify a second input sequence instead of a database (former
-task megablast: much faster for highly similar nucleotide sequences
-task blastn_short: find similar short sequences (e.g. primer sequences)
Step 4 – Parse the output
If you just have one query sequence, simply
view the BLAST text file
If you are doing a lot of queries on the
database and looking for “best hits”, you may
wish to use a parsing script (e.g. biopython or
Parsing BLAST output with Biopython
 Good to use the BLAST XML format for this…
result_handle = open("my_blast.xml")
 Now that we’ve got a handle, we are ready to parse the
output. The code to parse it is really quite small.
1. If you expect a single BLAST result (i.e. you used a
single query):
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
More parsing…
2. or, if you have lots of results (i.e. multiple
query sequences):
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
... # Do something with blast_record
What’s in a BLAST record?
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print '****Alignment****‘
print 'sequence:', alignment.title
print 'length:', alignment.length
print 'e value:', hsp.expect
print hsp.query[0:75] + '...‘
print hsp.match[0:75] + '...‘
print hsp.sbjct[0:75] + '...'
Gives output something like this…
sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation
protein WCOR413-like protein
alpha form mRNA, complete cds
length: 783
e value: 0.034
||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| | ||| ||...
Again, see http://biopython.org/DIST/docs/tutorial/Tutorial.html
for more details...

similar documents