NGS Bioinformatics Workshop
2.2 Tutorial – Whole Genome Assembly
Part I
May 9th, 2012
IRMACS 10900
Facilitator: Richard Bruskiewich
Adjunct Professor, MBB
Workflow for Today
Generate a synthetic NGS read data set
Genome assembly
Generate synthetic NGS read data for assembly
 Try a new program out called “ART” from Baylor College
Huang W, Li L, Myers JR, Marth GT. 2012. ART: a next-generation
sequencing read simulator. Bioinformatics. 28(4):593-4
 Available as open source and as binary programs for 32 or 64 bit
Windows, Mac and Linux
 Notes:
 the binary archive names are a bit strange – really a .tar.gz in disguise (need
to do a gunzip followed by a tar –xvf)
 The fastq sequence line is *lower case* which is not expected by some
software (e.g. ABySS)
Simulated Illuminex Paired End Reads
Using rice chloroplast genome (~134kb)
art_illumina -i Chloroplast.fasta 
-p -l 50 -f 20 -m 200 
-s 10 -o Chloroplast -sam
Generates files:
ART (Q Version 1.3.6)
Copyright(c) 2008-2012, Weichun Huang, Jason Myers. All Rights Reserved.
Paired-end Simulation
Total CPU time used: 2.48
Parameters used during run
Read Length:
Fold Coverage:
Mean Fragment Length:
Standard Deviation:
Profile Type:
ID Tag:
Quality Profile(s)
First Read:
Second Read:
EMP50R1 (built-in profile)
EMP50R2 (built-in profile)
Output files
FASTQ Sequence Files:
the 1st reads: Chloroplast1.fq
the 2nd reads: Chloroplast2.fq
ALN Alignment Files:
the 1st reads: Chloroplast1.aln
the 2nd reads: Chloroplast2.aln
SAM Alignment File:
The ART program generates peculiar id’s
(doesn’t mark the paired end reads…) and
lower case sequence letters, which causes
some headaches…
So, I wrote a small python script to fix this…
# Fixes the output of the ART program
# art_illumina -i reference.fa -p -l 50 -f 20 -m 200 -s 10 -o outFile_prefix -sam
from sys import stdin
seq = False
qual = False
if __name__ == '__main__':
for line in stdin:
line = line.strip()
if qual:
qual = False # to avoid treating rare quality score lines that start with '@' as id's
elif line.startswith('+'):
qual = True
elif not seq and line.startswith('@'):
# massage the ID
part1 = line.split('|')
part2 = part1[1].split('-')
line = part1[0]+'_'+part2[0]+'-'+part2[1]+'/'+part2[2]
seq = True
elif seq:
# convert sequence all to upper case to avoid downstream confusion...
line = line.upper()
seq = False
print line
Getting ABySS
For Ubuntu, sudo apt-get install abyss
Or visit BCGSC and download tar.gz source, then
configure..make (more up-to-date?)
Perhaps put the abyss bin directory on your path…
To test run ABySS:
abyss-pe k=25 name=test 
se=https://raw.github.com/dzerbino/ 
Try our test PE read data set
 abyss-pe name=Chloroplast31 k=31 
ABYSS_OPTIONS=--no-trim-masked 
in=‘Chloroplast1.fastq Chloroplast2.fastq‘
The ‘no-trim-masked’ needed because default
behaviour of abyss is to trim lower case letters
in sequence (which designate identified vector
sequences in 454 outputs…)
Try with other k-mer sizes…
For more info about ABySS
Active list service to troubleshoot issues:
[email protected]
download & tar -zxvf
sudo make install
put velvet directory on your $PATH
Run velveth:
velveth outputdir k_mer -fastq readfile
Run velvetg:
velvetg outputdir -ins_length 200 -exp_cov 20
download and tar –zxvf
sudo make install
Execute the program:
PrepareAllPathsInputs.pl # needs some config files…

similar documents