Using Unix Shell Scripts

to Manage Large Data
What is Unix shell script?
• A collection of unix commands may be stored in a
file, and csh/bash can be invoked to execute the
commands in that file.
• Like other programming languages, it has
variables and flow control statements, e.g.,
• you can run any shell simply by typing its name.
Useful Unix commands
• grep: globally searches for regular expressions in files
and prints all lines that contain the expression
• cut: select fields or characters from each line of a file
• head/tail: cut the first/last # lines of a file
• wc: count # characters/words/lines of a file
• split: read a file and writes it in n line pieces into a set
of output files
• cat/paste: join files by rows or columns
• join: merge two files by a common field
• awk: a POWERFUL pattern scanning and processing
Motivating example
• Genome-wide DNA methylation data
– ~3000 samples (rows)
– ~485,000 sites (columns)
– Data came in batches (~300 sample per file, ~1Gb
– For our analysis, we would like to:
• Pool all samples together
• but split to ~50,000 sites per file
– Load to R? will take ~14GB memory and R takes hours
to read each file
– Using csh scripts, only takes ~10 minutes
csh script: pool samples
cd /dir
rm -f cpg.txt
cp -f All_Beta_Values1.txt cpg.txt
foreach m (`seq 2 9`)
# count number of samples
@ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1
echo "file = ${m}, nrow = $l"
rm -f test.txt
# remove the header
tail -n $l All_Beta_Values${m}.txt > test.txt
cat test.txt >> cpg.txt
csh script: split by sites
cd /dir
foreach n (`seq 1 9`)
rm -f beta2950_${n}of10.txt
# start
@ l = ($n - 1) * 50000 + 2
# end
@ r = $n * 50000 + 1
zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt
zcat cpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt
Some tips
• To check whether a data file contains header
or not, whether it is tab- or comma-delimited
> head -n 1 filename
• To check a selected variable/column (e.g., to
see how missing values were coded)
> head -n 10 filename | cut -f #,#
• To get a subset of samples by matching ID
> grep -f ID.txt filename

