Ptyhon_Beagle_2013

Report
Python for High Performance
Computing
Who is this for?
Know basics
of SciPy ?
Interested in
HPC
applications?
Know basics
of NumPy?
You already
know
Python?
Know basics
of Matplotlib
?
Facts about Python
Pros
Cons
For most users,
computer speed is
not as important as
the human
investing in coding!
Is Python good fit for your problem?
The CPU-intensive
parts of your
application may be
migrated to C or
Fortran
Prototyping
new
algorithms/ide
as
Communication is
NOT frequent in your
Rapid
development is
programs ( if it is you
need to spend time
writing Python MPI code)
If your application can make
good use of proper
python libraries:
NumPy, SciPy…
your primary concern
Yes
When you need to
easily and quickly
glue existing
components of
your program
written in different
languages
Python in interactive shell on Beagle
In the same path as
the running script
Python looks for
MODULES
In the system-wide
modules folder:
site-packages
In paths specified
in $PYTHONPATH
[email protected]:~> module avail python
python/2.7.1(default)
python/2.7.3 python/2.7.3-vanilla python/3.3.0-vanilla
[email protected]:~> module load python/2.7.3
Python version 2.7.3 loaded
[email protected]:~> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages
[email protected]:~> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages
[email protected]:~> python
Python 2.7.3 (default, Jul 10 2013, 13:27:48)
[GCC 4.7.2 20120920 (Cray Inc.)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import scipy
PYTHONPATH in PBS script on Beagle
…
#PBS -l mppwidth=480
#PBS -j oe
. /opt/modules/3.2.6.6/init/bash
module swap PrgEnv-cray PrgEnv-gnu
module load numpy
export PATH=/soft/python/2.7/2.7.3/python/bin/:$PATH
export LD_LIBRARY_PATH=/soft/python/2.7/2.7.3/python/lib/:$LD_LIBRARY_PATH
export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages
export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/site-packages/
cd $PBS_O_WORKDIR
aprun -n 20 -N 1 python myampijob.py
NumPy
NymPy
maps
large
regions
of
memory
to a
single
object
A powerful N-dimensional array object with fast calculations
(in an “element-by-element fashion”)
Supports all native data
types (int, floats…)
You can describe
z = np.arange(3, dtype=np.float)
custom objects
>>> z
array([ 0., 1., 2.])
Homogeneous collection of exactly the same
data-type, every item takes up the same size block of
memory, and each block of memory in the array is
interpreted in exactly the same way.
Linear algebra
Basic scientific toolkit:
Fourier transform
Random number capabilities
Basic statistics
Shape of the array:
Is a tuple of N integers
(one for each dimension)
using a data type object
which describes how the
bytes in the fixed-size
block of memory should be
interpreted.
Array Data Structure
Strides: a tuple which shows you how many bytes you have to move to get from first row to second row.
Setting this attribute to another tuple will change the way the memory is viewed.
>>> a=np.arange(9).reshape(3,3)
>>> a
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> a.dtype
dtype('int64')
>>> a.strides
(24, 8)
NumPy Data Types
NumPy Data Types comparaison
import numpy as np
import timeit
setup = """
import numpy as np
A = np.ones((1000,1000,3), dtype=datatype)
"""
datatypes = "np.uint8", "np.uint16", "np.uint32", "np.uint64", "np.float16",
"np.float32", "np.float64"
stmt1 = """
A = A * 255 / 255 - 1 + 1
"""
stmt2 = """
A[:,:,:2] = A[:,:,:2] * A[:,:,:2]
"””
for stmt in [stmt1, stmt2]:
print stmt
for d in datatypes:
s = setup.replace("datatype", d)
T = timeit.Timer(stmt=stmt, setup=s)
print d,":", min(T.repeat(number=30))
print
print
[email protected]:~> python datatest.py
A = A * 255 / 255 - 1 + 1
np.uint8 : 1.2955300808
np.uint16 : 1.58471012115
np.uint32 : 1.67583012581
np.uint64 : 1.88479685783
np.float16 : 9.77974891663
np.float32 : 1.0303478241
np.float64 : 1.50705814362
A[:,:,:2] = A[:,:,:2] * A[:,:,:2]
np.uint8 : 1.17167806625
np.uint16 : 1.38199210167
np.uint32 : 1.75739693642
np.uint64 : 1.88689804077
np.float16 : 2.96481609344
np.float32 : 1.71590399742
np.float64 : 1.86970305443
“Structured” Arrays
# "Data structure" (dtype) that describes the fields and
# type of the items in each array element.
Elements of an array can
be any fixed-size data
structure!
>>> from numpy import *
>>> particle_dtype = dtype([('mass','float32'),
('velocity', 'float32')])
# This must be a list of tuples
>>> fmt = dtype([('name', 'S10'),
('age', int), …
('weight’ float) ... ])
>>> a = empty((3,4), dtype=fmt
>>> a.itemsize
26
>>> particles = array([(1,1), (1,2), (2,1), (1,3)],
dtype=particle_dtype)
>>> print particles
[(1.0, 1.0) (1.0, 2.0) (2.0, 1.0) (1.0, 3.0)]
>>> print particles['mass']
[ 1. 1. 2. 1.]
>>> print particles['velocity']
[ 1. 2. 1. 3.]
>>> particles[0]
(1.0, 1.0)
>>> particles.sort(order=('velocity','mass'))
>>> print particles
[(1.0, 1.0) (2.0, 1.0) (1.0, 2.0) (1.0, 3.0)]
“element-by-element fashion”
>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[2, 3], [4, 5]])
>>> a + b
array([[3, 5],[7, 9]])
>>> a * b
array([[ 2, 6],[12, 20]])
>>> np.multiply(a, b)
array([[ 2, 6],[12, 20]])
>>> a ** b
array([[ 1, 8],[ 81, 1024]])
>>> np.dot(a,b)
array([[10, 13], [22, 29]])
Numpy arrays properties
>>>import numpy as np
>>> import numpy as np.
>>> b = np.array([4,5,6])
>>> a=np.array([[ 0, 1],
...
[30, 3],
...
[ 4, 5],
...
[ 6, 7]])
>>> b.dtype
dtype('int64')
>>> b[2] = 12.6
>>> b
array([ 4, 5, 12])
>>> b.nbytes
24
>>> b>2
array([ True, True, True], dtype=bool)
>>> c=np.where(b>5)
>>>c
(array([2]),)
>>>b[c]
array([6])
a=np.arange(8).reshape(4,2)
>>> a.shape
(4, 2)
>>> d=a.T
>>> d
array([[ 0, 30, 4, 6],
[ 1, 3, 5, 7]])
>>> a.strides
(16, 8)
>>> a.T.strides
(8, 16)
Transpose doesn’t move
values in the memory it only
changes the order of
strides in array
Broadcasting: combine different dimensionality
>>> a=np.arange(12).reshape(4,3)
>>> a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
>>> b=np.arange(3)
>>> b
array([0, 1, 2])
>>> b.shape
(3,)
>>> a+b
array([[ 0, 2, 4],
[ 3, 5, 7],
[ 6, 8, 10],
[ 9, 11, 13]])
>>> a*b
array([[ 0, 1, 4],
[ 0, 4, 10],
[ 0, 7, 16],
[ 0, 10, 22]])
The smaller array broadcasts
across the larger array so that
they have compatible shapes!
WITHOUT COPYNG DATA
Broadcasting rule
>>> a=np.arange(12).reshape(4,3)
>>> a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
>>> b=np.arange(4)
>>> b
array([0, 1, 2, 3])
>>> a+b
If both your arrays are
twodimensional then
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (4,3) (4)
>>> b=np.arange(4).reshape(4,1)
>>> b
array([[0],
[1],
[2],
[3]])
>>> a+b
array([[ 0, 1, 2],
[ 4, 5, 6],
[ 8, 9, 10],
[12, 13, 14]])
their corresponding
sizes have to be either
equal or one of them
has to be 1!
If you have two arrays with different
dimensions number, say one 1x2x3
and other 2x3, then you compare
only the trailing common
dimensions, in this case 2x3
Instantiating NumPy arrays
>>> import numpy as np
>>> a = np.array([1, 2, 3])
>>> a
array([1, 2, 3])
>>> b = np.ones((3,2))
>>> b
array([[ 1.,1.],[ 1.,1.],[ 1.,1.]])
>>> b.shape
(3,2)
>>> c = np.zeros((1,3), int)
>>> c
array([[0, 0, 0]])
>>> c.dtype
dtype('int64')
>>> d = np.linspace(1,5,11)
>>> d
array([ 1. , 1.4, 1.8, 2.2, 2.6, 3. , 3.4, 3.8, 4.2, 4.6, 5. ])
NumPy & Slicing
>>>a=np.arange(36).reshape(6,6)
>>> a[:3]
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17]])
>>> a[::2]
array([[ 0, 1, 2, 3, 4, 5],
[12, 13, 14, 15, 16, 17],
[24, 25, 26, 27, 28, 29]])
>>> a[0,3:5]
array([3, 4])
>>> a[4:,4:]
array([[28, 29],
[34, 35]])
>>> a[:,2]
array([ 2, 8, 14, 20, 26, 32])
>>> a[2::2,::2]
array([[12, 14, 16],
[24, 26, 28]])
var[lower:upper:step]
Slices are References!
Slices are references to memory in the
original array.
Changing values in a slice also changes
the original array.
>>> a = np.array((0,1,2,3,4))
>>> b = a[2:4]
>>> b
array([2, 3])
>>> b[0] = 10
>>> a
array([ 0, 1, 10, 3, 4])
Searching and Indexing
>>> a = np.array([1, 3, 0, -5, 0], float)
>>> np.where(a != 0)
(array([0, 1, 3]),)
>>> a[a != 0]
array([ 1., 3., -5.])
>>> len(a[np.where(a > 0)])
2
>>> x = np.arange(9.).reshape(3, 3)
>>> x
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., 7., 8.]])
>>> np.where( x > 5 )
(array([2, 2, 2]), array([0, 1, 2]))
>>> b = np.array([10,20,30,40,50])
>>> i = np.where(a > 0)
>>> i
(array([0, 1]),)
>>> b[i]
array([ 10, 20])
>>> b
array([10, 20, 30, 40, 50])
Fast way to search
(and extract)
individual elements
of a NumPy array.
Much faster than using
a for loop or list
comprehension
Fancy Indexing
creates copies
instead of references!
Indexing with newaxis
newaxis is a special index that inserts a new axis in the array at the specified
location. Each newaxis increases the array’s dimensionality by 1.
>>> a = array((0,1,2))
>>> b = array((0,10,20,30))
>>> shape(a)
(3,)
>>> y = a[newaxis,:]
>>> shape(y)
(1, 3)
>>> y = a[:, newaxis, newaxis]
>>> shape(y)
(3, 1, 1)
>>> y = a[:, newaxis]
>>> shape(y)
(3, 1)
>>> y = a+ b[:, newaxis]
>>> y
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Example: Arrays from/to ASCII files
vi data.txt
-- BEGINNING OF THE FILE
% Day, Month, Year, Skip, Avg
01, 01, 2014, x876, 13 % crazy day!
% we don’t have Jan 03rd
04, 11, 2013, xfed, 55
vi fromto.py
from numpy import *
# loadtxt() automatically generates an array from txt file
file = loadtxt('data.txt', skiprows=1, dtype=int, delimiter=",",usecols = (0,1,2,4), comments = "%")
savetxt('fromto.txt', file)
Fancy Indexing in 2D
>>> a=np.arange(0,60).reshape(6,10)
>>> a[0:6,0:6]
>>> a[(0,1,2,3,4),(1,2,3,4,5)]
array([ 1, 12, 23, 34, 45])
>>> a[3:,[0, 2, 5]]
array([[30, 32, 35],
[40, 42, 45],
[50, 52, 55]])
mask = np.array([1,0,1,0,0,1], dtype=bool)
>>> a[mask,2]
array([ 2, 22, 52])
Example: Broadcasting & Fancy Indexing
“””
Calculate the sinc function: sin(r)/r. Use a Cartesian x,y grid
and calculate ``r = sqrt(x**2+y**2)`` with 0 in the center of the grid.
Calculate the function for -15,15 for both x and y.
"""
from numpy import linspace, sin, sqrt, newaxis
from pylab import imshow, gray, show
x = linspace(-15,15,101)
# flip y up so that it is a "column" vector.
y = linspace(-15,15,101)[:,newaxis]
# because of broadcasting rules, r is 2D.
r = sqrt(x**2+y**2)
# calculate our function.
sinc = sin(r)/r
# replace any location where r is 0 with 1.0
sinc[r==0] = 1.0
imshow(sinc, extent=[-15,15,-15,15])
gray()
show()
Array function methods: op.reduce (a, axis)
# multidimensional array, example
# 1D example
>>> a = array([1,2,3,4])
>>> add.reduce(a)
10
>>> a = array(['ab','cd','ef'],dtype=object)
>>> add.reduce(a)
'abcdef’
>>> a = array([1,1,0,1])
>>> logical_and.reduce(a)
False
>>> logical_or.reduce(a)
True
>>> a=arange(40).reshape(8,5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]])
>>> b=a[::2,0:3]
>>>b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
# default axis is 0: sum columns
>>>add.reduce(b)
array([60, 64, 68])
# axis a: sum rows
>>> add.reduce(b,1)
array([ 3, 33, 63, 93])
Array function methods: op.accumulate(a)
>>> a = array([1,2,3,4])
>>> add.accumulate(a)
array([ 1, 3, 6, 10])
>>> a = array(['ab','cd','ef'],dtype=object)
op.accumulate(a)
creates a
new array containing the
intermediate
results of the reduce operation
at
each element in a.
>>> add.accumulate(a)
array(['ab', 'abcd', 'abcdef'], dtype=object)
>>> a = array([1,1,0])
>>> logical_and.accumulate(a)
array([ True, True, False], dtype=bool)
>>> logical_or.accumulate(a)
array([ True, True, True], dtype=bool)
Array function methods: op.reduceat()
>>> a = array([0,10,20,30,40,50])
>>> indices = array([1,4])
>>> add.reduceat(a,indices)
array([60, 90])
op.reduceat (a,indices)
applies op to ranges in the 1-D
array a
defined by the values in
indices.
The resulting array has
the same length
as indices.
NOTE: For multidimensional arrays,
reduceat() is always applied along the
last axis (sum of rows for 2-D arrays).
This is different from the default for
reduce() and accumulate().
Array function methods: op.outer()
op.outer(a,b) forms all possible combinations of elements between
a and b using op. The shape of the resulting array results from
concatenating the shapes of a and b. (Order matters!)
Array function methods: op.choose()
# clip lower values up 10
# clip lower and upper values
>>> b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
>>> lt = b < 10
>>> gt = b > 15
>>> b<10
array([[ True, True, True],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
>>> choose(b<10,(b,10))
array([[10, 10, 10],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
>>> choice = lt + 2 * gt
>>> choice
array([[1, 1, 1],
[0, 0, 0],
[2, 2, 2],
[2, 2, 2]])
>>> choose(choice,(b,10,15))
array([[10, 10, 10],
[10, 11, 12],
[15, 15, 15],
[15, 15, 15]])
Array function methods: op.concatenate()
concatenate((a0,a1,…,aN),axis=0)
The input arrays (a0,a1,…,aN) are concatenated along the given axis.
They must have the same shape along every axis except the one given.
>>>concatenate((x,y)) >>> concatenate((x,y),1)
>>> array((x,y))
NumPy & speed
[email protected]:~> python -m timeit -s "s=0" "for i in range(1000000): s+=i"
10 loops, best of 3: 106 msec per loop
[email protected]:~> python -m timeit "sum(range(1000000))"
10 loops, best of 3: 41.6 msec per loop
[email protected]:~> python -m timeit -s "import numpy" "numpy.sum(numpy.arange(1000000))"
100 loops, best of 3: 4.16 msec per loop
Use NumPy array
functions instead of
loops , because they’ve
been optimized to work
on arrays
NumPy “slows down”
If you are only going to iterate one item at the time with NumPy over a
set of data (array) expect slowdown because Numpy adds that
additional layers of abstraction when calling undelaying C.
There is a way
around it!
DEREFERENCING
zi=complex(z[i])
qi=complex(q[i])
for iteration in range(maxiter):
zi = zi*zi + qi
Don’t use NumPy as a container!
Adding two elements from a list is much faster than adding two scalars in array.
Python lists are optimized to be containers of different objects. Using Python
functions (sum) on a NumPy array will be slower than using it on a list of floats.
[email protected]:~> python -m timeit -s "import numpy" "numpy.sum((numpy.linspace(10,100,1000)).tolist())"
10000 loops, best of 3: 196 usec per loop
[email protected]:~> python -m timeit -s "import numpy" "sum((numpy.linspace(10,100,1000)).tolist())"
10000 loops, best of 3: 55 usec per loop
[email protected]:~> python -m timeit -s "import numpy" "sum((numpy.linspace(10,100,1000)))"
1000 loops, best of 3: 441 usec per loop
Vectorization = get rid of for loops
Vector function is any function that operates on NumPy array.
import numpy as np
from time import time
# 1.store data set into Numpy array
myarray=np.linspace(1,5,1000000)
t1=time()
for i in range(len(myarray)):
myarray[i]+=1
print " time for loop\n" + str(time() - t1)
t2=time()
# 2. replace loop with equivalent Vector Functions (sum)
myarray+=1
print " time for vectorized version:" + str(time() - t2)
[email protected]:~/presentation> python vectorsimple.py
time for loop 1.765157938
time for vectorized version: 0.00224208831787
Vectorized functions
module load python vanilla
# img_0001.jpg shows two different proteins , expressed differently over time
# you can measure red and green chanels
# Where vectorization really counts is in multidimensional arrays, like images.
################ Loop ##########
time3 = time.time()
import time
import matplotlib.pyplot as plt
for i in xrange(0, image.shape[0]):
for k in xrange(0, image.shape[1]):
if image[i,k,1] < 40:
image[i,k,1] = 0
image = plt.imread(”protein.jpg")
time4 = time.time()
# We think the low green values are actually non-specific, background noise.
# Make anything in the green channel less than 40 be a zero
timeloop = (time4 - time3)
print 'time for loops: ', timeloop
time1 = time.time()
print 'vectorized is', (timeloop / timevec), 'times faster'
## Built in vectorized functions
image[ image[:,:,1] < 40 ] = 0
plt.imshow(image)
plt.show()
time2 = time.time()
timevec = (time2 - time1)
print 'time for vectorized: ', timevec
# thats the difference in your code taking a month to run, or a few
hours.
There is np.vectorize but the purpose of it is to transform functions which are not numpy-aware (e.g. take floats
as input and return floats as output) into functions that can operate on (and return) numpy arrays.
Techniques for improving Python performance
Remarks on efficiency: xrange vs range
XRANGE
Less memory
consuming: doesn't
generate the whole
list, it just generates
one element at the
time as they are
needed.
RANGE
Has to generate a
list of the numbers
in memory before
enumeration starts.
python xrangeinstead.py
:::Results of test - range v. xrange:::
Adding 1 1000000 times
Runtime using range: 0.236502885818
Runtime using xrange: 0.193695068359
To iterate over
a list you must
use range
function!
Remarks on Efficiency: List vs Dictionary
List is sequential, use
dictionary!
How much of a
difference are
these two as far as
performance goes?
[email protected]:~/presentation> python –mtimeit 'tmp=[]; tmp.append(True); x=tmp[0]'
1000000 loops, best of 3: 0.326 usec per loop
[email protected]:~/presentation> python –mtimeit 'tmp={}; tmp[0]=True; x=tmp[0]'
1000000 loops, best of 3: 0.259 usec per loop
Remarks on efficiency: if-else vs try-except
The rule of thumb: “exceptions should never happen”!
///the except block is costly///
python ifelse.py
:::Results of test - if/else v. try/except:::
Square root of 10000000 numbers
Runtime using if/else no errors:
0.748867034912
Runtime using try/except no errors: 0.665756940842
Runtime using if/else with errors: 0.545713186264
Runtime using try/except with errors: 2.7115380764
if-else is
faster than
try-except!
Remarks on efficiency: resizing NumPy array?
Avoid resizing
NumPy
arrays!!!
For arrays that need to grow / shrink, convert
them to list with the tolist method,
add/remove list elements, and convert back to
NumPy array again.
python noresize.py
:::Results of test - numpy.resize v. convert to list:::
grow array 1000000 times
Runtime using numpy.resize:
1.68510603905
Runtime using list conversion: 0.388118982315
Remarks on efficiency: avoid “.”
Avoid module prefix in
frequently called
functions!
Instead create local
aliases.
fastsin = math.sin
x = range(1000000)
for i in x:
x[i] = fastsin(x[i])
python avoidprefix.py
:::Results of test - module prefixes:::
Sin of integers 0 to 1000000
Runtime using module prefix: 0.611773014069
Runtime pre-loading function: 0.488955020905
x = range(1000000)
for i in x:
x[i] = math.sin(x[i])
Remarks on efficiency: inlining functions
def myfunc(x):
return x + 1
x=0
for i in xrange(1000000):
x = myfunc(x)
# inline version
x=0
for i in xrange(1000000):
x=x+1
python inlinefunctions.py
:::Results of test - inline vs. user-defined functions:::
Adding 1 1000000 times
Runtime using user-defined function: 0.373387098312
Runtime inline function:
0.207398891449
Function calls are in general
expensive in Python so for small
functions a performance gain can be
obtained by inlining functions
manually.
Remarks on efficiency: eval & exac
Let eval and exec
work on compiled
code!
python eoncompiled.py
:::Results of test - strings v. compiled code for exec:::
Adding 1 10000 times
Runtime using string:
0.162658929825
Runtime using compiled code: 0.00853800773621
Unless the eval and exec is called
only once with the same argument,
you should compile the argument
first and thereafter work with the
compiled version.
Remarks on efficiency: math module vs NumPy
Avoid using NumPy
functions with scalar
arguments!
Use math module for calculations on
one number.
Use NumPy for calculations on all
numbers in an array.
python numpynoscalar.py
:::Results of test - math vs. numpy for scalar arguments:::
Sin of integers 0 to 100000
Runtime using math.sin: 0.0568699836731
Runtime using numpy.sin: 0.659817934036
NOTE: These efficiency considerations are significant only when the mathematical functions are
called a large number of times!
Case study on Numerical Efficiency
Pure Python loops
Reduce/map loop
We express here the
matrix-vector product
as a sum of the scalar
products between
each row and the
vector. Uses for loop
just in one dimension.
Map/reduce
Dot product
We relay here on
NumPy matrix
multiplication operator
to perform the scalar
product.
python casestudy.py
plain loops: 7.81488e+00,
reduce/map loop: 7.77240e+00,
map/reduce: 3.36382e+00,
dot: 1.10629e-02
>>> foo = [2, 18, 9, 22, 17, 24, 8, 12]
Running
matrix-vector
product with
1000x1000
matrix
>>> print filter(lambda x: x % 3 == 0, foo)
[18, 9, 24, 12]
>>> print map(lambda x: x * 2 + 10, foo)
[14, 46, 28, 54, 44, 58, 26, 34]
>>> print reduce(lambda x, y: x + y, foo)
112
Pickling
Makes a byte
stream
representation of any
python object that can be
stored, transmitted and
unpicked.
Byte stream can be used to
send objects across
a network (MPI)
instead of storing them in a
file.
In Python, you can define functions inside other functions, and classes inside
other classes. But because of the way the "pickle”protocol works, you can only
pickle/unpickle (and in the context of mpi4py, send/receive) functions and
classes defined at the top level of a *.py file, but not any function or
class defined inside an outer one.
Pickle vs Cpickle
import pickle,time,cPickle
pick_size=2000000
print 'creating pickles %d items big' %(pick_size)
r=range(pick_size)
t=time.time()
for n in xrange(10):
d=pickle.dumps(r)
The pickle
module is rather
slow for large
data structures.
print '%d sec pickle' %(time.time()-t)
t=time.time()
for n in xrange(10):
d=cPickle.dumps(r)
print '%d sec cPickle' %(time.time()-t)
t=time.time()
for n in xrange(10):
d=cPickle.dumps(r,protocol=pickle.HIGHEST_PROTOCOL)
print '%d sec cPickle latest protocol' %(time.time()-t)
python picklevscpickle.py
creating pickles 2000000 items big
58 sec pickle
5 sec cPickle
1 sec cPickle latest protocol
Other useful scientific Python libraries
Cython
Translates
Pythonlooking
code into
C-code
that
compiles
to
reasonably
fast Ccode.
Works across all platforms (Win/Mac/Lin)
Cython is an extension-module writing language that looks a lot like
Python except for optional type declarations for variables.
Type declarations allow Cython compiler to replace generic, highly
dynamic Python code with specific and very fast compiled code that is
then able to be loaded into the Python run-time dynamically.
Even NumPy arrays can be declared with Cython
Cython “step by step”
# Load the proper environment
[email protected]:~> module load python/2.7.3-vanilla
Python version 2.7.3-vanilla loaded
# Extract python my_function to a new file .pyx
vi my_function.pyx
# "cimport" is used to import special compile-time information about the numpy module
# we have to tell cython where to find Numpy libraries
# on the python side it needs to know where Numpy is
cimport numpy as np
import numpy as np
# annotate cython types
cdef unsigned int i,j
…
# In Python file (mandelbrot_final.py) reference your function which you are going to compile in C
import my_function
Cython “step by step”…Cython Types
# Help Cython by adding annotations (they tell what the objects are)
list q z
int
unsigned int
complex
complex double
# no negative indices with for loop
# In my_function.pyx add those types
cdef unsigned int i
cdef int iteration
cdef complex zi, qi
cdef list output
Cython setup.py
# create setup.py file in order to produce a compiled-module
vi setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy
ext = Extension(”my_function", [”my_function.pyx"],
include_dirs = [numpy.get_include()])
setup(ext_modules=[ext],
cmdclass = {'build_ext': build_ext})
# to compile the extension module in this in this directory
# you will get my_function.so
python setup.py build_ext –-inplace
Matplotlib
Python 2D
plotting
library
which
produces
publication
quality
figures
Line plots
Scatter plots
Polynomial Fitting (np.polyfit)
Random Sampling
3D surface plots
Histograms
Bar charts
Scatterplots
…
Matplotlib on Beagle
It is not recommended to execute Matolotlib commands on compute nodes.
Instead a good approach would be to pickle your result data and make a
separate script which with Matplotlib commands and execute that script on
login nodes.
Generating and plotting a
Mandelbrot Set using
MPI4PY and Matplotlib.
[email protected]:/lustre/beagle/ams/mandel> qsub mandel.pbs
[email protected]:/lustre/beagle/ams/mandel> vi mandel_final.py
….
import pickle
outfile = open('/lustre/beagle/ams/mandel/mandelbrot_data.dat', 'wb')
pickle.dump(C, outfile )
…
[email protected]:/lustre/beagle/ams/mandel> python mandelplot.py
SciPy
Collection of
algorithms
and
mathematical
tools. Built on
the top of
NumPy gives
you structure
for handling
your arrays of
data.
Special functions (scipy.special)
Integration (scipy.integrate)
Optimization (scipy.optimize)
Interpolation (scipy.interpolate)
Fourier Transorms (scipy.fftpack)
Signal Processing (scipy.signal)
Linear Algebra (scipy.linalg)
Sparse Eigenvalue Problems with ARPACK
Statistics (scipy.stats)
Weave (scipy.weave)…..
Special Functions
>>> from scipy import special
>>> special?
Airy Functions
Elliptic Functions and Integrals
Bessel Functions
Zeros of Bessel Functions
Integrals of Bessel Functions
Derivatives of Bessel Functions
Spherical Bessel Functions
Ricatti-Bessel Functions
Struve Functions
Raw Statistical Functions (Friendly versions in scipy.stats)
Gamma and Related Functions
Error Function and Fresnel Integrals
Legendre Functions
Orthogonal polynomials --- 15 types
HyperGeometric Functions
All of your favorite
special functions
for any occasion!
Building N-dimensional arrays
>>> import numpy as np
>>> import scipy as sp
NumPy
methods
SciPy
methods
mgrid[]
arange
zeros
ones
linespace
• Get equally spaced points in N output
arrays for an N-dimensional (mesh) grid.
ogrid[]
• Construct an "open" grid of
points (not filled in but correctly
shaped for math operations to
be broadcast correctly).
>>> sp.mgrid[0:5]
array([0, 1, 2, 3, 4])
>>> sp.mgrid[0:1:5j]
array([ 0., 0.25, 0.5, 0.75, 1.])
>>> sp.mgrid[0:3,0:3]
array([[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],
[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]])
>>> np.mgrid[0:3,0:3].shape
(2, 3, 3)
>>> sp.ogrid[:3,:3]
[array([[0],
[1],
[2]]), array([[0, 1, 2]])]
Weave: Inline C Code
wtest.py
#! /usr/bin/env python
import numpy as np
import scipy
from scipy import weave
x = np.arange(1000)/1000.
y = np.empty(1000,dtype=np.double)
n=len(x)
code = """
int i;
for (i=0;i<n;i++) {
if (*x<0.5) *y = exp(-(*x)*2);
else *y = exp(-(*x));
x++; y++;
} """
weave.inline(code,['n','x','y'])
print x, y
You can run C code
In Python through the
weave module!
It understands NumPy arrays.
weave.inline() :
Includes C/C++ code directly in
Python code on_the_fly execution
Inline C Code
#!/bin/bash
#PBS -N myjob
#PBS -l walltime=00:30:00
#PBS -l mppwidth=24
#PBS -j oe
. /opt/modules/3.2.6.7/init/bash
module swap PrgEnv-cray PrgEnv-gnu
module load python/2.7.3
From
/lustre/beagle/’whoami’
do
qsub mypbs.pbs
export HOME=/lustre/beagle/$LOGNAME
export PATH=/soft/python/2.7/2.7.3/python/bin/:$PATH
export LD_LIBRARY_PATH=/soft/python/2.7/2.7.3/python/lib/:$LD_LIBRARY_PATH
export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages
export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages
# Location of the libraries for python 2.7
export LDFLAGS="-L/soft/python/2.7/2.7.3/lib”
## This is necessary when using weave because otherwise python will attempt to work on /home
export PYTHONCOMPILED="/lustre/beagle/ams/mypythonlibs"
cd $PBS_O_WORKDIR
aprun -n 1 -d 24 python wtest.py
Weave Speed Comparison
First set the proper
environment.
Remember on Beagle Cary
Env. is default!
vi setup.sh
export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/site-packages
export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/site-packages
module swap PrgEnv-cray PrgEnv-gnu
module swap gcc/4.8.1 gcc/4.7.2
module load python/2.7.3
[email protected]:~> . setup.sh
Python version 2.7.3 loaded
Weave Speed Comparison
vi weavetimeit.py
#! /usr/bin/env python
import timeit
import numpy as np
import scipy
from scipy import weave
def foo():
x = np.arange(100)/100.
y = np.empty(100,dtype=np.double)
n=len(x)
code = """
int i;
for (i=0;i<n;i++) {
if (*x<0.5) *y= exp(-(*x)*2);
else *y= exp(-(*x));
x++; y++;
} """
>>> import timeit
>>> setup=""”
... import numpy as np
...
... def my_function(x,y):
... lx=len(x)
... for k in xrange(lx):
...
if (x[k]<0.5):
...
y[k]=np.exp(-x[k]*2)
...
else:
...
y[k]=np.exp(-x[k])
... ""”
>>>
t=timeit.timeit('my_function(np.arange(100)/100.,np.empty(100,dty
pe=np.double))',setup=setup)
>>> print 'Pure Python loops', t
Pure Python loops 531.699547052
WOW
weave.inline(code,['n','x','y'])
if __name__=='__main__':
from timeit import Timer
t = Timer("foo()", "from __main__ import foo")
print t.timeit()
python weavetimeit.py
18.4321630001
Python modules for parallel computations
Parallel Programming
You can
parallelize your
code splitting
Function
Data set
Your parallel code is going to run as
fast as the slowest process
Communication is the slowest part of
parallel program
Communication between processes on
different nodes is the most expensive
Distributed Memory Model
High Speed
Network
Processor/Memory
Processor/Memory
Processor/Memory
Python & Global Interpreter Lock (GIL)
Python does
contain
threads by
itself but does
not use
multiple
processors
due to the
GIL!
GIL ensures that only one thread runs at the time.
One thread is always main: the one that started the program.
GIL is realized on I/O.
When thread is running is holds on GIL.
When we run two threads on different cores they compete for the GIL, so
running on two cores is almost twice as slow than on one core due to huge
amount of signal handling.
GIL is not enforced if you use C extension modules
in Python because you are side stepping
interpreter.
Python & Multithreading
When to use
multithreading?
Threads are only good for I/O bound tasks
You have a thread which needs to talk with server
While waiting for server to respond the other thread can do
stuff
qsub threadedfinal.pbs
('Lapse for thread with threads(', 2, ': ', 9.686963081359863)
('Lapse with multithreading (', 10, ': ', 9.939857006072998)
qsub sequentialfinal.pbs
('Lapse for sequential : ', 8.240514993667603)
Performance
is worse than
in sequential
case!
Python & Multiprocessing
When to use
multiprocessing?
For non numerical calculations.
Two processes can run at the exact same time, they don’t need to interrupt each
other. No interprocess communication. Therad safe! Memory safe!
We know how to divide up equally your job among the processes, and how
to coordinate them and bring them all back together again.
Each process do its job and puts in into a queue.
It can spawn multiple processes but they will still be bound within a single node.
Downside: memory overhead
qsub multiprocfinal.pbs
('Lapse for multiprocessing (', 10, ': ', 1.2027549743652344)
('Lapse for multiprocessing (', 2, ': ', 4.594444990158081)
MPI: Message Passing Interface
When to use
MPI?
For numerical calculations.
For communication among processes, which have separate address
spaces
Once you want to scale to a cluster or supercomputer
Most useful on Distributed Memory Machines
Every process has a copy of the same code:
aprun -n 5 python hello.py #starts all 5 instances of that code
You don't move executables around you are just passing messages-pure data
export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/site-packagesc
MPI4PY quick review
Allows a user to develop, wrap or interface with MPI interface from Python.
Uses pickling to serialize objects (any Python object) and pass them between
different processors or broadcasting arrays directly.
from mpi4py import MPI #i mport MPI module
comm = MPI.COMM_WORLD # to initialize MPI, refers to all processes (numbers them from 0 to n-1)
A communicator is a logical unit that defines which processes are
allowed to send and receive messages
size=comm.Get_size()
# Returns the number of processes in the communicator.
rank = comm.Get_rank() # Gives a number of a process. Each process will have different rank
MPI4PY quick review
Optimized to pass
any Python object
but you have to
specify data type.
MPI4Py
supports two
methods of
communication:
Python’s pickling method only
for general Python objects (list,
tuples…) because it uses auto
detection of data type. Your
data is serialized in buffer, or
you get a bucket of bytes with
a size.
randNum = numpy.zeros(1)
e.g. Bcast(...)
e.g. bcast(...)
MPI4Py natively
understands NumPy
arrays and performance
will be faster if you use
them, but you have to
initialize arrays!
SPMD
Single
Program
Multiple
Data
There is only one file of code
That the code is written to assign certain
lines to certain processors
Most commonly used
MPI standard talks about
moving around buffers;
raw blocks of data.
We use NumPy module to
get access to those
arrays, we move around
numpy arrays!
It is usually bad practice to perform I/O (e.g., call print) from any process
besides the root process, though it can oftentimes be a useful tool for
debugging!!!
MPI Communication: Point-Point
On a Recv you do not always need to specify the source. You can allow the calling process to accept
a message from any process that happen to be sending to the receiving process.
from mpi4py.MPI import ANY_SOURCE
or
source=MPI.ANY_SOURCE
Send/Recv:
are blocking functions,
means wait until receive the
message
Involves
sender & receiver
Point-Point
Only two processes
participate
Isend/Irecv:
are non blocking functions,
will do some work until
message is received
MPI Communication: Collective
All processes in a
communicator participate
Collective
communication
Functions:
barrier, reduce, gather, scatter…
Principles to remember:
Load Balancing:
Example on
20 nodes
• A program is inefficient if it is not using all of the available
resources (e.g., processes are idling because they are waiting
on another)
Communication is Expensive:
• Broadcast and Reduce are designed to optimize
communication among the whole communicator. However, any
sort of message passing is extremely expensive and one of
the main obstacles to obtaining speedups when parallelizing
an algorithm.
qsub loadmpi.pbs
qsub non_loadbalanced_mpi.pbs
0.467461109161
0.46592092514
0.468319892883
0.491802930832
0.471796989441
….
0.539043903351
Efficiency comparison
[email protected]:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/numpy/1.7.0/lib/python2.7/sitepackages
[email protected]:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=${PYTHONPATH}:/soft/python/2.7/2.7.3/modules/scipy/0.12.0/lib/python2.7/sitepackages
[email protected]:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/matplotlib/1.2.1/lib/python2.7/sitepackages
[email protected]:/lustre/beagle/ams/threading/matprint> export PYTHONPATH=$PYTHONPATH:/soft/python/2.7/2.7.3/modules/mpi4py/1.3/lib/python2.7/sitepackagesc
[email protected]:/lustre/beagle/ams/threading/matprint> module load python/2.7.3
Switching to GNU compiler environment
Python version 2.7.3 loaded
[email protected]:/lustre/beagle/ams/threading/matprint> python matplotlib_efficiancy.py
Efficiency comparison: calculate prime numbers
Efficiency comparison: Mandelbrot Set
qsub mandel_mpi_numpy.pbs
# using 10 nodes:
all tasks, mean: 0.27 seconds
all tasks, min:
0.23 seconds
all tasks, max:
0.36 seconds
all tasks, sum:
2.72 seconds
qsub mandel_mpi_numpy_cython.pbs
# using 10 nodes:
all tasks, mean: 0.08 seconds
all tasks, min:
0.04 seconds
all tasks, max:
0.23 seconds
all tasks, sum:
0.84 seconds
Conclusion
• When it comes to doing a lot of heavy number crunching, Pure Python is not
really an option.
• NumPy can get you most of the way to compiled speeds through
vectorization (use of NumPy functions instead of for loops).
• In situations where you still need the last ounce of speed in a critical section
and you already know C/C++, then Weave is a simple and speedy solution.
• If, however, you are not already familiar with C then you may find Cython
to be exactly what you are looking for to get the speed you need out of
Python.
• Multiprocessing
• MPI4Py
Further reading:
Matplotlib example gallery:
http://matplotlib.org/gallery.html
Great tutorial on mpi4py:
http://www.bu.edu/pasi/files/2011/01/Lisandro-Dalcin-mpi4py.pdf
NumPy Example List with Doc:
http://www.biochem-caflisch.uzh.ch/zhou/Numpy_Example_List_With_Doc.html
NumPy / SciPy documentation
http://docs.scipy.org/doc/
Questions
Please write to:
Ana Marija Sokovic (creator of this presentation)
[email protected]
or
Beagle Support
[email protected]
All scripts mentioned in this presentation can be found at:
http://wiki.ci.uchicago.edu/Beagle/PYTHON

similar documents