### MATlab

```Outline
Speeding up Matlab Computations
 Symmetric Multi-Processing with Matlab
 Accelerating Matlab computations with GPUs
 Running Matlab in distributed memory environments
 Using the Parallel Computing Toolbox
 Using the Matlab Distributed Compute Engine Server
 Using pMatlab
Mixing Matlab and Fortran/C code
 Compiling MEX code from C/Fortran
 Turning Matlab routines into C code
Symmetric Multi-Processing
 By default Matlab uses all cores on a given node for operations that can be threaded
Arrays and matrices • Basic information: ISFINITE, ISINF, ISNAN, MAX, MIN • Operators: +, -, .*, ./, .\, .^, *, ^, \ (MLDIVIDE), / (MRDIVIDE) • Array op
Symmetric Multi-Processing
 To be sure you only use the resources you request, you should either
request an entire node and all of the CPU’s...
salloc –t 60 -c 24
srun matlab
 Or request a single cpu and specify that Matlab should only use a single
salloc –t 60
Using GPUs with Matlab
 Matlab can use GPUs to do calculations, provided a GPU is available on
the node Matlab is running on.
salloc -t 60 --gres=gpu:1
matlab
 We can query the connected GPUs from within Matlab using
gpuDeviceCount()
gpuDevice()
 And obtain a list of GPU supported functions using
methods('gpuArray')
Using GPUs with Matlab
 So there is a 2D FFT – but no Hilbert function...
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
Z=gather(Z_);
imagesc(log(abs(Z)));
Distribute data to GPU
FFT performed on GPU
Gather data from GPU onto CPU
 We could do the log and abs functions on the GPU as well.
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
imagesc(gather(log(abs(Z_)));
Using GPUs with Matlab
 For our example, doing the FFT on the GPU is 7x faster. (4x if you include
moving the data to the GPU and back)
>> H=hilb(5000);
>> tic; A=gather(gpuArray(H)); toc
Elapsed time is 0.161166 seconds.
>> tic; A=gather(fft2(gpuArray(H))); toc
Elapsed time is 0.348159 seconds.
>> tic; A=fft2(H); toc
Elapsed time is 1.210464 seconds.
Using GPUs with Matlab
 Matlab has no built in hilb() function that can run on the GPU – but we
can write our own function(kernel) in cuda – and save it to hilbert.cu
__global__ void HilbertKernel( double * const out, size_t const numRows, size_t const numCols)
{
const int rowIdx = blockIdx.x * blockDim.x + threadIdx.x;
const int colIdx = blockIdx.y * blockDim.y + threadIdx.y;
if ( rowIdx >= numRows ) return;
if ( colIdx >= numCols ) return;
size_t linearIdx = rowIdx + colIdx*numRows;
out[linearIdx] = 1.0 / (double)(1+rowIdx+colIdx) ;
}
 And compile it with nvcc to generate a Parallel Thread eXecution file
nvcc -ptx hilbert.cu
Using GPUs with Matlab
 We have to initialize the kernel and specify the grid size before executing
the kernel.
H_=gpuArray.zeros(1000);
hilbert_kernel=parallel.gpu.CUDAKernel('hilbert.ptx','hilbert.cu');
hilbert_kernel.GridSize=size(H_);
H_=feval(hilbert_kernel, H_, 1000,1000);
Z_=fft2(H_);
imagesc(gather(log(abs(Z_))));
 The default for matlab kernel’s is 1 thread per block, but we could create
fewer blocks that were each 10 x 10 threads.
hilbert_kernel.GridSize=[100,100];
Parallel Computing Toolbox
 As an alternative you can also use the Parallel Computing Toolbox. This
supports parallelism via MPI
https://info.circ.rochester.edu/BlueHive/Software/Mathematics/matlab.html
 You should create a pool that is the same size as the number of processors
you requested in your job submission. Matlab also sells licenses for using a
Distributed Computing Server which allows for matlabpools that use more
than just the local node.
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
matlabpool(4)
parfor n=1:100
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off'); imagesc(log(abs(Z)));
print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));
end
matlabpool close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
matlabpool(4)
spmd
for n=drange(1:100)
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off');
imagesc(log(abs(Z)));
end
end
matlabpool close
matlabpool(4)
spmd
for n=labindex:numlabs:100
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off');
imagesc(log(abs(Z)));
end
end
matlabpool close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
pmode start 4
n=labindex;
H=hilb(n);
Z=fft2(H);
f=figure('Visible','off'); imagesc(log(abs(Z)));
print('-dpdf','-r300', sprintf('%s%03d%s','fig1-batch_',n,'.pdf'));
pmode lab2client H 3 H3
H3
pmode close
Parallel Computing Toolbox
•
You can achieve parallelism in several ways:
•
parfor loops – execute for loops in parallel
•
smpd – execute instructions in parallel (using ‘labindex’ or ‘drange’)
•
pmode – interactive version of smpd
•
distributed arrays – very similar to gpuArrays.
Example using distributed arrays
Example using gpuArray
H=hilb(1000);
H_=gpuArray(H);
Z_=fft2(H_);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool(8)
H=hilb(1000);
H_=distributed(H);
Z_=fft(fft(H_,[],1),[],2);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool close
Parallel Computing Toolbox
 What about building hilbert matrix in parallel?
matlabpool(4)
spmd
Define partition
codist=codistributor1d(1,[250,250,250,250],[1000,1000]);
Get local indices in x-direction
[i_lo, i_hi]=codist.globalIndices(1);
Allocate space for local part
H_local=zeros(250,1000);
for i=i_lo:i_hi
Initialize local array with
for j=1:1000
Hilbert values.
H_local(i-i_lo+1,j)=1/(i+j-1);
end
Assemble codistributed array
end
Now it's distributed like before!
H_ = codistributed.build(H_local, codist);
end
Z_=fft(fft(H_,[],1),[],2);
Z=gather(Z_);
imagesc(log(abs(Z)));
matlabpool close
Using pMatlab
 pMatlab is an alternative method to get distributed matlab functionality
without relying on Matlab’s Distributed Computing Server.
 It is built on top of MapMPI (an MPI implementation for matlab –
written in matlab - that uses file I/O for communication)
 It supports various operations on distributed arrays (up to 4D)

Remapping, aggregating, finding non-zero entries, transposing, ghosting

Elementary math functions (trig, exponential, complex, remainder/rounding)

2D Convolutions, FFTs, Discrete Cosine Transform
FFT's
need to be properly mapped (cannot be distributed along transform dimension).
 It does not have as much functionality as the parallel computing
toolbox – but it does support ghosting and more flexible partitioning!
Using pMatlab
 Since pMatlab works by launching other Matlab instances – we need
them to startup with pMatlab functionality. To do so we need to add a few
lines to our startup.m file in our matlab path.
rehash;
pMatlabGlobalsInit;
Running pMatlab in Batch Mode
 To submit a job in batch mode we need to create a batch script
#SBATCH -N Matlab
#SBATCH -p standard
#SBATCH –t 60
#SBATCH –N 2
matlab -nodisplay -r "pmatlab_launcher"
sample_script.pbs
 And a Matlab script to launch the pMatlab script
[sreturn, machines]=system('nodelist');
machines=regexp(machines, '\n', 'split');
machines=machines(1:size(machines,2)-1);
eval(pRUN('pmatlab_script',nProcs,machines));
pmatlab_launcher.m
Running pMatlab in Batch Mode
 And finally we have our pmatlab script.
Xmap=map([Np 1],{},0:Np-1);
H_=zeros(1000,1000,Xmap);
[I1,I2]=global_block_range(H_);
H_local=zeros(I1(2)-I1(1)+1,I2(2)-I2(1)+1);
for i=I1(1):I1(2)
for j=I2(1):I2(2)
H_local(i-I1(1)+1,j-I2(1)+1)=1/(i+j-1);
end
end
H_=put_local(H_,H_local);
Z_=fft(fft(H_,[],2),[],1);
Z=agg(Z_);
f=figure('Visible','off');
imagesc(log(abs(Z)));
print('-dpdf','-r300', 'fig1.pdf');
end
map for distributing array
Distributed matrix constructor
Indices for local portion of array
Allocate and populate local
portion of array with Hilbert
matrix values
X = local
put_local(X,
fft(local(X),[],2));
Copy
values into
distributed array
Do
and do x-fft. Z_ has different map
Z y-fft
= transpose_grid(X);
Collect
resulting matrix
Z = put_local(Z,
fft(local(Z),[],1));
matlab
pmatlab_script.m
process
Compiling Mex Code
 C, C++, or Fortran routines can be called from within Matlab.
#include "fintrf.h"
subroutine mexfunction(nlhs, plhs, nrhs, prhs)
mwPointer :: plhs(*), prhs(*)
integer :: nlhs, nrhs
mwPointer :: mxGetPr
mwPointer :: mxCreateDoubleMatrix
real(8) :: mxGetScalar
mwPointer :: pr_out
integer :: n
n = nint(mxGetScalar(prhs(1)))
plhs(1) = mxCreateDoubleMatrix(n,n, 0)
pr_out = mxGetPr(plhs(1))
call compute(%VAL(pr_out),n)
end subroutine mexfunction
subroutine compute(h, n)
integer :: n
real(8) :: h(n,n)
integer :: i,j
do i=1,n
do j=1,n
h(i,j)=1d0/(i+j-1d0)
end do
end do
end subroutine compute
mex hilbert.F90
>> H=hilbert(10)
Turning Matlab code into C
 First we create a log_abs_fft_hilb.m function
function result = log_abs_fft_hilb(n)
result=log(abs(fft2(hilb(n))));
 And then we run
>> codegen log_abs_fft_hilb.m –args {uint32(0)}
 This will produce a mex file that we can test.
>> A=log_abs_fft_hilb_mex(uint32(16));
>> B=log_abs_fft_hilb(16);
>> max(max(abs(A-B)))
ans = 8.8818e-16
 We could have specified the type of 'n' in our matlab function
function result = log_abs_fft_hilb(n)
assert(isa(n,'uint32'));
result=log(abs(fft2(hilb(n))));
Turning Matlab code into C
 Now we can also export a static library that we can link to:
>> codegen log_abs_fft_hilb.m -config coder.config('lib') -args {'uint32(0)'}
 This will create a subdirectory codegen/lib/log_abs_fft_hilb that will
have the source files '.c and .h' as well as a compiled object files '.o' and a
library 'log_abs_fft_hilb.a'
 The source files are portable to any platform with a 'C' compiler (ie
BlueStreak). We can rebuild the library on BlueStreak by running
mpixlc –c *.c
ar rcs log_abs_fft_hilb.a *.o
Turning Matlab code into C
 To use the function, we still need to write a main subroutine that links to it. This requires
working with matlab's variable types (which include dynamically resizable arrays)
#include "stdio.h"
#include "rtwtypes.h"
#include "log_abs_fft_hilb_types.h"
Matlab type definitions
void main() {
uint32_T n=64;
Argument to Matlab function
emxArray_real_T *result;
Return value of Matlab function
int32_T i,j;
emxInit_real_T(&result, 2);
Initialize Matlab array to have rank 2
log_abs_fft_hilb(n, result);
Call matlab function
for(i=0;i<result->size[0];i++) {
for(j=0;j<result->size[1];j++) {
printf("%f ",result->data[i+result->size[0]*j]);
Output result in column
}
major order
printf("\n");
}
emxFree_real_T(&result);
Free up memory associated with return array
}
Turning Matlab code into C
 And here is another example of calling 2D fft's on real data
void main() {
int32_T q0;
int32_T i;
int32_T n=8;
emxArray_creal_T *result;
emxArray_real_T *input;
emxInit_creal_T(&result, 2);
emxInit_real_T(&input, 2);
q0 = input->size[0] * input->size[1];
input->size[0]=n;
input->size[1]=n;
emxEnsureCapacity((emxArray__common *)input,
q0, (int32_T)sizeof(real_T));
for(j=0;j<input->size[1];j++ {
for(i=0;i<input->size[0];i++) {
input->data[i+input->size[0]*j]=1.0 / (real_T)(i+j+1);
}
}
my_fft(input, result);
for(i=0;i<result->size[0];i++) {
for(j=0;j<result->size[1];j++) {
printf("[% 10.4f,% 10.4f] ",
result->data[i+result->size[0]*j].re,
result->data[i+result->size[0]*j].im);
}
printf("\n");
}
emxFree_creal_T(&result);
emxFree_real_T(&input);
}
Turning Matlab code into C
 Exported FFT's only work on vectors of length 2N
 Error checking is disabled in exported C code
 Mex code will have the same functionality as exported C code, but will
also have error checking. It will warn about doing FFT's on arbitrary length
vectors, etc...
 Always test your mex code!
Matlab code is not that different from C code
#include <stdio.h>
#include <math.h>
#include <complex.h>
#include <fftw3.h>
void main() {
int n=4096;
int i,j;
double complex temp[n][n], input[n][n];
double result[n][n];
fftw_plan p;
p=fftw_plan_dft_2d(n, n, &input[0][0], &temp[0][0],
FFTW_FORWARD, FFTW_ESTIMATE);
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
input[i][j]=(double complex)(1.0/(double)(i+j+1));
}
}
fftw_execute(p);
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
result[i][j]=log(cabs(temp[i][j]));
}
}
for (i=0;i<n;i++){
for(j=0;j<n;j++) {
printf("%f ",result[i][j]);
}
}
fftw_destroy_plan(p);
}
Or you can write your own 'C'
code that uses open source
mathematical libraries (ie fftw).
```