Co-clustering using CUDA

Report
CO-CLUSTERING USING
CUDA
Co-Clustering Explained

Problem:
Large binary matrix of samples (rows) and features (columns)
 What samples should be grouped together? Why?
 What are shared features?


Co-clustering provides you the “why” explicitly

Correlated sample/feature pair
Row cluster:
s1 and s3 are in a group
Column cluster:
distinguishing features are
2,3, and 5
Co-Clustering - Details

Using Information Theoretic Co-clustering, as parallelized for Hadoop
architecture in:


Disco: Distributed co-clustering with Map-Reduce: A case study towards
petabyte-scale end-to-end mining, Papadimitriou et.al, Data Mining 2008
Partition entire matrix into row groups, col groups

Minimize length of encoding of resulting partitioned matrix




Competing code length factors: number of row groups & col groups,
homogeneity of clusters
Iterate over rows, rearrange and sub-partition to find better encoding using
heuristic
Repeat for columns, then rows again, until local optimum is found
Complexity: O(n*fp*(row_groups+col_groups)2*iters)
Credit: Chakrabarti et. al, KDD 2004
Implementation - Basics



Initial matrix generation : CPU
Initial random row/column group assignment: CPU
Memory structures very simple, arrays of ints
Implementation – Stats step 1

Statistics calculations:
 Calculates
statistics for each
row of each column group
 Statistic
is number of 1’s in a
column group
 Straight-forward
parallelization (each thread
works on one row at a time),
global memory
Column Groups
2 3 1
Row Groups
3
5
1
1
4
3 2
Stat(Row 3, ColumnGroup 3) = 1
Room For Improvement

Calculate row statistics according to histogram
algorithm from text book
 Block
columns
 Assign one thread block to each block
 Compute shared memory histograms within block
 Merge back to global memory when finished
Implementation – Stats step 2

Calculates cost for each row
group of each column group
Column Groups
 Essentially a reduce on the per2 3 1
Row Groups
row data
3
 Block the rows, assign block to
5
thread block
1
 Use shared memory and
1
atomics to build histogram of
4
all rows in a given row group


Merge shared histogram with
global histogram for that row
group
Iterate over all row groups
3 2
Stat(RowGroup 1, ColumnGroup 3) = 2
Implementation – Row/Col Group Optimization

For each row, find optimal group it could belong to
 Parallelized
straight-forwardly, one row per thread,
loop and stride to get all rows
 Each row calculation goes through all row groups,
determines global cost of moving to that row group




Move all rows to their optimal group
Recompute statistics
Repeat for column groups
Continue alternating row/column groupings until
convergence
Room For Improvement

Parallelization could be more sophisticated
 Could
block the rows and compute the cost of the row
joining each row group in parallel
 Using shared memory atomics to identify minimum cost

In practice, this algorithm heavily favors a small
number of row and column groups
 The
parllelization would be therefore be small
Implementation Outer Loop

After local minimum is found, change initial number
of row and column groups and retry
 Change
number of row groups or number of column
groups, up or down
 Continue changing number of row or column groups in
that direction until cost fails to decrease
 Try both directions in both dimensions before stopping

Outer loop performed on CPU
Room for Improvement

Outer loop could parallelize inner loop actions over
different GPUs
 Each
could explore the different dimensions and
directions in parallel
Implementation – CPU + Validation


CPU implementation performed all steps described
earlier, but sequentially
Validation
 Used
CPU implementation of statistics calculations to
validate GPU stats calculations
 CPU and GPU log implementations differ, so validated
cost calculations by allowing for a tolerance of 5% btw
results
 Did not have time to validate the overall algorithm or
visualize the outputs to it to see if coclusters produced
were reasonable
Timing Measurements


Time was measured by clock_t/CLOCKS_PER_SEC
under CPU implementation
Measured by cuda events under GPU
implementation
Development Lessons Learned

CUDA and structured data is a bad idea
Even structs of arrays are impossible to deal with
 Host-side pointer math on device pointers does not work


CUDA API has REALLY unfriendly error messages


__device__ variables declared globally must be
passed to kernels


Take care to do very, very little through that API
Runtime errors otherwise
You can malloc and free shared memory in device code
as of 3.2
Development Lessons Learned Cont

Visual Studio CUDA integration leaves a lot to be
desired
 All
optimizations removed, still can’t set breakpoints
everywhere
 Many variables show as freed
 No in-IDE, real-time, in editor compile errors


But, Visual Studio does give nice auto-complete,
auto-definition navigation
No CUDA linker => separate files must be directly
#include’d
Experiment - Environment

Float.cs.drexel.edu
 CPU:
4 quad-core Intel Xeon L5360 processors @2.13
Ghz
 GPU: 2 Nvidia GeForce GTX 580 GPUs @1544Mhz
Experiment - Description

Sequential (CPU) and Parallel (GPU) tested on square
matrices of order 100, 1000, and 10000


Larger matrices caused memory problems
GPU tested with varying block and thread counts
Num blocks: 10, 100, 5000
 Num threads: 10, 100, 1024 (max)


Resulting co-clusters usually stayed in the 50-200
row/column group range, regardless of matrix order

Row and column groupings are important in the calculation
of matrix statistics, rows and columns are blocked by these
Experiment Results
Speedup - 10 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results

For small number of blocks, 100 thread
performance peaks at num_blocks * num_threads =
matrix_order
I
would expect this to be the optimal configuration,
when num_blocks ~= num_row_groups ~=
num_col_groups
 Slowdown occurs when matrix order exceeds total
number of threads and more must be done serially
Experiment - Results
Speedup - 100 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results
Speedup - 5000 Blocks
80
70
60
50
Num Threads
10
40
100
1024
30
20
10
0
100
1000
Matrix Order
10000
Experiment Results

Interestingly, the maximum speedup was the same in all
block counts

Roughly speaking, as long as num_blocks * num_threads >=
matrix order, max speedup of ~70 is achieved


10 threads never got there, due to block scheduling overhead?
Possibly cost of copying to shared memory for block processing
was not recouped in 10 thread case?
Maxing out thread count is counter-productive in smaller
matrices

Hypothesis: When block count is excessive (as for small
matrices), scheduling of large blocks of threads that return
immediately is costly
Experiment Results
Effficiency - 10 Blocks
0.08
0.07
0.06
0.05
Num Threads
10
0.04
100
1024
0.03
0.02
0.01
0
100
1000
Matrix Order
10000
Experiment Results
Efficiency - 100 Blocks
0.05
0.045
0.04
0.035
Num Threads
0.03
10
0.025
100
0.02
1024
0.015
0.01
0.005
0
100
1000
Matrix Order
10000
Experiment Results
Efficiency - 5000 Blocks
0.0012
0.001
0.0008
Num Threads
10
0.0006
100
1024
0.0004
0.0002
0
100
1000
Matrix Order
10000
Experiment Results

Efficiency is consistently highest for the smaller
numbers of blocks and smaller numbers of threads
within those blocks
 Hypothesis:
Overhead of starting blocks and threads
must be high enough to result in diminishing returns
when adding blocks and threads

similar documents