Overview of Ocelot - CompArch - Georgia Institute of Technology

Report
OVERVIEW OF OCELOT:
ARCHITECTURE
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Overview
 GPU
Ocelot overview
 Building,
configuring, and executing Ocelot programs
 Ocelot
Device Interface and CUDA Runtime API
 Ocelot
PTX Internal Representation
 PTX
Pass Manager
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
2
3
Ocelot: Multiplatform Dynamic Compilation
esd.lbl.gov
Data Parallel IR
Language
Front-End
R. Domingo &
D. Kaeli (NEU)
Just-in-time code
generation and
optimization for data
intensive applications
• Environment for i) compiler research, ii) architecture
research, and iii) productivity tools
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
3
NVIDIA’s Compute Unified Device Architecture (CUDA)
 Integrate
the concept of a compute kernel called from
standard languages

Multithreaded host programs
 The
compute kernel specifies data parallel computation as
thousands of threads
 An
accelerator model of computing
Explicit functions for off-loading computation to GPUs
 Data movement explicitly managed by the programmer

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
4
NVIDIA’s Compute Unified Device Architecture (CUDA)
Host
 For
GPU
access to CUDA tutorials
http://developer.nvidia.com/cuda-education-training
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
5
Structure of a Compute Kernel
Parallel Thread
Execution (PTX)
instruction set
architecture
 Arrays
of (data parallel) thread blocks called cooperative thread
arrays (CTAs)
 Barrier synchronization
 Mapped to single instruction stream multiple data stream (SIMD)
processor
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
6
NVIDIA Fermi GF 100
•4 Global Processing Clusters
(GPCs) containing 4 SMs each
•Each SM has 32 ALUs, 4 SFUs,
and 16 LS units
•Each ALU has access to 1024
32bit registers (total of 128kB per
SM)
•Each SM has its own Shared
Memory/L1 cache (64kB total)
ALU
•Unified L2 cache (768kB)
Streaming multiprocessor (SM) •Six 64bit Memory Controllers
(total 384bit wide)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
7
Ocelot Structure1
PTX Kernel
CUDA Application
nvcc
 Ocelot

is built with nvcc and the LLVM backend
Structured around a PTX IR LLVM IR Translator
 Compile
stock CUDA applications without modification
Diamos, A. Kerr, S. Yalamanchili, and N. Clark, “Ocelot: A Dynamic Optimizing Compiler for Bulk Synchronous Applications
in Heterogeneous Systems,” PACT, September 2010. .
1G.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
8
CUDA to PTX

PTX modules stored as string literals in fat binary

We ignore accompanying binary image (GPU native binary)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
9
Overview
 GPU
Ocelot overview
 Building,
configuring, and executing Ocelot programs
 Ocelot
Device Interface and CUDA Runtime API
 Ocelot
PTX Internal Representation
 PTX
Pass Manager
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
10
Dependencies



Software

C++ Compiler (GCC 4.5.x)

Lex Lexer Generator (Flex 2.5.35)

YACC Parser Generator (Bison 2.4.1)

Scons (Python 2.7)

LLVM (3.1)
Libraries

boost_system (1.46)

boost_filesystem (1.46)

boost_serialization (1.46)

GLEW (optional for GL interop) (1.5)

GL (for NVIDIA GPU Devices)
Library headers

Boost (1.46)
http://code.google.com/p/gpuocelot/wiki/Installation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
11
Ocelot Source Code
• Freely available via Google Code project site (New BSD License)
http://code.google.com/p/gpuocelot/
• ocelot/
• analysis/
• api/
• cuda/
• executive/
• ir/
• parser/
• tools/
• trace/
• translator/
• transforms/
-- analysis passes
-- Ocelot-specific API extensions
-- implements CUDA runtime
-- Device interface and backend implementations
-- internal representations (PTX, LLVM, AMD IL)
-- parser (to PTX)
-- standalone applications using Ocelot
-- trace generation and analysis tools
-- translators from PTX to LLVM and AMD IL
-- program transformations
svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
12
Building GPU Ocelot

Obtain source code


Compile with Scons



sudo ./build.py –install
Build and execute unit tests


svn checkout http://gpuocelot.googlecode.com/svn/trunk/ gpuocelot-read-only
sudo ./build.py –test=full
Output appears in .release_build

libocelot.so

OcelotConfig

Tests
Installation directory:

/usr/local/include/ocelot

/usr/local/lib
http://code.google.com/p/gpuocelot/wiki/Installation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
13
14
Configuring Ocelot


Controls Ocelot’s initial state

Located in application’s startup directory




trace: {
configure.ocelot
memoryChecker: {
enabled: true,
checkInitialization: false
trace specifies which trace generators are initially
attached
},
raceDetector: {
executive controls device properties
enabled: false,
ignoreIrrelevantWrites: true
trace:

memoryChecker – ensures

raceDetector - enforces synchronized access to .shared

debugger - interactive debugger
},
debugger: {
enabled: false,
kernelFilter: "_Z13scalarProdGPUPfS_S_ii",
executive:

devices:

List of Ocelot backend devices that are enabled

nvidia - NVIDIA GPU backend

emulated – Ocelot PTX emulator (trace generators)

llvm – efficient execution of PTX on multicore CPU

amd – translation to AMD IL for PTX on AMD RADEON GPU
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
alwaysAttach: true
},
},
executive: {
devices: [ "emulated" ],
}
}
14
Building and Executing CUDA Programs
 nvcc
-c example.cu -arch sm_23
 g++
-o example example.o `OcelotConfig -l`

`OcelotConfig -l` expands to ‘-locelot’
 libocelot.so
replaces libcudart.so
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
15
Overview
 GPU
Ocelot overview
 Building,
 Ocelot
 Ocelot
 PTX
configuring, and executing Ocelot programs
Device Interface and CUDA Runtime API
PTX Internal Representation
Pass Manager
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
16
CUDA Runtime API

Ocelot implements CUDA Runtime API

Transparent hooks into existing CUDA applications
 override methods of cuda::CudaDeviceInterface

Maps CUDA RT onto Ocelot device interface abstraction
 cuda::CudaRuntime

Extended through custom Ocelot API
 e.g. ocelot::registerPTXModule( );
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
17
18
Ocelot CUDA Runtime Overview
A
reimplementation of
the CUDA Runtime API
 Compatible
with
existing applications

Link against libocelot.so
instead of libcudart
R. Domingo & D.
Kaeli (NEU)
Kernels execute anywhere  Key to portability!
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
18
Ocelot CUDA Runtime
 Clean
device
abstraction

All back-ends implement
same interface
 Ocelot
API Extensions
Add/remove trace
generators
 Compile/launch kernels
directly in PTX
 Device memory sharing
among host threads
 Device switching

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
19
Ocelot Source Code: CUDA Runtime API
• ocelot/
• analysis/
• api/
• cuda/
•
•
•
•
•
•
•
•
•
•
•
•
-- analysis passes
-- Ocelot-specific API extensions
-- implements CUDA runtime
interface/CudaRuntimeInterface.h
interface/CudaRuntime.h
interface/CudaRuntimeContext.h
interface/FatBinaryContext.h
interface/CudaDriverFrontend.h
executive/
ir/
parser/
tools/
trace/
translator/
transforms/
-- Device interface and backend implementations
-- internal representations (PTX, LLVM, AMD IL)
-- parser (to PTX)
-- standalone applications using Ocelot
-- trace generation and analysis tools
-- translators from PTX to LLVM and AMD IL
-- program transformations
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
20
Ocelot CUDA Runtime API Implementation



Implement interface defined by cuda::CudaRuntimeInterface

ocelot/cuda/interface/CudaRuntime.h

ocelot/cuda/implementation/CudaRuntime.cpp

class cuda::CudaRuntime
cuda::CudaRuntime members

Host thread contexts

Ocelot devices

Registered modules, textures, kernels

Fat binaries

Global mutex
CUDA Runtime API functions


eg. cudaMemcpy, cudaLaunch, __cudaRegisterModule(),
Additional functions

eg. _lock(), _unlock(), _registerModule()
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
21
Ocelot Source Code: Device Interface
• ocelot/
• executive/
•
•
•
•
•
-- Device interface and backend implementations
interface/Device.h
interface/EmulatorDevice.h
interface/NVIDIAGPUDevice.h
interface/MulticoreCPUDevice.h
interface/ATIGPUDevice.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
22
Ocelot Device Interface

class executive::Device

Succinct interface for device objects


Module registration

Memory management

Kernel configuration and launching

Global variable and texture management

OpenGL interoperability

Streams and Events

Trace generators
Minimal set of APIs for device-oriented programming model


Capture device state:


Memory allocations, global variables, textures, graphics interoperability
Facilitate creation of backend execution targets


57 functions (versus CUDA Runtime’s 120+)
Implement Device interface
Enable multiple API front ends

Implement front ends targeting Device interface
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
23
Overview
 GPU
Ocelot overview
 Building,
 Ocelot
Device Interface and CUDA Runtime API
 Ocelot
 PTX
configuring, and executing Ocelot programs
PTX Internal Representation
Pass Manager
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
24
Ocelot PTX Intermediate Representation (IR)
 Backend compiler framework for PTX
 Full-featured PTX IR





Class hierarchy for PTX instructions/directives
PTX control flow graph
Static single-assignment form
Dataflow/dominance analysis
Enables PTX optimization
PTX Kernel
 IR to IR translation
 From PTX to other IRs
 LLVM (x86/PowerPC/ARM)
 CAL (AMD GPUs)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
25
Ocelot Source Code: Intermediate Representation
• ocelot/
• ir/
•
•
•
•
•
•
•
-- internal representations (PTX, LLVM, AMD IL)
interface/Module.h
interface/PTXInstruction.h
interface/PTXOperand.h
interface/PTXKernel.h
interface/ControlFlowGraph.h
interface/ILInstruction.h
interface/LLVMInstruction.h
• parser/
-- parser (to PTX)
• interface/PTXParser.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
26
Ocelot PTX Internal Representation


C++ classes representing PTX module

ir::PTXModule

ir::PTXKernel

ir::PTXInstruction

ir::PTXOperand

ir::GlobalVariable

ir::LocalVariable

ir::Parameter
Ocelot PTX Parser target, Emitter source


ir::PTXInstruction::valid( )
Translator source

PTX to LLVM

PTX to AMD IL

Suitable for analysis and transformation

Executable representation

PTX Emulator
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
27
Ocelot PTX IR: Kernels
ir::Module
.global .f32 globalVariable;
ir::Global
ir::Kernel
.entry sequence (
.param .u64 __cudaparm_sequence_A,
.param .s32 __cudaparm_sequence_N)
{
.reg .u32 %r<11>;
.reg .u64 %rd<6>;
.local u32 %rp0;
ir::Local
...
...
ir::Parameter
ir::BasicBlock
$LDWbegin_sequence:
ld.param.s32
%r6, [__cudaparm_sequence_N];
setp.le.s32
%p1, %r6, %r5;
@%p1 bra
$Lt_0_1026;
...
...
$Lt_0_1026:
exit;
$LDWend_sequence:
} // sequence
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
28
Ocelot PTX IR: Instructions
ir::BasicBlock
ir::PTXInstruction
add.s32 %r7, %r5, 1;
ir::PTXOperand
ld .param .u64
%rd1, [__cudaparm_sequence_A];
addressMode: address
opcode
addressSpace dataType
d
a
addressMode: register
cvt.s64.s32
%rd2, %r5;
mul.wide.s32
%rd3, %r5, 4;
add.u64
%rd4, %rd1, %rd3;
st .global .s32
[ %rd4 + 0 ], %r7;
addressMode: indirect
$Lt_0_6146;
addressMode: label
addressMode: immediate
Guard predicate
@%p1
bra
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
29
Control and Data-Flow Graphs
• Data structure for representing kernels
• Basic blocks
• fall-through and branch edges
• instruction vector
• label
• Traversals:
• pre-order, topological, post-order
• iterator visits blocks
• Data-flow graph overlays CFG
• definition-use chains explicit
• to and from SSA form
• CFG Transformations:
• split blocks, edges
• DFG Transformations:
• insert and remove values
• iterate over def-use
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
30
Example: Control-Flow Graphs
// example: splits basic blocks containing barriers
//
for (ir::ControlFlowGraph::iterator bb_it = kernel->cfg()->begin();
bb_it != kernel->cfg()->end();
++bb_it) {
// iterate over basic blocks
unsigned int n = 0;
ir::BasicBlock::InstructionList::iterator inst_it;
for (inst_it = (bb_it)->instructions.begin();
inst_it != (bb_it)->instructions.end();
++inst_it, n++) {
// iterate over instructions in *bb_it
const ir::PTXInstruction *inst = static_cast<
const ir::PTXInstruction *>(*inst_it);
if (inst->opcode == ir::PTXInstruction::Bar) {
if (n + 1 < (unsigned int)(bb_it)->instructions.size()) {
std::string label = (bb_it)->label + "_bar";
kernel->cfg()->split_block(bb_it, n+1,
ir::BasicBlock::Edge::FallThrough, label);
}
break;
// split block containing bar.sync
//
so that it’s always the last
//
instruction in a block
}
} // end for (inst_it)
} // end for (bb_it)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
31
Example: Spilling Live Values
// ocelot/analysis/implementation/RemoveBarrierPass.cpp
//
void RemoveBarrierPass::_addSpillCode( DataflowGraph::iterator block,
const DataflowGraph::Block::RegisterSet& alive )
{
unsigned int bytes = 0;
ir::PTXInstruction move ( ir::PTXInstruction::Mov );
move.type = ir::PTXOperand::u64;
move.a.identifier = "__ocelot_remove_barrier_pass_stack";
move.a.addressMode = ir::PTXOperand::Address;
move.a.type = ir::PTXOperand::u64;
move.d.reg = _kernel->dfg()->newRegister();
move.d.addressMode = ir::PTXOperand::Register;
move.d.type = ir::PTXOperand::u64;
_kernel->dfg()->insert( block, move, block->instructions().size() - 1 );
...
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Example: Spilling Live Values
...
for( DataflowGraph::Block::RegisterSet::const_iterator
reg = alive.begin(); reg != alive.end(); ++reg ) {
ir::PTXInstruction save( ir::PTXInstruction::St );
save.type = reg->type;
save.addressSpace = ir::PTXInstruction::Local;
save.d.addressMode = ir::PTXOperand::Indirect;
save.d.reg = move.d.reg;
save.d.type = ir::PTXOperand::u64;
save.d.offset = bytes;
bytes += ir::PTXOperand::bytes( save.type );
save.a.addressMode = ir::PTXOperand::Register;
save.a.type = reg->type;
save.a.reg = reg->id;
_kernel->dfg()->insert( block, save, block->instructions().size() - 1 );
}
_spillBytes = std::max( bytes, _spillBytes );
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
IR for AMD and LLVM

LLVM IR
•
•
•

AMD Backend: R. Domingo & D. Kaeli (NEU)
Implements all of the LLVM instruction set
Decouples translator with LLVM project
Easier to construct than LLVM’s actual IR
AMD IL
• Supports translation from PTX to AMD interface
 Emitters construct parseable string representations of
modules
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
34
Overview
 GPU
Ocelot overview
 Building,
configuring, and executing Ocelot programs
 Ocelot
Device Interface and CUDA Runtime API
 Ocelot
PTX Internal Representation
 PTX
Pass Manager
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
35
PTX PassManager

Orchestrates analysis and transformation passes




Derived from LLVM model
Analysis Passes generate meta-data
Meta-data consumed by transformations
Transformation Passes modify the IR
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
36
Using the Pass Manager

Passes added to a manager


Schedules execution
Manages analysis meta-data


Ensures meta-data available
Up to date; not redundantly computed
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
37
Analysis Passes

Analysis runs over the PTX IR




Generates meta-data
Modifies PTX IR
Possibly updates or invalidates existing meta-data
Examples



Data-flow graph
Dominator and Post-dominator trees
Thread frontiers
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
38
Analysis Passes – Supported Analaysis Structures

Control Flow Graph


Data Flow Graph




analysis/interface/DominatorTree.h

analysis/interface/PostDominatorTree.h
Superblock Analysis
analysis/interface/SuperblockAnalysis.h
Divergence Graph


analysis/interface/DataflowGraph.h
Dominator and Post-Dominator Trees


ir/interface/ControlFlowGraph.h
analysis/interface/DivergenceGraph.h
Thread Frontiers

analysis/interface/ThreadFrontiers.h
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
39
Transformation Passes

Modify the PTX IR


Consume meta-data
Examples:

Dead-code elimination


Control-flow structuring


transforms/interface/StructuralTransform.h
Sync elimination


transforms/interface/DeadCodeEliminationPass.h
transforms/interface/SyncElimination.h
Dynamic instrumentation
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
40
Example: Dead Code Elimination Transformation
Pass
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
41
Dead Code Elimination

Approach

Run once on each kernel

Consume data-flow analysis meta-data

Delete instructions producing values with no users

Implementation

transforms/interface/DeadCodeEliminationPass.h

transforms/implementation/DeadCodeEliminationPass.cpp
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
42
Dead Code Elimination (1 of 5)

Setup pass dependencies
DeadCodeEliminationPass::DeadCodeEliminationPass()
: KernelPass(Analysis::DataflowGraphAnalysis
| Analysis::StaticSingleAssignment,
"DeadCodeEliminationPass")
{
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
43
Dead Code Elimination (2 of 5)

Run pass
void DeadCodeEliminationPass::runOnKernel(ir::IRKernel& k)
{

Get analysis metadata
Analysis* dfgAnalysis = getAnalysis(Analysis::DataflowGraphAnalysis);
assert(dfgAnalysis != 0);
// cast up
analysis::DataflowGraph& dfg =
*static_cast<analysis::DataflowGraph*>(dfgAnalysis);
assert(dfg.ssa());
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
44
Dead Code Elimination (3 of 5)

Loop until change
BlockSet blocks;
for (iterator block = dfg.begin(); block != dfg.end(); ++block)
{
report(" Queueing up BB_" << block->id());
blocks.insert(block);
}
while(!blocks.empty())
{
iterator block = *blocks.begin();
blocks.erase(blocks.begin());
eliminateDeadInstructions(dfg, blocks, block);
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
45
Dead Code Elimination (4 of 5)

Remove unused live-out values
AliveKillList aliveOutKillList;
for (RegisterSet::iterator aliveOut = block->aliveOut().begin();
aliveOut != block->aliveOut().end(); ++aliveOut)
{
if (canRemoveAliveOut(dfg, block, *aliveOut))
{
report(" removed " << aliveOut->id);
aliveOutKillList.push_back(aliveOut);
}
}
for (AliveKillList::iterator killed = aliveOutKillList.begin();
killed != aliveOutKillList.end(); ++killed)
{
block->aliveOut().erase(*killed);
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
46
Dead Code Elimination (5 of 5)

Check if an instruction can be removed
if (ptx.hasSideEffects()) return false;
for (RegisterPointerVector::iterator reg = instruction->d.begin();
reg != instruction->d.end(); ++reg) {
// the reg is alive outside the block
if (block->aliveOut().count(*reg) != 0) return false;
InstructionVector::iterator next = instruction;
for (++next; next != block->instructions().end(); ++next) {
for (RegisterPointerVector::iterator source = next->s.begin();
source != next->s.end(); ++source) {
// found a user in the block
if (*source->pointer == *reg->pointer) return false;
}
}
}
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
47
Dead Code Elimination

Repeat for




phi instructions
Other instructions
alive-in values
Ensures meta-data is valid
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
48
Running Passes on PTX

Static optimizer




PTXOptimizer
Runs passes on PTX assembly files
ocelot/tools/PTXOptimizer.cpp
JIT optimization


Runs passes before kernels are launched
ocelot/api/implementation/OcelotRuntime.cpp
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
49
Questions


GPU Ocelot

Google Code site:

Research Project site: http://gpuocelot.gatech.edu

Mailing list:
[email protected]
Contributors


http://code.google.com/p/gpuocelot
Gregory Diamos, Rodrigo Dominguez, Naila Farooqui, Andrew Kerr, Ashwin Lele, Si Li,
Tri Pho, Jin Wang, Haicheng Wu, Sudhakar Yalamanchili
Sponsors

AMD, IBM, Intel, LogicBlox, NSF, NVIDIA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
50

similar documents