5-Kumar

Report
Indirect Branching in the
Transmeta Efficeon Processor
Naveen Kumar and Naveen Neelakantam
Intel Corporation
Intel Corporation
Introduction
 Transmeta Efficeon processor
– HW/SW co-designed processor marketed in 2003
– Binary translation of x86 to underlying VLIW hardware
– Focus on how Efficeon handles indirect branches
– Indirect branches are particularly difficult for binary translation
– Efficeon provided a number of unique solutions
– Many interesting HW/SW solutions to improve efficiency
– Our hope is that we can use and build upon these ideas
2
Intel Corporation
Disclaimer and Acknowledgement
 A review of past work, not original research by
authors
 Efficeon was implemented by Transmeta, but
details rarely published
 Acknowledgement and thanks to the original
Transmeta team
 We continue further advancement of these ideas*
* Intel purchased Transmeta IP
3
Intel Corporation
Transmeta Efficeon Processor
 6–issue VLIW, in-order, 10 stage pipeline
 Provides x86 compatibility
 Co-designed with a software system
– The Code Morphing Software (CMS)
x86 Application
and x86 OS
Dynamic
binary
translation
x86 ISA
CMS
RISC ISA
VLIW Processor
4
Intel Corporation
Dynamic Binary Translation
 Intercept executing app
 Interpret and profile
 Dynamically compile “hot”
code to host ISA
x86 Code
Interpret
Translate
 Cache and execute
 Compiled code fragments
are “chained” together
Translation Cache
– Difficult to chain across an
indirect branch
– Branch target unknown until
runtime
5
Intel Corporation
Host Processor
Indirect Branch Translation
pushf ; save state
push eax ; save scratch registers
push ebx
mov eax, DWORD PTR [eax] ; move target to eax
mov ebx,eax ; save target
and eax,TABLE_MASK
lea eax, DWORD PTR [eax*8+IBTC_TABLE_START]
cmp ebx,DWORD PTR [eax] ; compare tag (app. address)
jne L2 ; jump if miss
L1: ; IBTC hit
mov eax,DWORD PTR [eax+4] ; load code cache address
mov [&bta_loc], eax ; store in memory
pop ebx ; restore context
pop eax
popf
jmp [bta_loc] ; jump via memory
L2: ; IBTC miss, return to translator
pop ebx ; restore minimal context
pop eax
popf
pusha ; save full context
pushf
push DWORD PTR [eax] ; pass target PC
push fragment_address ; pass from fragment address
push reenter_code_cache ; tail call optimization
jmp buildFrag ; call buildFrag(targPC,fragAddr)
1. Save state
2. Save scratch registers
3. Save target address
4. Compare tag
5. If miss, return to translator
Hit:
6. Load code cache address
7. Restore context
8. Jump to target address in code
cache
Table: A software-only translation of indirect branch
 Several proposals to improve translation efficiency
6
Intel Corporation
Indirect Branch Translation
 System level translators
– Branch target can change by a page-table/segment update
– Page permission changes
– Page table entry changes (LPN  PPN mappings)
– Segment limit and permissions
– Sharing translations across processes possible, but
additional checks needed
 Bottomline: Indirect branch translation is expensive
in traditional BT systems
7
Intel Corporation
Indirect Branch Prediction
 Traditional processors often use a BTB
 Insufficient: translated to a conditional direct branch
 Conditional branches in an indirect branch translation
– Multiple conditional branches in an indirect branch translation
– Data-dependent on indirect branch target
– These branches also become difficult to predict in hardware
 Bottomline: Indirect branches lead to poor branch
prediction in traditional BT systems
8
Intel Corporation
Indirect Branching in Efficeon
 Efficeon’s uses HW/SW co-design to address:
– Efficient translation of indirect branches
– Better branch prediction than in other BT systems
 Next, we discuss how Efficeon handles:
– x86 return emulation
– x86 indirect branch emulation
– Native indirect branches and returns
9
Intel Corporation
x86 Return Example
foo:
call bar
…
foo+2
call bar
…
foo+8
baz
foo+8
foo+2
baz
bar:
…
ret
Return Address Stack
 Conventional hardware has near-perfect return
target prediction
– Front-end typically implements a return address stack
10
Intel Corporation
x86 Return Translation
foo’:
mov
sub
br
foo:
call bar
…
foo+2
call bar
…
foo+8
[esp], foo+2
esp, esp, 4
bar’
foo+2’:
…
mov [esp], foo+8
sub esp, esp, 4
br
bar’
bar:
…
ret
foo+8’:
…
Return is emulated using an
indirect branch which is difficult
to predict
– Inlining doesn’t help
11
Intel Corporation
bar’:
…
add
br
esp, esp, 4
lookup_ibtc(esp)
Hardware support: Flook Stack
x86 EIP
x86 context
x86 CS limit
Tag
translated target
Target
 16-entry flook stack is explicitly managed by CMS
– Intended for emulating call/return in a translation
– Flook stack enables RAS-like target prediction
– Includes “tag” validation of an entry before consumption
12
Intel Corporation
Translation using Flook Stack
foo’:
mov
mov
st
sub
precall
br
rtemp, <foo+2>
flook_x86_eip, rtemp
foo+2
rtemp, [esp-4]
esp, esp, 4
<foo+2’>
foo+2’
<bar’>
foo+2’:
…
bar‘:
…
ld
mov
add
ret
13
rtemp, [esp]
foo+2
flook_x86_eip, rtemp
esp, esp, 4
Intel Corporation
x86 EIP
x86 Indirect Branch Emulation
 Translation similar to the one shown before
– Additional architectural registers significantly reduce
translation size
– Multiple “inlined” comparisons with known targets
– Monitoring and update of predicted targets in SW
– Compare translation “context” with runtime “context”
 Enhance branch prediction by co-design
– Software inserts target address in a “link” register
– Perform “other” computation
– Pipeline front-end fetches instructions at predicted target
– Actual branching happens later via a “brl” instruction
14
Intel Corporation
Native Indirect Branches
 Translation dispatch and interpreter
– Both are frequent users of indirect branches
– Lousy branch prediction
– Software can aid in branch prediction
– Link pipe
– Push target addresses onto a hardware structure
– Do “other” computation
– Frontend can fetch the branch target in the mean time
– Branch to the top of link pipe using “brlp”
 Native subroutines
– Link stack
– Corollary to a traditional call stack
15
Intel Corporation
Summary and Future Work
 Indirect branches particularly expensive
 Several techniques to speed-up indirect branches
– Flook stack
– Link register and brl
– Link pipe and brlp
– Link stack
 Future Work: Since Efficeon, other proposals to
enhance indirect branch handling in BT system
– Hiser et al, Kim et al
– Would be interesting to combine some of these ideas
16
Intel Corporation
References
 Bala et al, “Transparent Dynamic Optimization: The Design
and Implementation of Dynamo”, 1999.
 Banning et al, “Link pipe system for storage and retrieval of
sequences of branch addresses”, 2003.
 Banning et al, “Fast look-up of indirect branch destination in a
dynamic translation system”, 2006.
 Hiser et al, “Evaluating indirect branch handling mechanisms
in software dynamic translation system”, 2007.
 Kim et al, “Hardware Support for Control Transfers in Code
Caches”, 2003.
 Kevin Krewell, “Transmeta gets more Efficeon”,
Microprocessor Report, 2003.
17
Intel Corporation

similar documents