Virtualizing I/O Devices on VMware Workstation's Hosted

Paper by Jeremy Sugerman, Ganesh Venkitachalam,
Beng-Hong Lim
Presented by Kit Cischke
Presentation Overview
Section the First: Basic Background and
the Problems Virtualizing the IA-32
 Section the Second: A Hosted Virtual
Machine Architecture
 Especially Virtualizing Network I/O
Section the Third: Performance Metrics and
 Including Optimizations
Section the Fourth: Future Performance
A Little Story
The year is 1997-ish. A young, idealistic
college student wants to try this “Linux” thing.
He gets a hold of some Red Hat installation
discs and goes to it (with much help).
 Well, mostly.
His network card isn’t supported by the distro
on the disk.
 He can get an updated driver online, but he can’t get
online because his network card doesn’t have a
This poor student goes back to playing Tomb
What, according to popular “informed”
opinion, makes the Wintel platform so
popular, yet so instable (relatively
 Support for lots and lots of hardware!
The Point
Both of those stories illustrate the same
point, and the driving force behind this
 Namely, if you’re writing VM software for
the PC, you either have to:
 A.) Write device drivers for a vast array of
devices that work with your VMM
 B.) Come up with a way to use the existing
Enter: The “Hosted” VM Model.
Hosted vs. Native
We’ll see later that “world switches” are very costly.
Other PC Virtualization Problems
Technically, the IA-32 is not naturally
 So sayeth Popek and Goldberg: “An
architecture can support virtual machines
only if all instructions that can inspect or
modify privileged machine state will trap
when executed from any but the most
privileged mode.”
Lots of pre-existing PC software people
won’t get rid of.
A Hosted VM Architecture
VMApp is what the user
sees, and runs in normal
user space.
VMDriver is essentially the
VMM, and residing as a
driver, it gets direct access
to the hardware.
The physical processor is
either executing in the host
world or the VMM world.
“World switches” mean
restoring all of the userand system-visible state,
making it more
heavyweight than “normal”
process switches.
Strictly computational applications run just like any
other VM.
When I/O occurs, a world switch occurs, VMApp
performs the I/O request on behalf of the guest OS,
captures the result, then hands it to VMDriver.
 We’ll look at this in more detail later.
Obviously, there is opportunity for significant
performance degradation from this, as world
switches are expensive.
Additionally, since the host OS is scheduling
everything, performance of the VM can’t be
The question is, “Is performance good enough, given
device support?”
Basic I/O Virtualization
On PCs, I/O access is usually done using
privileged IN and OUT instructions.
 A decision is made: Can the VMM handle
 If it is I/O stuff that doesn’t need to actually talk
to the hardware, then yes.
 Otherwise, cause a world switch and ask the
host to do it.
This is totally adequate and appropriate for
devices with low sustained throughput or
high latency.
 For example, a keyboard.
Virtualizing a Network Card
A NIC is everything a keyboard isn’t: high
sustained throughput and low latency.
 Therefore, it’s performance can be
indicative of the paradigm as a whole.
 In VMware workstation, there is a virtual
NIC presented to the guest, which is
largely indistinguishable from a “real” PCI
 There are two different ways we can make
the connection from the physical NIC to the
virtual NIC.
Virtual NIC Details
Implemented partially in the VMM and
partially in VMApp.
 VMM exports virtual I/O ports and a
virtual IRQ for the device in the VM.
 VMApp catches and honors the
requests made to the virtual NIC.
 The modeled NIC is an AMD Lance.
Sending Packets from a Guest
The guest OS gives and OUT
instruction to the Lance.
 The VMM sees these, and
hands control to the
This requires the world switch.
VMDriver pushes the
requests to VMApp.
This happens a number of
VMApp makes a system call
to the physical NIC, which
actually sends the packet.
 When finished, we switch
back to the VMM and raise an
IRQ indicating the packet has
been launched.
This stuff only
needs to be
done because
of the hosted
Receiving Packets in a Guest
The physical NIC gets
the data and raises a
real IRQ.
The NIC delivers the
packets to our virtual
bridge, and VMApp
sees them when it
issues a select
VMApp moves the
packets to a shared
memory location and
tells the VMM it’s
Receiving Packets in a Guest
World switch to the
VMM, and the VMM
raises a virtual IRQ.
The Guest OS
recognizes this and
issues appropriate
instructions to read
the incoming data.
The VMM switches
back to the host world
to let the physical NIC
receive some packet
Section the 3rd – Performance
Where will performance be affected?
1. A world switch from the VMM to the host for every
real hardware access.
 Recall there are some we can successfully fake.
2. I/O interrupt handling might mean traveling
through ISRs in the VMM, host OS and guest OS.
3. Packet transmission by the guest involves the
drivers for the virtual and physical NICs.
4. Data that is in kernel buffers needs to go to the
guest OS, and that’s just one more copy.
Consequence: An app that would otherwise saturate
the network interface might become CPU bound by
all this extra work.
So does it?
Experimental Setup
Two Intel-based physically connected via Ethernet and a
cross-over cable.
 PC-350: A 350 MHz Pentium II with 128 MB RAM running Linux
 PC-733: A 733 MHz Pentium III with 256 MB RAM running
Virtual machines were configured with that Lance NIC
bridged to the physical Intel EtherExpress NICs. The VM
runs Red Hat 6.2 plus the 2.2.17-14 kernel update.
 Each physical machine ran one instance of the VM with
half the physical RAM of the host.
 An in-house test program called nettest was written to
stress the network interface.
 nettest attempts to eliminate or mitigate any machine-
dependent influences aside from network performance.
Packet Transmit Overheads
Experiment 1:
 VM/PC-733 sends 100 MB to PC-350 in
4096-byte chunks.
Result 1:
 The workload is CPU-bound with an
average throughput of only 64 Mb/s.
Analysis 1:
 On the next slide!
Experiment Analysis
From the start of the OUT instruction to when the packet hits the wire, it’s 23.8
μs. It’s 31.63 μs to when control is returned to the VMM for the next instruction.
Of that time, 30.65 μs are spent in world switches and in the host. (That’s
96.9%.) If we assume the 17.55 μs in the driver are spent physically
transmitting the packet, that leaves 13.10 μs of overhead.
That’s considerable, but doesn’t explain how we became CPU-bound.
Further Analysis
More than 25% of the
time in the VMM is
spent just preparing to
transfer control to
Then, each of those
transfers requires an
8.90 μs world switch.
This is almost two
orders of magnitude
slower than native I/O
Further Further Analysis
The other significant cost isn’t in just one row of the
chart, but spread around: IRQ processing.
Every packet sent raises an IRQ on both the virtual
and physical NIC. For network-intensive ops, the
interrupt rate is high.
So what?
 Each IRQ in the VMM world runs the VMM ISR, then
there’s a world switch to the host world. Then the host
ISR runs. If the packet is destined for the guest, VMApp
needs to send an IRQ to the guest OS.
 That virtual IRQ requires another world switch, delivering
the IRQ and then running the guest OS’s ISR.
 That’s a lot of code for a simple packet IRQ.
 Furthermore, the guest ISRs run expensive privileged
instructions to handle the interrupt. Even more cost.
One More Overhead
VMApp and the VMM can’t recognize a
packet destined for the host or guest.
Only the host OS can do that.
 This means there’s a world switch in
store, which takes time.
 Running the select syscall too
frequently is wasteful and running it too
infrequently could miss important, timecritical deadlines.
Optimizations – I/O Handling
If it’s not a real packet transmittal, don’t
switch into the host world. The VMM
can do it.
 Additionally, we can take advantage of
the memory semantics of the Lance
address register to not use privileged
instructions to do the accesses.
 Just use a simple MOV instruction instead.
Send Combining
If the world switch rate is too high (as
monitored by the VMM), a transmitted
packet won’t actually cause a world switch.
 Instead it goes into a ring buffer until an
interrupt-induced world switch occurs.
 When the world switch occurs, off go the
If the ring buffer gets too full (3 buffered
packets), we force the world switch.
 Bonus: Other interrupts might (and do) get
processed when we’re already in the host
world, saving more world switching.
Cheating on IRQ Notification
Rather than using the select syscall, let
VMNet and VMApp use some shared
memory to communicate when packets
are available.
 Now we just check a bit vector and
automatically return to the VMM, saving
at least one run through an ISR.
Optimized Results
Additional Results
Additional Results
Future Performance
Reducing CPU Virtualization
 The authors cop out and say, “However, a discussion
of [delivering virtual IRQs, handling IRET instructions
and the MMU overheads associated with context
switches] requires an understanding of VMware
Workstation’s core virtualization technology and is
beyond the scope of this paper.”
They do mention an “easy” optimization
regarding the interrupt controller.
Every packet in TCP requires an ACK that
means 5 access to the IC. One can be treated
like the Lance address register and be
replaced by a MOV instruction, and that saves
costly virtualization of privileged execution.
Modifying the Guest OS
First Idea: Make the Guest OS stop using
privileged instructions!
 Second Idea: Make the Guest share
intelligence with the VMM.
 For example: Don’t do page table switching
when starting the idle task.
 This is meaningful: VM/PC-733 spends 8.5% of
its time virtualizing page table switches!
A quick and dirty prototype of this halved
the MMU-derived vritualization overhead
and all that saved time becomes CPU idle
Optimizing the Guest Driver
Create an idealized version of the device
you’re trying to support.
 E.g., a virtual NIC that only uses a single OUT
instruction and skips the transmit IRQ, instead of
12 I/O instructions and a transmit IRQ.
That’s all well and good (and VMware
actually does this in their server products),
but it means supporting a bunch of drivers,
which is what we were trying to avoid in the
first place.
Changing the Host
If we can change the host’s behavior, we can save
time or memory (or both).
For example, network operations in Linux make
heavy use of sk_buffs.
 sk_buffs are the buffers in which the Linux kernel
handles network packets. The packet is received by the
network card, put into a sk_buff and then passed to the
network stack, which then uses the sk_buff.
What would be great is to allocate a big chunk of
memory to fill with sk_buffs instead of constantly
mallocing fresh memory.
Two potential problems:
 Memory Leaks
 Inaccessible OS code
Bypass the Host
In other words, ditch the whole “hosted
VM” thing and write real drivers for the
 Of course, this takes us back to the root
of the problem – too many devices to
Summing Up
We can make our VMMs simpler and more widely
available using a hosted VM.
I/O performance takes a hit though.
There are various optimizations that can be made:
 Reducing world switches by paying attention to what
we’re trying to accomplish.
 Reduce overhead during world switches by send
combining (and other similar things).
 Cheat on checking for available data from the driver.
With these optimizations, we can nearly match native
performance, even on sub-standard machines.
 The 733 MHz machines the tests were run on were below
the corporate standard, even for 2001.

similar documents