Paper by Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim Presented by Kit Cischke Presentation Overview Section the First: Basic Background and the Problems Virtualizing the IA-32 Section the Second: A Hosted Virtual Machine Architecture Especially Virtualizing Network I/O Section the Third: Performance Metrics and Analysis Including Optimizations Section the Fourth: Future Performance Enhancements A Little Story The year is 1997-ish. A young, idealistic college student wants to try this “Linux” thing. He gets a hold of some Red Hat installation discs and goes to it (with much help). Success! Well, mostly. His network card isn’t supported by the distro on the disk. He can get an updated driver online, but he can’t get online because his network card doesn’t have a driver. This poor student goes back to playing Tomb Raider. Alternatively… What, according to popular “informed” opinion, makes the Wintel platform so popular, yet so instable (relatively speaking)? Support for lots and lots of hardware! The Point Both of those stories illustrate the same point, and the driving force behind this paper… Namely, if you’re writing VM software for the PC, you either have to: A.) Write device drivers for a vast array of devices that work with your VMM B.) Come up with a way to use the existing drivers. Enter: The “Hosted” VM Model. Hosted vs. Native We’ll see later that “world switches” are very costly. Other PC Virtualization Problems Technically, the IA-32 is not naturally virtualizable. So sayeth Popek and Goldberg: “An architecture can support virtual machines only if all instructions that can inspect or modify privileged machine state will trap when executed from any but the most privileged mode.” Lots of pre-existing PC software people won’t get rid of. A Hosted VM Architecture VMApp is what the user sees, and runs in normal user space. VMDriver is essentially the VMM, and residing as a driver, it gets direct access to the hardware. The physical processor is either executing in the host world or the VMM world. “World switches” mean restoring all of the userand system-visible state, making it more heavyweight than “normal” process switches. Consequences Strictly computational applications run just like any other VM. When I/O occurs, a world switch occurs, VMApp performs the I/O request on behalf of the guest OS, captures the result, then hands it to VMDriver. We’ll look at this in more detail later. Obviously, there is opportunity for significant performance degradation from this, as world switches are expensive. Additionally, since the host OS is scheduling everything, performance of the VM can’t be guaranteed. The question is, “Is performance good enough, given device support?” Basic I/O Virtualization On PCs, I/O access is usually done using privileged IN and OUT instructions. A decision is made: Can the VMM handle it? If it is I/O stuff that doesn’t need to actually talk to the hardware, then yes. Otherwise, cause a world switch and ask the host to do it. This is totally adequate and appropriate for devices with low sustained throughput or high latency. For example, a keyboard. Virtualizing a Network Card A NIC is everything a keyboard isn’t: high sustained throughput and low latency. Therefore, it’s performance can be indicative of the paradigm as a whole. In VMware workstation, there is a virtual NIC presented to the guest, which is largely indistinguishable from a “real” PCI NIC. There are two different ways we can make the connection from the physical NIC to the virtual NIC. Connections… Virtual NIC Details Implemented partially in the VMM and partially in VMApp. VMM exports virtual I/O ports and a virtual IRQ for the device in the VM. VMApp catches and honors the requests made to the virtual NIC. The modeled NIC is an AMD Lance. Sending Packets from a Guest The guest OS gives and OUT instruction to the Lance. The VMM sees these, and hands control to the VMDriver. This requires the world switch. VMDriver pushes the requests to VMApp. This happens a number of times. VMApp makes a system call to the physical NIC, which actually sends the packet. When finished, we switch back to the VMM and raise an IRQ indicating the packet has been launched. This stuff only needs to be done because of the hosted VM. Receiving Packets in a Guest The physical NIC gets the data and raises a real IRQ. The NIC delivers the packets to our virtual bridge, and VMApp sees them when it issues a select syscall. VMApp moves the packets to a shared memory location and tells the VMM it’s ready. Receiving Packets in a Guest More extra work! World switch to the VMM, and the VMM raises a virtual IRQ. The Guest OS recognizes this and issues appropriate instructions to read the incoming data. The VMM switches back to the host world to let the physical NIC receive some packet acknowledgements. Section the 3rd – Performance Where will performance be affected? 1. A world switch from the VMM to the host for every real hardware access. Recall there are some we can successfully fake. 2. I/O interrupt handling might mean traveling through ISRs in the VMM, host OS and guest OS. 3. Packet transmission by the guest involves the drivers for the virtual and physical NICs. 4. Data that is in kernel buffers needs to go to the guest OS, and that’s just one more copy. Consequence: An app that would otherwise saturate the network interface might become CPU bound by all this extra work. So does it? Experimental Setup Two Intel-based physically connected via Ethernet and a cross-over cable. PC-350: A 350 MHz Pentium II with 128 MB RAM running Linux 2.2.13 PC-733: A 733 MHz Pentium III with 256 MB RAM running 2.2.17. Virtual machines were configured with that Lance NIC bridged to the physical Intel EtherExpress NICs. The VM runs Red Hat 6.2 plus the 2.2.17-14 kernel update. Each physical machine ran one instance of the VM with half the physical RAM of the host. An in-house test program called nettest was written to stress the network interface. nettest attempts to eliminate or mitigate any machine- dependent influences aside from network performance. Packet Transmit Overheads Experiment 1: VM/PC-733 sends 100 MB to PC-350 in 4096-byte chunks. Result 1: The workload is CPU-bound with an average throughput of only 64 Mb/s. Analysis 1: On the next slide! Experiment Analysis From the start of the OUT instruction to when the packet hits the wire, it’s 23.8 μs. It’s 31.63 μs to when control is returned to the VMM for the next instruction. Of that time, 30.65 μs are spent in world switches and in the host. (That’s 96.9%.) If we assume the 17.55 μs in the driver are spent physically transmitting the packet, that leaves 13.10 μs of overhead. That’s considerable, but doesn’t explain how we became CPU-bound. Further Analysis More than 25% of the time in the VMM is spent just preparing to transfer control to VMApp. Then, each of those transfers requires an 8.90 μs world switch. This is almost two orders of magnitude slower than native I/O calls. Further Further Analysis The other significant cost isn’t in just one row of the chart, but spread around: IRQ processing. Every packet sent raises an IRQ on both the virtual and physical NIC. For network-intensive ops, the interrupt rate is high. So what? Each IRQ in the VMM world runs the VMM ISR, then there’s a world switch to the host world. Then the host ISR runs. If the packet is destined for the guest, VMApp needs to send an IRQ to the guest OS. That virtual IRQ requires another world switch, delivering the IRQ and then running the guest OS’s ISR. That’s a lot of code for a simple packet IRQ. Furthermore, the guest ISRs run expensive privileged instructions to handle the interrupt. Even more cost. One More Overhead VMApp and the VMM can’t recognize a packet destined for the host or guest. Only the host OS can do that. This means there’s a world switch in store, which takes time. Running the select syscall too frequently is wasteful and running it too infrequently could miss important, timecritical deadlines. Optimizations – I/O Handling If it’s not a real packet transmittal, don’t switch into the host world. The VMM can do it. Additionally, we can take advantage of the memory semantics of the Lance address register to not use privileged instructions to do the accesses. Just use a simple MOV instruction instead. Send Combining If the world switch rate is too high (as monitored by the VMM), a transmitted packet won’t actually cause a world switch. Instead it goes into a ring buffer until an interrupt-induced world switch occurs. When the world switch occurs, off go the packets. If the ring buffer gets too full (3 buffered packets), we force the world switch. Bonus: Other interrupts might (and do) get processed when we’re already in the host world, saving more world switching. Cheating on IRQ Notification Rather than using the select syscall, let VMNet and VMApp use some shared memory to communicate when packets are available. Now we just check a bit vector and automatically return to the VMM, saving at least one run through an ISR. Optimized Results Additional Results Additional Results Future Performance Enhancements Reducing CPU Virtualization The authors cop out and say, “However, a discussion of [delivering virtual IRQs, handling IRET instructions and the MMU overheads associated with context switches] requires an understanding of VMware Workstation’s core virtualization technology and is beyond the scope of this paper.” They do mention an “easy” optimization regarding the interrupt controller. Every packet in TCP requires an ACK that means 5 access to the IC. One can be treated like the Lance address register and be replaced by a MOV instruction, and that saves costly virtualization of privileged execution. Modifying the Guest OS First Idea: Make the Guest OS stop using privileged instructions! Second Idea: Make the Guest share intelligence with the VMM. For example: Don’t do page table switching when starting the idle task. This is meaningful: VM/PC-733 spends 8.5% of its time virtualizing page table switches! A quick and dirty prototype of this halved the MMU-derived vritualization overhead and all that saved time becomes CPU idle time. Optimizing the Guest Driver Create an idealized version of the device you’re trying to support. E.g., a virtual NIC that only uses a single OUT instruction and skips the transmit IRQ, instead of 12 I/O instructions and a transmit IRQ. That’s all well and good (and VMware actually does this in their server products), but it means supporting a bunch of drivers, which is what we were trying to avoid in the first place. Changing the Host If we can change the host’s behavior, we can save time or memory (or both). For example, network operations in Linux make heavy use of sk_buffs. sk_buffs are the buffers in which the Linux kernel handles network packets. The packet is received by the network card, put into a sk_buff and then passed to the network stack, which then uses the sk_buff. What would be great is to allocate a big chunk of memory to fill with sk_buffs instead of constantly mallocing fresh memory. Two potential problems: Memory Leaks Inaccessible OS code Bypass the Host In other words, ditch the whole “hosted VM” thing and write real drivers for the VMM. Of course, this takes us back to the root of the problem – too many devices to support. Summing Up We can make our VMMs simpler and more widely available using a hosted VM. I/O performance takes a hit though. There are various optimizations that can be made: Reducing world switches by paying attention to what we’re trying to accomplish. Reduce overhead during world switches by send combining (and other similar things). Cheat on checking for available data from the driver. With these optimizations, we can nearly match native performance, even on sub-standard machines. The 733 MHz machines the tests were run on were below the corporate standard, even for 2001.