OS Hacking and more.

10.05.2008

Hypervisor To Go

Over the last few months I've been able to assemble a Hypervisor “To Go" based on the OpenSolaris with either Xen (Solaris xVM) or VirtualBox. There are still few outstanding issues but I completed a fully functional prototype (say, proof-of-concept) for this project. This presentation, which I gave at Sun Labs in August 2008, discusses the purpose, design, and implementation of Hypervisor “To Go”.




Hypervisor “To Go” was an idea that occurred to me first when I visited the Sun Labs for their Open House event last year. It was really a simple concept: What if you could have a hypervisor installation on a USB stick that was incredibly portable and independent from the host hardware, from which you could boot your favorite operating system, with any/all of your applications and customizations.




This concept evolved to something a bit more concrete, with a couple of key components. The "system" could be based on what is arguably the most portable device around – a USB flash drive. A bootable USB drive can be taken anywhere and it can be used in pretty much any system, so we'd be able set up the user's world there in a portable fashion. But now, the fun part: there are a limited number of OS distributions that actually support Live USB boot with data persistence (i.e. preserve the the modifications and new files across reboots). So in order to allow the user to run his or her favorite OS in combination with his or her choice of additional software on an arbitrary hardware, virtualization (a hypervisor) would be the answer. We'd package a hypervisor with the stripped down OS (base) as a bootable image for a USB flash drive. The user can boot such an image anywhere, and then choose his or her guest OS and additional applications.



What we ultimately want is a situation where the user can plug a USB drive into any machine boot from his fully configured USB drive, and have a working environment out-of-the-box. The user should be able to copy the preconfigured setup to this machine with the click of a button, deploy his "appliance" instantly, and make use of local machine hardware resources, such as disks and networking. Since the “appliance” is a virtual machine, it can be run with the hypervisor regardless of the host. Such a setup has real-world utility in a data center environment.



Perhaps the greatest hurdle we encountered was creating a USB bootable distribution of OpenSolaris, which I have detailed in earlier posts. In short, the OpenSolaris boot process is currently tied to the identity of the boot media, which doesn't change on a single system. However, across many systems, BIOS or a disk controller may identify a single disk in a multitude of ways, which constrains interoperability. To get a USB live boot, we needed to change to boot process itself.


Our changes to the boot process allow us to boot from a USB flash (or a similar device) plugged into a random slot (i.e. without a preconceived knowledge of the bootpath), discover storage devices, then let ZFS to discover and mount the root pools dynamically, and finally chroot to normally configured root filesystem and run real /sbin/init. The ramdisk is used as trampoline only, but having this trampoline allows us to discover the root pool dynamically, without having the Solaris device name hardcoded in menu.lst or in a disk label.






Here we see the code from our boot discovery process. All essential changes are confined to the /sbin/init script on ramdisk (miniroot). The disk file system contains a result of a normal installation from LiveCD without any additional tricks.



Here we see the OpenSolaris kernel code that normally mounts the rootfs and the associated virtual filesystems.



This is our script which mimics the original root mount process for a USB drive.


Finally, the real root device is located and init is started.



The whole concept of booting to ramdisk first, and then chrooting to a larger file system backed by stable storage resembles Linux boot. There are few differences in the Solaris case though. We had to work around a few, we tried to exploit the other ones to our benefit.

1. The key remaining problem is the size of the ramdisk image. Grub uses BIOS calls to load it. Many BIOSes are inefficient. It is essential to detect the optimal block size (typically 8KB, sometimes 4KB or 16KB) and use that size. 512B reads are very slow. This is a major issue when ramdisk is dozens or hundreds of MB.

2. With a linux ramdisk the kernel unpacks a cpio archive directly to the ramdisk. These files are deleted before the real root is chrooted. There is no such option in Solaris, so the all the memory used for ramdisk is gone forever.

3. Most of the ramdisk space is taken by kernel modules. David Minor suggested splitting the ramdisk image into separate pieces. We also looked into detecting necessary modules and stripping the rest (sound, networking, etc). We couldn't find anybody who knows/remembers the complete philosophy of how Solaris determines what to load. Without deeper understanding it is difficult to estimate how much space this could save.

4. There are issues running Solaris in a chrooted environment.
- The default module search path is /kernel, /usr/kernel, /platform, /usr/platform. If you have more modules in the real root file system than on ramdisk you have to either loopback mount /newroot/[dir] on /[dir] before chrooting, or to alter the kernel search path.
- Even though the content of /etc/mnttab is generated by kernel, it does not reflect the fact that in a chrooted environment /newroot/proc is /proc. (workaround: run "chroot /newroot mount" instead of "mount" when mimicing vfs_mountroot in the /sbin/init script.) This may also be the case with the share command.
- devicefs used for /devices cannot be remounted second time. (workaround: loopback mount as /newroot/devices.)

5. There are advantages using a ramdisk with Solaris:
- ZFS does not require additional tools to re-construct its view of storage devices, and the ZFS-enabled grub does not depend on BIOS view of which disk drive is bootable.
- Solaris can use "reconfigure" during boot rather than information injected by an installer or Live Boot scripts to discover all the devices.

6. Other thoughts:
- Debugging stuff in the /sbin/init script is pretty nasty because the console is normally configured by the real init program.
- If you miss something needed by the paravirtualized kernel then the xpv module does not load but the output goes by default to the paravirtualized console i.e. you have a completely black screen.


1 comment:

tomagig said...

Very nice, I would greatly enjoy this.