Understanding the Firecracker VMM

written by James Larisch on March 31, 2021.

Home

In this post I describe Firecracker, the minimal, Rust-based virtual machine monitor used by Amazon to power AWS Lambda. I first discuss the motivation behind Firecracker, including security requirements and performance characteristics of serverless platforms today. I then discuss Firecracker in the context of existing virtualization solutions such as containers, traditional virtual machines, KVM, and QEMU.

Introduction & Background

Serverless platforms such as Lambda use a pool of machines to service a wide array of serverless applications across multiple tenants. The basic unit of computation is the function instance—one execution of a particular user’s Lambda function.

Function-as-a-Service platforms automatically start and terminate function instances as request volume changes; this enables efficient statistical multiplexing of hardware and allows tenants to pay only for resources directly used to service requests. Serverless platforms forward each new request to a running, idle function instance in what is known as a “warm start”; if no such instance is available, platforms “cold start” a fresh instance to handle the request.

To use hardware as efficiently as possible, multiple function instances—belonging to many different tenants—execute simultaneously on each machine. As a result, the strong isolation of function instances is paramount for service providers, as the Firecracker authors themselves state [1].

Beyond just security, Function-as-a-Service platforms have unique performance and feature requirements which distinguish them from traditional virtualization platforms that provide access to a “fully featured” cloud virtual machine (e.g., EC2, DigitalOcean). For example, function instances have minimal reliable access to disks and other devices—they can write to data to a local filesystem but (1) a subsequent invocation of the same function by the same user may execute on a different instance, with a different disk; (2) the platform is free to completely reset function instances between invocations, since serverless functions should be generally stateless. Similarly, operating system configuration features such as adding device drivers, installing kernel modules, and interacting with peripherals such as keyboards and monitors are typically either explicitly or implicitly out of scope for function instances.

Historically, FaaS providers such as Google Cloud Functions, Azure Functions, and AWS Lambda have isolated function instances using either containers or virtual machines [9]. Containers such as those provided by Docker share access to the OS kernel when colocated. This provides efficient multiplexing, since containers are basically just collections of processes isolated via Linux cgroups and namespaces. However, the security and strong isolation provided by containers is (qualitatively) questionable. The TCB encompasses the entire kernel and is considered large, so (as the Firecracker authors discuss) the only defense is minimizing the number of system calls allowed by containers—this is a usability and security tradeoff. Furthermore, “containment escape” attacks on Docker have been discovered [5,7].

Virtual machines, on the other hand, are (qualitatively) considered more secure than containers—each VM (and thus each function instance) runs in an isolated environment with its own virtual hardware, page tables, and kernel. Unfortunately traditional virtual machine are heavyweight—each VM runs its own kernel which takes up memory on the host (can reach as high as 100s of MB) and startup time is on the order of seconds (I have witnessed this). Part of this overhead is due to the size of guest OSes—off-the-shelf Linux contains over 5 million lines of code (70%) dedicated solely to device drivers [6]. Overhead also comes from the VMM itself—Xen [2], VMWare ESX [8], and QEMU [3] were simply not designed for serverless workloads (low overhead, minimal guest size, and low boot times).

This overhead is unacceptable for serverless platforms like Lambda, despite the strong isolation guarantees, since such platforms wish to maximize both the amount of memory used by user code rather than the containment unit (to minimize the number of machines required to service Lambda as a whole) as well as minimize the boot-time of function instances (so that end-user latency remains low and machines spend less precious CPU time booting instances in response to load). According to the Firecracker authors containers, which typically provide less overhead and boot-time, do not provide sufficient isolation between instances, which is equally (if not more) important than overhead and performance.

Firecracker

Firecracker [1] is designed to be a small, safe virtual machine monitor (VMM). It uses KVM [4], a Linux kernel module and API which allows userspace programs to use CPU virtualization features. Serverless functions are deployed atop Firecracker inside microVMs, which are guests that run a highly reduced version of Linux. The Firecracker VMM is explicitly not designed to fully emulate a wide array of hardware—it only provides guest OSes (and thus function instances) access to the CPU, memory, a block device, and a network device. Firecracker thus reduces guest OS size and uses minimal VMM designed explicitly for serverless workloads—in particular, one of the primary goals of Firecracker is to reduce the boot-time of microVMs (and thus function instances). Before discussing how the authors evaluated Firecracker’s boot time performance, we first describe how Firecracker fits in with existing virtualization technologies.

As stated, Firecracker is a virtual machine monitor. It runs in Linux as a standard process and creates virtual machines which run as standard Linux processes themselves in a new “guest mode” provided by KVM. The kernel module KVM provides this functionality in the form of an API. It allows userspace programs to construct virtual machines which may take advantage of hardware-assisted virtualization—in particular, these virtual machine processes may “passthrough” directly to the underlying CPU for direct execution. KVM also either virtualizes the MMU or leverages the hardware IOMMU for virtualizing memory. However, KVM is just an API and may be used differently by different VMMs—Firecracker is one such VMM.

KVM by itself virtualizes only the CPU and MMU—it does not virtualize I/O devices such as disks and network devices. KVM-based VMMs must define their own userspace hooks (event-handlers) which KVM executes if the guest OS makes an I/O call. As a result, using KVM by itself to execute Linux virtual machines is insufficient—the VMM must also virtualize devices. kvm-qemu is a popular KVM-based VMM which uses QEMU—QEMU [3] was originally designed for full system emulation and emulated the CPU and MMU, (which may be of completely different architectures than the host), disk, network, peripherals, monitors, etc. It can be thus used to execute virtual machines without kernel or hardware support since it performs dynamic binary translation of guest architecture instructions to host architecture instructions. Unsurprisingly, this translation is slow. However, when combined with KVM (as kvm-qemu), which provides much faster CPU and MMU access due to hardware-assisted virtualization, QEMU is responsible only for virtualizing I/O devices such as disks and network devices—which significantly improves performance.

But QEMU is designed to emulate entire systems. As the Firecracker paper describes, it consists of over 1.4 million lines of code since it aims to support all types of I/O devices. As far as Firecracker is concerned, this has two implications: (1) the TCB for the QEMU VMM is large: it can require up to 270 system calls; (2) it is designed to support a wide array of hardware and minimal overhead and boot times are not necessarily first priorities.

Firecracker replaces the QEMU VMM as the KVM-based hypervisor. It utilizes the KVM API to produce Linux-based userspace virtual machines which run in the new “guest mode”, and it defines I/O device hooks. Firecracker makes the following critical design choices which distinguish if from QEMU and other general-purpose hypervisors:

Minimal size. The Firecracker VMM consists of 50K lines of Rust including tests and auto-generated bindings. This keeps the TCB small, and Rust is (apparently) safe.

Minimal I/O support. Because of the limited scope of serverless workloads, the Firecracker VMM supports only limited emulated devices: network and block device, serial ports, and partial keyboard support. They use the virtio [10] API for both network and block devices—guests are provided with a single TUN/TAP network interface and a single block device—the total amount of code written to support these devices appears to be less than 2K lines.

Minimal VM images. Because the VMM only supports a small amount of device types, a guest kernel which includes all potential device drivers will introduce unneeded memory and boot-time overhead. In particular, Firecracker removes all kernel modules and all devices drivers that are not needed for disk or network access to produce a microVM kernel. Such microVMs would be unacceptable for general workloads but are ideal for serverless platforms.

To sum up, Firecracker achieves its goals by isolating guests using KVM-based virtual machines, building a small VMM in Rust which supports only the minimum features required for function instance guests, and removing all unnecessary kernel modules and drivers from the deployed guest kernel. Much of this work was done in the hopes of reducing boot-time. I have done a bit of work replicating the boot time experiments from the NSDI Firecracker paper. I may present them in a future post!


[1] Alexandru Agache, Marc Brooker, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. 2020. Firecracker: Lightweight Virtualization for Serverless Applications. In 17th USENIX symposium on networked systems design and implementation (NSDI 20), 419–434. Retrieved from https://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-137.pdf

[2] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. 2003. Xen and the Art of Virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP ’03), ACM, New York, NY, USA, 164–177. Retrieved from https://www.cl.cam.ac.uk/research/srg/netos/papers/2003-xensosp.pdf

[3] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track, California, USA, 46. Retrieved from https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/bellard/bellard.pdf

[4] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori. 2007. kvm: the Linux Virtual Machine Monitor. In Proceedings of the Linux Symposium, Dttawa, Dntorio, Canada, 225–230. Retrieved from https://www.kernel.org/doc/ols/2007/ols2007v1-pages-225-230.pdf

[5] G. Lawrence. 2016. Dirty COW (CVE-2016-5195): Docker Container Escape. Retrieved from https://blog.paranoidsoftware.com/dirty-cow-cve-2016-5195-docker-container-escape/

[6] Thorsten Leemhuis. 2012. Kernel Log: 15,000,000 lines, 3.0 promoted to long-term kernel. (2012). Retrieved from http://www.h-online.com/open/features/Kernel-Log-15-000-000-lines-of-code-3-0-promoted-to-long-term-kernel-1408062.html

[7] D. Shapira. 2017. Escaping Docker container using waitid(): CVE-2017-5123. Retrieved from https://www.twistlock.com/2017/12/27/escaping-docker-container-using-waitid-cve-2017-5123/

[8] Carl A Waldspurger. 2002. Memory resource management in VMware ESX server. ACM SIGOPS Operating Systems Review 36, SI (2002), 181–194. Retrieved from https://www.waldspurger.org/carl/papers/esx-mem-osdi02.pdf

[9] Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. 2018. Peeking Behind the Curtains of Serverless Platforms. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), USENIX Association, Boston, MA, 133–146. Retrieved from https://www.usenix.org/conference/atc18/presentation/wang-liang

[10] virtio. Retrieved from https://www.linux-kvm.org/page/Virtio