mirror of
https://chromium.googlesource.com/crosvm/crosvm
synced 2025-02-05 18:20:34 +00:00
48903bc8ee
Add explanation on several components I'm a bit familiar with. BUG=b:195003973 TEST=none Change-Id: I7c4c6ebc266ae40dadfb984e9dff8016efe6ab73 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/3188676 Tested-by: kokoro <noreply+kokoro@google.com> Commit-Queue: Keiichi Watanabe <keiichiw@chromium.org> Reviewed-by: Chirantan Ekbote <chirantan@chromium.org> Reviewed-by: Dennis Kempin <denniskempin@google.com>
273 lines
15 KiB
Markdown
273 lines
15 KiB
Markdown
# Architecture
|
|
|
|
The principle characteristics of crosvm are:
|
|
|
|
- A process per virtual device, made using fork
|
|
- Each process is sandboxed using [minijail]
|
|
- Takes full advantage of KVM and low-level Linux syscalls, and so only runs
|
|
on Linux
|
|
- Written in Rust for security and safety
|
|
|
|
A typical session of crosvm starts in `main.rs` where command line parsing is
|
|
done to build up a `Config` structure. The `Config` is used by `run_config` in
|
|
`linux.rs` to setup and execute a VM. Broken down into rough steps:
|
|
|
|
1. Load the linux kernel from an ELF file.
|
|
1. Create a handful of control sockets used by the virtual devices.
|
|
1. Invoke the architecture specific VM builder `Arch::build_vm` (located in
|
|
`x86_64/src/lib.rs` or `aarch64/src/lib.rs`).
|
|
1. `Arch::build_vm` will itself invoke the provided `create_devices` function
|
|
from `linux.rs`
|
|
1. `create_devices` creates every PCI device, including the virtio devices,
|
|
that were configured in `Config`, along with matching [minijail] configs for
|
|
each.
|
|
1. `Arch::generate_pci_root`, using a list of every PCI device with optional
|
|
`Minijail`, will finally jail the PCI devices and construct a `PciRoot` that
|
|
communicates with them.
|
|
1. Once the VM has been built, it's contained within a `RunnableLinuxVm` object
|
|
that is used by the VCPUs and control loop to service requests until
|
|
shutdown.
|
|
|
|
## Forking
|
|
|
|
During the device creation routine, each device will be created and then wrapped
|
|
in a `ProxyDevice` which will internally `fork` (but not `exec`) and [minijail]
|
|
the device, while dropping it for the main process. The only interaction that
|
|
the device is capable of having with the main process is via the proxied trait
|
|
methods of `BusDevice`, shared memory mappings such as the guest memory, and
|
|
file descriptors that were specifically allowed by that device's security
|
|
policy. This can lead to some surprising behavior to be aware of such as why
|
|
some file descriptors which were once valid are now invalid.
|
|
|
|
## Sandboxing Policy
|
|
|
|
Every sandbox is made with [minijail] and starts with `create_base_minijail` in
|
|
`linux.rs` which set some very restrictive settings. Linux namespaces and
|
|
seccomp filters are used extensively. Each seccomp policy can be found under
|
|
`seccomp/{arch}/{device}.policy` and should start by `@include`-ing the
|
|
`common_device.policy`. With the exception of architecture specific devices
|
|
(such as `Pl030` on ARM or `I8042` on x86_64), every device will need a
|
|
different policy for each supported architecture.
|
|
|
|
## The VM Control Sockets
|
|
|
|
For the operations that devices need to perform on the global VM state, such as
|
|
mapping into guest memory address space, there are the vm control sockets. There
|
|
are a few kinds, split by the type of request and response that the socket will
|
|
process. This also proves basic security privilege separation in case a device
|
|
becomes compromised by a malicious guest. For example, a rogue device that is
|
|
able to allocate MSI routes would not be able to use the same socket to
|
|
(de)register guest memory. During the device initialization stage, each device
|
|
that requires some aspect of VM control will have a constructor that requires
|
|
the corresponding control socket. The control socket will get preserved when the
|
|
device is sandboxed and and the other side of the socket will be waited on in
|
|
the main process's control loop.
|
|
|
|
The socket exposed by crosvm with the `--socket` command line argument is
|
|
another form of the VM control socket. Because the protocol of the control
|
|
socket is internal and unstable, the only supported way of using that resulting
|
|
named unix domain socket is via crosvm command line subcommands such as `crosvm
|
|
stop`.
|
|
|
|
## GuestMemory
|
|
|
|
`GuestMemory` and its friends `VolatileMemory`, `VolatileSlice`,
|
|
`MemoryMapping`, and `SharedMemory`, are common types used throughout crosvm to
|
|
interact with guest memory. Know which one to use in what place using some
|
|
guidelines
|
|
|
|
- `GuestMemory` is for sending around references to all of the guest memory.
|
|
It can be cloned freely, but the underlying guest memory is always the same.
|
|
Internally, it's implemented using `MemoryMapping` and `SharedMemory`. Note
|
|
that `GuestMemory` is mapped into the host address space, but it is
|
|
non-contiguous. Device memory, such as mapped DMA-Bufs, are not present in
|
|
`GuestMemory`.
|
|
- `SharedMemory` wraps a `memfd` and can be mapped using `MemoryMapping` to
|
|
access its data. `SharedMemory` can't be cloned.
|
|
- `VolatileMemory` is a trait that exposes generic access to non-contiguous
|
|
memory. `GuestMemory` implements this trait. Use this trait for functions
|
|
that operate on a memory space but don't necessarily need it to be guest
|
|
memory.
|
|
- `VolatileSlice` is analogous to a Rust slice, but unlike those, a
|
|
`VolatileSlice` has data that changes asynchronously by all those that
|
|
reference it. Exclusive mutability and data synchronization are not
|
|
available when it comes to a `VolatileSlice`. This type is useful for
|
|
functions that operate on contiguous shared memory, such as a single entry
|
|
from a scatter gather table, or for safe wrappers around functions which
|
|
operate on pointers, such as a `read` or `write` syscall.
|
|
- `MemoryMapping` is a safe wrapper around anonymous and file mappings. Access
|
|
via Rust references is forbidden, but indirect reading and writing is
|
|
available via `VolatileSlice` and several convenience functions. This type
|
|
is most useful for mapping memory unrelated to `GuestMemory`.
|
|
|
|
### Device Model
|
|
|
|
### `Bus`/`BusDevice`
|
|
|
|
The root of the crosvm device model is the `Bus` structure and its friend the
|
|
`BusDevice` trait. The `Bus` structure is a virtual computer bus used to emulate
|
|
the memory-mapped I/O bus and also I/O ports for x86 VMs. On a read or write to
|
|
an address on a VM's bus, the corresponding `Bus` object is queried for a
|
|
`BusDevice` that occupies that address. `Bus` will then forward the read/write
|
|
to the `BusDevice`. Because of this behavior, only one `BusDevice` may exist at
|
|
any given address. However, a `BusDevice` may be placed at more than one address
|
|
range. Depending on how a `BusDevice` was inserted into the `Bus`, the forwarded
|
|
read/write will be relative to 0 or to the start of the address range that the
|
|
`BusDevice` occupies (which would be ambiguous if the `BusDevice` occupied more
|
|
than one range).
|
|
|
|
Only the base address of a multi-byte read/write is used to search for a device,
|
|
so a device implementation should be aware that the last address of a single
|
|
read/write may be outside its address range. For example, if a `BusDevice` was
|
|
inserted at base address 0x1000 with a length of 0x40, a 4-byte read by a VCPU
|
|
at 0x39 would be forwarded to that `BusDevice`.
|
|
|
|
Each `BusDevice` is reference counted and wrapped in a mutex, so implementations
|
|
of `BusDevice` need not worry about synchronizing their access across multiple
|
|
VCPUs and threads. Each VCPU will get a complete copy of the `Bus`, so there is
|
|
no contention for querying the `Bus` about an address. Once the `BusDevice` is
|
|
found, the `Bus` will acquire an exclusive lock to the device and forward the
|
|
VCPU's read/write. The implementation of the `BusDevice` will block execution of
|
|
the VCPU that invoked it, as well as any other VCPU attempting access, until it
|
|
returns from its method.
|
|
|
|
Most devices in crosvm do not implement `BusDevice` directly, but some are
|
|
examples are `i8042` and `Serial`. With the exception of PCI devices, all
|
|
devices are inserted by architecture specific code (which may call into the
|
|
architecture-neutral `arch` crate). A `BusDevice` can be proxied to a sandboxed
|
|
process using `ProxyDevice`, which will create the second process using a fork,
|
|
with no exec.
|
|
|
|
### `PciConfigIo`/`PciConfigMmio`
|
|
|
|
In order to use the more complex PCI bus, there are a couple adapters that
|
|
implement `BusDevice` and call into a `PciRoot` with higher level calls to
|
|
`config_space_read`/`config_space_write`. The `PciConfigMmio` is a `BusDevice`
|
|
for insertion into the MMIO `Bus` for ARM devices. For x86_64, `PciConfigIo` is
|
|
inserted into the I/O port `Bus`. There is only one implementation of `PciRoot`
|
|
that is used by either of the `PciConfig*` structures. Because these devices are
|
|
very simple, they have very little code or state. They aren't sandboxed and are
|
|
run as part of the main process.
|
|
|
|
### `PciRoot`/`PciDevice`/`VirtioPciDevice`
|
|
|
|
The `PciRoot`, analogous to `BusDevice` for `Bus`s, contains all the `PciDevice`
|
|
trait objects. Because of a shortcut (or hack), the `ProxyDevice` only supports
|
|
jailing `BusDevice` traits. Therefore, `PciRoot` only contains `BusDevice`s,
|
|
even though they also implement `PciDevice`. In fact, every `PciDevice` also
|
|
implements `BusDevice` because of a blanket implementation (`impl<T: PciDevice>
|
|
BusDevice for T { … }`). There are a few PCI related methods in `BusDevice` to
|
|
allow the `PciRoot` to still communicate with the underlying `PciDevice` (yes,
|
|
this abstraction is very leaky). Most devices will not implement `PciDevice`
|
|
directly, instead using the `VirtioPciDevice` implementation for virtio devices,
|
|
but the xHCI (USB) controller is an example that implements `PciDevice`
|
|
directly. The `VirtioPciDevice` is an implementation of `PciDevice` that wraps a
|
|
`VirtioDevice`, which is how the virtio specified PCI transport is adapted to a
|
|
transport agnostic `VirtioDevice` implementation.
|
|
|
|
### `VirtioDevice`
|
|
|
|
The `VirtioDevice` is the most widely implemented trait among the device traits.
|
|
Each of the different virtio devices (block, rng, net, etc.) implement this
|
|
trait directly and they follow a similar pattern. Most of the trait methods are
|
|
easily filled in with basic information about the specific device, but
|
|
`activate` will be the heart of the implementation. It's called by the virtio
|
|
transport after the guest's driver has indicated the device has been configured
|
|
and is ready to run. The virtio device implementation will receive the run time
|
|
related resources (`GuestMemory`, `Interrupt`, etc.) for processing virtio
|
|
queues and associated interrupts via the arguments to `activate`, but `activate`
|
|
can't spend its time actually processing the queues. A VCPU will be blocked as
|
|
long as `activate` is running. Every device uses `activate` to launch a worker
|
|
thread that takes ownership of run time resources to do the actual processing.
|
|
There is some subtlety in dealing with virtio queues, so the smart thing to do
|
|
is copy a simpler device and adapt it, such as the rng device (`rng.rs`).
|
|
|
|
## Communication Framework
|
|
|
|
Because of the multi-process nature of crosvm, communication is done over
|
|
several IPC primitives. The common ones are shared memory pages, unix sockets,
|
|
anonymous pipes, and various other file descriptor variants (DMA-buf, eventfd,
|
|
etc.). Standard methods (`read`/`write`) of using these primitives may be used,
|
|
but crosvm has developed some helpers which should be used where applicable.
|
|
|
|
### `PollContext`/`EpollContext`
|
|
|
|
Most threads in crosvm will have a wait loop using a `PollContext`, which is a
|
|
wrapper around Linux's `epoll` primitive for selecting over file descriptors.
|
|
`EpollContext` is very similar but has slightly fewer features, but is usable by
|
|
multiple threads at once. In either case, each FD is added to the context along
|
|
with an associated token, whose type is the type parameter of `PollContext`.
|
|
This token must be convertible to and from a `u64`, which is a limitation
|
|
imposed by how `epoll` works. There is a custom derive `#[derive(PollToken)]`
|
|
which can be applied to an `enum` declaration that makes it easy to use your own
|
|
enum in a `PollContext`.
|
|
|
|
Note that the limitations of `PollContext` are the same as the limitations of
|
|
`epoll`. The same FD can not be inserted more than once, and the FD will be
|
|
automatically removed if the process runs out of references to that FD. A
|
|
`dup`/`fork` call will increment that reference count, so closing the original
|
|
FD will not actually remove it from the `PollContext`. It is possible to receive
|
|
tokens from `PollContext` for an FD that was closed because of a race condition
|
|
in which an event was registered in the background before the `close` happened.
|
|
Best practice is to remove an FD before closing it so that events associated
|
|
with it can be reliably eliminated.
|
|
|
|
### `serde` with Descriptors.
|
|
|
|
Using raw sockets and pipes to communicate is very inconvenient for rich data
|
|
types. To help make this easier and less error prone, crosvm uses the `serde`
|
|
crate. To allow transmitting types with embedded descriptors (FDs on Linux or
|
|
HANDLEs on Windows), a module is provided for sending and receiving descriptors
|
|
alongside the plain old bytes that serde consumes.
|
|
|
|
[minijail]: https://android.googlesource.com/platform/external/minijail
|
|
|
|
## Code Map
|
|
|
|
Source code is organized into crates, each with their own unit tests.
|
|
|
|
- `./src/` - The top-level binary front-end for using crosvm.
|
|
- `aarch64` - Support code specific to 64 bit ARM architectures.
|
|
- `base` - Safe wrappers for small system facilities which provides
|
|
cross-platform-compatible interfaces. For Linux, this is basically a thin
|
|
wrapper of `sys_util`.
|
|
- `bin` - Scripts for code health such as wrappers of `rustfmt` and `clippy`.
|
|
- `ci` - Scripts for continuous integration.
|
|
- `cros_async` - Runtime for async/await programming. This crate provides a
|
|
`Future` executor based on `io_uring` and one based on `epoll`.
|
|
- `devices` - Virtual devices exposed to the guest OS.
|
|
- `disk` - Library to create and manipulate several types of disks such as raw
|
|
disk, [qcow], etc.
|
|
- `hypervisor` - Abstract layer to interact with hypervisors. For Linux, this
|
|
crate is a wrapper of `kvm`.
|
|
- `integration_tests` - End-to-end tests that run a crosvm VM.
|
|
- `kernel_loader` - Loads elf64 kernel files to a slice of memory.
|
|
- `kvm_sys` - Low-level (mostly) auto-generated structures and constants for
|
|
using KVM.
|
|
- `kvm` - Unsafe, low-level wrapper code for using `kvm_sys`.
|
|
- `libvda` - Safe wrapper of [libvda], a Chrome OS HW-accelerated video
|
|
decoding/encoding library.
|
|
- `net_sys` - Low-level (mostly) auto-generated structures and constants for
|
|
creating TUN/TAP devices.
|
|
- `net_util` - Wrapper for creating TUN/TAP devices.
|
|
- `qcow_util` - A library and a binary to manipulate [qcow] disks.
|
|
- `seccomp` - Contains minijail seccomp policy files for each sandboxed
|
|
device. Because some syscalls vary by architecture, the seccomp policies are
|
|
split by architecture.
|
|
- `sync` - Our version of `std::sync::Mutex` and `std::sync::Condvar`.
|
|
- `sys_util` - Mostly safe wrappers for small system facilities such as
|
|
`eventfd` or `syslog`.
|
|
- `third_party` - Third-party libraries which we are maintaining on the Chrome
|
|
OS tree or the AOSP tree.
|
|
- `vfio_sys` - Low-level (mostly) auto-generated structures, constants and
|
|
ioctls for [VFIO].
|
|
- `vhost` - Wrappers for creating vhost based devices.
|
|
- `virtio_sys` - Low-level (mostly) auto-generated structures and constants
|
|
for interfacing with kernel vhost support.
|
|
- `vm_control` - IPC for the VM.
|
|
- `vm_memory` - Vm-specific memory objects.
|
|
- `x86_64` - Support code specific to 64 bit intel machines.
|
|
|
|
[qcow]: https://en.wikipedia.org/wiki/Qcow
|
|
[libvda]: https://chromium.googlesource.com/chromiumos/platform2/+/refs/heads/main/arc/vm/libvda/
|
|
[VFIO]: https://www.kernel.org/doc/html/latest/driver-api/vfio.html
|