No description
Find a file
Chirantan Ekbote dcbf1652a4 sync: Don't transfer waiters from Condvar -> Mutex
A performance optimization should never change the observable behavior
and yet that's what this one did. Canceling a `cv.wait()` call after
the waiter was already transferred to the Mutex's wait list should still
result in us waking up the next waiter in the Condvar's wait list.
Instead, the `cancel_after_transfer` test was checking for the opposite
behavior.

Additionally, the transfer was racy with concurrent cancellation.
Consider the following sequence of events:

Thread A                            Thread B
--------                            --------

drop WaitFuture                     cv.notify_all()
waiter.cancel.lock()                raw_mutex.transfer_waiters()
c = cancel.c
data = cancel.data
waiter.cancel.unlock()
                                    waiter.cancel.lock()
                                    cancel.c = mu_cancel_waiter
                                    cancel.data = mutex_ptr
                                    waiter.cancel.unlock()
                                    waiter.is_waiting_for = Mutex
                                    mu.unlock_slow()
                                    get_wake_list()
                                    waiter.is_waiting_for = None
                                    waiter.wake()
c(data, waiter, false)
cancel_waiter(cv, waiter, false)
waiter.is_waiting_for == None
get_wake_list

There are 2 issues in the above sequence:

1. Thread A has stale information about the state of the waiter.  Since
   the waiter was woken, it needs to set `wake_next` in the cancel
   function to true but instead incorrectly sets it to false.  By
   itself, this isn't that big of an issue because the cancel function
   also checks if the waiter was already removed from the wait
   list (i.e., it was woken up) but that check is problematic because of
   the next issue.
2. The Condvar's cancel function can detect when a waiter has been moved
   to the Mutex's wait list (waiter.is_waiting_for == Mutex) and can
   request to retry the cancellation.  However, when
   waiter.is_waiting_for == None (which means it was removed from the
   wait list), it doesn't know whether the waiter was woken up from the
   Mutex's wait list or the Condvar's wait list.  It incorrectly assumes
   that the waiter was in the Condvar's wait list and does not retry the
   cancel.  As a result, the Mutex's cancel function is never called,
   which means any waiters still in the Mutex's wait list will never get
   woken up.

I haven't been able to come up with a way to fix these issues without
making everything way more complicated so for now let's just drop the
transfer optimization.

The initial motivation for this optimization was to avoid having to make
a FUTEX_WAKE syscall for every thread that needs to be woken up and to
avoid a thundering herd problem where the newly woken up threads all
cause a bunch of contention on the mutex.  However, waking up futures
tends to be cheaper than waking up a whole thread.  If no executor
threads are blocked then it doesn't even involve making a syscall as the
executor will simply add the future to its ready list.  Additionally,
it's unlikely that multi-threaded executors will have more threads than
the # of cpus on the system so that should also reduce the amount of
contention on the mutex.

If this code starts showing up as a hotspot in perf traces then we
should consider figuring out a way to re-enable this optimization.

BUG=chromium:1157860
TEST=unit tests.  Also run the tests in a loop for an hour on a kukui
     and see that it didn't hang

Cq-Depend: chromium:2793844
Change-Id: Iee3861a40c8d9a45d3a01863d804efc82d4467ac
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2804867
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Dylan Reid <dgreid@chromium.org>
Reviewed-by: Daniel Verkamp <dverkamp@chromium.org>
Commit-Queue: Chirantan Ekbote <chirantan@chromium.org>
2021-04-06 09:20:22 +00:00
aarch64 arch: rewrite FDT writer in native rust 2021-03-03 01:29:15 +00:00
acpi_tables acpi_tables: Add test case for reading SDT data from files 2020-10-27 05:22:34 +00:00
arch replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
assertions edition: Remove extern crate lines 2019-04-15 02:06:08 -07:00
base base: add tube module 2021-04-01 01:59:29 +00:00
bin uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
bit_field uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
ci uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
cros_async sync: Don't transfer waiters from Condvar -> Mutex 2021-04-06 09:20:22 +00:00
crosvm_plugin uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
data_model [linux_input_sys/data_model]: signed input_event 2021-03-23 18:49:33 +00:00
devices vm_memory: Allow GuestMemory to be backed by multiple FDs 2021-04-06 04:02:26 +00:00
disk uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
docs replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
enumn crosvm: Fix clippy::needless_doctest_main 2020-07-21 13:18:10 +00:00
fuse uprev rust-toolchain and fix clippy warnings 2021-03-22 21:41:07 +00:00
fuzz Add fuzz to crosvm's workspace 2021-03-24 06:23:34 +00:00
gpu_display [linux_input_sys/data_model]: signed input_event 2021-03-23 18:49:33 +00:00
hypervisor vm_memory: Allow GuestMemory to be backed by multiple FDs 2021-04-06 04:02:26 +00:00
integration_tests Run integration_tests by calling crosvm binary 2021-03-19 20:35:53 +00:00
io_uring Fixup Cargo.toml for cros_async and io_uring 2021-04-01 03:32:58 +00:00
kernel_cmdline Fix new clippy warning for potential matches! uses 2020-11-05 06:27:17 +00:00
kernel_loader base: Add shared memory layer to base. 2020-09-30 19:44:40 +00:00
kvm vm_memory: Allow GuestMemory to be backed by multiple FDs 2021-04-06 04:02:26 +00:00
kvm_sys Enable KVM_CAP_ARM_PROTECTED_VM when --protected-vm is passed. 2021-03-02 19:04:43 +00:00
linux_input_sys [linux_input_sys/data_model]: signed input_event 2021-03-23 18:49:33 +00:00
net_sys Add "base" crate and transition crosvm usages to it from sys_util 2020-08-06 18:19:44 +00:00
net_util Final major RawDescriptor transition. 2020-11-13 02:38:47 +00:00
power_monitor crosvm: power_monitor: Populate more battery fields. 2021-02-09 04:41:52 +00:00
protos protos: add arch = x86 guards around CPUID helpers 2020-08-18 05:30:38 +00:00
qcow_utils Add "base" crate and transition crosvm usages to it from sys_util 2020-08-06 18:19:44 +00:00
rand_ish rand_ish: Generate random string from SimpleRng 2020-06-24 06:44:56 +00:00
resources replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
rutabaga_gfx rutabaga_gfx: convert to SafeDescriptor 2021-03-23 00:44:10 +00:00
seccomp crosvm: sandbox changes for udmabuf 2021-03-30 16:42:00 +00:00
src linux: reorder video devices after gpu 2021-04-06 03:36:35 +00:00
sync sync: add wait_timeout method to condvar wrapper 2019-09-16 17:18:28 +00:00
sys_util sys_util: Migrate code from libchromeos::linux. 2021-04-05 21:22:49 +00:00
syscall_defines Update x86 and x86_64 syscalls to Linux v5.6-rc5, avoiding duplicates. 2020-06-25 14:34:32 +00:00
tempfile tempfile: add tempfile() and NamedTempFile 2020-08-27 00:39:02 +00:00
tests Framework for extended integration tests 2021-01-20 17:48:10 +00:00
tpm2 crosvm: add license blurb to all files 2019-04-24 15:51:38 -07:00
tpm2-sys tpm: Update libtpm2 to master 2020-07-24 08:08:21 +00:00
usb_sys Add "base" crate and transition crosvm usages to it from sys_util 2020-08-06 18:19:44 +00:00
usb_util replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
vfio_sys crosvm-direct: interrupt passthrough kernel interface. 2021-03-31 02:12:55 +00:00
vhost vm_memory: Allow GuestMemory to be backed by multiple FDs 2021-04-06 04:02:26 +00:00
virtio_sys base: First steps towards universal RawDescriptor 2020-10-31 07:12:34 +00:00
vm_control replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
vm_memory vm_memory: Allow GuestMemory to be backed by multiple FDs 2021-04-06 04:02:26 +00:00
x86_64 replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
.dockerignore add docker supported builds and tests 2019-05-15 13:36:19 -07:00
.gitignore Kokoro: Extensive polishing and bugfixing 2021-02-10 22:04:43 +00:00
.gitmodules tpm: Add tpm2-sys crate 2019-01-13 03:23:13 -08:00
.rustfmt.toml rustfmt.toml: Use 2018 edition 2021-02-10 11:54:06 +00:00
Cargo.lock replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
Cargo.toml Set Cargo.toml's default-run to crsovm 2021-04-06 00:21:56 +00:00
CONTRIBUTING.md crosvm: Remove old test infrastructure 2021-03-03 07:05:03 +00:00
LICENSE add LICENSE and README 2017-04-17 14:06:21 -07:00
navbar.md docs: Add note about rust-vmm integration 2020-10-01 20:43:41 +00:00
OWNERS crosvm: Remove owners wildcard 2021-03-19 01:57:41 +00:00
PRESUBMIT.cfg crosvm: Add a pre-upload hook to run clippy 2020-07-28 16:29:06 +00:00
README.md crosvm: Remove old test infrastructure 2021-03-03 07:05:03 +00:00
run_tests replace all usage of MsgOnSocket derives 2021-04-02 15:40:41 +00:00
rust-toolchain native and aarch64 cross-compile containers 2021-01-20 17:41:27 +00:00
test_all Kokoro: Extensive polishing and bugfixing 2021-02-10 22:04:43 +00:00
unblocked_terms.txt Add COIL presubmit to crosvm 2021-02-04 02:47:03 +00:00

crosvm - The Chrome OS Virtual Machine Monitor

This component, known as crosvm, runs untrusted operating systems along with virtualized devices. This only runs VMs through the Linux's KVM interface. What makes crosvm unique is a focus on safety within the programming language and a sandbox around the virtual devices to protect the kernel from attack in case of an exploit in the devices.

IRC

The channel #crosvm on freenode is used for technical discussion related to crosvm development and integration.

Getting started

Building for CrOS

crosvm on Chromium OS is built with Portage, so it follows the same general workflow as any cros_workon package. The full package name is chromeos-base/crosvm.

See the Chromium OS developer guide for more on how to build and deploy with Portage.

Building with Docker

See the README from the ci subdirectory to learn how to build and test crosvm in enviroments outside of the Chrome OS chroot.

Building for Linux

NOTE: Building for Linux natively is new and not fully supported.

First, set up depot_tools and use repo to sync down the crosvm source tree. This is a subset of the entire Chromium OS manifest with just enough repos to build crosvm.

mkdir crosvm
cd crosvm
repo init -g crosvm -u https://chromium.googlesource.com/chromiumos/manifest.git --repo-url=https://chromium.googlesource.com/external/repo.git
repo sync

A basic crosvm build links against libcap. On a Debian-based system, you can install libcap-dev.

Handy Debian one-liner for all build and runtime deps, particularly if you're running Crostini:

sudo apt install build-essential libcap-dev libgbm-dev libvirglrenderer-dev libwayland-bin libwayland-dev pkg-config protobuf-compiler python wayland-protocols

Known issues:

  • Seccomp policy files have hardcoded absolute paths. You can either fix up the paths locally, or set up an awesome hacky symlink: sudo mkdir /usr/share/policy && sudo ln -s /path/to/crosvm/seccomp/x86_64 /usr/share/policy/crosvm. We'll eventually build the precompiled policies into the crosvm binary.
  • Devices can't be jailed if /var/empty doesn't exist. sudo mkdir -p /var/empty to work around this for now.
  • You need read/write permissions for /dev/kvm to run tests or other crosvm instances. Usually it's owned by the kvm group, so sudo usermod -a -G kvm $USER and then log out and back in again to fix this.
  • Some other features (networking) require CAP_NET_ADMIN so those usually need to be run as root.

And that's it! You should be able to cargo build/run/test.

Usage

To see the usage information for your version of crosvm, run crosvm or crosvm run --help.

Boot a Kernel

To run a very basic VM with just a kernel and default devices:

$ crosvm run "${KERNEL_PATH}"

The uncompressed kernel image, also known as vmlinux, can be found in your kernel build directory in the case of x86 at arch/x86/boot/compressed/vmlinux.

Rootfs

With a disk image

In most cases, you will want to give the VM a virtual block device to use as a root file system:

$ crosvm run -r "${ROOT_IMAGE}" "${KERNEL_PATH}"

The root image must be a path to a disk image formatted in a way that the kernel can read. Typically this is a squashfs image made with mksquashfs or an ext4 image made with mkfs.ext4. By using the -r argument, the kernel is automatically told to use that image as the root, and therefore can only be given once. More disks can be given with -d or --rwdisk if a writable disk is desired.

To run crosvm with a writable rootfs:

WARNING: Writable disks are at risk of corruption by a malicious or malfunctioning guest OS.

crosvm run --rwdisk "${ROOT_IMAGE}" -p "root=/dev/vda" vmlinux

NOTE: If more disks arguments are added prior to the desired rootfs image, the root=/dev/vda must be adjusted to the appropriate letter.

With virtiofs

Linux kernel 5.4+ is required for using virtiofs. This is convenient for testing. The file system must be named "mtd*" or "ubi*".

crosvm run --shared-dir "/:mtdfake:type=fs:cache=always" \
    -p "rootfstype=virtiofs root=mtdfake" vmlinux

Control Socket

If the control socket was enabled with -s, the main process can be controlled while crosvm is running. To tell crosvm to stop and exit, for example:

NOTE: If the socket path given is for a directory, a socket name underneath that path will be generated based on crosvm's PID.

$ crosvm run -s /run/crosvm.sock ${USUAL_CROSVM_ARGS}
    <in another shell>
$ crosvm stop /run/crosvm.sock

WARNING: The guest OS will not be notified or gracefully shutdown.

This will cause the original crosvm process to exit in an orderly fashion, allowing it to clean up any OS resources that might have stuck around if crosvm were terminated early.

Multiprocess Mode

By default crosvm runs in multiprocess mode. Each device that supports running inside of a sandbox will run in a jailed child process of crosvm. The appropriate minijail seccomp policy files must be present either in /usr/share/policy/crosvm or in the path specified by the --seccomp-policy-dir argument. The sandbox can be disabled for testing with the --disable-sandbox option.

Virtio Wayland

Virtio Wayland support requires special support on the part of the guest and as such is unlikely to work out of the box unless you are using a Chrome OS kernel along with a termina rootfs.

To use it, ensure that the XDG_RUNTIME_DIR enviroment variable is set and that the path $XDG_RUNTIME_DIR/wayland-0 points to the socket of the Wayland compositor you would like the guest to use.

GDB Support

crosvm supports GDB Remote Serial Protocol to allow developers to debug guest kernel via GDB.

You can enable the feature by --gdb flag:

# Use uncompressed vmlinux
$ crosvm run --gdb <port> ${USUAL_CROSVM_ARGS} vmlinux

Then, you can start GDB in another shell.

$ gdb vmlinux
(gdb) target remote :<port>
(gdb) hbreak start_kernel
(gdb) c
<start booting in the other shell>

For general techniques for debugging the Linux kernel via GDB, see this kernel documentation.

Defaults

The following are crosvm's default arguments and how to override them.

  • 256MB of memory (set with -m)
  • 1 virtual CPU (set with -c)
  • no block devices (set with -r, -d, or --rwdisk)
  • no network (set with --host_ip, --netmask, and --mac)
  • virtio wayland support if XDG_RUNTIME_DIR enviroment variable is set (disable with --no-wl)
  • only the kernel arguments necessary to run with the supported devices (add more with -p)
  • run in multiprocess mode (run in single process mode with --disable-sandbox)
  • no control socket (set with -s)

System Requirements

A Linux kernel with KVM support (check for /dev/kvm) is required to run crosvm. In order to run certain devices, there are additional system requirements:

  • virtio-wayland - The memfd_create syscall, introduced in Linux 3.17, and a Wayland compositor.
  • vsock - Host Linux kernel with vhost-vsock support, introduced in Linux 4.8.
  • multiprocess - Host Linux kernel with seccomp-bpf and Linux namespacing support.
  • virtio-net - Host Linux kernel with TUN/TAP support (check for /dev/net/tun) and running with CAP_NET_ADMIN privileges.

Emulated Devices

Device Description
CMOS/RTC Used to get the current calendar time.
i8042 Used by the guest kernel to exit crosvm.
serial x86 I/O port driven serial devices that print to stdout and take input from stdin.
virtio-block Basic read/write block device.
virtio-net Device to interface the host and guest networks.
virtio-rng Entropy source used to seed guest OS's entropy pool.
virtio-vsock Enabled VSOCKs for the guests.
virtio-wayland Allowed guest to use host Wayland socket.

Contributing

Code Health

test_all

Crosvm provides docker containers to build and run tests for both x86_64 and aarch64, which can be run with the ./test_all script. See ci/README.md for more details on how to use the containers for local development.

rustfmt

All code should be formatted with rustfmt. We have a script that applies rustfmt to all Rust code in the crosvm repo: please run bin/fmt before checking in a change. This is different from cargo fmt --all which formats multiple crates but a single workspace only; crosvm consists of multiple workspaces.

clippy

The clippy linter is used to check for common Rust problems. The crosvm project uses a specific set of clippy checks; please run bin/clippy before checking in a change.

Dependencies

With a few exceptions, external dependencies inside of the Cargo.toml files are not allowed. The reason being that community made crates tend to explode the binary size by including dozens of transitive dependencies. All these dependencies also must be reviewed to ensure their suitability to the crosvm project. Currently allowed crates are:

  • cc - Build time dependency needed to build C source code used in crosvm.
  • libc - Required to use the standard library, this crate is a simple wrapper around libc's symbols.

Code Overview

The crosvm source code is written in Rust and C. To build, crosvm generally requires the most recent stable version of rustc.

Source code is organized into crates, each with their own unit tests. These crates are:

  • crosvm - The top-level binary front-end for using crosvm.
  • devices - Virtual devices exposed to the guest OS.
  • kernel_loader - Loads elf64 kernel files to a slice of memory.
  • kvm_sys - Low-level (mostly) auto-generated structures and constants for using KVM.
  • kvm - Unsafe, low-level wrapper code for using kvm_sys.
  • net_sys - Low-level (mostly) auto-generated structures and constants for creating TUN/TAP devices.
  • net_util - Wrapper for creating TUN/TAP devices.
  • sys_util - Mostly safe wrappers for small system facilities such as eventfd or syslog.
  • syscall_defines - Lists of syscall numbers in each architecture used to make syscalls not supported in libc.
  • vhost - Wrappers for creating vhost based devices.
  • virtio_sys - Low-level (mostly) auto-generated structures and constants for interfacing with kernel vhost support.
  • vm_control - IPC for the VM.
  • x86_64 - Support code specific to 64 bit intel machines.

The seccomp folder contains minijail seccomp policy files for each sandboxed device. Because some syscalls vary by architecture, the seccomp policies are split by architecture.