crosvm

mirror of https://chromium.googlesource.com/crosvm/crosvm synced 2025-02-05 18:20:34 +00:00

History

Chirantan Ekbote dcbf1652a4 sync: Don't transfer waiters from Condvar -> Mutex A performance optimization should never change the observable behavior and yet that's what this one did. Canceling a `cv.wait()` call after the waiter was already transferred to the Mutex's wait list should still result in us waking up the next waiter in the Condvar's wait list. Instead, the `cancel_after_transfer` test was checking for the opposite behavior. Additionally, the transfer was racy with concurrent cancellation. Consider the following sequence of events: Thread A Thread B -------- -------- drop WaitFuture cv.notify_all() waiter.cancel.lock() raw_mutex.transfer_waiters() c = cancel.c data = cancel.data waiter.cancel.unlock() waiter.cancel.lock() cancel.c = mu_cancel_waiter cancel.data = mutex_ptr waiter.cancel.unlock() waiter.is_waiting_for = Mutex mu.unlock_slow() get_wake_list() waiter.is_waiting_for = None waiter.wake() c(data, waiter, false) cancel_waiter(cv, waiter, false) waiter.is_waiting_for == None get_wake_list There are 2 issues in the above sequence: 1. Thread A has stale information about the state of the waiter. Since the waiter was woken, it needs to set `wake_next` in the cancel function to true but instead incorrectly sets it to false. By itself, this isn't that big of an issue because the cancel function also checks if the waiter was already removed from the wait list (i.e., it was woken up) but that check is problematic because of the next issue. 2. The Condvar's cancel function can detect when a waiter has been moved to the Mutex's wait list (waiter.is_waiting_for == Mutex) and can request to retry the cancellation. However, when waiter.is_waiting_for == None (which means it was removed from the wait list), it doesn't know whether the waiter was woken up from the Mutex's wait list or the Condvar's wait list. It incorrectly assumes that the waiter was in the Condvar's wait list and does not retry the cancel. As a result, the Mutex's cancel function is never called, which means any waiters still in the Mutex's wait list will never get woken up. I haven't been able to come up with a way to fix these issues without making everything way more complicated so for now let's just drop the transfer optimization. The initial motivation for this optimization was to avoid having to make a FUTEX_WAKE syscall for every thread that needs to be woken up and to avoid a thundering herd problem where the newly woken up threads all cause a bunch of contention on the mutex. However, waking up futures tends to be cheaper than waking up a whole thread. If no executor threads are blocked then it doesn't even involve making a syscall as the executor will simply add the future to its ready list. Additionally, it's unlikely that multi-threaded executors will have more threads than the # of cpus on the system so that should also reduce the amount of contention on the mutex. If this code starts showing up as a hotspot in perf traces then we should consider figuring out a way to re-enable this optimization. BUG=chromium:1157860 TEST=unit tests. Also run the tests in a loop for an hour on a kukui and see that it didn't hang Cq-Depend: chromium:2793844 Change-Id: Iee3861a40c8d9a45d3a01863d804efc82d4467ac Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2804867 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Dylan Reid <dgreid@chromium.org> Reviewed-by: Daniel Verkamp <dverkamp@chromium.org> Commit-Queue: Chirantan Ekbote <chirantan@chromium.org>		2021-04-06 09:20:22 +00:00
..
src	sync: Don't transfer waiters from Condvar -> Mutex	2021-04-06 09:20:22 +00:00
.build_test_skip	build_test: misc options, improvements, amd bug fixes	2020-10-15 13:40:32 +00:00
Cargo.toml	Fixup Cargo.toml for cros_async and io_uring	2021-04-01 03:32:58 +00:00