crosvm/cros_async
Chirantan Ekbote dcbf1652a4 sync: Don't transfer waiters from Condvar -> Mutex
A performance optimization should never change the observable behavior
and yet that's what this one did. Canceling a `cv.wait()` call after
the waiter was already transferred to the Mutex's wait list should still
result in us waking up the next waiter in the Condvar's wait list.
Instead, the `cancel_after_transfer` test was checking for the opposite
behavior.

Additionally, the transfer was racy with concurrent cancellation.
Consider the following sequence of events:

Thread A                            Thread B
--------                            --------

drop WaitFuture                     cv.notify_all()
waiter.cancel.lock()                raw_mutex.transfer_waiters()
c = cancel.c
data = cancel.data
waiter.cancel.unlock()
                                    waiter.cancel.lock()
                                    cancel.c = mu_cancel_waiter
                                    cancel.data = mutex_ptr
                                    waiter.cancel.unlock()
                                    waiter.is_waiting_for = Mutex
                                    mu.unlock_slow()
                                    get_wake_list()
                                    waiter.is_waiting_for = None
                                    waiter.wake()
c(data, waiter, false)
cancel_waiter(cv, waiter, false)
waiter.is_waiting_for == None
get_wake_list

There are 2 issues in the above sequence:

1. Thread A has stale information about the state of the waiter.  Since
   the waiter was woken, it needs to set `wake_next` in the cancel
   function to true but instead incorrectly sets it to false.  By
   itself, this isn't that big of an issue because the cancel function
   also checks if the waiter was already removed from the wait
   list (i.e., it was woken up) but that check is problematic because of
   the next issue.
2. The Condvar's cancel function can detect when a waiter has been moved
   to the Mutex's wait list (waiter.is_waiting_for == Mutex) and can
   request to retry the cancellation.  However, when
   waiter.is_waiting_for == None (which means it was removed from the
   wait list), it doesn't know whether the waiter was woken up from the
   Mutex's wait list or the Condvar's wait list.  It incorrectly assumes
   that the waiter was in the Condvar's wait list and does not retry the
   cancel.  As a result, the Mutex's cancel function is never called,
   which means any waiters still in the Mutex's wait list will never get
   woken up.

I haven't been able to come up with a way to fix these issues without
making everything way more complicated so for now let's just drop the
transfer optimization.

The initial motivation for this optimization was to avoid having to make
a FUTEX_WAKE syscall for every thread that needs to be woken up and to
avoid a thundering herd problem where the newly woken up threads all
cause a bunch of contention on the mutex.  However, waking up futures
tends to be cheaper than waking up a whole thread.  If no executor
threads are blocked then it doesn't even involve making a syscall as the
executor will simply add the future to its ready list.  Additionally,
it's unlikely that multi-threaded executors will have more threads than
the # of cpus on the system so that should also reduce the amount of
contention on the mutex.

If this code starts showing up as a hotspot in perf traces then we
should consider figuring out a way to re-enable this optimization.

BUG=chromium:1157860
TEST=unit tests.  Also run the tests in a loop for an hour on a kukui
     and see that it didn't hang

Cq-Depend: chromium:2793844
Change-Id: Iee3861a40c8d9a45d3a01863d804efc82d4467ac
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2804867
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Dylan Reid <dgreid@chromium.org>
Reviewed-by: Daniel Verkamp <dverkamp@chromium.org>
Commit-Queue: Chirantan Ekbote <chirantan@chromium.org>
2021-04-06 09:20:22 +00:00
..
src sync: Don't transfer waiters from Condvar -> Mutex 2021-04-06 09:20:22 +00:00
.build_test_skip build_test: misc options, improvements, amd bug fixes 2020-10-15 13:40:32 +00:00
Cargo.toml Fixup Cargo.toml for cros_async and io_uring 2021-04-01 03:32:58 +00:00