Message ID | 20210818001210.4073390-5-seanjc@google.com |
---|---|
State | New |
Headers | show |
Series | KVM: rseq: Fix and a test for a KVM+rseq bug | expand |
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote: > Add a test to verify an rseq's CPU ID is updated correctly if the task is > migrated while the kernel is handling KVM_RUN. This is a regression test > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM > without updating rseq, leading to a stale CPU ID and other badness. > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- [...] > + > +static void *migration_worker(void *ign) > +{ > + cpu_set_t allowed_mask; > + int r, i, nr_cpus, cpu; > + > + CPU_ZERO(&allowed_mask); > + > + nr_cpus = CPU_COUNT(&possible_mask); > + > + for (i = 0; i < 20000; i++) { > + cpu = i % nr_cpus; > + if (!CPU_ISSET(cpu, &possible_mask)) > + continue; > + > + CPU_SET(cpu, &allowed_mask); > + > + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); > + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno, > + strerror(errno)); > + > + CPU_CLR(cpu, &allowed_mask); > + > + usleep(10); > + } > + done = true; > + return NULL; > +} > + > +int main(int argc, char *argv[]) > +{ > + struct kvm_vm *vm; > + u32 cpu, rseq_cpu; > + int r; > + > + /* Tell stdout not to buffer its content */ > + setbuf(stdout, NULL); > + > + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask); > + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno, > + strerror(errno)); > + > + if (CPU_COUNT(&possible_mask) < 2) { > + print_skip("Only one CPU, task migration not possible\n"); > + exit(KSFT_SKIP); > + } > + > + sys_rseq(0); > + > + /* > + * Create and run a dummy VM that immediately exits to userspace via > + * GUEST_SYNC, while concurrently migrating the process by setting its > + * CPU affinity. > + */ > + vm = vm_create_default(VCPU_ID, 0, guest_code); > + > + pthread_create(&migration_thread, NULL, migration_worker, 0); > + > + while (!done) { > + vcpu_run(vm, VCPU_ID); > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, > + "Guest failed?"); > + > + cpu = sched_getcpu(); > + rseq_cpu = READ_ONCE(__rseq.cpu_id); > + > + /* > + * Verify rseq's CPU matches sched's CPU, and that sched's CPU > + * is stable. This doesn't handle the case where the task is > + * migrated between sched_getcpu() and reading rseq, and again > + * between reading rseq and sched_getcpu(), but in practice no > + * false positives have been observed, while on the other hand > + * blocking migration while this thread reads CPUs messes with > + * the timing and prevents hitting failures on a buggy kernel. > + */ I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id if you add a pthread mutex to protect: sched_getcpu and __rseq_abi.cpu_id reads vs sched_setaffinity calls within the migration thread. Thoughts ? Thanks, Mathieu > + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(), > + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); > + } > + > + pthread_join(migration_thread, NULL); > + > + kvm_vm_free(vm); > + > + sys_rseq(RSEQ_FLAG_UNREGISTER); > + > + return 0; > +} > -- > 2.33.0.rc1.237.g0d66db33f3-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote: > ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote: > > > Add a test to verify an rseq's CPU ID is updated correctly if the task is > > migrated while the kernel is handling KVM_RUN. This is a regression test > > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer > > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM > > without updating rseq, leading to a stale CPU ID and other badness. > > > > Signed-off-by: Sean Christopherson <seanjc@google.com> > > --- > > [...] > > > + while (!done) { > > + vcpu_run(vm, VCPU_ID); > > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, > > + "Guest failed?"); > > + > > + cpu = sched_getcpu(); > > + rseq_cpu = READ_ONCE(__rseq.cpu_id); > > + > > + /* > > + * Verify rseq's CPU matches sched's CPU, and that sched's CPU > > + * is stable. This doesn't handle the case where the task is > > + * migrated between sched_getcpu() and reading rseq, and again > > + * between reading rseq and sched_getcpu(), but in practice no > > + * false positives have been observed, while on the other hand > > + * blocking migration while this thread reads CPUs messes with > > + * the timing and prevents hitting failures on a buggy kernel. > > + */ > > I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id > if you add a pthread mutex to protect: > > sched_getcpu and __rseq_abi.cpu_id reads > > vs > > sched_setaffinity calls within the migration thread. > > Thoughts ? I tried that and couldn't reproduce the bug. That's what I attempted to call out in the blurb "blocking migration while this thread reads CPUs ... prevents hitting failures on a buggy kernel". I considered adding arbitrary delays around the mutex to try and hit the bug, but I was worried that even if I got it "working" for this bug, the test would be too tailored to this bug and potentially miss future regression. Letting the two threads run wild seemed like it would provide the best coverage, at the cost of potentially causing to false failures.
----- On Aug 19, 2021, at 7:33 PM, Sean Christopherson seanjc@google.com wrote: > On Thu, Aug 19, 2021, Mathieu Desnoyers wrote: >> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote: >> >> > Add a test to verify an rseq's CPU ID is updated correctly if the task is >> > migrated while the kernel is handling KVM_RUN. This is a regression test >> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer >> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM >> > without updating rseq, leading to a stale CPU ID and other badness. >> > >> > Signed-off-by: Sean Christopherson <seanjc@google.com> >> > --- >> >> [...] >> >> > + while (!done) { >> > + vcpu_run(vm, VCPU_ID); >> > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, >> > + "Guest failed?"); >> > + >> > + cpu = sched_getcpu(); >> > + rseq_cpu = READ_ONCE(__rseq.cpu_id); >> > + >> > + /* >> > + * Verify rseq's CPU matches sched's CPU, and that sched's CPU >> > + * is stable. This doesn't handle the case where the task is >> > + * migrated between sched_getcpu() and reading rseq, and again >> > + * between reading rseq and sched_getcpu(), but in practice no >> > + * false positives have been observed, while on the other hand >> > + * blocking migration while this thread reads CPUs messes with >> > + * the timing and prevents hitting failures on a buggy kernel. >> > + */ >> >> I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id >> if you add a pthread mutex to protect: >> >> sched_getcpu and __rseq_abi.cpu_id reads >> >> vs >> >> sched_setaffinity calls within the migration thread. >> >> Thoughts ? > > I tried that and couldn't reproduce the bug. That's what I attempted to call > out > in the blurb "blocking migration while this thread reads CPUs ... prevents > hitting > failures on a buggy kernel". > > I considered adding arbitrary delays around the mutex to try and hit the bug, > but > I was worried that even if I got it "working" for this bug, the test would be > too > tailored to this bug and potentially miss future regression. Letting the two > threads run wild seemed like it would provide the best coverage, at the cost of > potentially causing to false failures. OK, so your point is that using mutual exclusion to ensure stability of the cpu id changes the timings too much, to a point where the issues don't reproduce. I understand that this mutex ties the migration thread timings to the vcpu thread's use of the mutex, which will reduce timings randomness, which is unwanted here. I still really hate flakiness in tests, because then people stop caring when they fail once in a while. And with the nature of rseq, a once-in-a-while failure is a big deal. Let's see if we can use other tricks to ensure stability of the cpu id without changing timings too much. One idea would be to use a seqcount lock. But even if we use that, I'm concerned that the very long writer critical section calling sched_setaffinity would need to be alternated with a sleep to ensure the read-side progresses. The sleep delay could be relatively small compared to the duration of the sched_setaffinity call, e.g. ratio 1:10. static volatile uint64_t seqcnt; The thread responsible for setting the affinity would do something like: for (;;) { atomic_inc_seq_cst(&seqcnt); sched_setaffinity(..., n++ % nr_cpus); atomic_inc_seq_cst(&seqcnt); usleep(1); /* this is where read-side is allowed to progress. */ } And the thread reading the rseq cpu id and calling sched_getcpu(): uint64_t snapshot; do { snapshot = atomic_load(&seqcnt) & ~1; /* force retry if odd */ smp_rmb(); cpu = sched_getcpu(); rseq_cpu = READ_ONCE(__rseq.cpu_id); smp_rmb(); } while (snapshot != atomic_load(&seqcnt)); So the reader retry the cpu id reads whenever sched_setaffinity is being called by the migration thread, and whenever it is preempted for more than one migration thread loop. That should achieve our goal of providing cpu id stability without significantly changing the timings of the migration thread, given that it never blocks waiting for the reader. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
On Fri, Aug 20, 2021, Mathieu Desnoyers wrote: > I still really hate flakiness in tests, because then people stop caring when they > fail once in a while. And with the nature of rseq, a once-in-a-while failure is a > big deal. Let's see if we can use other tricks to ensure stability of the cpu id > without changing timings too much. Yeah, zero agrument regarding flaky tests. > One idea would be to use a seqcount lock. A sequence counter did the trick! Thanks much! > But even if we use that, I'm concerned that the very long writer critical > section calling sched_setaffinity would need to be alternated with a sleep to > ensure the read-side progresses. The sleep delay could be relatively small > compared to the duration of the sched_setaffinity call, e.g. ratio 1:10. I already had an arbitrary usleep(10) to let the reader make progress between sched_setaffinity() calls. Dropping it down to 1us didn't affect reproducibility, so I went with that to shave those precious cycles :-) Eliminating the delay entirely did result in no repro, which was a nice confirmation that it's needed to let the reader get back into KVM_RUN. Thanks again!
diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore index 0709af0144c8..6d031ff6b68e 100644 --- a/tools/testing/selftests/kvm/.gitignore +++ b/tools/testing/selftests/kvm/.gitignore @@ -47,6 +47,7 @@ /kvm_page_table_test /memslot_modification_stress_test /memslot_perf_test +/rseq_test /set_memory_region_test /steal_time /kvm_binary_stats_test diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 5832f510a16c..0756e79cb513 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus TEST_GEN_PROGS_x86_64 += kvm_page_table_test TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test TEST_GEN_PROGS_x86_64 += memslot_perf_test +TEST_GEN_PROGS_x86_64 += rseq_test TEST_GEN_PROGS_x86_64 += set_memory_region_test TEST_GEN_PROGS_x86_64 += steal_time TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test TEST_GEN_PROGS_aarch64 += dirty_log_perf_test TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus TEST_GEN_PROGS_aarch64 += kvm_page_table_test +TEST_GEN_PROGS_aarch64 += rseq_test TEST_GEN_PROGS_aarch64 += set_memory_region_test TEST_GEN_PROGS_aarch64 += steal_time TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test TEST_GEN_PROGS_s390x += dirty_log_test TEST_GEN_PROGS_s390x += kvm_create_max_vcpus TEST_GEN_PROGS_s390x += kvm_page_table_test +TEST_GEN_PROGS_s390x += rseq_test TEST_GEN_PROGS_s390x += set_memory_region_test TEST_GEN_PROGS_s390x += kvm_binary_stats_test diff --git a/tools/testing/selftests/kvm/rseq_test.c b/tools/testing/selftests/kvm/rseq_test.c new file mode 100644 index 000000000000..90ed535eded7 --- /dev/null +++ b/tools/testing/selftests/kvm/rseq_test.c @@ -0,0 +1,131 @@ +// SPDX-License-Identifier: GPL-2.0-only +#define _GNU_SOURCE /* for program_invocation_short_name */ +#include <errno.h> +#include <fcntl.h> +#include <pthread.h> +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <signal.h> +#include <syscall.h> +#include <sys/ioctl.h> +#include <linux/rseq.h> +#include <linux/unistd.h> + +#include "kvm_util.h" +#include "processor.h" +#include "test_util.h" + +#define VCPU_ID 0 + +static __thread volatile struct rseq __rseq = { + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, +}; + +#define RSEQ_SIG 0xdeadbeef + +static pthread_t migration_thread; +static cpu_set_t possible_mask; +static bool done; + +static void guest_code(void) +{ + for (;;) + GUEST_SYNC(0); +} + +static void sys_rseq(int flags) +{ + int r; + + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG); + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno)); +} + +static void *migration_worker(void *ign) +{ + cpu_set_t allowed_mask; + int r, i, nr_cpus, cpu; + + CPU_ZERO(&allowed_mask); + + nr_cpus = CPU_COUNT(&possible_mask); + + for (i = 0; i < 20000; i++) { + cpu = i % nr_cpus; + if (!CPU_ISSET(cpu, &possible_mask)) + continue; + + CPU_SET(cpu, &allowed_mask); + + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", errno, + strerror(errno)); + + CPU_CLR(cpu, &allowed_mask); + + usleep(10); + } + done = true; + return NULL; +} + +int main(int argc, char *argv[]) +{ + struct kvm_vm *vm; + u32 cpu, rseq_cpu; + int r; + + /* Tell stdout not to buffer its content */ + setbuf(stdout, NULL); + + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask); + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno, + strerror(errno)); + + if (CPU_COUNT(&possible_mask) < 2) { + print_skip("Only one CPU, task migration not possible\n"); + exit(KSFT_SKIP); + } + + sys_rseq(0); + + /* + * Create and run a dummy VM that immediately exits to userspace via + * GUEST_SYNC, while concurrently migrating the process by setting its + * CPU affinity. + */ + vm = vm_create_default(VCPU_ID, 0, guest_code); + + pthread_create(&migration_thread, NULL, migration_worker, 0); + + while (!done) { + vcpu_run(vm, VCPU_ID); + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, + "Guest failed?"); + + cpu = sched_getcpu(); + rseq_cpu = READ_ONCE(__rseq.cpu_id); + + /* + * Verify rseq's CPU matches sched's CPU, and that sched's CPU + * is stable. This doesn't handle the case where the task is + * migrated between sched_getcpu() and reading rseq, and again + * between reading rseq and sched_getcpu(), but in practice no + * false positives have been observed, while on the other hand + * blocking migration while this thread reads CPUs messes with + * the timing and prevents hitting failures on a buggy kernel. + */ + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(), + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); + } + + pthread_join(migration_thread, NULL); + + kvm_vm_free(vm); + + sys_rseq(RSEQ_FLAG_UNREGISTER); + + return 0; +}
Add a test to verify an rseq's CPU ID is updated correctly if the task is migrated while the kernel is handling KVM_RUN. This is a regression test for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM without updating rseq, leading to a stale CPU ID and other badness. Signed-off-by: Sean Christopherson <seanjc@google.com> --- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + tools/testing/selftests/kvm/rseq_test.c | 131 ++++++++++++++++++++++++ 3 files changed, 135 insertions(+) create mode 100644 tools/testing/selftests/kvm/rseq_test.c