All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
@ 2020-07-03  2:50 Sean Christopherson
  2020-07-08 16:08 ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-03  2:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima

Introduce a new capability, KVM_CAP_MEMSLOT_ZAP_CONTROL, to allow
userspace to control the memslot zapping behavior on a per-VM basis.
x86's default behavior is to zap all SPTEs, including the root shadow
page, across all memslots.  While effective, the nuke and pave approach
isn't exactly performant, especially for large VMs and/or VMs that
heavily utilize RO memslots for MMIO devices, e.g. option ROMs.

On a vanilla VM with 6gb of RAM, the targeted zap reduces the number of
EPT violations during boot by ~14% with THP enabled in the host, and by
~7% with THP disabled in the host.  On a much more custom VM with 32gb
and a significant amount of memslot zapping, this can reduce the number
of EPT violations by 50% during guest boot, and improve boot time by
as much as 25%.

Keep the current x86 memslot zapping behavior as the default, as there's
an unresolved bug that pops up when zapping only the affected memslot,
and the exact conditions that trigger the bug are not fully known.  See
https://patchwork.kernel.org/patch/10798453 for details.

Implement the capability as a set of flags so that other architectures
might be able to use the capability without having to conform to x86's
semantics.

Cc: Xiong Zhang <xiong.y.zhang@intel.com>
Cc: Wayne Boyer <wayne.boyer@intel.com>
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 Documentation/virt/kvm/api.rst  | 21 +++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu/mmu.c          | 21 ++++++++++++++++++++-
 arch/x86/kvm/x86.c              | 10 ++++++++++
 include/uapi/linux/kvm.h        |  4 ++++
 5 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 426f94582b7a..4b7b48e9a376 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5843,6 +5843,27 @@ controlled by the kvm module parameter halt_poll_ns. This capability allows
 the maximum halt time to specified on a per-VM basis, effectively overriding
 the module parameter for the target VM.
 
+7.21 KVM_CAP_MEMSLOT_ZAP_CONTROL
+--------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] controls which flags are enabled/disabled
+:Returns: 0 on success; -1 on error
+
+Valid flags are::
+
+  #define KVM_ZAP_ONLY_MEMSLOT_SPTES   (1 << 0)
+
+This capability allows userspace to control the shadow PTE zapping behavior
+when a memslot is deleted or moved via KVM_SET_USER_MEMORY_REGION.  By default,
+x86 zaps all SPTEs across all memslots, which can negatively impact performance
+but may be necessary for functional correctness for certain configurations.
+
+If KVM_ZAP_ONLY_MEMSLOT_SPTES is set, KVM zaps only the leaf/last SPTEs for the
+deleted/moved memslot.  Upper level SPTEs are retained, as are SPTEs for other
+memslots.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f852ee350beb..6803681238f5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1002,6 +1002,8 @@ struct kvm_arch {
 	bool guest_can_read_msr_platform_info;
 	bool exception_payload_enabled;
 
+	bool zap_only_memslot_sptes;
+
 	struct kvm_pmu_event_filter *pmu_event_filter;
 	struct task_struct *nx_lpage_recovery_thread;
 };
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3dd0af7e7515..45edcf5dcd50 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5806,11 +5806,30 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
 }
 
+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	bool flush;
+
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	spin_lock(&kvm->mmu_lock);
+	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true);
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+	spin_unlock(&kvm->mmu_lock);
+}
+
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (kvm->arch.zap_only_memslot_sptes)
+		kvm_mmu_zap_memslot(kvm, slot);
+	else
+		kvm_mmu_zap_all_fast(kvm);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00c88c2f34e4..3e07c9e4daac 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3471,6 +3471,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_MSR_PLATFORM_INFO:
 	case KVM_CAP_EXCEPTION_PAYLOAD:
 	case KVM_CAP_SET_GUEST_DEBUG:
+	case KVM_CAP_MEMSLOT_ZAP_CONTROL:
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4984,6 +4985,15 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		kvm->arch.exception_payload_enabled = cap->args[0];
 		r = 0;
 		break;
+	case KVM_CAP_MEMSLOT_ZAP_CONTROL:
+		r = -EINVAL;
+		if (cap->args[0] & ~(u64)KVM_ZAP_ONLY_MEMSLOT_SPTES)
+			break;
+
+		kvm->arch.zap_only_memslot_sptes = cap->args[0] &
+						   KVM_ZAP_ONLY_MEMSLOT_SPTES;
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 4fdf30316582..5f75c348ceed 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1031,6 +1031,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_SECURE_GUEST 181
 #define KVM_CAP_HALT_POLL 182
 #define KVM_CAP_ASYNC_PF_INT 183
+#define KVM_CAP_MEMSLOT_ZAP_CONTROL 184
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1685,4 +1686,7 @@ struct kvm_hyperv_eventfd {
 #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
 #define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
 
+/* Flags for KVM_CAP_MEMSLOT_ZAP_CONTROL */
+#define KVM_ZAP_ONLY_MEMSLOT_SPTES	(1 << 0)
+
 #endif /* __LINUX_KVM_H */
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-03  2:50 [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot Sean Christopherson
@ 2020-07-08 16:08 ` Paolo Bonzini
  2020-07-09 21:12   ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2020-07-08 16:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Xiong Zhang, Wayne Boyer, Zhenyu Wang,
	Jun Nakajima

On 03/07/20 04:50, Sean Christopherson wrote:
> Introduce a new capability, KVM_CAP_MEMSLOT_ZAP_CONTROL, to allow
> userspace to control the memslot zapping behavior on a per-VM basis.
> x86's default behavior is to zap all SPTEs, including the root shadow
> page, across all memslots.  While effective, the nuke and pave approach
> isn't exactly performant, especially for large VMs and/or VMs that
> heavily utilize RO memslots for MMIO devices, e.g. option ROMs.
> 
> On a vanilla VM with 6gb of RAM, the targeted zap reduces the number of
> EPT violations during boot by ~14% with THP enabled in the host, and by
> ~7% with THP disabled in the host.  On a much more custom VM with 32gb
> and a significant amount of memslot zapping, this can reduce the number
> of EPT violations by 50% during guest boot, and improve boot time by
> as much as 25%.
> 
> Keep the current x86 memslot zapping behavior as the default, as there's
> an unresolved bug that pops up when zapping only the affected memslot,
> and the exact conditions that trigger the bug are not fully known.  See
> https://patchwork.kernel.org/patch/10798453 for details.
> 
> Implement the capability as a set of flags so that other architectures
> might be able to use the capability without having to conform to x86's
> semantics.

It's bad that we have no clue what's causing the bad behavior, but I
don't think it's wise to have a bug that is known to happen when you
enable the capability. :/

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-08 16:08 ` Paolo Bonzini
@ 2020-07-09 21:12   ` Sean Christopherson
  2020-07-09 22:18     ` Paolo Bonzini
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-09 21:12 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Xiong Zhang, Wayne Boyer, Zhenyu Wang,
	Jun Nakajima

On Wed, Jul 08, 2020 at 06:08:24PM +0200, Paolo Bonzini wrote:
> On 03/07/20 04:50, Sean Christopherson wrote:
> > Introduce a new capability, KVM_CAP_MEMSLOT_ZAP_CONTROL, to allow
> > userspace to control the memslot zapping behavior on a per-VM basis.
> > x86's default behavior is to zap all SPTEs, including the root shadow
> > page, across all memslots.  While effective, the nuke and pave approach
> > isn't exactly performant, especially for large VMs and/or VMs that
> > heavily utilize RO memslots for MMIO devices, e.g. option ROMs.
> > 
> > On a vanilla VM with 6gb of RAM, the targeted zap reduces the number of
> > EPT violations during boot by ~14% with THP enabled in the host, and by
> > ~7% with THP disabled in the host.  On a much more custom VM with 32gb
> > and a significant amount of memslot zapping, this can reduce the number
> > of EPT violations by 50% during guest boot, and improve boot time by
> > as much as 25%.
> > 
> > Keep the current x86 memslot zapping behavior as the default, as there's
> > an unresolved bug that pops up when zapping only the affected memslot,
> > and the exact conditions that trigger the bug are not fully known.  See
> > https://patchwork.kernel.org/patch/10798453 for details.
> > 
> > Implement the capability as a set of flags so that other architectures
> > might be able to use the capability without having to conform to x86's
> > semantics.
> 
> It's bad that we have no clue what's causing the bad behavior, but I
> don't think it's wise to have a bug that is known to happen when you
> enable the capability. :/

I don't necessarily disagree, but at the same time it's entirely possible
it's a Qemu bug.  If the bad behavior doesn't occur with other VMMs, those
other VMMs shouldn't be penalized because we can't figure out what Qemu is
getting wrong.

Even if this is a kernel bug, I'm fairly confident at this point that it's
not a KVM bug.  Or rather, if it's a KVM "bug", then there's a fundamental
dependency in memslot management that needs to be rooted out and documented.

And we're kind of in a catch-22; it'll be extremely difficult to narrow down
exactly who is breaking what without being able to easily test the optimized
zapping with other VMMs and/or setups.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-09 21:12   ` Sean Christopherson
@ 2020-07-09 22:18     ` Paolo Bonzini
  2020-07-10  4:29       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2020-07-09 22:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Xiong Zhang, Wayne Boyer, Zhenyu Wang,
	Jun Nakajima

On 09/07/20 23:12, Sean Christopherson wrote:
>> It's bad that we have no clue what's causing the bad behavior, but I
>> don't think it's wise to have a bug that is known to happen when you
>> enable the capability. :/

(Note that this wasn't a NACK, though subtly so).

> I don't necessarily disagree, but at the same time it's entirely possible
> it's a Qemu bug.

No, it cannot be.  QEMU is not doing anything but
KVM_SET_USER_MEMORY_REGION, and it's doing that synchronously with
writes to the PCI configuration space BARs.

> Even if this is a kernel bug, I'm fairly confident at this point that it's
> not a KVM bug.  Or rather, if it's a KVM "bug", then there's a fundamental
> dependency in memslot management that needs to be rooted out and documented.

Heh, here my surmise is that  it cannot be anything but a KVM bug,
because  Memslots are not used by anything outside KVM...  But maybe I'm
missing something.

> And we're kind of in a catch-22; it'll be extremely difficult to narrow down
> exactly who is breaking what without being able to easily test the optimized
> zapping with other VMMs and/or setups.

I agree with this, and we could have a config symbol that depends on
BROKEN and enables it unconditionally.  However a capability is the
wrong tool.

Paolo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-09 22:18     ` Paolo Bonzini
@ 2020-07-10  4:29       ` Sean Christopherson
  2020-07-13 18:22         ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-10  4:29 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Xiong Zhang, Wayne Boyer, Zhenyu Wang,
	Jun Nakajima, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 4238 bytes --]

+Alex, whom I completely spaced on Cc'ing.

Alex, this is related to the dreaded VFIO memslot zapping issue from last
year.  Start of thread: https://patchwork.kernel.org/patch/11640719/.

The TL;DR of below: can you try the attached patch with your reproducer
from the original bug[*]?  I honestly don't know whether it has a legitimate
chance of working, but it's the one thing in all of this that I know was
definitely a bug.  I'd like to test it out if only to sate my curiosity.
Absolutely no rush.

[*] https://patchwork.kernel.org/patch/10798453/#22817321

On Fri, Jul 10, 2020 at 12:18:17AM +0200, Paolo Bonzini wrote:
> On 09/07/20 23:12, Sean Christopherson wrote:
> >> It's bad that we have no clue what's causing the bad behavior, but I
> >> don't think it's wise to have a bug that is known to happen when you
> >> enable the capability. :/
> 
> (Note that this wasn't a NACK, though subtly so).
> 
> > I don't necessarily disagree, but at the same time it's entirely possible
> > it's a Qemu bug.
> 
> No, it cannot be.  QEMU is not doing anything but
> KVM_SET_USER_MEMORY_REGION, and it's doing that synchronously with
> writes to the PCI configuration space BARs.

I'm not saying it's likely, but it's certainly possible.  The failure
went away when KVM zapped SPTEs for seemingly unrelated addresses, i.e. the
error likely goes beyond just the memslot aspect.

> > Even if this is a kernel bug, I'm fairly confident at this point that it's
> > not a KVM bug.  Or rather, if it's a KVM "bug", then there's a fundamental
> > dependency in memslot management that needs to be rooted out and documented.
> 
> Heh, here my surmise is that  it cannot be anything but a KVM bug,
> because  Memslots are not used by anything outside KVM...  But maybe I'm
> missing something.

As above, it's not really a memslot issue, it's more of a paging/TLB issue,
or possibly none of the above.  E.g. it could be a timing bug that goes away
simply because zapping and rebuilding slows things down to the point where
the timing window is closed.

I should have qualified "fairly confident ... that it's not a KVM bug" as
"not a KVM bug related to removing SPTEs for the deleted/moved memslot _as
implemented in this patch_".

Digging back through the old thread, I don't think we ever tried passing
%true for @lock_flush_tlb when zapping rmaps.  And a comment from Alex also
caught my eye, where he said of the following: "If anything, removing this
chunk seems to make things worse."

	if (need_resched() || spin_needbreak(&kvm->mmu_lock)) {
		kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);
		flush = false;
		cond_resched_lock(&kvm->mmu_lock);
	}

A somewhat far fetched theory is that passing %false to kvm_zap_rmapp()
via slot_handle_all_level() created a window where a vCPU could have both
the old stale entry and the new memslot entry in its TLB if the equivalent
to above lock dropping in slot_handle_level_range() triggered.

Removing the above intermediate flush would exacerbate the theoretical
problem by further delaying the flush, i.e. would create a bigger window
for the guest to access the stale SPTE.

Where things get really far fetched is how zapping seemingly random SPTEs
fits in.  Best crazy guess is that zapping enough random things while holding
MMU lock would eventually zap a SPTE that caused the guest to block in the
EPT violation handler.

I'm not exactly confident that the correct zapping approach will actually
resolve the VFIO issue, but I think it's worth trying since not flushing
during rmap zapping was definitely a bug.

> > And we're kind of in a catch-22; it'll be extremely difficult to narrow down
> > exactly who is breaking what without being able to easily test the optimized
> > zapping with other VMMs and/or setups.
> 
> I agree with this, and we could have a config symbol that depends on
> BROKEN and enables it unconditionally.  However a capability is the
> wrong tool.

Ya, a capability is a bad idea.  I was coming at it from the angle that, if
there is a fundamental requirement with e.g. GPU passthrough that requires
zapping all SPTEs, then enabling the precise capability on a per-VM basis
would make sense.  But adding something to the ABI on pure speculation is
silly.

[-- Attachment #2: 0001-KVM-x86-mmu-Zap-only-relevant-last-leaf-sptes-when-r.patch --]
[-- Type: text/x-diff, Size: 1249 bytes --]

From b68a2e6095d76574322ce7cf6e63406436fef36d Mon Sep 17 00:00:00 2001
From: Sean Christopherson <sean.j.christopherson@intel.com>
Date: Thu, 9 Jul 2020 21:25:11 -0700
Subject: [PATCH] KVM: x86/mmu: Zap only relevant last/leaf sptes when removing
 a memslot

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3dd0af7e75151..9f468337f832c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5810,7 +5810,18 @@ static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	bool flush;
+
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	spin_lock(&kvm->mmu_lock);
+	flush = slot_handle_all_level(kvm, slot, kvm_zap_rmapp, true);
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+	spin_unlock(&kvm->mmu_lock);
 }
 
 void kvm_mmu_init_vm(struct kvm *kvm)
-- 
2.26.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-10  4:29       ` Sean Christopherson
@ 2020-07-13 18:22         ` Alex Williamson
  2020-07-13 19:06           ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2020-07-13 18:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima

On Thu, 9 Jul 2020 21:29:22 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> +Alex, whom I completely spaced on Cc'ing.
> 
> Alex, this is related to the dreaded VFIO memslot zapping issue from last
> year.  Start of thread: https://patchwork.kernel.org/patch/11640719/.
> 
> The TL;DR of below: can you try the attached patch with your reproducer
> from the original bug[*]?  I honestly don't know whether it has a legitimate
> chance of working, but it's the one thing in all of this that I know was
> definitely a bug.  I'd like to test it out if only to sate my curiosity.
> Absolutely no rush.

Mixed results, maybe you can provide some guidance.  Running this
against v5.8-rc4, I haven't reproduced the glitch.  But it's been a
long time since I tested this previously, so I went back to v5.3-rc5 to
make sure I still have a recipe to trigger it.  I can still get the
failure there as the selective flush commit was reverted in rc6.  Then
I wondered, can I take broken v5.3-rc5 and apply this fix to prove that
it works?  No, v5.3-rc5 + this patch still glitches.  So I thought
maybe I could make v5.8-rc4 break by s/true/false/ in this patch.
Nope.  Then I applied the original patch from[1] to try to break it.
Nope.  So if anything, I think the evidence suggests this was broken
elsewhere and is now fixed, or maybe it is a timing issue that I can't
trigger on newer kernels.  If the reproducer wasn't so touchy and time
consuming, I'd try to bisect, but I don't have that sort of bandwidth.
Thanks,

Alex

[1] https://patchwork.kernel.org/patch/10798453/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-13 18:22         ` Alex Williamson
@ 2020-07-13 19:06           ` Sean Christopherson
  2020-07-21  3:03             ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-13 19:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima

On Mon, Jul 13, 2020 at 12:22:26PM -0600, Alex Williamson wrote:
> On Thu, 9 Jul 2020 21:29:22 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > +Alex, whom I completely spaced on Cc'ing.
> > 
> > Alex, this is related to the dreaded VFIO memslot zapping issue from last
> > year.  Start of thread: https://patchwork.kernel.org/patch/11640719/.
> > 
> > The TL;DR of below: can you try the attached patch with your reproducer
> > from the original bug[*]?  I honestly don't know whether it has a legitimate
> > chance of working, but it's the one thing in all of this that I know was
> > definitely a bug.  I'd like to test it out if only to sate my curiosity.
> > Absolutely no rush.
> 
> Mixed results, maybe you can provide some guidance.  Running this
> against v5.8-rc4, I haven't reproduced the glitch.  But it's been a
> long time since I tested this previously, so I went back to v5.3-rc5 to
> make sure I still have a recipe to trigger it.  I can still get the
> failure there as the selective flush commit was reverted in rc6.  Then
> I wondered, can I take broken v5.3-rc5 and apply this fix to prove that
> it works?  No, v5.3-rc5 + this patch still glitches.  So I thought
> maybe I could make v5.8-rc4 break by s/true/false/ in this patch.
> Nope.  Then I applied the original patch from[1] to try to break it.
> Nope.  So if anything, I think the evidence suggests this was broken
> elsewhere and is now fixed, or maybe it is a timing issue that I can't
> trigger on newer kernels.  If the reproducer wasn't so touchy and time
> consuming, I'd try to bisect, but I don't have that sort of bandwidth.

Ow.  That manages to be both a best case and worst case scenario.  I can't
think of any clever way to avoid bisecting.  There have been a number of
fixes in tangentially related code since 5.3, e.g. memslots, MMU, TLB,
etc..., but trying to isolate which one, if any of them, fixed the bug has
a high probability of being a wild goose chase.

The only ideas I have going forward are to:

  a) Reproduce the bug outside of your environment and find a resource that
     can go through the painful bisection.

  b) Add a module param to toggle the new behavior and see if anything
     breaks.

I can ask internally if it's possible to get a resource on my end to go
after (a).  (b) is a question for Paolo.

Thanks much for testing!

> Thanks,
> 
> Alex
> 
> [1] https://patchwork.kernel.org/patch/10798453/
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-13 19:06           ` Sean Christopherson
@ 2020-07-21  3:03             ` Sean Christopherson
  2020-07-21 16:00               ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-21  3:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima, Weijiang Yang

+Weijiang

On Mon, Jul 13, 2020 at 12:06:50PM -0700, Sean Christopherson wrote:
> The only ideas I have going forward are to:
> 
>   a) Reproduce the bug outside of your environment and find a resource that
>      can go through the painful bisection.

We're trying to reproduce the original issue in the hopes of biesecting, but
have not yet discovered the secret sauce.  A few questions:

  - Are there any known hardware requirements, e.g. specific flavor of GPU?

  - What's the average time to failure when running FurMark/PassMark?  E.g.
    what's a reasonable time to wait before rebooting to rerun the tests (I
    assume this is what you meant when you said you sometimes needed to
    reboot to observe failure).

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-21  3:03             ` Sean Christopherson
@ 2020-07-21 16:00               ` Alex Williamson
  2020-07-23 15:57                 ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2020-07-21 16:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima, Weijiang Yang

On Mon, 20 Jul 2020 20:03:19 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> +Weijiang
> 
> On Mon, Jul 13, 2020 at 12:06:50PM -0700, Sean Christopherson wrote:
> > The only ideas I have going forward are to:
> > 
> >   a) Reproduce the bug outside of your environment and find a resource that
> >      can go through the painful bisection.  
> 
> We're trying to reproduce the original issue in the hopes of biesecting, but
> have not yet discovered the secret sauce.  A few questions:
> 
>   - Are there any known hardware requirements, e.g. specific flavor of GPU?

I'm using an old GeForce GT635, I don't think there's anything special
about this card.

>   - What's the average time to failure when running FurMark/PassMark?  E.g.
>     what's a reasonable time to wait before rebooting to rerun the tests (I
>     assume this is what you meant when you said you sometimes needed to
>     reboot to observe failure).

The failure mode ranges from graphics glitches, ex. vectors drawn
across a window during the test or stray lines when interacting with
the Windows menu button, to graphics driver failures triggering an
error dialog, usually from PassMark.  I usually start FurMark, run the
stress test for ~10s, kill it, then run a PassMark benchmark.  If I
don't observe any glitching during the run, I'll trigger the Windows
menu a few times, then reboot and try again.  The graphics tests within
PassMark are generally when the glitches are triggered, both 2D and 3D,
sometimes it's sufficient to only run those tests rather than the full
system benchmark.  That's largely the trouble with this bisect is that
the test is very interactive and requires observation.  Sometimes when
it fails it snowballs into worse and worse errors and there's high
confidence that it's bad, but other times you'll be suspicious that
something occurred and need to repeat the testing.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-21 16:00               ` Alex Williamson
@ 2020-07-23 15:57                 ` Sean Christopherson
  2020-07-23 18:35                   ` Alex Williamson
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2020-07-23 15:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima, Weijiang Yang

On Tue, Jul 21, 2020 at 10:00:36AM -0600, Alex Williamson wrote:
> On Mon, 20 Jul 2020 20:03:19 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > +Weijiang
> > 
> > On Mon, Jul 13, 2020 at 12:06:50PM -0700, Sean Christopherson wrote:
> > > The only ideas I have going forward are to:
> > > 
> > >   a) Reproduce the bug outside of your environment and find a resource that
> > >      can go through the painful bisection.  
> > 
> > We're trying to reproduce the original issue in the hopes of biesecting, but
> > have not yet discovered the secret sauce.  A few questions:
> > 
> >   - Are there any known hardware requirements, e.g. specific flavor of GPU?
> 
> I'm using an old GeForce GT635, I don't think there's anything special
> about this card.

Would you be able to provide your QEMU command line?  Or at least any
potentially relevant bits?  Still no luck reproducing this on our end.

Thanks again!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-23 15:57                 ` Sean Christopherson
@ 2020-07-23 18:35                   ` Alex Williamson
  2020-09-14 16:49                     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2020-07-23 18:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima, Weijiang Yang

On Thu, 23 Jul 2020 08:57:11 -0700
Sean Christopherson <sean.j.christopherson@intel.com> wrote:

> On Tue, Jul 21, 2020 at 10:00:36AM -0600, Alex Williamson wrote:
> > On Mon, 20 Jul 2020 20:03:19 -0700
> > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >   
> > > +Weijiang
> > > 
> > > On Mon, Jul 13, 2020 at 12:06:50PM -0700, Sean Christopherson wrote:  
> > > > The only ideas I have going forward are to:
> > > > 
> > > >   a) Reproduce the bug outside of your environment and find a resource that
> > > >      can go through the painful bisection.    
> > > 
> > > We're trying to reproduce the original issue in the hopes of biesecting, but
> > > have not yet discovered the secret sauce.  A few questions:
> > > 
> > >   - Are there any known hardware requirements, e.g. specific flavor of GPU?  
> > 
> > I'm using an old GeForce GT635, I don't think there's anything special
> > about this card.  
> 
> Would you be able to provide your QEMU command line?  Or at least any
> potentially relevant bits?  Still no luck reproducing this on our end.

XML:

<domain type='kvm'>
  <name>GeForce</name>
  <uuid>2b417d4b-f25b-4522-a5be-e105f032f99c</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='3'/>
    <vcpupin vcpu='1' cpuset='7'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='6'/>
    <emulatorpin cpuset='0,4'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-5.0'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/edk2/ovmf/OVMF_CODE.fd</loader>
    <nvram template='/usr/share/edk2/ovmf/OVMF_VARS.fd'>/var/lib/libvirt/qemu/nvram/GeForce_VARS.fd</nvram>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='KeenlyKVM'/>
    </hyperv>
    <kvm>
      <hidden state='on'/>
    </kvm>
    <vmport state='off'/>
  </features>
  <cpu mode='custom' match='exact' check='none'>
    <model fallback='allow'>IvyBridge-IBRS</model>
    <topology sockets='1' dies='1' cores='4' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/mnt/ssd/GeForce.qcow2'/>
      <target dev='sda' bus='scsi'/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <driver queues='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </controller>
    <interface type='direct'>
      <mac address='52:54:00:60:ef:ac'/>
      <source dev='enp4s0' mode='bridge'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <rom bar='on'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <rom bar='off'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
</domain>

From libvirt log:

LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-1-GeForce \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-GeForce/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-GeForce/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-GeForce/.config \
QEMU_AUDIO_DRV=none \
/usr/local/bin/qemu-system-x86_64 \
-name guest=GeForce,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-GeForce/master-key.aes \
-blockdev '{"driver":"file","filename":"/usr/share/edk2/ovmf/OVMF_CODE.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/GeForce_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-i440fx-5.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-cpu IvyBridge-IBRS,hv-time,hv-relaxed,hv-vapic,hv-spinlocks=0x1fff,hv-vendor-id=KeenlyKVM,kvm=off \
-m 4096 \
-mem-prealloc \
-mem-path /dev/hugepages/libvirt/qemu/1-GeForce \
-overcommit mem-lock=off \
-smp 4,sockets=1,dies=1,cores=4,threads=1 \
-uuid 2b417d4b-f25b-4522-a5be-e105f032f99c \
-display none \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=36,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=localtime,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot menu=on,strict=on \
-device nec-usb-xhci,id=usb,bus=pci.0,addr=0x8 \
-device virtio-scsi-pci,id=scsi0,num_queues=4,bus=pci.0,addr=0x5 \
-blockdev '{"driver":"file","filename":"/mnt/ssd/GeForce-2019-08-02.img","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":true,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-blockdev '{"driver":"file","filename":"/mnt/ssd/Geforce.qcow2","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage","backing":"libvirt-2-format"}' \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=2,write-cache=on \
-netdev tap,fd=38,id=hostnet0,vhost=on,vhostfd=40 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:60:ef:ac,bus=pci.0,addr=0x3 \
-device vfio-pci,host=0000:01:00.0,id=hostdev0,bus=pci.0,addr=0x4,rombar=1 \
-device vfio-pci,host=0000:01:00.1,id=hostdev1,bus=pci.0,addr=0x6,rombar=0 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot
  2020-07-23 18:35                   ` Alex Williamson
@ 2020-09-14 16:49                     ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2020-09-14 16:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Xiong Zhang, Wayne Boyer,
	Zhenyu Wang, Jun Nakajima, Weijiang Yang

On Thu, Jul 23, 2020 at 12:35:44PM -0600, Alex Williamson wrote:
> On Thu, 23 Jul 2020 08:57:11 -0700
> Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> > On Tue, Jul 21, 2020 at 10:00:36AM -0600, Alex Williamson wrote:
> > > On Mon, 20 Jul 2020 20:03:19 -0700
> > > Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > >   
> > > > +Weijiang
> > > > 
> > > > On Mon, Jul 13, 2020 at 12:06:50PM -0700, Sean Christopherson wrote:  
> > > > > The only ideas I have going forward are to:
> > > > > 
> > > > >   a) Reproduce the bug outside of your environment and find a resource that
> > > > >      can go through the painful bisection.    
> > > > 
> > > > We're trying to reproduce the original issue in the hopes of biesecting, but
> > > > have not yet discovered the secret sauce.  A few questions:
> > > > 
> > > >   - Are there any known hardware requirements, e.g. specific flavor of GPU?  
> > > 
> > > I'm using an old GeForce GT635, I don't think there's anything special
> > > about this card.  
> > 
> > Would you be able to provide your QEMU command line?  Or at least any
> > potentially relevant bits?  Still no luck reproducing this on our end.

*sigh*

The "good" news is that we were able to reproduce and bisect the "fix".

That bad news is that the "fix" is the fracturing of large pages for the
iTLB multi-hit bug, added by commit b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT
mitigation").  The GPU pass-through failures can be reproduced by loading
KVM with kvm.nx_huge_pages=0.

So, we have another data point, but still no clear explanation of exactly
what is broken.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-14 16:56 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-03  2:50 [PATCH] KVM: x86/mmu: Add capability to zap only sptes for the affected memslot Sean Christopherson
2020-07-08 16:08 ` Paolo Bonzini
2020-07-09 21:12   ` Sean Christopherson
2020-07-09 22:18     ` Paolo Bonzini
2020-07-10  4:29       ` Sean Christopherson
2020-07-13 18:22         ` Alex Williamson
2020-07-13 19:06           ` Sean Christopherson
2020-07-21  3:03             ` Sean Christopherson
2020-07-21 16:00               ` Alex Williamson
2020-07-23 15:57                 ` Sean Christopherson
2020-07-23 18:35                   ` Alex Williamson
2020-09-14 16:49                     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.