Message ID | 20200914134321.958079-1-pizhenwei@bytedance.com |
---|---|
Headers | show |
Series | add MEMORY_FAILURE event | expand |
Hi, A patchset about handling 'MCE' might have been ignored, can anyone tell me whether the purpose is reasonable? https://patchwork.kernel.org/cover/11773795/ On 9/14/20 9:43 PM, zhenwei pi wrote: > Although QEMU could catch signal BUS to handle hardware memory > corrupted event, sadly, QEMU just prints a little log and try to fix > it silently. > > In these patches, introduce a 'MEMORY_FAILURE' event with 4 detailed > actions of QEMU, then uplayer could know what situaction QEMU hit and > did. And further step we can do: if a host server hits a 'hypervisor-ignore' > or 'guest-mce', scheduler could migrate VM to another host; if hitting > 'hypervisor-stop' or 'guest-triple-fault', scheduler could select other > healthy servers to launch VM. > > zhenwei pi (3): > target-i386: seperate MCIP & MCE_MASK error reason > iqapi/run-state.json: introduce memory failure event > target-i386: post memory failure event to uplayer > > qapi/run-state.json | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > target/i386/helper.c | 30 +++++++++++++++++++++++------- > target/i386/kvm.c | 5 ++++- > 3 files changed, 73 insertions(+), 8 deletions(-) >
On 21/09/20 04:22, zhenwei pi wrote: > Hi, > > A patchset about handling 'MCE' might have been ignored, can anyone tell > me whether the purpose is reasonable? > > https://patchwork.kernel.org/cover/11773795/ Yes, it's very useful. Just one thing, "guest-mce" can be reported for both AR and AO faults. Is it worth adding a 'type' field to distinguish the two? Paolo > On 9/14/20 9:43 PM, zhenwei pi wrote: >> Although QEMU could catch signal BUS to handle hardware memory >> corrupted event, sadly, QEMU just prints a little log and try to fix >> it silently. >> >> In these patches, introduce a 'MEMORY_FAILURE' event with 4 detailed >> actions of QEMU, then uplayer could know what situaction QEMU hit and >> did. And further step we can do: if a host server hits a >> 'hypervisor-ignore' >> or 'guest-mce', scheduler could migrate VM to another host; if hitting >> 'hypervisor-stop' or 'guest-triple-fault', scheduler could select other >> healthy servers to launch VM. >> >> zhenwei pi (3): >> target-i386: seperate MCIP & MCE_MASK error reason >> iqapi/run-state.json: introduce memory failure event >> target-i386: post memory failure event to uplayer >> >> qapi/run-state.json | 46 >> ++++++++++++++++++++++++++++++++++++++++++++++ >> target/i386/helper.c | 30 +++++++++++++++++++++++------- >> target/i386/kvm.c | 5 ++++- >> 3 files changed, 73 insertions(+), 8 deletions(-) >> >
On 9/21/20 8:09 PM, Paolo Bonzini wrote: > On 21/09/20 04:22, zhenwei pi wrote: >> Hi, >> >> A patchset about handling 'MCE' might have been ignored, can anyone tell >> me whether the purpose is reasonable? >> >> https://patchwork.kernel.org/cover/11773795/ > > Yes, it's very useful. Just one thing, "guest-mce" can be reported for > both AR and AO faults. Is it worth adding a 'type' field to distinguish > the two? > > Paolo > Sure. how about adding a 'flags' of a structure? and a field named 'action-required' to describe AO or AR? >> On 9/14/20 9:43 PM, zhenwei pi wrote: >>> Although QEMU could catch signal BUS to handle hardware memory >>> corrupted event, sadly, QEMU just prints a little log and try to fix >>> it silently. >>> >>> In these patches, introduce a 'MEMORY_FAILURE' event with 4 detailed >>> actions of QEMU, then uplayer could know what situaction QEMU hit and >>> did. And further step we can do: if a host server hits a >>> 'hypervisor-ignore' >>> or 'guest-mce', scheduler could migrate VM to another host; if hitting >>> 'hypervisor-stop' or 'guest-triple-fault', scheduler could select other >>> healthy servers to launch VM. >>> >>> zhenwei pi (3): >>> target-i386: seperate MCIP & MCE_MASK error reason >>> iqapi/run-state.json: introduce memory failure event >>> target-i386: post memory failure event to uplayer >>> >>> qapi/run-state.json | 46 >>> ++++++++++++++++++++++++++++++++++++++++++++++ >>> target/i386/helper.c | 30 +++++++++++++++++++++++------- >>> target/i386/kvm.c | 5 ++++- >>> 3 files changed, 73 insertions(+), 8 deletions(-) >>> >> > -- zhenwei pi
On 21/09/20 15:10, zhenwei pi wrote: >> > Right, to make architecture-neutral, how about these changes: > 'PC-RAM' -> 'guest-memory' > 'guest-mce' -> 'guest-mce-inject' > 'guest-triple-fault' -> 'guest-mce-fault' Perhaps we should have three fields 1) recipient: 'hypervisor' or 'guest' 2) action: 'ignore', 'inject', 'fatal' 3) kind: 'action-optional' or 'action-required' And possibly: 4) recursive: true or false On x86 "recursive" would be set if MCIP=1. Paolo