diff mbox

ARM64: KVM: Fix coherent_icache_guest_page() for host with external L3-cache.

Message ID da83e967f5a266ca28ee54638a4a9be9@www.loen.fr
State New
Headers show

Commit Message

Marc Zyngier Aug. 15, 2013, 4:52 a.m. UTC
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:
> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> 
> wrote:
>> Hi Pranav,
>>
>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>> Systems with large external L3-cache (few MBs), might have dirty
>>> content belonging to the guest page in L3-cache. To tackle this,
>>> we need to flush such dirty content from d-cache so that guest
>>> will see correct contents of guest page when guest MMU is disabled.
>>>
>>> The patch fixes coherent_icache_guest_page() for external L3-cache.
>>>
>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>> ---
>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>> b/arch/arm64/include/asm/kvm_mmu.h
>>> index efe609c..5129038 100644
>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> @@ -123,6 +123,8 @@ static inline void
>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>> +             /* Flush d-cache for systems with external caches. */
>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged 
>>> VIVT */
>>>               /* any kind of VIPT cache */
>>>               __flush_icache_all();
>>
>> [adding Will to the discussion as we talked about this in the past]
>>
>> That's definitely an issue, but I'm not sure the fix is to hit the 
>> data
>> cache on each page mapping. It looks overkill.
>>
>> Wouldn't it be enough to let userspace do the cache cleaning? 
>> kvmtools
>> knows which bits of the guest memory have been touched, and can do a 
>> "DC
>> DVAC" on this region.
>
> It seems a bit unnatural to have cache cleaning is user-space. I am 
> sure
> other architectures don't have such cache cleaning in user-space for 
> KVM.
>
>>
>> The alternative is do it in the kernel before running any vcpu - but
>> that's not very nice either (have to clean the whole of the guest
>> memory, which makes a full dcache clean more appealing).
>
> Actually, cleaning full d-cache by set/way upon first run of VCPU was
> our second option but current approach seemed very simple hence
> we went for this.
>
> If more people vote for full d-cache clean upon first run of VCPU 
> then
> we should revise this patch.

Can you please give the attached patch a spin on your HW? I've 
boot-tested it on a model, but of course I can't really verify that it 
fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

Thanks,

         M.

Comments

Anup Patel Aug. 15, 2013, 6:26 a.m. UTC | #1
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> Hi Anup,
>
>
> On 2013-08-14 15:22, Anup Patel wrote:
>>
>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com>
>> wrote:
>>>
>>> Hi Pranav,
>>>
>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>
>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>> we need to flush such dirty content from d-cache so that guest
>>>> will see correct contents of guest page when guest MMU is disabled.
>>>>
>>>> The patch fixes coherent_icache_guest_page() for external L3-cache.
>>>>
>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>> ---
>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>> index efe609c..5129038 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -123,6 +123,8 @@ static inline void
>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>> +             /* Flush d-cache for systems with external caches. */
>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
>>>>               /* any kind of VIPT cache */
>>>>               __flush_icache_all();
>>>
>>>
>>> [adding Will to the discussion as we talked about this in the past]
>>>
>>> That's definitely an issue, but I'm not sure the fix is to hit the data
>>> cache on each page mapping. It looks overkill.
>>>
>>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools
>>> knows which bits of the guest memory have been touched, and can do a "DC
>>> DVAC" on this region.
>>
>>
>> It seems a bit unnatural to have cache cleaning is user-space. I am sure
>> other architectures don't have such cache cleaning in user-space for KVM.
>>
>>>
>>> The alternative is do it in the kernel before running any vcpu - but
>>> that's not very nice either (have to clean the whole of the guest
>>> memory, which makes a full dcache clean more appealing).
>>
>>
>> Actually, cleaning full d-cache by set/way upon first run of VCPU was
>> our second option but current approach seemed very simple hence
>> we went for this.
>>
>> If more people vote for full d-cache clean upon first run of VCPU then
>> we should revise this patch.
>
>
> Can you please give the attached patch a spin on your HW? I've boot-tested
> it on a model, but of course I can't really verify that it fixes your issue.
>
> As far as I can see, it should fix it without any additional flushing.
>
> Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as
VGIC CPU interface) with MMU off then MMIO devices will be
accessed as normal memory when HCR_EL2.DC=1.

The HCR_EL2.DC=1 makes sense only if we have all software
emulated devices for Guest/VM which is not true for KVM ARM or
KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM
when Stage1 MMU is off.

Nevertheless, we will still try your patch.

>
> Thanks,
>
>
>         M.
> --
> Fast, cheap, reliable. Pick two.

Regards,
Anup
Marc Zyngier Aug. 15, 2013, 8:31 a.m. UTC | #2
On 2013-08-15 07:26, Anup Patel wrote:
> Hi Marc,
>
> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> 
> wrote:
>> Hi Anup,
>>
>>
>> On 2013-08-14 15:22, Anup Patel wrote:
>>>
>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier 
>>> <marc.zyngier@arm.com>
>>> wrote:
>>>>
>>>> Hi Pranav,
>>>>
>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>
>>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>>> we need to flush such dirty content from d-cache so that guest
>>>>> will see correct contents of guest page when guest MMU is 
>>>>> disabled.
>>>>>
>>>>> The patch fixes coherent_icache_guest_page() for external 
>>>>> L3-cache.
>>>>>
>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>> ---
>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>> index efe609c..5129038 100644
>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>> +             /* Flush d-cache for systems with external caches. 
>>>>> */
>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged 
>>>>> VIVT */
>>>>>               /* any kind of VIPT cache */
>>>>>               __flush_icache_all();
>>>>
>>>>
>>>> [adding Will to the discussion as we talked about this in the 
>>>> past]
>>>>
>>>> That's definitely an issue, but I'm not sure the fix is to hit the 
>>>> data
>>>> cache on each page mapping. It looks overkill.
>>>>
>>>> Wouldn't it be enough to let userspace do the cache cleaning? 
>>>> kvmtools
>>>> knows which bits of the guest memory have been touched, and can do 
>>>> a "DC
>>>> DVAC" on this region.
>>>
>>>
>>> It seems a bit unnatural to have cache cleaning is user-space. I am 
>>> sure
>>> other architectures don't have such cache cleaning in user-space 
>>> for KVM.
>>>
>>>>
>>>> The alternative is do it in the kernel before running any vcpu - 
>>>> but
>>>> that's not very nice either (have to clean the whole of the guest
>>>> memory, which makes a full dcache clean more appealing).
>>>
>>>
>>> Actually, cleaning full d-cache by set/way upon first run of VCPU 
>>> was
>>> our second option but current approach seemed very simple hence
>>> we went for this.
>>>
>>> If more people vote for full d-cache clean upon first run of VCPU 
>>> then
>>> we should revise this patch.
>>
>>
>> Can you please give the attached patch a spin on your HW? I've 
>> boot-tested
>> it on a model, but of course I can't really verify that it fixes 
>> your issue.
>>
>> As far as I can see, it should fix it without any additional 
>> flushing.
>>
>> Please let me know how it goes.
>
> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
> Outer Write-Back Write-Allocate memory"
>
> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> treated as "Strongly-ordered device memory"
>
> Now if Guest/VM access hardware MMIO devices directly (such as
> VGIC CPU interface) with MMU off then MMIO devices will be
> accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be 
accessed as device memory.

> The HCR_EL2.DC=1 makes sense only if we have all software
> emulated devices for Guest/VM which is not true for KVM ARM or
> KVM ARM64 because we use VGIC.
>
> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output 
of Stage-1, and Stage-2 overrides still apply.

> Nevertheless, we will still try your patch.

Thanks.

         M.
PranavkumarSawargaonkar Aug. 15, 2013, 8:39 a.m. UTC | #3
Hi Marc,

On 15 August 2013 10:22, Marc Zyngier <marc.zyngier@arm.com> wrote:
> Hi Anup,
>
>
> On 2013-08-14 15:22, Anup Patel wrote:
>>
>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com>
>> wrote:
>>>
>>> Hi Pranav,
>>>
>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>
>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>> we need to flush such dirty content from d-cache so that guest
>>>> will see correct contents of guest page when guest MMU is disabled.
>>>>
>>>> The patch fixes coherent_icache_guest_page() for external L3-cache.
>>>>
>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>> ---
>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>  1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>> index efe609c..5129038 100644
>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>> @@ -123,6 +123,8 @@ static inline void
>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>> +             /* Flush d-cache for systems with external caches. */
>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
>>>>               /* any kind of VIPT cache */
>>>>               __flush_icache_all();
>>>
>>>
>>> [adding Will to the discussion as we talked about this in the past]
>>>
>>> That's definitely an issue, but I'm not sure the fix is to hit the data
>>> cache on each page mapping. It looks overkill.
>>>
>>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools
>>> knows which bits of the guest memory have been touched, and can do a "DC
>>> DVAC" on this region.
>>
>>
>> It seems a bit unnatural to have cache cleaning is user-space. I am sure
>> other architectures don't have such cache cleaning in user-space for KVM.
>>
>>>
>>> The alternative is do it in the kernel before running any vcpu - but
>>> that's not very nice either (have to clean the whole of the guest
>>> memory, which makes a full dcache clean more appealing).
>>
>>
>> Actually, cleaning full d-cache by set/way upon first run of VCPU was
>> our second option but current approach seemed very simple hence
>> we went for this.
>>
>> If more people vote for full d-cache clean upon first run of VCPU then
>> we should revise this patch.
>
>
> Can you please give the attached patch a spin on your HW? I've boot-tested
> it on a model, but of course I can't really verify that it fixes your issue.
>
> As far as I can see, it should fix it without any additional flushing.
>
> Please let me know how it goes.
>
> Thanks,
>
>
>         M.
> --
> Fast, cheap, reliable. Pick two.
>
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Surely we will try this patch.

Just to add few more points to this discussion :

Actually we are already flushing d-cache in "coherent_icache_guest_page"
by calling flush_icache_range.

Now in my patch i am doing same thing one more time by calling
__flush_dcache_area just below flush_icache_range.

Main difference between d-cache flushing in above two routines is :
flush_icache_range is flushing d-cache by point of unification (dc     cvau) and
 __flush_dcache_area is flushing that by point of coherency (dc      civac).

Now since second routine is doing it by coherency it is making L3C to
be coherent and sync changes with immediate effect and solving the
issue.

Thanks,
Pranav
Anup Patel Aug. 15, 2013, 1:31 p.m. UTC | #4
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 2013-08-15 07:26, Anup Patel wrote:
>>
>> Hi Marc,
>>
>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com>
>> wrote:
>>>
>>> Hi Anup,
>>>
>>>
>>> On 2013-08-14 15:22, Anup Patel wrote:
>>>>
>>>>
>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Pranav,
>>>>>
>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>>
>>>>>>
>>>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>>>> we need to flush such dirty content from d-cache so that guest
>>>>>> will see correct contents of guest page when guest MMU is disabled.
>>>>>>
>>>>>> The patch fixes coherent_icache_guest_page() for external L3-cache.
>>>>>>
>>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>>> ---
>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>>  1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> index efe609c..5129038 100644
>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>>> +             /* Flush d-cache for systems with external caches. */
>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT
>>>>>> */
>>>>>>               /* any kind of VIPT cache */
>>>>>>               __flush_icache_all();
>>>>>
>>>>>
>>>>>
>>>>> [adding Will to the discussion as we talked about this in the past]
>>>>>
>>>>> That's definitely an issue, but I'm not sure the fix is to hit the data
>>>>> cache on each page mapping. It looks overkill.
>>>>>
>>>>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools
>>>>> knows which bits of the guest memory have been touched, and can do a
>>>>> "DC
>>>>> DVAC" on this region.
>>>>
>>>>
>>>>
>>>> It seems a bit unnatural to have cache cleaning is user-space. I am sure
>>>> other architectures don't have such cache cleaning in user-space for
>>>> KVM.
>>>>
>>>>>
>>>>> The alternative is do it in the kernel before running any vcpu - but
>>>>> that's not very nice either (have to clean the whole of the guest
>>>>> memory, which makes a full dcache clean more appealing).
>>>>
>>>>
>>>>
>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU was
>>>> our second option but current approach seemed very simple hence
>>>> we went for this.
>>>>
>>>> If more people vote for full d-cache clean upon first run of VCPU then
>>>> we should revise this patch.
>>>
>>>
>>>
>>> Can you please give the attached patch a spin on your HW? I've
>>> boot-tested
>>> it on a model, but of course I can't really verify that it fixes your
>>> issue.
>>>
>>> As far as I can see, it should fix it without any additional flushing.
>>>
>>> Please let me know how it goes.
>>
>>
>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
>> Outer Write-Back Write-Allocate memory"
>>
>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> treated as "Strongly-ordered device memory"
>>
>> Now if Guest/VM access hardware MMIO devices directly (such as
>> VGIC CPU interface) with MMU off then MMIO devices will be
>> accessed as normal memory when HCR_EL2.DC=1.
>
>
> I don't think so. Stage-2 still applies, and should force MMIO to be
> accessed as device memory.
>
>
>> The HCR_EL2.DC=1 makes sense only if we have all software
>> emulated devices for Guest/VM which is not true for KVM ARM or
>> KVM ARM64 because we use VGIC.
>>
>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> when Stage1 MMU is off.
>
>
> See above. My understanding is that HCR.DC controls the default output of
> Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant
and wanted guest to decide the memory attribute. In other words, you
did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

>
>
>> Nevertheless, we will still try your patch.
>
>
> Thanks.
>
>
>         M.
> --
> Fast, cheap, reliable. Pick two.

Regards,
Anup
Marc Zyngier Aug. 15, 2013, 2:47 p.m. UTC | #5
On 2013-08-15 14:31, Anup Patel wrote:
> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> 
> wrote:
>> On 2013-08-15 07:26, Anup Patel wrote:
>>>
>>> Hi Marc,
>>>
>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier 
>>> <marc.zyngier@arm.com>
>>> wrote:
>>>>
>>>> Hi Anup,
>>>>
>>>>
>>>> On 2013-08-14 15:22, Anup Patel wrote:
>>>>>
>>>>>
>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier 
>>>>> <marc.zyngier@arm.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Pranav,
>>>>>>
>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>>>
>>>>>>>
>>>>>>> Systems with large external L3-cache (few MBs), might have 
>>>>>>> dirty
>>>>>>> content belonging to the guest page in L3-cache. To tackle 
>>>>>>> this,
>>>>>>> we need to flush such dirty content from d-cache so that guest
>>>>>>> will see correct contents of guest page when guest MMU is 
>>>>>>> disabled.
>>>>>>>
>>>>>>> The patch fixes coherent_icache_guest_page() for external 
>>>>>>> L3-cache.
>>>>>>>
>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar 
>>>>>>> <pranavkumar@linaro.org>
>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>>>> ---
>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>>>  1 file changed, 2 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>> index efe609c..5129038 100644
>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>>>> +             /* Flush d-cache for systems with external 
>>>>>>> caches. */
>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>>>       } else if (!icache_is_aivivt()) {       /* non 
>>>>>>> ASID-tagged VIVT
>>>>>>> */
>>>>>>>               /* any kind of VIPT cache */
>>>>>>>               __flush_icache_all();
>>>>>>
>>>>>>
>>>>>>
>>>>>> [adding Will to the discussion as we talked about this in the 
>>>>>> past]
>>>>>>
>>>>>> That's definitely an issue, but I'm not sure the fix is to hit 
>>>>>> the data
>>>>>> cache on each page mapping. It looks overkill.
>>>>>>
>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? 
>>>>>> kvmtools
>>>>>> knows which bits of the guest memory have been touched, and can 
>>>>>> do a
>>>>>> "DC
>>>>>> DVAC" on this region.
>>>>>
>>>>>
>>>>>
>>>>> It seems a bit unnatural to have cache cleaning is user-space. I 
>>>>> am sure
>>>>> other architectures don't have such cache cleaning in user-space 
>>>>> for
>>>>> KVM.
>>>>>
>>>>>>
>>>>>> The alternative is do it in the kernel before running any vcpu - 
>>>>>> but
>>>>>> that's not very nice either (have to clean the whole of the 
>>>>>> guest
>>>>>> memory, which makes a full dcache clean more appealing).
>>>>>
>>>>>
>>>>>
>>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU 
>>>>> was
>>>>> our second option but current approach seemed very simple hence
>>>>> we went for this.
>>>>>
>>>>> If more people vote for full d-cache clean upon first run of VCPU 
>>>>> then
>>>>> we should revise this patch.
>>>>
>>>>
>>>>
>>>> Can you please give the attached patch a spin on your HW? I've
>>>> boot-tested
>>>> it on a model, but of course I can't really verify that it fixes 
>>>> your
>>>> issue.
>>>>
>>>> As far as I can see, it should fix it without any additional 
>>>> flushing.
>>>>
>>>> Please let me know how it goes.
>>>
>>>
>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
>>> Outer Write-Back Write-Allocate memory"
>>>
>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>>> treated as "Strongly-ordered device memory"
>>>
>>> Now if Guest/VM access hardware MMIO devices directly (such as
>>> VGIC CPU interface) with MMU off then MMIO devices will be
>>> accessed as normal memory when HCR_EL2.DC=1.
>>
>>
>> I don't think so. Stage-2 still applies, and should force MMIO to be
>> accessed as device memory.
>>
>>
>>> The HCR_EL2.DC=1 makes sense only if we have all software
>>> emulated devices for Guest/VM which is not true for KVM ARM or
>>> KVM ARM64 because we use VGIC.
>>>
>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>>> when Stage1 MMU is off.
>>
>>
>> See above. My understanding is that HCR.DC controls the default 
>> output of
>> Stage-1, and Stage-2 overrides still apply.
>
> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> and wanted guest to decide the memory attribute. In other words, you
> did not want to enforce any memory attribute in Stage2.
>
> Please refer to this patch 
> https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the 
discussion, you will certainly notice it was dropped for a very good 
reason:
https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and 
they are essential to the patch I sent this morning.

         M.
Anup Patel Aug. 15, 2013, 3:13 p.m. UTC | #6
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 2013-08-15 14:31, Anup Patel wrote:
>>
>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com>
>> wrote:
>>>
>>> On 2013-08-15 07:26, Anup Patel wrote:
>>>>
>>>>
>>>> Hi Marc,
>>>>
>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> Hi Anup,
>>>>>
>>>>>
>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Pranav,
>>>>>>>
>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>>>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>>>>>> we need to flush such dirty content from d-cache so that guest
>>>>>>>> will see correct contents of guest page when guest MMU is disabled.
>>>>>>>>
>>>>>>>> The patch fixes coherent_icache_guest_page() for external L3-cache.
>>>>>>>>
>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>>>>> ---
>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>>>>  1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>> index efe609c..5129038 100644
>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>>>>> +             /* Flush d-cache for systems with external caches. */
>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged
>>>>>>>> VIVT
>>>>>>>> */
>>>>>>>>               /* any kind of VIPT cache */
>>>>>>>>               __flush_icache_all();
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [adding Will to the discussion as we talked about this in the past]
>>>>>>>
>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit the
>>>>>>> data
>>>>>>> cache on each page mapping. It looks overkill.
>>>>>>>
>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>>>>>>> kvmtools
>>>>>>> knows which bits of the guest memory have been touched, and can do a
>>>>>>> "DC
>>>>>>> DVAC" on this region.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> It seems a bit unnatural to have cache cleaning is user-space. I am
>>>>>> sure
>>>>>> other architectures don't have such cache cleaning in user-space for
>>>>>> KVM.
>>>>>>
>>>>>>>
>>>>>>> The alternative is do it in the kernel before running any vcpu - but
>>>>>>> that's not very nice either (have to clean the whole of the guest
>>>>>>> memory, which makes a full dcache clean more appealing).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU was
>>>>>> our second option but current approach seemed very simple hence
>>>>>> we went for this.
>>>>>>
>>>>>> If more people vote for full d-cache clean upon first run of VCPU then
>>>>>> we should revise this patch.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Can you please give the attached patch a spin on your HW? I've
>>>>> boot-tested
>>>>> it on a model, but of course I can't really verify that it fixes your
>>>>> issue.
>>>>>
>>>>> As far as I can see, it should fix it without any additional flushing.
>>>>>
>>>>> Please let me know how it goes.
>>>>
>>>>
>>>>
>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>>>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
>>>> Outer Write-Back Write-Allocate memory"
>>>>
>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>>>> treated as "Strongly-ordered device memory"
>>>>
>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>>>> accessed as normal memory when HCR_EL2.DC=1.
>>>
>>>
>>>
>>> I don't think so. Stage-2 still applies, and should force MMIO to be
>>> accessed as device memory.
>>>
>>>
>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>>>> KVM ARM64 because we use VGIC.
>>>>
>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>>>> when Stage1 MMU is off.
>>>
>>>
>>>
>>> See above. My understanding is that HCR.DC controls the default output of
>>> Stage-1, and Stage-2 overrides still apply.
>>
>>
>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> and wanted guest to decide the memory attribute. In other words, you
>> did not want to enforce any memory attribute in Stage2.
>>
>> Please refer to this patch https://patchwork.kernel.org/patch/2543201/
>
>
> This patch has never been merged. If you carry on following the discussion,
> you will certainly notice it was dropped for a very good reason:
> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>
> So Stage-2 memory attributes are used, they are not going away, and they are
> essential to the patch I sent this morning.
>
>
>         M.
> --
> Fast, cheap, reliable. Pick two.

HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is
provided a DMA-capable device in pass-through mode. The reason being
bootloader/firmware typically don't enable MMU and such bootloader/firmware
will programme a pass-through DMA-capable device without any flushes to
guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a
Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware
accessing this SATA AHCI controller to load kernel images from SATA disk.
In this situation, we will have to hack Guest bootloader/firmware AHCI driver to
explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

Regards,
Anup
Marc Zyngier Aug. 15, 2013, 3:37 p.m. UTC | #7
On 2013-08-15 16:13, Anup Patel wrote:
> Hi Marc,
>
> On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> 
> wrote:
>> On 2013-08-15 14:31, Anup Patel wrote:
>>>
>>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier 
>>> <marc.zyngier@arm.com>
>>> wrote:
>>>>
>>>> On 2013-08-15 07:26, Anup Patel wrote:
>>>>>
>>>>>
>>>>> Hi Marc,
>>>>>
>>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier 
>>>>> <marc.zyngier@arm.com>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Anup,
>>>>>>
>>>>>>
>>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier 
>>>>>>> <marc.zyngier@arm.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Pranav,
>>>>>>>>
>>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Systems with large external L3-cache (few MBs), might have 
>>>>>>>>> dirty
>>>>>>>>> content belonging to the guest page in L3-cache. To tackle 
>>>>>>>>> this,
>>>>>>>>> we need to flush such dirty content from d-cache so that 
>>>>>>>>> guest
>>>>>>>>> will see correct contents of guest page when guest MMU is 
>>>>>>>>> disabled.
>>>>>>>>>
>>>>>>>>> The patch fixes coherent_icache_guest_page() for external 
>>>>>>>>> L3-cache.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar 
>>>>>>>>> <pranavkumar@linaro.org>
>>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>>>>>> ---
>>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>>>>>  1 file changed, 2 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>> index efe609c..5129038 100644
>>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>>>>>> +             /* Flush d-cache for systems with external 
>>>>>>>>> caches. */
>>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non 
>>>>>>>>> ASID-tagged
>>>>>>>>> VIVT
>>>>>>>>> */
>>>>>>>>>               /* any kind of VIPT cache */
>>>>>>>>>               __flush_icache_all();
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [adding Will to the discussion as we talked about this in the 
>>>>>>>> past]
>>>>>>>>
>>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit 
>>>>>>>> the
>>>>>>>> data
>>>>>>>> cache on each page mapping. It looks overkill.
>>>>>>>>
>>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>>>>>>>> kvmtools
>>>>>>>> knows which bits of the guest memory have been touched, and 
>>>>>>>> can do a
>>>>>>>> "DC
>>>>>>>> DVAC" on this region.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> It seems a bit unnatural to have cache cleaning is user-space. 
>>>>>>> I am
>>>>>>> sure
>>>>>>> other architectures don't have such cache cleaning in 
>>>>>>> user-space for
>>>>>>> KVM.
>>>>>>>
>>>>>>>>
>>>>>>>> The alternative is do it in the kernel before running any vcpu 
>>>>>>>> - but
>>>>>>>> that's not very nice either (have to clean the whole of the 
>>>>>>>> guest
>>>>>>>> memory, which makes a full dcache clean more appealing).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Actually, cleaning full d-cache by set/way upon first run of 
>>>>>>> VCPU was
>>>>>>> our second option but current approach seemed very simple hence
>>>>>>> we went for this.
>>>>>>>
>>>>>>> If more people vote for full d-cache clean upon first run of 
>>>>>>> VCPU then
>>>>>>> we should revise this patch.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Can you please give the attached patch a spin on your HW? I've
>>>>>> boot-tested
>>>>>> it on a model, but of course I can't really verify that it fixes 
>>>>>> your
>>>>>> issue.
>>>>>>
>>>>>> As far as I can see, it should fix it without any additional 
>>>>>> flushing.
>>>>>>
>>>>>> Please let me know how it goes.
>>>>>
>>>>>
>>>>>
>>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>>>>> treated as "Normal Non-shareable, Inner Write-Back 
>>>>> Write-Allocate,
>>>>> Outer Write-Back Write-Allocate memory"
>>>>>
>>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>>>>> treated as "Strongly-ordered device memory"
>>>>>
>>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>>>>> accessed as normal memory when HCR_EL2.DC=1.
>>>>
>>>>
>>>>
>>>> I don't think so. Stage-2 still applies, and should force MMIO to 
>>>> be
>>>> accessed as device memory.
>>>>
>>>>
>>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>>>>> KVM ARM64 because we use VGIC.
>>>>>
>>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>>>>> when Stage1 MMU is off.
>>>>
>>>>
>>>>
>>>> See above. My understanding is that HCR.DC controls the default 
>>>> output of
>>>> Stage-1, and Stage-2 overrides still apply.
>>>
>>>
>>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>>> and wanted guest to decide the memory attribute. In other words, 
>>> you
>>> did not want to enforce any memory attribute in Stage2.
>>>
>>> Please refer to this patch 
>>> https://patchwork.kernel.org/patch/2543201/
>>
>>
>> This patch has never been merged. If you carry on following the 
>> discussion,
>> you will certainly notice it was dropped for a very good reason:
>> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>>
>> So Stage-2 memory attributes are used, they are not going away, and 
>> they are
>> essential to the patch I sent this morning.
>>
>>
>>         M.
>> --
>> Fast, cheap, reliable. Pick two.
>
> HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM 
> is
> provided a DMA-capable device in pass-through mode. The reason being
> bootloader/firmware typically don't enable MMU and such 
> bootloader/firmware
> will programme a pass-through DMA-capable device without any flushes 
> to
> guest RAM (because it has not enabled MMU).
>
> A good example of such a device would be SATA AHCI controller given 
> to a
> Guest/VM as direct access (using SystemMMU) and Guest 
> bootloader/firmware
> accessing this SATA AHCI controller to load kernel images from SATA 
> disk.
> In this situation, we will have to hack Guest bootloader/firmware
> AHCI driver to
> explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of 
curiosity: is that a made up example or something you actually have?

Back to square one:
Can you please benchmark the various cache cleaning options (global at 
startup time, per-page on S2 translation fault, and user-space)?

Thanks,

         M.
Marc Zyngier Aug. 15, 2013, 3:42 p.m. UTC | #8
Hi Pranav,

On 2013-08-15 09:39, Pranavkumar Sawargaonkar wrote:

[...]

> Just to add few more points to this discussion :
>
> Actually we are already flushing d-cache in 
> "coherent_icache_guest_page"
> by calling flush_icache_range.
>
> Now in my patch i am doing same thing one more time by calling
> __flush_dcache_area just below flush_icache_range.
>
> Main difference between d-cache flushing in above two routines is :
> flush_icache_range is flushing d-cache by point of unification (dc
> cvau) and
>  __flush_dcache_area is flushing that by point of coherency (dc      
> civac).
>
> Now since second routine is doing it by coherency it is making L3C to
> be coherent and sync changes with immediate effect and solving the
> issue.

I understand that. What I object to is that always cleaning to PoC is 
expensive (both in terms of latency and energy), and doing so for each 
page, long after the MMU has been enabled feels like a great loss of 
performance.

My gut feeling is that a single clean operation at startup time 
(whether it is from the kernel or userspace) would be a lot better.

Thanks,

         M.
Anup Patel Aug. 15, 2013, 3:45 p.m. UTC | #9
On Thu, Aug 15, 2013 at 9:07 PM, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 2013-08-15 16:13, Anup Patel wrote:
>>
>> Hi Marc,
>>
>> On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> wrote:
>>>
>>> On 2013-08-15 14:31, Anup Patel wrote:
>>>>
>>>>
>>>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> On 2013-08-15 07:26, Anup Patel wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Marc,
>>>>>>
>>>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Anup,
>>>>>>>
>>>>>>>
>>>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Pranav,
>>>>>>>>>
>>>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Systems with large external L3-cache (few MBs), might have dirty
>>>>>>>>>> content belonging to the guest page in L3-cache. To tackle this,
>>>>>>>>>> we need to flush such dirty content from d-cache so that guest
>>>>>>>>>> will see correct contents of guest page when guest MMU is
>>>>>>>>>> disabled.
>>>>>>>>>>
>>>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>>>>>>>>>> L3-cache.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org>
>>>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>>>>>>>>> ---
>>>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>>>>>>>>>  1 file changed, 2 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>>> index efe609c..5129038 100644
>>>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>>>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>>>>>>>>> +             /* Flush d-cache for systems with external caches.
>>>>>>>>>> */
>>>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non ASID-tagged
>>>>>>>>>> VIVT
>>>>>>>>>> */
>>>>>>>>>>               /* any kind of VIPT cache */
>>>>>>>>>>               __flush_icache_all();
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [adding Will to the discussion as we talked about this in the past]
>>>>>>>>>
>>>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit the
>>>>>>>>> data
>>>>>>>>> cache on each page mapping. It looks overkill.
>>>>>>>>>
>>>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>>>>>>>>> kvmtools
>>>>>>>>> knows which bits of the guest memory have been touched, and can do
>>>>>>>>> a
>>>>>>>>> "DC
>>>>>>>>> DVAC" on this region.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It seems a bit unnatural to have cache cleaning is user-space. I am
>>>>>>>> sure
>>>>>>>> other architectures don't have such cache cleaning in user-space for
>>>>>>>> KVM.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The alternative is do it in the kernel before running any vcpu -
>>>>>>>>> but
>>>>>>>>> that's not very nice either (have to clean the whole of the guest
>>>>>>>>> memory, which makes a full dcache clean more appealing).
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU
>>>>>>>> was
>>>>>>>> our second option but current approach seemed very simple hence
>>>>>>>> we went for this.
>>>>>>>>
>>>>>>>> If more people vote for full d-cache clean upon first run of VCPU
>>>>>>>> then
>>>>>>>> we should revise this patch.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Can you please give the attached patch a spin on your HW? I've
>>>>>>> boot-tested
>>>>>>> it on a model, but of course I can't really verify that it fixes your
>>>>>>> issue.
>>>>>>>
>>>>>>> As far as I can see, it should fix it without any additional
>>>>>>> flushing.
>>>>>>>
>>>>>>> Please let me know how it goes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>>>>>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate,
>>>>>> Outer Write-Back Write-Allocate memory"
>>>>>>
>>>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>>>>>> treated as "Strongly-ordered device memory"
>>>>>>
>>>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>>>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>>>>>> accessed as normal memory when HCR_EL2.DC=1.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I don't think so. Stage-2 still applies, and should force MMIO to be
>>>>> accessed as device memory.
>>>>>
>>>>>
>>>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>>>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>>>>>> KVM ARM64 because we use VGIC.
>>>>>>
>>>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>>>>>> when Stage1 MMU is off.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> See above. My understanding is that HCR.DC controls the default output
>>>>> of
>>>>> Stage-1, and Stage-2 overrides still apply.
>>>>
>>>>
>>>>
>>>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>>>> and wanted guest to decide the memory attribute. In other words, you
>>>> did not want to enforce any memory attribute in Stage2.
>>>>
>>>> Please refer to this patch https://patchwork.kernel.org/patch/2543201/
>>>
>>>
>>>
>>> This patch has never been merged. If you carry on following the
>>> discussion,
>>> you will certainly notice it was dropped for a very good reason:
>>> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>>>
>>> So Stage-2 memory attributes are used, they are not going away, and they
>>> are
>>> essential to the patch I sent this morning.
>>>
>>>
>>>         M.
>>> --
>>> Fast, cheap, reliable. Pick two.
>>
>>
>> HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is
>> provided a DMA-capable device in pass-through mode. The reason being
>> bootloader/firmware typically don't enable MMU and such
>> bootloader/firmware
>> will programme a pass-through DMA-capable device without any flushes to
>> guest RAM (because it has not enabled MMU).
>>
>> A good example of such a device would be SATA AHCI controller given to a
>> Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware
>> accessing this SATA AHCI controller to load kernel images from SATA disk.
>> In this situation, we will have to hack Guest bootloader/firmware
>> AHCI driver to
>> explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>
>
> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> curiosity: is that a made up example or something you actually have?

Actually, this would be a typical use-case in embedded virtualization.
Other devices that fit this use-case would be network controllers,
security engines, or some proprietary HW not having drivers in Linux.

>
> Back to square one:
> Can you please benchmark the various cache cleaning options (global at
> startup time, per-page on S2 translation fault, and user-space)?
>
> Thanks,
>
>
>         M.
> --
> Fast, cheap, reliable. Pick two.

--Anup
Christoffer Dall Aug. 15, 2013, 4:53 p.m. UTC | #10
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> On 2013-08-15 16:13, Anup Patel wrote:
> > Hi Marc,
> >
> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> 
> > wrote:
> >> On 2013-08-15 14:31, Anup Patel wrote:
> >>>
> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier 
> >>> <marc.zyngier@arm.com>
> >>> wrote:
> >>>>
> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> >>>>>
> >>>>>
> >>>>> Hi Marc,
> >>>>>
> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier 
> >>>>> <marc.zyngier@arm.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hi Anup,
> >>>>>>
> >>>>>>
> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier 
> >>>>>>> <marc.zyngier@arm.com>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Pranav,
> >>>>>>>>
> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Systems with large external L3-cache (few MBs), might have 
> >>>>>>>>> dirty
> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle 
> >>>>>>>>> this,
> >>>>>>>>> we need to flush such dirty content from d-cache so that 
> >>>>>>>>> guest
> >>>>>>>>> will see correct contents of guest page when guest MMU is 
> >>>>>>>>> disabled.
> >>>>>>>>>
> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external 
> >>>>>>>>> L3-cache.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar 
> >>>>>>>>> <pranavkumar@linaro.org>
> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >>>>>>>>> ---
> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> >>>>>>>>>  1 file changed, 2 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> >>>>>>>>> index efe609c..5129038 100644
> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> >>>>>>>>> +             /* Flush d-cache for systems with external 
> >>>>>>>>> caches. */
> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non 
> >>>>>>>>> ASID-tagged
> >>>>>>>>> VIVT
> >>>>>>>>> */
> >>>>>>>>>               /* any kind of VIPT cache */
> >>>>>>>>>               __flush_icache_all();
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> [adding Will to the discussion as we talked about this in the 
> >>>>>>>> past]
> >>>>>>>>
> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit 
> >>>>>>>> the
> >>>>>>>> data
> >>>>>>>> cache on each page mapping. It looks overkill.
> >>>>>>>>
> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> >>>>>>>> kvmtools
> >>>>>>>> knows which bits of the guest memory have been touched, and 
> >>>>>>>> can do a
> >>>>>>>> "DC
> >>>>>>>> DVAC" on this region.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. 
> >>>>>>> I am
> >>>>>>> sure
> >>>>>>> other architectures don't have such cache cleaning in 
> >>>>>>> user-space for
> >>>>>>> KVM.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The alternative is do it in the kernel before running any vcpu 
> >>>>>>>> - but
> >>>>>>>> that's not very nice either (have to clean the whole of the 
> >>>>>>>> guest
> >>>>>>>> memory, which makes a full dcache clean more appealing).
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of 
> >>>>>>> VCPU was
> >>>>>>> our second option but current approach seemed very simple hence
> >>>>>>> we went for this.
> >>>>>>>
> >>>>>>> If more people vote for full d-cache clean upon first run of 
> >>>>>>> VCPU then
> >>>>>>> we should revise this patch.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Can you please give the attached patch a spin on your HW? I've
> >>>>>> boot-tested
> >>>>>> it on a model, but of course I can't really verify that it fixes 
> >>>>>> your
> >>>>>> issue.
> >>>>>>
> >>>>>> As far as I can see, it should fix it without any additional 
> >>>>>> flushing.
> >>>>>>
> >>>>>> Please let me know how it goes.
> >>>>>
> >>>>>
> >>>>>
> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> >>>>> treated as "Normal Non-shareable, Inner Write-Back 
> >>>>> Write-Allocate,
> >>>>> Outer Write-Back Write-Allocate memory"
> >>>>>
> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> >>>>> treated as "Strongly-ordered device memory"
> >>>>>
> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> >>>>
> >>>>
> >>>>
> >>>> I don't think so. Stage-2 still applies, and should force MMIO to 
> >>>> be
> >>>> accessed as device memory.
> >>>>
> >>>>
> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> >>>>> KVM ARM64 because we use VGIC.
> >>>>>
> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> >>>>> when Stage1 MMU is off.
> >>>>
> >>>>
> >>>>
> >>>> See above. My understanding is that HCR.DC controls the default 
> >>>> output of
> >>>> Stage-1, and Stage-2 overrides still apply.
> >>>
> >>>
> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> >>> and wanted guest to decide the memory attribute. In other words, 
> >>> you
> >>> did not want to enforce any memory attribute in Stage2.
> >>>
> >>> Please refer to this patch 
> >>> https://patchwork.kernel.org/patch/2543201/
> >>
> >>
> >> This patch has never been merged. If you carry on following the 
> >> discussion,
> >> you will certainly notice it was dropped for a very good reason:
> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> >>
> >> So Stage-2 memory attributes are used, they are not going away, and 
> >> they are
> >> essential to the patch I sent this morning.
> >>
> >>
> >>         M.
> >> --
> >> Fast, cheap, reliable. Pick two.
> >
> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM 
> > is
> > provided a DMA-capable device in pass-through mode. The reason being
> > bootloader/firmware typically don't enable MMU and such 
> > bootloader/firmware
> > will programme a pass-through DMA-capable device without any flushes 
> > to
> > guest RAM (because it has not enabled MMU).
> >
> > A good example of such a device would be SATA AHCI controller given 
> > to a
> > Guest/VM as direct access (using SystemMMU) and Guest 
> > bootloader/firmware
> > accessing this SATA AHCI controller to load kernel images from SATA 
> > disk.
> > In this situation, we will have to hack Guest bootloader/firmware
> > AHCI driver to
> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> 
> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of 
> curiosity: is that a made up example or something you actually have?
> 
> Back to square one:
> Can you please benchmark the various cache cleaning options (global at 
> startup time, per-page on S2 translation fault, and user-space)?
> 
Eh, why is this a more valid argument than the vgic?  The device
passthrough Stage-2 mappings would still have the Stage-2 memory
attributes to configure that memory region as device memory.  Why is it
relevant if the device is DMA-capable in this context?

-Christoffer
Anup Patel Aug. 16, 2013, 5:02 a.m. UTC | #11
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>> On 2013-08-15 16:13, Anup Patel wrote:
>> > Hi Marc,
>> >
>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> > wrote:
>> >> On 2013-08-15 14:31, Anup Patel wrote:
>> >>>
>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>> >>> <marc.zyngier@arm.com>
>> >>> wrote:
>> >>>>
>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>> >>>>>
>> >>>>>
>> >>>>> Hi Marc,
>> >>>>>
>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>> >>>>> <marc.zyngier@arm.com>
>> >>>>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>> Hi Anup,
>> >>>>>>
>> >>>>>>
>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>> >>>>>>> <marc.zyngier@arm.com>
>> >>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Hi Pranav,
>> >>>>>>>>
>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>> >>>>>>>>> dirty
>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>> >>>>>>>>> this,
>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>> >>>>>>>>> guest
>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>> >>>>>>>>> disabled.
>> >>>>>>>>>
>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>> >>>>>>>>> L3-cache.
>> >>>>>>>>>
>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>> >>>>>>>>> <pranavkumar@linaro.org>
>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> >>>>>>>>> ---
>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>> >>>>>>>>>  1 file changed, 2 insertions(+)
>> >>>>>>>>>
>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>> >>>>>>>>> index efe609c..5129038 100644
>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>> >>>>>>>>> +             /* Flush d-cache for systems with external
>> >>>>>>>>> caches. */
>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>> >>>>>>>>> ASID-tagged
>> >>>>>>>>> VIVT
>> >>>>>>>>> */
>> >>>>>>>>>               /* any kind of VIPT cache */
>> >>>>>>>>>               __flush_icache_all();
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>> >>>>>>>> past]
>> >>>>>>>>
>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>> >>>>>>>> the
>> >>>>>>>> data
>> >>>>>>>> cache on each page mapping. It looks overkill.
>> >>>>>>>>
>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>> >>>>>>>> kvmtools
>> >>>>>>>> knows which bits of the guest memory have been touched, and
>> >>>>>>>> can do a
>> >>>>>>>> "DC
>> >>>>>>>> DVAC" on this region.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>> >>>>>>> I am
>> >>>>>>> sure
>> >>>>>>> other architectures don't have such cache cleaning in
>> >>>>>>> user-space for
>> >>>>>>> KVM.
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>> >>>>>>>> - but
>> >>>>>>>> that's not very nice either (have to clean the whole of the
>> >>>>>>>> guest
>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>> >>>>>>> VCPU was
>> >>>>>>> our second option but current approach seemed very simple hence
>> >>>>>>> we went for this.
>> >>>>>>>
>> >>>>>>> If more people vote for full d-cache clean upon first run of
>> >>>>>>> VCPU then
>> >>>>>>> we should revise this patch.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>> >>>>>> boot-tested
>> >>>>>> it on a model, but of course I can't really verify that it fixes
>> >>>>>> your
>> >>>>>> issue.
>> >>>>>>
>> >>>>>> As far as I can see, it should fix it without any additional
>> >>>>>> flushing.
>> >>>>>>
>> >>>>>> Please let me know how it goes.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>> >>>>> Write-Allocate,
>> >>>>> Outer Write-Back Write-Allocate memory"
>> >>>>>
>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> >>>>> treated as "Strongly-ordered device memory"
>> >>>>>
>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>> >>>>
>> >>>>
>> >>>>
>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>> >>>> be
>> >>>> accessed as device memory.
>> >>>>
>> >>>>
>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>> >>>>> KVM ARM64 because we use VGIC.
>> >>>>>
>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> >>>>> when Stage1 MMU is off.
>> >>>>
>> >>>>
>> >>>>
>> >>>> See above. My understanding is that HCR.DC controls the default
>> >>>> output of
>> >>>> Stage-1, and Stage-2 overrides still apply.
>> >>>
>> >>>
>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> >>> and wanted guest to decide the memory attribute. In other words,
>> >>> you
>> >>> did not want to enforce any memory attribute in Stage2.
>> >>>
>> >>> Please refer to this patch
>> >>> https://patchwork.kernel.org/patch/2543201/
>> >>
>> >>
>> >> This patch has never been merged. If you carry on following the
>> >> discussion,
>> >> you will certainly notice it was dropped for a very good reason:
>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>> >>
>> >> So Stage-2 memory attributes are used, they are not going away, and
>> >> they are
>> >> essential to the patch I sent this morning.
>> >>
>> >>
>> >>         M.
>> >> --
>> >> Fast, cheap, reliable. Pick two.
>> >
>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>> > is
>> > provided a DMA-capable device in pass-through mode. The reason being
>> > bootloader/firmware typically don't enable MMU and such
>> > bootloader/firmware
>> > will programme a pass-through DMA-capable device without any flushes
>> > to
>> > guest RAM (because it has not enabled MMU).
>> >
>> > A good example of such a device would be SATA AHCI controller given
>> > to a
>> > Guest/VM as direct access (using SystemMMU) and Guest
>> > bootloader/firmware
>> > accessing this SATA AHCI controller to load kernel images from SATA
>> > disk.
>> > In this situation, we will have to hack Guest bootloader/firmware
>> > AHCI driver to
>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>>
>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>> curiosity: is that a made up example or something you actually have?
>>
>> Back to square one:
>> Can you please benchmark the various cache cleaning options (global at
>> startup time, per-page on S2 translation fault, and user-space)?
>>
> Eh, why is this a more valid argument than the vgic?  The device
> passthrough Stage-2 mappings would still have the Stage-2 memory
> attributes to configure that memory region as device memory.  Why is it
> relevant if the device is DMA-capable in this context?
>
> -Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do
explicit cache flushes when programming DMA-capable device. Now If
HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
will not be visible to DMA-capable device.

--Anup
Anup Patel Aug. 16, 2013, 6:57 a.m. UTC | #12
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
> On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
>> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>>> On 2013-08-15 16:13, Anup Patel wrote:
>>> > Hi Marc,
>>> >
>>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>>> > wrote:
>>> >> On 2013-08-15 14:31, Anup Patel wrote:
>>> >>>
>>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>>> >>> <marc.zyngier@arm.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>> Hi Marc,
>>> >>>>>
>>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>>> >>>>> <marc.zyngier@arm.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Hi Anup,
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>>> >>>>>>> <marc.zyngier@arm.com>
>>> >>>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Hi Pranav,
>>> >>>>>>>>
>>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>>> >>>>>>>>> dirty
>>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>>> >>>>>>>>> this,
>>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>>> >>>>>>>>> guest
>>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>>> >>>>>>>>> disabled.
>>> >>>>>>>>>
>>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>>> >>>>>>>>> L3-cache.
>>> >>>>>>>>>
>>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>>> >>>>>>>>> <pranavkumar@linaro.org>
>>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>>> >>>>>>>>> ---
>>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>>> >>>>>>>>>  1 file changed, 2 insertions(+)
>>> >>>>>>>>>
>>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>>> >>>>>>>>> index efe609c..5129038 100644
>>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>>> >>>>>>>>> +             /* Flush d-cache for systems with external
>>> >>>>>>>>> caches. */
>>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>>> >>>>>>>>> ASID-tagged
>>> >>>>>>>>> VIVT
>>> >>>>>>>>> */
>>> >>>>>>>>>               /* any kind of VIPT cache */
>>> >>>>>>>>>               __flush_icache_all();
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>>> >>>>>>>> past]
>>> >>>>>>>>
>>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>>> >>>>>>>> the
>>> >>>>>>>> data
>>> >>>>>>>> cache on each page mapping. It looks overkill.
>>> >>>>>>>>
>>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>>> >>>>>>>> kvmtools
>>> >>>>>>>> knows which bits of the guest memory have been touched, and
>>> >>>>>>>> can do a
>>> >>>>>>>> "DC
>>> >>>>>>>> DVAC" on this region.
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>>> >>>>>>> I am
>>> >>>>>>> sure
>>> >>>>>>> other architectures don't have such cache cleaning in
>>> >>>>>>> user-space for
>>> >>>>>>> KVM.
>>> >>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>>> >>>>>>>> - but
>>> >>>>>>>> that's not very nice either (have to clean the whole of the
>>> >>>>>>>> guest
>>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>>> >>>>>>> VCPU was
>>> >>>>>>> our second option but current approach seemed very simple hence
>>> >>>>>>> we went for this.
>>> >>>>>>>
>>> >>>>>>> If more people vote for full d-cache clean upon first run of
>>> >>>>>>> VCPU then
>>> >>>>>>> we should revise this patch.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>>> >>>>>> boot-tested
>>> >>>>>> it on a model, but of course I can't really verify that it fixes
>>> >>>>>> your
>>> >>>>>> issue.
>>> >>>>>>
>>> >>>>>> As far as I can see, it should fix it without any additional
>>> >>>>>> flushing.
>>> >>>>>>
>>> >>>>>> Please let me know how it goes.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>>> >>>>> Write-Allocate,
>>> >>>>> Outer Write-Back Write-Allocate memory"
>>> >>>>>
>>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>>> >>>>> treated as "Strongly-ordered device memory"
>>> >>>>>
>>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>>> >>>> be
>>> >>>> accessed as device memory.
>>> >>>>
>>> >>>>
>>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>>> >>>>> KVM ARM64 because we use VGIC.
>>> >>>>>
>>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>>> >>>>> when Stage1 MMU is off.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> See above. My understanding is that HCR.DC controls the default
>>> >>>> output of
>>> >>>> Stage-1, and Stage-2 overrides still apply.
>>> >>>
>>> >>>
>>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>>> >>> and wanted guest to decide the memory attribute. In other words,
>>> >>> you
>>> >>> did not want to enforce any memory attribute in Stage2.
>>> >>>
>>> >>> Please refer to this patch
>>> >>> https://patchwork.kernel.org/patch/2543201/
>>> >>
>>> >>
>>> >> This patch has never been merged. If you carry on following the
>>> >> discussion,
>>> >> you will certainly notice it was dropped for a very good reason:
>>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>>> >>
>>> >> So Stage-2 memory attributes are used, they are not going away, and
>>> >> they are
>>> >> essential to the patch I sent this morning.
>>> >>
>>> >>
>>> >>         M.
>>> >> --
>>> >> Fast, cheap, reliable. Pick two.
>>> >
>>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>>> > is
>>> > provided a DMA-capable device in pass-through mode. The reason being
>>> > bootloader/firmware typically don't enable MMU and such
>>> > bootloader/firmware
>>> > will programme a pass-through DMA-capable device without any flushes
>>> > to
>>> > guest RAM (because it has not enabled MMU).
>>> >
>>> > A good example of such a device would be SATA AHCI controller given
>>> > to a
>>> > Guest/VM as direct access (using SystemMMU) and Guest
>>> > bootloader/firmware
>>> > accessing this SATA AHCI controller to load kernel images from SATA
>>> > disk.
>>> > In this situation, we will have to hack Guest bootloader/firmware
>>> > AHCI driver to
>>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>>>
>>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>>> curiosity: is that a made up example or something you actually have?
>>>
>>> Back to square one:
>>> Can you please benchmark the various cache cleaning options (global at
>>> startup time, per-page on S2 translation fault, and user-space)?
>>>
>> Eh, why is this a more valid argument than the vgic?  The device
>> passthrough Stage-2 mappings would still have the Stage-2 memory
>> attributes to configure that memory region as device memory.  Why is it
>> relevant if the device is DMA-capable in this context?
>>
>> -Christoffer
>
> Most ARM bootloader/firmware run with MMU off hence, they will not do
> explicit cache flushes when programming DMA-capable device. Now If
> HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> will not be visible to DMA-capable device.
>
> --Anup

The approach of flushing d-cache by set/way upon first run of VCPU will
not work because for set/way operations ARM ARM says: "For set/way
operations, and for All (entire cache) operations, the point is defined to be
to the next level of caching". In other words, set/way operations work upto
point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
at bootup time which does not work for us when L3-cache is enabled so,
there is no point is adding one more __flush_dcache_all() upon first run of
VCPU in KVM ARM64.

IMHO, we are left with following options:
1. Flush all RAM regions of VCPU using __flush_dcache_range()
    upon first run of VCPU
2. Implement outer-cache framework for ARM64 and flush all
    caches + outer cache (i.e. L3-cache) upon first run of VCPU
3. Use an alternate version of flush_icache_range() which will
    flush d-cache by PoC instead of PoU. We can also ensure
    that coherent_icache_guest_page() function will be called
    upon Stage2 prefetch aborts only.

Regards,
Anup
Christoffer Dall Aug. 16, 2013, 5:14 p.m. UTC | #13
On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:
> On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> >> On 2013-08-15 16:13, Anup Patel wrote:
> >> > Hi Marc,
> >> >
> >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
> >> > wrote:
> >> >> On 2013-08-15 14:31, Anup Patel wrote:
> >> >>>
> >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> >> >>> <marc.zyngier@arm.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> >> >>>>>
> >> >>>>>
> >> >>>>> Hi Marc,
> >> >>>>>
> >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> >> >>>>> <marc.zyngier@arm.com>
> >> >>>>> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Hi Anup,
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> >> >>>>>>> <marc.zyngier@arm.com>
> >> >>>>>>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Hi Pranav,
> >> >>>>>>>>
> >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> >> >>>>>>>>> dirty
> >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> >> >>>>>>>>> this,
> >> >>>>>>>>> we need to flush such dirty content from d-cache so that
> >> >>>>>>>>> guest
> >> >>>>>>>>> will see correct contents of guest page when guest MMU is
> >> >>>>>>>>> disabled.
> >> >>>>>>>>>
> >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> >> >>>>>>>>> L3-cache.
> >> >>>>>>>>>
> >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> >> >>>>>>>>> <pranavkumar@linaro.org>
> >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >> >>>>>>>>> ---
> >> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> >> >>>>>>>>>  1 file changed, 2 insertions(+)
> >> >>>>>>>>>
> >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> >> >>>>>>>>> index efe609c..5129038 100644
> >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> >> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> >> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> >> >>>>>>>>> +             /* Flush d-cache for systems with external
> >> >>>>>>>>> caches. */
> >> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> >> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> >> >>>>>>>>> ASID-tagged
> >> >>>>>>>>> VIVT
> >> >>>>>>>>> */
> >> >>>>>>>>>               /* any kind of VIPT cache */
> >> >>>>>>>>>               __flush_icache_all();
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> [adding Will to the discussion as we talked about this in the
> >> >>>>>>>> past]
> >> >>>>>>>>
> >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
> >> >>>>>>>> the
> >> >>>>>>>> data
> >> >>>>>>>> cache on each page mapping. It looks overkill.
> >> >>>>>>>>
> >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> >> >>>>>>>> kvmtools
> >> >>>>>>>> knows which bits of the guest memory have been touched, and
> >> >>>>>>>> can do a
> >> >>>>>>>> "DC
> >> >>>>>>>> DVAC" on this region.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> >> >>>>>>> I am
> >> >>>>>>> sure
> >> >>>>>>> other architectures don't have such cache cleaning in
> >> >>>>>>> user-space for
> >> >>>>>>> KVM.
> >> >>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> The alternative is do it in the kernel before running any vcpu
> >> >>>>>>>> - but
> >> >>>>>>>> that's not very nice either (have to clean the whole of the
> >> >>>>>>>> guest
> >> >>>>>>>> memory, which makes a full dcache clean more appealing).
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> >> >>>>>>> VCPU was
> >> >>>>>>> our second option but current approach seemed very simple hence
> >> >>>>>>> we went for this.
> >> >>>>>>>
> >> >>>>>>> If more people vote for full d-cache clean upon first run of
> >> >>>>>>> VCPU then
> >> >>>>>>> we should revise this patch.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> Can you please give the attached patch a spin on your HW? I've
> >> >>>>>> boot-tested
> >> >>>>>> it on a model, but of course I can't really verify that it fixes
> >> >>>>>> your
> >> >>>>>> issue.
> >> >>>>>>
> >> >>>>>> As far as I can see, it should fix it without any additional
> >> >>>>>> flushing.
> >> >>>>>>
> >> >>>>>> Please let me know how it goes.
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> >> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> >> >>>>> Write-Allocate,
> >> >>>>> Outer Write-Back Write-Allocate memory"
> >> >>>>>
> >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> >> >>>>> treated as "Strongly-ordered device memory"
> >> >>>>>
> >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> >> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> >> >>>> be
> >> >>>> accessed as device memory.
> >> >>>>
> >> >>>>
> >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> >> >>>>> KVM ARM64 because we use VGIC.
> >> >>>>>
> >> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> >> >>>>> when Stage1 MMU is off.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> See above. My understanding is that HCR.DC controls the default
> >> >>>> output of
> >> >>>> Stage-1, and Stage-2 overrides still apply.
> >> >>>
> >> >>>
> >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> >> >>> and wanted guest to decide the memory attribute. In other words,
> >> >>> you
> >> >>> did not want to enforce any memory attribute in Stage2.
> >> >>>
> >> >>> Please refer to this patch
> >> >>> https://patchwork.kernel.org/patch/2543201/
> >> >>
> >> >>
> >> >> This patch has never been merged. If you carry on following the
> >> >> discussion,
> >> >> you will certainly notice it was dropped for a very good reason:
> >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> >> >>
> >> >> So Stage-2 memory attributes are used, they are not going away, and
> >> >> they are
> >> >> essential to the patch I sent this morning.
> >> >>
> >> >>
> >> >>         M.
> >> >> --
> >> >> Fast, cheap, reliable. Pick two.
> >> >
> >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> >> > is
> >> > provided a DMA-capable device in pass-through mode. The reason being
> >> > bootloader/firmware typically don't enable MMU and such
> >> > bootloader/firmware
> >> > will programme a pass-through DMA-capable device without any flushes
> >> > to
> >> > guest RAM (because it has not enabled MMU).
> >> >
> >> > A good example of such a device would be SATA AHCI controller given
> >> > to a
> >> > Guest/VM as direct access (using SystemMMU) and Guest
> >> > bootloader/firmware
> >> > accessing this SATA AHCI controller to load kernel images from SATA
> >> > disk.
> >> > In this situation, we will have to hack Guest bootloader/firmware
> >> > AHCI driver to
> >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> >>
> >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> >> curiosity: is that a made up example or something you actually have?
> >>
> >> Back to square one:
> >> Can you please benchmark the various cache cleaning options (global at
> >> startup time, per-page on S2 translation fault, and user-space)?
> >>
> > Eh, why is this a more valid argument than the vgic?  The device
> > passthrough Stage-2 mappings would still have the Stage-2 memory
> > attributes to configure that memory region as device memory.  Why is it
> > relevant if the device is DMA-capable in this context?
> >
> 
> Most ARM bootloader/firmware run with MMU off hence, they will not do
> explicit cache flushes when programming DMA-capable device. Now If
> HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> will not be visible to DMA-capable device.
>
Read my question again: The bootloaders running with the MMU off in a VM
will only disable the MMU for Stage-1 translations.  Stage-2
translations are still enforced using hte Stage-2 page tables and their
attributes for all mappings to devices will still enforce
strongly-ordered device type memory.

Again, what does it matter if the device is DMA-capable or not?

-Christoffer
Anup Patel Aug. 16, 2013, 5:18 p.m. UTC | #14
On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote:

> On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:
> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> > <christoffer.dall@linaro.org> wrote:
> > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> > >> On 2013-08-15 16:13, Anup Patel wrote:
> > >> > Hi Marc,
> > >> >
> > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com
> >
> > >> > wrote:
> > >> >> On 2013-08-15 14:31, Anup Patel wrote:
> > >> >>>
> > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> > >> >>> <marc.zyngier@arm.com>
> > >> >>> wrote:
> > >> >>>>
> > >> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> Hi Marc,
> > >> >>>>>
> > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> > >> >>>>> <marc.zyngier@arm.com>
> > >> >>>>> wrote:
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> Hi Anup,
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> > >> >>>>>>> <marc.zyngier@arm.com>
> > >> >>>>>>> wrote:
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> Hi Pranav,
> > >> >>>>>>>>
> > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> > >> >>>>>>>>> dirty
> > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> > >> >>>>>>>>> this,
> > >> >>>>>>>>> we need to flush such dirty content from d-cache so that
> > >> >>>>>>>>> guest
> > >> >>>>>>>>> will see correct contents of guest page when guest MMU is
> > >> >>>>>>>>> disabled.
> > >> >>>>>>>>>
> > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> > >> >>>>>>>>> L3-cache.
> > >> >>>>>>>>>
> > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> > >> >>>>>>>>> <pranavkumar@linaro.org>
> > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> > >> >>>>>>>>> ---
> > >> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> > >> >>>>>>>>>  1 file changed, 2 insertions(+)
> > >> >>>>>>>>>
> > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> > >> >>>>>>>>> index efe609c..5129038 100644
> > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > >> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> > >> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> > >> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> > >> >>>>>>>>> +             /* Flush d-cache for systems with external
> > >> >>>>>>>>> caches. */
> > >> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> > >> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> > >> >>>>>>>>> ASID-tagged
> > >> >>>>>>>>> VIVT
> > >> >>>>>>>>> */
> > >> >>>>>>>>>               /* any kind of VIPT cache */
> > >> >>>>>>>>>               __flush_icache_all();
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> [adding Will to the discussion as we talked about this in the
> > >> >>>>>>>> past]
> > >> >>>>>>>>
> > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to
> hit
> > >> >>>>>>>> the
> > >> >>>>>>>> data
> > >> >>>>>>>> cache on each page mapping. It looks overkill.
> > >> >>>>>>>>
> > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> > >> >>>>>>>> kvmtools
> > >> >>>>>>>> knows which bits of the guest memory have been touched, and
> > >> >>>>>>>> can do a
> > >> >>>>>>>> "DC
> > >> >>>>>>>> DVAC" on this region.
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> > >> >>>>>>> I am
> > >> >>>>>>> sure
> > >> >>>>>>> other architectures don't have such cache cleaning in
> > >> >>>>>>> user-space for
> > >> >>>>>>> KVM.
> > >> >>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> The alternative is do it in the kernel before running any
> vcpu
> > >> >>>>>>>> - but
> > >> >>>>>>>> that's not very nice either (have to clean the whole of the
> > >> >>>>>>>> guest
> > >> >>>>>>>> memory, which makes a full dcache clean more appealing).
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> > >> >>>>>>> VCPU was
> > >> >>>>>>> our second option but current approach seemed very simple
> hence
> > >> >>>>>>> we went for this.
> > >> >>>>>>>
> > >> >>>>>>> If more people vote for full d-cache clean upon first run of
> > >> >>>>>>> VCPU then
> > >> >>>>>>> we should revise this patch.
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> Can you please give the attached patch a spin on your HW? I've
> > >> >>>>>> boot-tested
> > >> >>>>>> it on a model, but of course I can't really verify that it
> fixes
> > >> >>>>>> your
> > >> >>>>>> issue.
> > >> >>>>>>
> > >> >>>>>> As far as I can see, it should fix it without any additional
> > >> >>>>>> flushing.
> > >> >>>>>>
> > >> >>>>>> Please let me know how it goes.
> > >> >>>>>
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> > >> >>>>> Write-Allocate,
> > >> >>>>> Outer Write-Back Write-Allocate memory"
> > >> >>>>>
> > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> > >> >>>>> treated as "Strongly-ordered device memory"
> > >> >>>>>
> > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> > >> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> > >> >>>> be
> > >> >>>> accessed as device memory.
> > >> >>>>
> > >> >>>>
> > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> > >> >>>>> KVM ARM64 because we use VGIC.
> > >> >>>>>
> > >> >>>>> IMHO, this patch enforces incorrect memory attribute for
> Guest/VM
> > >> >>>>> when Stage1 MMU is off.
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>> See above. My understanding is that HCR.DC controls the default
> > >> >>>> output of
> > >> >>>> Stage-1, and Stage-2 overrides still apply.
> > >> >>>
> > >> >>>
> > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> > >> >>> and wanted guest to decide the memory attribute. In other words,
> > >> >>> you
> > >> >>> did not want to enforce any memory attribute in Stage2.
> > >> >>>
> > >> >>> Please refer to this patch
> > >> >>> https://patchwork.kernel.org/patch/2543201/
> > >> >>
> > >> >>
> > >> >> This patch has never been merged. If you carry on following the
> > >> >> discussion,
> > >> >> you will certainly notice it was dropped for a very good reason:
> > >> >>
> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> > >> >>
> > >> >> So Stage-2 memory attributes are used, they are not going away, and
> > >> >> they are
> > >> >> essential to the patch I sent this morning.
> > >> >>
> > >> >>
> > >> >>         M.
> > >> >> --
> > >> >> Fast, cheap, reliable. Pick two.
> > >> >
> > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> > >> > is
> > >> > provided a DMA-capable device in pass-through mode. The reason being
> > >> > bootloader/firmware typically don't enable MMU and such
> > >> > bootloader/firmware
> > >> > will programme a pass-through DMA-capable device without any flushes
> > >> > to
> > >> > guest RAM (because it has not enabled MMU).
> > >> >
> > >> > A good example of such a device would be SATA AHCI controller given
> > >> > to a
> > >> > Guest/VM as direct access (using SystemMMU) and Guest
> > >> > bootloader/firmware
> > >> > accessing this SATA AHCI controller to load kernel images from SATA
> > >> > disk.
> > >> > In this situation, we will have to hack Guest bootloader/firmware
> > >> > AHCI driver to
> > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> > >>
> > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out
> of
> > >> curiosity: is that a made up example or something you actually have?
> > >>
> > >> Back to square one:
> > >> Can you please benchmark the various cache cleaning options (global at
> > >> startup time, per-page on S2 translation fault, and user-space)?
> > >>
> > > Eh, why is this a more valid argument than the vgic?  The device
> > > passthrough Stage-2 mappings would still have the Stage-2 memory
> > > attributes to configure that memory region as device memory.  Why is it
> > > relevant if the device is DMA-capable in this context?
> > >
> >
> > Most ARM bootloader/firmware run with MMU off hence, they will not do
> > explicit cache flushes when programming DMA-capable device. Now If
> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> > will not be visible to DMA-capable device.
> >
> Read my question again: The bootloaders running with the MMU off in a VM
> will only disable the MMU for Stage-1 translations.  Stage-2
> translations are still enforced using hte Stage-2 page tables and their
> attributes for all mappings to devices will still enforce
> strongly-ordered device type memory.
>

Please read my reply again. Also try to read-up SATA AHCI spec.


>
> Again, what does it matter if the device is DMA-capable or not?
>
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>
Christoffer Dall Aug. 16, 2013, 5:19 p.m. UTC | #15
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> > <christoffer.dall@linaro.org> wrote:
> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> >>> On 2013-08-15 16:13, Anup Patel wrote:
> >>> > Hi Marc,
> >>> >
> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
> >>> > wrote:
> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
> >>> >>>
> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> >>> >>> <marc.zyngier@arm.com>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> Hi Marc,
> >>> >>>>>
> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> >>> >>>>> <marc.zyngier@arm.com>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Hi Anup,
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> >>> >>>>>>> <marc.zyngier@arm.com>
> >>> >>>>>>> wrote:
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Hi Pranav,
> >>> >>>>>>>>
> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> >>> >>>>>>>>> dirty
> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> >>> >>>>>>>>> this,
> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
> >>> >>>>>>>>> guest
> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
> >>> >>>>>>>>> disabled.
> >>> >>>>>>>>>
> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> >>> >>>>>>>>> L3-cache.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> >>> >>>>>>>>> <pranavkumar@linaro.org>
> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >>> >>>>>>>>> ---
> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
> >>> >>>>>>>>>
> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> >>> >>>>>>>>> index efe609c..5129038 100644
> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
> >>> >>>>>>>>> caches. */
> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> >>> >>>>>>>>> ASID-tagged
> >>> >>>>>>>>> VIVT
> >>> >>>>>>>>> */
> >>> >>>>>>>>>               /* any kind of VIPT cache */
> >>> >>>>>>>>>               __flush_icache_all();
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
> >>> >>>>>>>> past]
> >>> >>>>>>>>
> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
> >>> >>>>>>>> the
> >>> >>>>>>>> data
> >>> >>>>>>>> cache on each page mapping. It looks overkill.
> >>> >>>>>>>>
> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> >>> >>>>>>>> kvmtools
> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
> >>> >>>>>>>> can do a
> >>> >>>>>>>> "DC
> >>> >>>>>>>> DVAC" on this region.
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> >>> >>>>>>> I am
> >>> >>>>>>> sure
> >>> >>>>>>> other architectures don't have such cache cleaning in
> >>> >>>>>>> user-space for
> >>> >>>>>>> KVM.
> >>> >>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
> >>> >>>>>>>> - but
> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
> >>> >>>>>>>> guest
> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> >>> >>>>>>> VCPU was
> >>> >>>>>>> our second option but current approach seemed very simple hence
> >>> >>>>>>> we went for this.
> >>> >>>>>>>
> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
> >>> >>>>>>> VCPU then
> >>> >>>>>>> we should revise this patch.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
> >>> >>>>>> boot-tested
> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
> >>> >>>>>> your
> >>> >>>>>> issue.
> >>> >>>>>>
> >>> >>>>>> As far as I can see, it should fix it without any additional
> >>> >>>>>> flushing.
> >>> >>>>>>
> >>> >>>>>> Please let me know how it goes.
> >>> >>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> >>> >>>>> Write-Allocate,
> >>> >>>>> Outer Write-Back Write-Allocate memory"
> >>> >>>>>
> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> >>> >>>>> treated as "Strongly-ordered device memory"
> >>> >>>>>
> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> >>> >>>> be
> >>> >>>> accessed as device memory.
> >>> >>>>
> >>> >>>>
> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> >>> >>>>> KVM ARM64 because we use VGIC.
> >>> >>>>>
> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> >>> >>>>> when Stage1 MMU is off.
> >>> >>>>
> >>> >>>>
> >>> >>>>
> >>> >>>> See above. My understanding is that HCR.DC controls the default
> >>> >>>> output of
> >>> >>>> Stage-1, and Stage-2 overrides still apply.
> >>> >>>
> >>> >>>
> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> >>> >>> and wanted guest to decide the memory attribute. In other words,
> >>> >>> you
> >>> >>> did not want to enforce any memory attribute in Stage2.
> >>> >>>
> >>> >>> Please refer to this patch
> >>> >>> https://patchwork.kernel.org/patch/2543201/
> >>> >>
> >>> >>
> >>> >> This patch has never been merged. If you carry on following the
> >>> >> discussion,
> >>> >> you will certainly notice it was dropped for a very good reason:
> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> >>> >>
> >>> >> So Stage-2 memory attributes are used, they are not going away, and
> >>> >> they are
> >>> >> essential to the patch I sent this morning.
> >>> >>
> >>> >>
> >>> >>         M.
> >>> >> --
> >>> >> Fast, cheap, reliable. Pick two.
> >>> >
> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> >>> > is
> >>> > provided a DMA-capable device in pass-through mode. The reason being
> >>> > bootloader/firmware typically don't enable MMU and such
> >>> > bootloader/firmware
> >>> > will programme a pass-through DMA-capable device without any flushes
> >>> > to
> >>> > guest RAM (because it has not enabled MMU).
> >>> >
> >>> > A good example of such a device would be SATA AHCI controller given
> >>> > to a
> >>> > Guest/VM as direct access (using SystemMMU) and Guest
> >>> > bootloader/firmware
> >>> > accessing this SATA AHCI controller to load kernel images from SATA
> >>> > disk.
> >>> > In this situation, we will have to hack Guest bootloader/firmware
> >>> > AHCI driver to
> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> >>>
> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> >>> curiosity: is that a made up example or something you actually have?
> >>>
> >>> Back to square one:
> >>> Can you please benchmark the various cache cleaning options (global at
> >>> startup time, per-page on S2 translation fault, and user-space)?
> >>>
> >> Eh, why is this a more valid argument than the vgic?  The device
> >> passthrough Stage-2 mappings would still have the Stage-2 memory
> >> attributes to configure that memory region as device memory.  Why is it
> >> relevant if the device is DMA-capable in this context?
> >>
> >> -Christoffer
> >
> > Most ARM bootloader/firmware run with MMU off hence, they will not do
> > explicit cache flushes when programming DMA-capable device. Now If
> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> > will not be visible to DMA-capable device.
> >
> > --Anup
> 
> The approach of flushing d-cache by set/way upon first run of VCPU will
> not work because for set/way operations ARM ARM says: "For set/way
> operations, and for All (entire cache) operations, the point is defined to be
> to the next level of caching". In other words, set/way operations work upto
> point of unification.
> 
> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
> at bootup time which does not work for us when L3-cache is enabled so,
> there is no point is adding one more __flush_dcache_all() upon first run of
> VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply
to this case.  I'm a little confused about your comment here.

> 
> IMHO, we are left with following options:
> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>     upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can
be added later.  Is this the case?  I don't remember.

> 2. Implement outer-cache framework for ARM64 and flush all
>     caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

> 3. Use an alternate version of flush_icache_range() which will
>     flush d-cache by PoC instead of PoU. We can also ensure
>     that coherent_icache_guest_page() function will be called
>     upon Stage2 prefetch aborts only.
> 

I'm sorry, but I'm not following this discussion.  Are we not mixing a
discussion along two different axis here?  As I see it there are two
separate issues:

 (A) When/Where to flush the memory regions
 (B) How to flush the memory regions, including the hardware method and
     the kernel software engineering aspect.

-Christoffer
Christoffer Dall Aug. 16, 2013, 5:28 p.m. UTC | #16
On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote:
> On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote:
> 
> > On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:
> > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> > > <christoffer.dall@linaro.org> wrote:
> > > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> > > >> On 2013-08-15 16:13, Anup Patel wrote:
> > > >> > Hi Marc,
> > > >> >
> > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com
> > >
> > > >> > wrote:
> > > >> >> On 2013-08-15 14:31, Anup Patel wrote:
> > > >> >>>
> > > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> > > >> >>> <marc.zyngier@arm.com>
> > > >> >>> wrote:
> > > >> >>>>
> > > >> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> Hi Marc,
> > > >> >>>>>
> > > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> > > >> >>>>> <marc.zyngier@arm.com>
> > > >> >>>>> wrote:
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> Hi Anup,
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> > > >> >>>>>>> <marc.zyngier@arm.com>
> > > >> >>>>>>> wrote:
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> Hi Pranav,
> > > >> >>>>>>>>
> > > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> > > >> >>>>>>>>> dirty
> > > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> > > >> >>>>>>>>> this,
> > > >> >>>>>>>>> we need to flush such dirty content from d-cache so that
> > > >> >>>>>>>>> guest
> > > >> >>>>>>>>> will see correct contents of guest page when guest MMU is
> > > >> >>>>>>>>> disabled.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> > > >> >>>>>>>>> L3-cache.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> > > >> >>>>>>>>> <pranavkumar@linaro.org>
> > > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> > > >> >>>>>>>>> ---
> > > >> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> > > >> >>>>>>>>>  1 file changed, 2 insertions(+)
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> > > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> > > >> >>>>>>>>> index efe609c..5129038 100644
> > > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> > > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> > > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > > >> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> > > >> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> > > >> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> > > >> >>>>>>>>> +             /* Flush d-cache for systems with external
> > > >> >>>>>>>>> caches. */
> > > >> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> > > >> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> > > >> >>>>>>>>> ASID-tagged
> > > >> >>>>>>>>> VIVT
> > > >> >>>>>>>>> */
> > > >> >>>>>>>>>               /* any kind of VIPT cache */
> > > >> >>>>>>>>>               __flush_icache_all();
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> [adding Will to the discussion as we talked about this in the
> > > >> >>>>>>>> past]
> > > >> >>>>>>>>
> > > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to
> > hit
> > > >> >>>>>>>> the
> > > >> >>>>>>>> data
> > > >> >>>>>>>> cache on each page mapping. It looks overkill.
> > > >> >>>>>>>>
> > > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> > > >> >>>>>>>> kvmtools
> > > >> >>>>>>>> knows which bits of the guest memory have been touched, and
> > > >> >>>>>>>> can do a
> > > >> >>>>>>>> "DC
> > > >> >>>>>>>> DVAC" on this region.
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> > > >> >>>>>>> I am
> > > >> >>>>>>> sure
> > > >> >>>>>>> other architectures don't have such cache cleaning in
> > > >> >>>>>>> user-space for
> > > >> >>>>>>> KVM.
> > > >> >>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> The alternative is do it in the kernel before running any
> > vcpu
> > > >> >>>>>>>> - but
> > > >> >>>>>>>> that's not very nice either (have to clean the whole of the
> > > >> >>>>>>>> guest
> > > >> >>>>>>>> memory, which makes a full dcache clean more appealing).
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> > > >> >>>>>>> VCPU was
> > > >> >>>>>>> our second option but current approach seemed very simple
> > hence
> > > >> >>>>>>> we went for this.
> > > >> >>>>>>>
> > > >> >>>>>>> If more people vote for full d-cache clean upon first run of
> > > >> >>>>>>> VCPU then
> > > >> >>>>>>> we should revise this patch.
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> Can you please give the attached patch a spin on your HW? I've
> > > >> >>>>>> boot-tested
> > > >> >>>>>> it on a model, but of course I can't really verify that it
> > fixes
> > > >> >>>>>> your
> > > >> >>>>>> issue.
> > > >> >>>>>>
> > > >> >>>>>> As far as I can see, it should fix it without any additional
> > > >> >>>>>> flushing.
> > > >> >>>>>>
> > > >> >>>>>> Please let me know how it goes.
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> > > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> > > >> >>>>> Write-Allocate,
> > > >> >>>>> Outer Write-Back Write-Allocate memory"
> > > >> >>>>>
> > > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> > > >> >>>>> treated as "Strongly-ordered device memory"
> > > >> >>>>>
> > > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> > > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> > > >> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> > > >> >>>> be
> > > >> >>>> accessed as device memory.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> > > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> > > >> >>>>> KVM ARM64 because we use VGIC.
> > > >> >>>>>
> > > >> >>>>> IMHO, this patch enforces incorrect memory attribute for
> > Guest/VM
> > > >> >>>>> when Stage1 MMU is off.
> > > >> >>>>
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> See above. My understanding is that HCR.DC controls the default
> > > >> >>>> output of
> > > >> >>>> Stage-1, and Stage-2 overrides still apply.
> > > >> >>>
> > > >> >>>
> > > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> > > >> >>> and wanted guest to decide the memory attribute. In other words,
> > > >> >>> you
> > > >> >>> did not want to enforce any memory attribute in Stage2.
> > > >> >>>
> > > >> >>> Please refer to this patch
> > > >> >>> https://patchwork.kernel.org/patch/2543201/
> > > >> >>
> > > >> >>
> > > >> >> This patch has never been merged. If you carry on following the
> > > >> >> discussion,
> > > >> >> you will certainly notice it was dropped for a very good reason:
> > > >> >>
> > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> > > >> >>
> > > >> >> So Stage-2 memory attributes are used, they are not going away, and
> > > >> >> they are
> > > >> >> essential to the patch I sent this morning.
> > > >> >>
> > > >> >>
> > > >> >>         M.
> > > >> >> --
> > > >> >> Fast, cheap, reliable. Pick two.
> > > >> >
> > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> > > >> > is
> > > >> > provided a DMA-capable device in pass-through mode. The reason being
> > > >> > bootloader/firmware typically don't enable MMU and such
> > > >> > bootloader/firmware
> > > >> > will programme a pass-through DMA-capable device without any flushes
> > > >> > to
> > > >> > guest RAM (because it has not enabled MMU).
> > > >> >
> > > >> > A good example of such a device would be SATA AHCI controller given
> > > >> > to a
> > > >> > Guest/VM as direct access (using SystemMMU) and Guest
> > > >> > bootloader/firmware
> > > >> > accessing this SATA AHCI controller to load kernel images from SATA
> > > >> > disk.
> > > >> > In this situation, we will have to hack Guest bootloader/firmware
> > > >> > AHCI driver to
> > > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> > > >>
> > > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out
> > of
> > > >> curiosity: is that a made up example or something you actually have?
> > > >>
> > > >> Back to square one:
> > > >> Can you please benchmark the various cache cleaning options (global at
> > > >> startup time, per-page on S2 translation fault, and user-space)?
> > > >>
> > > > Eh, why is this a more valid argument than the vgic?  The device
> > > > passthrough Stage-2 mappings would still have the Stage-2 memory
> > > > attributes to configure that memory region as device memory.  Why is it
> > > > relevant if the device is DMA-capable in this context?
> > > >
> > >
> > > Most ARM bootloader/firmware run with MMU off hence, they will not do
> > > explicit cache flushes when programming DMA-capable device. Now If
> > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> > > will not be visible to DMA-capable device.
> > >
> > Read my question again: The bootloaders running with the MMU off in a VM
> > will only disable the MMU for Stage-1 translations.  Stage-2
> > translations are still enforced using hte Stage-2 page tables and their
> > attributes for all mappings to devices will still enforce
> > strongly-ordered device type memory.
> >
> 
> Please read my reply again. Also try to read-up SATA AHCI spec.
> 
>

Can you please explain how the specs realate to my question?

The only case I can see is if the guest puts data in a DMA region that
it expects the device to read.  Is this the case?

Just so we're clear, this is quite different from programming the device
through the device memory.

-Christoffer
Christoffer Dall Aug. 16, 2013, 5:42 p.m. UTC | #17
On Fri, Aug 16, 2013 at 10:28:06AM -0700, Christoffer Dall wrote:
> On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote:
> > On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote:
> > 
> > > On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:
> > > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> > > > <christoffer.dall@linaro.org> wrote:
> > > > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> > > > >> On 2013-08-15 16:13, Anup Patel wrote:
> > > > >> > Hi Marc,
> > > > >> >
> > > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com
> > > >
> > > > >> > wrote:
> > > > >> >> On 2013-08-15 14:31, Anup Patel wrote:
> > > > >> >>>
> > > > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> > > > >> >>> <marc.zyngier@arm.com>
> > > > >> >>> wrote:
> > > > >> >>>>
> > > > >> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> Hi Marc,
> > > > >> >>>>>
> > > > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> > > > >> >>>>> <marc.zyngier@arm.com>
> > > > >> >>>>> wrote:
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> Hi Anup,
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> > > > >> >>>>>>> <marc.zyngier@arm.com>
> > > > >> >>>>>>> wrote:
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Hi Pranav,
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> > > > >> >>>>>>>>> dirty
> > > > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> > > > >> >>>>>>>>> this,
> > > > >> >>>>>>>>> we need to flush such dirty content from d-cache so that
> > > > >> >>>>>>>>> guest
> > > > >> >>>>>>>>> will see correct contents of guest page when guest MMU is
> > > > >> >>>>>>>>> disabled.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> > > > >> >>>>>>>>> L3-cache.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> > > > >> >>>>>>>>> <pranavkumar@linaro.org>
> > > > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> > > > >> >>>>>>>>> ---
> > > > >> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> > > > >> >>>>>>>>>  1 file changed, 2 insertions(+)
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> > > > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> > > > >> >>>>>>>>> index efe609c..5129038 100644
> > > > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> > > > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> > > > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> > > > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > > > >> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> > > > >> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> > > > >> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> > > > >> >>>>>>>>> +             /* Flush d-cache for systems with external
> > > > >> >>>>>>>>> caches. */
> > > > >> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> > > > >> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> > > > >> >>>>>>>>> ASID-tagged
> > > > >> >>>>>>>>> VIVT
> > > > >> >>>>>>>>> */
> > > > >> >>>>>>>>>               /* any kind of VIPT cache */
> > > > >> >>>>>>>>>               __flush_icache_all();
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> [adding Will to the discussion as we talked about this in the
> > > > >> >>>>>>>> past]
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to
> > > hit
> > > > >> >>>>>>>> the
> > > > >> >>>>>>>> data
> > > > >> >>>>>>>> cache on each page mapping. It looks overkill.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> > > > >> >>>>>>>> kvmtools
> > > > >> >>>>>>>> knows which bits of the guest memory have been touched, and
> > > > >> >>>>>>>> can do a
> > > > >> >>>>>>>> "DC
> > > > >> >>>>>>>> DVAC" on this region.
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> > > > >> >>>>>>> I am
> > > > >> >>>>>>> sure
> > > > >> >>>>>>> other architectures don't have such cache cleaning in
> > > > >> >>>>>>> user-space for
> > > > >> >>>>>>> KVM.
> > > > >> >>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> The alternative is do it in the kernel before running any
> > > vcpu
> > > > >> >>>>>>>> - but
> > > > >> >>>>>>>> that's not very nice either (have to clean the whole of the
> > > > >> >>>>>>>> guest
> > > > >> >>>>>>>> memory, which makes a full dcache clean more appealing).
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> > > > >> >>>>>>> VCPU was
> > > > >> >>>>>>> our second option but current approach seemed very simple
> > > hence
> > > > >> >>>>>>> we went for this.
> > > > >> >>>>>>>
> > > > >> >>>>>>> If more people vote for full d-cache clean upon first run of
> > > > >> >>>>>>> VCPU then
> > > > >> >>>>>>> we should revise this patch.
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> Can you please give the attached patch a spin on your HW? I've
> > > > >> >>>>>> boot-tested
> > > > >> >>>>>> it on a model, but of course I can't really verify that it
> > > fixes
> > > > >> >>>>>> your
> > > > >> >>>>>> issue.
> > > > >> >>>>>>
> > > > >> >>>>>> As far as I can see, it should fix it without any additional
> > > > >> >>>>>> flushing.
> > > > >> >>>>>>
> > > > >> >>>>>> Please let me know how it goes.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> > > > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> > > > >> >>>>> Write-Allocate,
> > > > >> >>>>> Outer Write-Back Write-Allocate memory"
> > > > >> >>>>>
> > > > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> > > > >> >>>>> treated as "Strongly-ordered device memory"
> > > > >> >>>>>
> > > > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> > > > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> > > > >> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> > > > >> >>>> be
> > > > >> >>>> accessed as device memory.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> > > > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> > > > >> >>>>> KVM ARM64 because we use VGIC.
> > > > >> >>>>>
> > > > >> >>>>> IMHO, this patch enforces incorrect memory attribute for
> > > Guest/VM
> > > > >> >>>>> when Stage1 MMU is off.
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> See above. My understanding is that HCR.DC controls the default
> > > > >> >>>> output of
> > > > >> >>>> Stage-1, and Stage-2 overrides still apply.
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> > > > >> >>> and wanted guest to decide the memory attribute. In other words,
> > > > >> >>> you
> > > > >> >>> did not want to enforce any memory attribute in Stage2.
> > > > >> >>>
> > > > >> >>> Please refer to this patch
> > > > >> >>> https://patchwork.kernel.org/patch/2543201/
> > > > >> >>
> > > > >> >>
> > > > >> >> This patch has never been merged. If you carry on following the
> > > > >> >> discussion,
> > > > >> >> you will certainly notice it was dropped for a very good reason:
> > > > >> >>
> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> > > > >> >>
> > > > >> >> So Stage-2 memory attributes are used, they are not going away, and
> > > > >> >> they are
> > > > >> >> essential to the patch I sent this morning.
> > > > >> >>
> > > > >> >>
> > > > >> >>         M.
> > > > >> >> --
> > > > >> >> Fast, cheap, reliable. Pick two.
> > > > >> >
> > > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> > > > >> > is
> > > > >> > provided a DMA-capable device in pass-through mode. The reason being
> > > > >> > bootloader/firmware typically don't enable MMU and such
> > > > >> > bootloader/firmware
> > > > >> > will programme a pass-through DMA-capable device without any flushes
> > > > >> > to
> > > > >> > guest RAM (because it has not enabled MMU).
> > > > >> >
> > > > >> > A good example of such a device would be SATA AHCI controller given
> > > > >> > to a
> > > > >> > Guest/VM as direct access (using SystemMMU) and Guest
> > > > >> > bootloader/firmware
> > > > >> > accessing this SATA AHCI controller to load kernel images from SATA
> > > > >> > disk.
> > > > >> > In this situation, we will have to hack Guest bootloader/firmware
> > > > >> > AHCI driver to
> > > > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> > > > >>
> > > > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out
> > > of
> > > > >> curiosity: is that a made up example or something you actually have?
> > > > >>
> > > > >> Back to square one:
> > > > >> Can you please benchmark the various cache cleaning options (global at
> > > > >> startup time, per-page on S2 translation fault, and user-space)?
> > > > >>
> > > > > Eh, why is this a more valid argument than the vgic?  The device
> > > > > passthrough Stage-2 mappings would still have the Stage-2 memory
> > > > > attributes to configure that memory region as device memory.  Why is it
> > > > > relevant if the device is DMA-capable in this context?
> > > > >
> > > >
> > > > Most ARM bootloader/firmware run with MMU off hence, they will not do
> > > > explicit cache flushes when programming DMA-capable device. Now If
> > > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> > > > will not be visible to DMA-capable device.
> > > >
> > > Read my question again: The bootloaders running with the MMU off in a VM
> > > will only disable the MMU for Stage-1 translations.  Stage-2
> > > translations are still enforced using hte Stage-2 page tables and their
> > > attributes for all mappings to devices will still enforce
> > > strongly-ordered device type memory.
> > >
> > 
> > Please read my reply again. Also try to read-up SATA AHCI spec.
> > 
> >
> 
> Can you please explain how the specs realate to my question?
> 
> The only case I can see is if the guest puts data in a DMA region that
> it expects the device to read.  Is this the case?
> 
> Just so we're clear, this is quite different from programming the device
> through the device memory.
> 

ok, I see from the specs.  It's not just DMA-capable, but actually
DMA-programmable.  In that case setting DC=1 would be problematic.

Thanks for being so helpful in explaining this ;)

-Christoffer
Anup Patel Aug. 16, 2013, 5:42 p.m. UTC | #18
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
>> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
>> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
>> > <christoffer.dall@linaro.org> wrote:
>> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>> >>> On 2013-08-15 16:13, Anup Patel wrote:
>> >>> > Hi Marc,
>> >>> >
>> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> >>> > wrote:
>> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
>> >>> >>>
>> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>> >>> >>> <marc.zyngier@arm.com>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> Hi Marc,
>> >>> >>>>>
>> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>> >>> >>>>> <marc.zyngier@arm.com>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Hi Anup,
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>> >>> >>>>>>> <marc.zyngier@arm.com>
>> >>> >>>>>>> wrote:
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> Hi Pranav,
>> >>> >>>>>>>>
>> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>> >>> >>>>>>>>> dirty
>> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>> >>> >>>>>>>>> this,
>> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>> >>> >>>>>>>>> guest
>> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>> >>> >>>>>>>>> disabled.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>> >>> >>>>>>>>> L3-cache.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>> >>> >>>>>>>>> <pranavkumar@linaro.org>
>> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> >>> >>>>>>>>> ---
>> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>> >>> >>>>>>>>> index efe609c..5129038 100644
>> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
>> >>> >>>>>>>>> caches. */
>> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>> >>> >>>>>>>>> ASID-tagged
>> >>> >>>>>>>>> VIVT
>> >>> >>>>>>>>> */
>> >>> >>>>>>>>>               /* any kind of VIPT cache */
>> >>> >>>>>>>>>               __flush_icache_all();
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>> >>> >>>>>>>> past]
>> >>> >>>>>>>>
>> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>> >>> >>>>>>>> the
>> >>> >>>>>>>> data
>> >>> >>>>>>>> cache on each page mapping. It looks overkill.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>> >>> >>>>>>>> kvmtools
>> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
>> >>> >>>>>>>> can do a
>> >>> >>>>>>>> "DC
>> >>> >>>>>>>> DVAC" on this region.
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>> >>> >>>>>>> I am
>> >>> >>>>>>> sure
>> >>> >>>>>>> other architectures don't have such cache cleaning in
>> >>> >>>>>>> user-space for
>> >>> >>>>>>> KVM.
>> >>> >>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>> >>> >>>>>>>> - but
>> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
>> >>> >>>>>>>> guest
>> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>> >>> >>>>>>> VCPU was
>> >>> >>>>>>> our second option but current approach seemed very simple hence
>> >>> >>>>>>> we went for this.
>> >>> >>>>>>>
>> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
>> >>> >>>>>>> VCPU then
>> >>> >>>>>>> we should revise this patch.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>> >>> >>>>>> boot-tested
>> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
>> >>> >>>>>> your
>> >>> >>>>>> issue.
>> >>> >>>>>>
>> >>> >>>>>> As far as I can see, it should fix it without any additional
>> >>> >>>>>> flushing.
>> >>> >>>>>>
>> >>> >>>>>> Please let me know how it goes.
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>>
>> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>> >>> >>>>> Write-Allocate,
>> >>> >>>>> Outer Write-Back Write-Allocate memory"
>> >>> >>>>>
>> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> >>> >>>>> treated as "Strongly-ordered device memory"
>> >>> >>>>>
>> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>> >>> >>>> be
>> >>> >>>> accessed as device memory.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>> >>> >>>>> KVM ARM64 because we use VGIC.
>> >>> >>>>>
>> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> >>> >>>>> when Stage1 MMU is off.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> See above. My understanding is that HCR.DC controls the default
>> >>> >>>> output of
>> >>> >>>> Stage-1, and Stage-2 overrides still apply.
>> >>> >>>
>> >>> >>>
>> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> >>> >>> and wanted guest to decide the memory attribute. In other words,
>> >>> >>> you
>> >>> >>> did not want to enforce any memory attribute in Stage2.
>> >>> >>>
>> >>> >>> Please refer to this patch
>> >>> >>> https://patchwork.kernel.org/patch/2543201/
>> >>> >>
>> >>> >>
>> >>> >> This patch has never been merged. If you carry on following the
>> >>> >> discussion,
>> >>> >> you will certainly notice it was dropped for a very good reason:
>> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>> >>> >>
>> >>> >> So Stage-2 memory attributes are used, they are not going away, and
>> >>> >> they are
>> >>> >> essential to the patch I sent this morning.
>> >>> >>
>> >>> >>
>> >>> >>         M.
>> >>> >> --
>> >>> >> Fast, cheap, reliable. Pick two.
>> >>> >
>> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>> >>> > is
>> >>> > provided a DMA-capable device in pass-through mode. The reason being
>> >>> > bootloader/firmware typically don't enable MMU and such
>> >>> > bootloader/firmware
>> >>> > will programme a pass-through DMA-capable device without any flushes
>> >>> > to
>> >>> > guest RAM (because it has not enabled MMU).
>> >>> >
>> >>> > A good example of such a device would be SATA AHCI controller given
>> >>> > to a
>> >>> > Guest/VM as direct access (using SystemMMU) and Guest
>> >>> > bootloader/firmware
>> >>> > accessing this SATA AHCI controller to load kernel images from SATA
>> >>> > disk.
>> >>> > In this situation, we will have to hack Guest bootloader/firmware
>> >>> > AHCI driver to
>> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>> >>>
>> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>> >>> curiosity: is that a made up example or something you actually have?
>> >>>
>> >>> Back to square one:
>> >>> Can you please benchmark the various cache cleaning options (global at
>> >>> startup time, per-page on S2 translation fault, and user-space)?
>> >>>
>> >> Eh, why is this a more valid argument than the vgic?  The device
>> >> passthrough Stage-2 mappings would still have the Stage-2 memory
>> >> attributes to configure that memory region as device memory.  Why is it
>> >> relevant if the device is DMA-capable in this context?
>> >>
>> >> -Christoffer
>> >
>> > Most ARM bootloader/firmware run with MMU off hence, they will not do
>> > explicit cache flushes when programming DMA-capable device. Now If
>> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
>> > will not be visible to DMA-capable device.
>> >
>> > --Anup
>>
>> The approach of flushing d-cache by set/way upon first run of VCPU will
>> not work because for set/way operations ARM ARM says: "For set/way
>> operations, and for All (entire cache) operations, the point is defined to be
>> to the next level of caching". In other words, set/way operations work upto
>> point of unification.
>>
>> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
>> at bootup time which does not work for us when L3-cache is enabled so,
>> there is no point is adding one more __flush_dcache_all() upon first run of
>> VCPU in KVM ARM64.
>
> When did anyone suggest using a cache cleaning method that doesn't apply
> to this case.  I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says:
"Can you please benchmark the various cache cleaning options (global at
startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in
coherent_icache_guest_page() we could also flush entire d-cache upon
first VCPU run. This only issue in flushing d-cache upon first VCPU run
is that we cannot flush d-cache by set/way because as per ARM ARM
all operations by set/way are upto PoU and not PoC. In presence of
L3-Cache we need to flush d-cache upto PoC so we have to use
__flush_dache_range()

>
>>
>> IMHO, we are left with following options:
>> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>>     upon first run of VCPU
>
> We could do that, but we have to ensure that no other memory regions can
> be added later.  Is this the case?  I don't remember.

Yes, memory regions being added later be problem for this option.

>
>> 2. Implement outer-cache framework for ARM64 and flush all
>>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
>
> What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer
caches (i.e. caches that are not part of the CPU but exist as
separate entity between CPU and Memory for performance
improvement). Using this framework we can flush all CPU caches
and outer cache.

>
>> 3. Use an alternate version of flush_icache_range() which will
>>     flush d-cache by PoC instead of PoU. We can also ensure
>>     that coherent_icache_guest_page() function will be called
>>     upon Stage2 prefetch aborts only.
>>
>
> I'm sorry, but I'm not following this discussion.  Are we not mixing a
> discussion along two different axis here?  As I see it there are two
> separate issues:
>
>  (A) When/Where to flush the memory regions
>  (B) How to flush the memory regions, including the hardware method and
>      the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence
of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
user-space typically loads images to guest RAM so, in-presence of
huge L3-cache (few MBs). When the VCPU starts running some of the
contents guest RAM will be still in L3-cache and VCPU runs with
MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
see incorrect contents. To solve this problem we need to flush the
guest RAM contents before they are accessed by first time by VCPU.

>
> -Christoffer

--Anup
Christoffer Dall Aug. 16, 2013, 5:50 p.m. UTC | #19
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> >> > <christoffer.dall@linaro.org> wrote:
> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> >> >>> On 2013-08-15 16:13, Anup Patel wrote:
> >> >>> > Hi Marc,
> >> >>> >
> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
> >> >>> > wrote:
> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
> >> >>> >>>
> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> >> >>> >>> <marc.zyngier@arm.com>
> >> >>> >>> wrote:
> >> >>> >>>>
> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>> Hi Marc,
> >> >>> >>>>>
> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> >> >>> >>>>> <marc.zyngier@arm.com>
> >> >>> >>>>> wrote:
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>> Hi Anup,
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> >> >>> >>>>>>> <marc.zyngier@arm.com>
> >> >>> >>>>>>> wrote:
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> Hi Pranav,
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> >> >>> >>>>>>>>> dirty
> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> >> >>> >>>>>>>>> this,
> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
> >> >>> >>>>>>>>> guest
> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
> >> >>> >>>>>>>>> disabled.
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> >> >>> >>>>>>>>> L3-cache.
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> >> >>> >>>>>>>>> <pranavkumar@linaro.org>
> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >> >>> >>>>>>>>> ---
> >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
> >> >>> >>>>>>>>>
> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> >> >>> >>>>>>>>> index efe609c..5129038 100644
> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
> >> >>> >>>>>>>>> caches. */
> >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> >> >>> >>>>>>>>> ASID-tagged
> >> >>> >>>>>>>>> VIVT
> >> >>> >>>>>>>>> */
> >> >>> >>>>>>>>>               /* any kind of VIPT cache */
> >> >>> >>>>>>>>>               __flush_icache_all();
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
> >> >>> >>>>>>>> past]
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
> >> >>> >>>>>>>> the
> >> >>> >>>>>>>> data
> >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> >> >>> >>>>>>>> kvmtools
> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
> >> >>> >>>>>>>> can do a
> >> >>> >>>>>>>> "DC
> >> >>> >>>>>>>> DVAC" on this region.
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> >> >>> >>>>>>> I am
> >> >>> >>>>>>> sure
> >> >>> >>>>>>> other architectures don't have such cache cleaning in
> >> >>> >>>>>>> user-space for
> >> >>> >>>>>>> KVM.
> >> >>> >>>>>>>
> >> >>> >>>>>>>>
> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
> >> >>> >>>>>>>> - but
> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
> >> >>> >>>>>>>> guest
> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>>
> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> >> >>> >>>>>>> VCPU was
> >> >>> >>>>>>> our second option but current approach seemed very simple hence
> >> >>> >>>>>>> we went for this.
> >> >>> >>>>>>>
> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
> >> >>> >>>>>>> VCPU then
> >> >>> >>>>>>> we should revise this patch.
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>>
> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
> >> >>> >>>>>> boot-tested
> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
> >> >>> >>>>>> your
> >> >>> >>>>>> issue.
> >> >>> >>>>>>
> >> >>> >>>>>> As far as I can see, it should fix it without any additional
> >> >>> >>>>>> flushing.
> >> >>> >>>>>>
> >> >>> >>>>>> Please let me know how it goes.
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>>
> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> >> >>> >>>>> Write-Allocate,
> >> >>> >>>>> Outer Write-Back Write-Allocate memory"
> >> >>> >>>>>
> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> >> >>> >>>>> treated as "Strongly-ordered device memory"
> >> >>> >>>>>
> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> >> >>> >>>> be
> >> >>> >>>> accessed as device memory.
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> >> >>> >>>>> KVM ARM64 because we use VGIC.
> >> >>> >>>>>
> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> >> >>> >>>>> when Stage1 MMU is off.
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>>
> >> >>> >>>> See above. My understanding is that HCR.DC controls the default
> >> >>> >>>> output of
> >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> >> >>> >>> and wanted guest to decide the memory attribute. In other words,
> >> >>> >>> you
> >> >>> >>> did not want to enforce any memory attribute in Stage2.
> >> >>> >>>
> >> >>> >>> Please refer to this patch
> >> >>> >>> https://patchwork.kernel.org/patch/2543201/
> >> >>> >>
> >> >>> >>
> >> >>> >> This patch has never been merged. If you carry on following the
> >> >>> >> discussion,
> >> >>> >> you will certainly notice it was dropped for a very good reason:
> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> >> >>> >>
> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
> >> >>> >> they are
> >> >>> >> essential to the patch I sent this morning.
> >> >>> >>
> >> >>> >>
> >> >>> >>         M.
> >> >>> >> --
> >> >>> >> Fast, cheap, reliable. Pick two.
> >> >>> >
> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> >> >>> > is
> >> >>> > provided a DMA-capable device in pass-through mode. The reason being
> >> >>> > bootloader/firmware typically don't enable MMU and such
> >> >>> > bootloader/firmware
> >> >>> > will programme a pass-through DMA-capable device without any flushes
> >> >>> > to
> >> >>> > guest RAM (because it has not enabled MMU).
> >> >>> >
> >> >>> > A good example of such a device would be SATA AHCI controller given
> >> >>> > to a
> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
> >> >>> > bootloader/firmware
> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
> >> >>> > disk.
> >> >>> > In this situation, we will have to hack Guest bootloader/firmware
> >> >>> > AHCI driver to
> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> >> >>>
> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> >> >>> curiosity: is that a made up example or something you actually have?
> >> >>>
> >> >>> Back to square one:
> >> >>> Can you please benchmark the various cache cleaning options (global at
> >> >>> startup time, per-page on S2 translation fault, and user-space)?
> >> >>>
> >> >> Eh, why is this a more valid argument than the vgic?  The device
> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
> >> >> attributes to configure that memory region as device memory.  Why is it
> >> >> relevant if the device is DMA-capable in this context?
> >> >>
> >> >> -Christoffer
> >> >
> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
> >> > explicit cache flushes when programming DMA-capable device. Now If
> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> >> > will not be visible to DMA-capable device.
> >> >
> >> > --Anup
> >>
> >> The approach of flushing d-cache by set/way upon first run of VCPU will
> >> not work because for set/way operations ARM ARM says: "For set/way
> >> operations, and for All (entire cache) operations, the point is defined to be
> >> to the next level of caching". In other words, set/way operations work upto
> >> point of unification.
> >>
> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
> >> at bootup time which does not work for us when L3-cache is enabled so,
> >> there is no point is adding one more __flush_dcache_all() upon first run of
> >> VCPU in KVM ARM64.
> >
> > When did anyone suggest using a cache cleaning method that doesn't apply
> > to this case.  I'm a little confused about your comment here.
> 
> Please refer last reply from Marc Z where he says:
> "Can you please benchmark the various cache cleaning options (global at
> startup time, per-page on S2 translation fault, and user-space)?".
> 
> Rather doing __flush_dcache_range() on each page in
> coherent_icache_guest_page() we could also flush entire d-cache upon
> first VCPU run. This only issue in flushing d-cache upon first VCPU run
> is that we cannot flush d-cache by set/way because as per ARM ARM
> all operations by set/way are upto PoU and not PoC. In presence of
> L3-Cache we need to flush d-cache upto PoC so we have to use
> __flush_dache_range()
> 
> >
> >>
> >> IMHO, we are left with following options:
> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
> >>     upon first run of VCPU
> >
> > We could do that, but we have to ensure that no other memory regions can
> > be added later.  Is this the case?  I don't remember.
> 
> Yes, memory regions being added later be problem for this option.
> 
> >
> >> 2. Implement outer-cache framework for ARM64 and flush all
> >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
> >
> > What's the difference between (2) and (1)?
> 
> Linux ARM (i.e. 32-bit port) has a framework for having outer
> caches (i.e. caches that are not part of the CPU but exist as
> separate entity between CPU and Memory for performance
> improvement). Using this framework we can flush all CPU caches
> and outer cache.
> 

And we don't have such a framework in arm64?  But __flush_dcache_range
does nevertheless flush outer caches as well?

> >
> >> 3. Use an alternate version of flush_icache_range() which will
> >>     flush d-cache by PoC instead of PoU. We can also ensure
> >>     that coherent_icache_guest_page() function will be called
> >>     upon Stage2 prefetch aborts only.
> >>
> >
> > I'm sorry, but I'm not following this discussion.  Are we not mixing a
> > discussion along two different axis here?  As I see it there are two
> > separate issues:
> >
> >  (A) When/Where to flush the memory regions
> >  (B) How to flush the memory regions, including the hardware method and
> >      the kernel software engineering aspect.
> 
> Discussion here is about getting KVM ARM64 working in-presence
> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
> user-space typically loads images to guest RAM so, in-presence of
> huge L3-cache (few MBs). When the VCPU starts running some of the
> contents guest RAM will be still in L3-cache and VCPU runs with
> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
> see incorrect contents. To solve this problem we need to flush the
> guest RAM contents before they are accessed by first time by VCPU.
> 
ok, I'm with you that far.

But is it also not true that we need to decide between:

  A.1: Flush the entire guest RAM before running the VCPU
  A.2: Flush the pages as we fault them in

And (independently):

  B.1: Use __flush_dcache_range
  B.2: Use something else + outer cache framework for arm64
  B.3: Use flush_icache_range

Or do these all interleave somehow?  If so, I don't understand why.  Can
you help?

-Christoffer
Christoffer Dall Aug. 16, 2013, 6:06 p.m. UTC | #20
On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote:
> On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
> > On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
> > <christoffer.dall@linaro.org> wrote:
> > > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
> > >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
> > >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> > >> > <christoffer.dall@linaro.org> wrote:
> > >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> > >> >>> On 2013-08-15 16:13, Anup Patel wrote:
> > >> >>> > Hi Marc,
> > >> >>> >
> > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
> > >> >>> > wrote:
> > >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
> > >> >>> >>>
> > >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> > >> >>> >>> <marc.zyngier@arm.com>
> > >> >>> >>> wrote:
> > >> >>> >>>>
> > >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> > >> >>> >>>>>
> > >> >>> >>>>>
> > >> >>> >>>>> Hi Marc,
> > >> >>> >>>>>
> > >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> > >> >>> >>>>> <marc.zyngier@arm.com>
> > >> >>> >>>>> wrote:
> > >> >>> >>>>>>
> > >> >>> >>>>>>
> > >> >>> >>>>>> Hi Anup,
> > >> >>> >>>>>>
> > >> >>> >>>>>>
> > >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> > >> >>> >>>>>>> <marc.zyngier@arm.com>
> > >> >>> >>>>>>> wrote:
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> Hi Pranav,
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> > >> >>> >>>>>>>>> dirty
> > >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> > >> >>> >>>>>>>>> this,
> > >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
> > >> >>> >>>>>>>>> guest
> > >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
> > >> >>> >>>>>>>>> disabled.
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> > >> >>> >>>>>>>>> L3-cache.
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> > >> >>> >>>>>>>>> <pranavkumar@linaro.org>
> > >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> > >> >>> >>>>>>>>> ---
> > >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> > >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
> > >> >>> >>>>>>>>>
> > >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> > >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> > >> >>> >>>>>>>>> index efe609c..5129038 100644
> > >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> > >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> > >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> > >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> > >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> > >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> > >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> > >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
> > >> >>> >>>>>>>>> caches. */
> > >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> > >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> > >> >>> >>>>>>>>> ASID-tagged
> > >> >>> >>>>>>>>> VIVT
> > >> >>> >>>>>>>>> */
> > >> >>> >>>>>>>>>               /* any kind of VIPT cache */
> > >> >>> >>>>>>>>>               __flush_icache_all();
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
> > >> >>> >>>>>>>> past]
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
> > >> >>> >>>>>>>> the
> > >> >>> >>>>>>>> data
> > >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> > >> >>> >>>>>>>> kvmtools
> > >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
> > >> >>> >>>>>>>> can do a
> > >> >>> >>>>>>>> "DC
> > >> >>> >>>>>>>> DVAC" on this region.
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> > >> >>> >>>>>>> I am
> > >> >>> >>>>>>> sure
> > >> >>> >>>>>>> other architectures don't have such cache cleaning in
> > >> >>> >>>>>>> user-space for
> > >> >>> >>>>>>> KVM.
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>>
> > >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
> > >> >>> >>>>>>>> - but
> > >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
> > >> >>> >>>>>>>> guest
> > >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>>
> > >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> > >> >>> >>>>>>> VCPU was
> > >> >>> >>>>>>> our second option but current approach seemed very simple hence
> > >> >>> >>>>>>> we went for this.
> > >> >>> >>>>>>>
> > >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
> > >> >>> >>>>>>> VCPU then
> > >> >>> >>>>>>> we should revise this patch.
> > >> >>> >>>>>>
> > >> >>> >>>>>>
> > >> >>> >>>>>>
> > >> >>> >>>>>>
> > >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
> > >> >>> >>>>>> boot-tested
> > >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
> > >> >>> >>>>>> your
> > >> >>> >>>>>> issue.
> > >> >>> >>>>>>
> > >> >>> >>>>>> As far as I can see, it should fix it without any additional
> > >> >>> >>>>>> flushing.
> > >> >>> >>>>>>
> > >> >>> >>>>>> Please let me know how it goes.
> > >> >>> >>>>>
> > >> >>> >>>>>
> > >> >>> >>>>>
> > >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> > >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> > >> >>> >>>>> Write-Allocate,
> > >> >>> >>>>> Outer Write-Back Write-Allocate memory"
> > >> >>> >>>>>
> > >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> > >> >>> >>>>> treated as "Strongly-ordered device memory"
> > >> >>> >>>>>
> > >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> > >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> > >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> > >> >>> >>>>
> > >> >>> >>>>
> > >> >>> >>>>
> > >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> > >> >>> >>>> be
> > >> >>> >>>> accessed as device memory.
> > >> >>> >>>>
> > >> >>> >>>>
> > >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> > >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> > >> >>> >>>>> KVM ARM64 because we use VGIC.
> > >> >>> >>>>>
> > >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> > >> >>> >>>>> when Stage1 MMU is off.
> > >> >>> >>>>
> > >> >>> >>>>
> > >> >>> >>>>
> > >> >>> >>>> See above. My understanding is that HCR.DC controls the default
> > >> >>> >>>> output of
> > >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
> > >> >>> >>>
> > >> >>> >>>
> > >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> > >> >>> >>> and wanted guest to decide the memory attribute. In other words,
> > >> >>> >>> you
> > >> >>> >>> did not want to enforce any memory attribute in Stage2.
> > >> >>> >>>
> > >> >>> >>> Please refer to this patch
> > >> >>> >>> https://patchwork.kernel.org/patch/2543201/
> > >> >>> >>
> > >> >>> >>
> > >> >>> >> This patch has never been merged. If you carry on following the
> > >> >>> >> discussion,
> > >> >>> >> you will certainly notice it was dropped for a very good reason:
> > >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> > >> >>> >>
> > >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
> > >> >>> >> they are
> > >> >>> >> essential to the patch I sent this morning.
> > >> >>> >>
> > >> >>> >>
> > >> >>> >>         M.
> > >> >>> >> --
> > >> >>> >> Fast, cheap, reliable. Pick two.
> > >> >>> >
> > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> > >> >>> > is
> > >> >>> > provided a DMA-capable device in pass-through mode. The reason being
> > >> >>> > bootloader/firmware typically don't enable MMU and such
> > >> >>> > bootloader/firmware
> > >> >>> > will programme a pass-through DMA-capable device without any flushes
> > >> >>> > to
> > >> >>> > guest RAM (because it has not enabled MMU).
> > >> >>> >
> > >> >>> > A good example of such a device would be SATA AHCI controller given
> > >> >>> > to a
> > >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
> > >> >>> > bootloader/firmware
> > >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
> > >> >>> > disk.
> > >> >>> > In this situation, we will have to hack Guest bootloader/firmware
> > >> >>> > AHCI driver to
> > >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> > >> >>>
> > >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> > >> >>> curiosity: is that a made up example or something you actually have?
> > >> >>>
> > >> >>> Back to square one:
> > >> >>> Can you please benchmark the various cache cleaning options (global at
> > >> >>> startup time, per-page on S2 translation fault, and user-space)?
> > >> >>>
> > >> >> Eh, why is this a more valid argument than the vgic?  The device
> > >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
> > >> >> attributes to configure that memory region as device memory.  Why is it
> > >> >> relevant if the device is DMA-capable in this context?
> > >> >>
> > >> >> -Christoffer
> > >> >
> > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
> > >> > explicit cache flushes when programming DMA-capable device. Now If
> > >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> > >> > will not be visible to DMA-capable device.
> > >> >
> > >> > --Anup
> > >>
> > >> The approach of flushing d-cache by set/way upon first run of VCPU will
> > >> not work because for set/way operations ARM ARM says: "For set/way
> > >> operations, and for All (entire cache) operations, the point is defined to be
> > >> to the next level of caching". In other words, set/way operations work upto
> > >> point of unification.
> > >>
> > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
> > >> at bootup time which does not work for us when L3-cache is enabled so,
> > >> there is no point is adding one more __flush_dcache_all() upon first run of
> > >> VCPU in KVM ARM64.
> > >
> > > When did anyone suggest using a cache cleaning method that doesn't apply
> > > to this case.  I'm a little confused about your comment here.
> > 
> > Please refer last reply from Marc Z where he says:
> > "Can you please benchmark the various cache cleaning options (global at
> > startup time, per-page on S2 translation fault, and user-space)?".
> > 
> > Rather doing __flush_dcache_range() on each page in
> > coherent_icache_guest_page() we could also flush entire d-cache upon
> > first VCPU run. This only issue in flushing d-cache upon first VCPU run
> > is that we cannot flush d-cache by set/way because as per ARM ARM
> > all operations by set/way are upto PoU and not PoC. In presence of
> > L3-Cache we need to flush d-cache upto PoC so we have to use
> > __flush_dache_range()
> > 
> > >
> > >>
> > >> IMHO, we are left with following options:
> > >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
> > >>     upon first run of VCPU
> > >
> > > We could do that, but we have to ensure that no other memory regions can
> > > be added later.  Is this the case?  I don't remember.
> > 
> > Yes, memory regions being added later be problem for this option.
> > 
> > >
> > >> 2. Implement outer-cache framework for ARM64 and flush all
> > >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
> > >
> > > What's the difference between (2) and (1)?
> > 
> > Linux ARM (i.e. 32-bit port) has a framework for having outer
> > caches (i.e. caches that are not part of the CPU but exist as
> > separate entity between CPU and Memory for performance
> > improvement). Using this framework we can flush all CPU caches
> > and outer cache.
> > 
> 
> And we don't have such a framework in arm64?  But __flush_dcache_range
> does nevertheless flush outer caches as well?
> 
> > >
> > >> 3. Use an alternate version of flush_icache_range() which will
> > >>     flush d-cache by PoC instead of PoU. We can also ensure
> > >>     that coherent_icache_guest_page() function will be called
> > >>     upon Stage2 prefetch aborts only.
> > >>
> > >
> > > I'm sorry, but I'm not following this discussion.  Are we not mixing a
> > > discussion along two different axis here?  As I see it there are two
> > > separate issues:
> > >
> > >  (A) When/Where to flush the memory regions
> > >  (B) How to flush the memory regions, including the hardware method and
> > >      the kernel software engineering aspect.
> > 
> > Discussion here is about getting KVM ARM64 working in-presence
> > of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
> > user-space typically loads images to guest RAM so, in-presence of
> > huge L3-cache (few MBs). When the VCPU starts running some of the
> > contents guest RAM will be still in L3-cache and VCPU runs with
> > MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
> > see incorrect contents. To solve this problem we need to flush the
> > guest RAM contents before they are accessed by first time by VCPU.
> > 
> ok, I'm with you that far.
> 
> But is it also not true that we need to decide between:
> 
>   A.1: Flush the entire guest RAM before running the VCPU
>   A.2: Flush the pages as we fault them in
> 
> And (independently):
> 
>   B.1: Use __flush_dcache_range
>   B.2: Use something else + outer cache framework for arm64
>   B.3: Use flush_icache_range
> 
> Or do these all interleave somehow?  If so, I don't understand why.  Can
> you help?
> 
Oh, I think I understand your point now.  You mean if we use
flush_cache_all before we run the vcpu, then it's not sufficient?  I
assumed we would simply be using __flush_dcache_area for the guest RAM
regions which would flush to PoC.

For the record, I think we need a solution that also covers the case
where a memory region is registered later, and I therefore prefer having
this functionality in the stage-2 fault handler, where we are already
taking care of similar issues (like your patch suggested).

Especially if we can limit ourselves to doing so when the guest MMU is
disabled, then I really think this is going to be the least overhead and
measuring the performance of this penalty vs. at first CPU execution is
a bit overkill IMHO, since we are only talking about boot time for any
reasonable guest (which would run with the MMU enabled for real
workloads presumeably).

The only caveat is the previously discussed issue if user space loads
code after the first VCPU execution, and only user space would know if
that happens, which would argue for user space doing the cleaning...
Hmmm.

I also still have my worries about swapping, since user space is free to
map guest RAM as non-executable.

Did I miss anything here?

-Christoffer
Anup Patel Aug. 16, 2013, 6:11 p.m. UTC | #21
On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
>> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
>> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
>> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
>> >> > <christoffer.dall@linaro.org> wrote:
>> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>> >> >>> On 2013-08-15 16:13, Anup Patel wrote:
>> >> >>> > Hi Marc,
>> >> >>> >
>> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> >> >>> > wrote:
>> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
>> >> >>> >>>
>> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>> >> >>> >>> <marc.zyngier@arm.com>
>> >> >>> >>> wrote:
>> >> >>> >>>>
>> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> Hi Marc,
>> >> >>> >>>>>
>> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>> >> >>> >>>>> <marc.zyngier@arm.com>
>> >> >>> >>>>> wrote:
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> Hi Anup,
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>> >> >>> >>>>>>> <marc.zyngier@arm.com>
>> >> >>> >>>>>>> wrote:
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Hi Pranav,
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>> >> >>> >>>>>>>>> dirty
>> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>> >> >>> >>>>>>>>> this,
>> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>> >> >>> >>>>>>>>> guest
>> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>> >> >>> >>>>>>>>> disabled.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>> >> >>> >>>>>>>>> L3-cache.
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>> >> >>> >>>>>>>>> <pranavkumar@linaro.org>
>> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> >> >>> >>>>>>>>> ---
>> >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>> >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
>> >> >>> >>>>>>>>>
>> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> index efe609c..5129038 100644
>> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>> >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>> >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>> >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
>> >> >>> >>>>>>>>> caches. */
>> >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>> >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>> >> >>> >>>>>>>>> ASID-tagged
>> >> >>> >>>>>>>>> VIVT
>> >> >>> >>>>>>>>> */
>> >> >>> >>>>>>>>>               /* any kind of VIPT cache */
>> >> >>> >>>>>>>>>               __flush_icache_all();
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>> >> >>> >>>>>>>> past]
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>> >> >>> >>>>>>>> the
>> >> >>> >>>>>>>> data
>> >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>> >> >>> >>>>>>>> kvmtools
>> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
>> >> >>> >>>>>>>> can do a
>> >> >>> >>>>>>>> "DC
>> >> >>> >>>>>>>> DVAC" on this region.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>> >> >>> >>>>>>> I am
>> >> >>> >>>>>>> sure
>> >> >>> >>>>>>> other architectures don't have such cache cleaning in
>> >> >>> >>>>>>> user-space for
>> >> >>> >>>>>>> KVM.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>>
>> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>> >> >>> >>>>>>>> - but
>> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
>> >> >>> >>>>>>>> guest
>> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>> >> >>> >>>>>>> VCPU was
>> >> >>> >>>>>>> our second option but current approach seemed very simple hence
>> >> >>> >>>>>>> we went for this.
>> >> >>> >>>>>>>
>> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
>> >> >>> >>>>>>> VCPU then
>> >> >>> >>>>>>> we should revise this patch.
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>>
>> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>> >> >>> >>>>>> boot-tested
>> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
>> >> >>> >>>>>> your
>> >> >>> >>>>>> issue.
>> >> >>> >>>>>>
>> >> >>> >>>>>> As far as I can see, it should fix it without any additional
>> >> >>> >>>>>> flushing.
>> >> >>> >>>>>>
>> >> >>> >>>>>> Please let me know how it goes.
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>>
>> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>> >> >>> >>>>> Write-Allocate,
>> >> >>> >>>>> Outer Write-Back Write-Allocate memory"
>> >> >>> >>>>>
>> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> >> >>> >>>>> treated as "Strongly-ordered device memory"
>> >> >>> >>>>>
>> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>> >> >>> >>>> be
>> >> >>> >>>> accessed as device memory.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>> >> >>> >>>>> KVM ARM64 because we use VGIC.
>> >> >>> >>>>>
>> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> >> >>> >>>>> when Stage1 MMU is off.
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>>
>> >> >>> >>>> See above. My understanding is that HCR.DC controls the default
>> >> >>> >>>> output of
>> >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
>> >> >>> >>>
>> >> >>> >>>
>> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> >> >>> >>> and wanted guest to decide the memory attribute. In other words,
>> >> >>> >>> you
>> >> >>> >>> did not want to enforce any memory attribute in Stage2.
>> >> >>> >>>
>> >> >>> >>> Please refer to this patch
>> >> >>> >>> https://patchwork.kernel.org/patch/2543201/
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> This patch has never been merged. If you carry on following the
>> >> >>> >> discussion,
>> >> >>> >> you will certainly notice it was dropped for a very good reason:
>> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>> >> >>> >>
>> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
>> >> >>> >> they are
>> >> >>> >> essential to the patch I sent this morning.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>         M.
>> >> >>> >> --
>> >> >>> >> Fast, cheap, reliable. Pick two.
>> >> >>> >
>> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>> >> >>> > is
>> >> >>> > provided a DMA-capable device in pass-through mode. The reason being
>> >> >>> > bootloader/firmware typically don't enable MMU and such
>> >> >>> > bootloader/firmware
>> >> >>> > will programme a pass-through DMA-capable device without any flushes
>> >> >>> > to
>> >> >>> > guest RAM (because it has not enabled MMU).
>> >> >>> >
>> >> >>> > A good example of such a device would be SATA AHCI controller given
>> >> >>> > to a
>> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
>> >> >>> > bootloader/firmware
>> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
>> >> >>> > disk.
>> >> >>> > In this situation, we will have to hack Guest bootloader/firmware
>> >> >>> > AHCI driver to
>> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>> >> >>>
>> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>> >> >>> curiosity: is that a made up example or something you actually have?
>> >> >>>
>> >> >>> Back to square one:
>> >> >>> Can you please benchmark the various cache cleaning options (global at
>> >> >>> startup time, per-page on S2 translation fault, and user-space)?
>> >> >>>
>> >> >> Eh, why is this a more valid argument than the vgic?  The device
>> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
>> >> >> attributes to configure that memory region as device memory.  Why is it
>> >> >> relevant if the device is DMA-capable in this context?
>> >> >>
>> >> >> -Christoffer
>> >> >
>> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
>> >> > explicit cache flushes when programming DMA-capable device. Now If
>> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
>> >> > will not be visible to DMA-capable device.
>> >> >
>> >> > --Anup
>> >>
>> >> The approach of flushing d-cache by set/way upon first run of VCPU will
>> >> not work because for set/way operations ARM ARM says: "For set/way
>> >> operations, and for All (entire cache) operations, the point is defined to be
>> >> to the next level of caching". In other words, set/way operations work upto
>> >> point of unification.
>> >>
>> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
>> >> at bootup time which does not work for us when L3-cache is enabled so,
>> >> there is no point is adding one more __flush_dcache_all() upon first run of
>> >> VCPU in KVM ARM64.
>> >
>> > When did anyone suggest using a cache cleaning method that doesn't apply
>> > to this case.  I'm a little confused about your comment here.
>>
>> Please refer last reply from Marc Z where he says:
>> "Can you please benchmark the various cache cleaning options (global at
>> startup time, per-page on S2 translation fault, and user-space)?".
>>
>> Rather doing __flush_dcache_range() on each page in
>> coherent_icache_guest_page() we could also flush entire d-cache upon
>> first VCPU run. This only issue in flushing d-cache upon first VCPU run
>> is that we cannot flush d-cache by set/way because as per ARM ARM
>> all operations by set/way are upto PoU and not PoC. In presence of
>> L3-Cache we need to flush d-cache upto PoC so we have to use
>> __flush_dache_range()
>>
>> >
>> >>
>> >> IMHO, we are left with following options:
>> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>> >>     upon first run of VCPU
>> >
>> > We could do that, but we have to ensure that no other memory regions can
>> > be added later.  Is this the case?  I don't remember.
>>
>> Yes, memory regions being added later be problem for this option.
>>
>> >
>> >> 2. Implement outer-cache framework for ARM64 and flush all
>> >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
>> >
>> > What's the difference between (2) and (1)?
>>
>> Linux ARM (i.e. 32-bit port) has a framework for having outer
>> caches (i.e. caches that are not part of the CPU but exist as
>> separate entity between CPU and Memory for performance
>> improvement). Using this framework we can flush all CPU caches
>> and outer cache.
>>
>
> And we don't have such a framework in arm64?  But __flush_dcache_range

Yes, for now we don't have outer-cache framework in arm64.

> does nevertheless flush outer caches as well?

Yes, __flush_dcache_range() flushes the outer cache too.

The __flush_dcache_range() works for us because it flushes (i.e.
clean & invalidate) by VA upto PoC. All operation upto PoC on a
cache line will force eviction of the cache line from all CPU caches
(i.e. both L1 & L2) and also from external L3-cache to ensure that
RAM has updated contents of cache line.

Now, the outer caches (such as L3-cache) typically provide a
mechanism to flush entire L3-cache using L3-cache registers. If
we have an outer-cache framework then we can flush all caches
using __flush_dcache_all() + outer cache flush all.

>
>> >
>> >> 3. Use an alternate version of flush_icache_range() which will
>> >>     flush d-cache by PoC instead of PoU. We can also ensure
>> >>     that coherent_icache_guest_page() function will be called
>> >>     upon Stage2 prefetch aborts only.
>> >>
>> >
>> > I'm sorry, but I'm not following this discussion.  Are we not mixing a
>> > discussion along two different axis here?  As I see it there are two
>> > separate issues:
>> >
>> >  (A) When/Where to flush the memory regions
>> >  (B) How to flush the memory regions, including the hardware method and
>> >      the kernel software engineering aspect.
>>
>> Discussion here is about getting KVM ARM64 working in-presence
>> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
>> user-space typically loads images to guest RAM so, in-presence of
>> huge L3-cache (few MBs). When the VCPU starts running some of the
>> contents guest RAM will be still in L3-cache and VCPU runs with
>> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
>> see incorrect contents. To solve this problem we need to flush the
>> guest RAM contents before they are accessed by first time by VCPU.
>>
> ok, I'm with you that far.
>
> But is it also not true that we need to decide between:
>
>   A.1: Flush the entire guest RAM before running the VCPU
>   A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

>
> And (independently):
>
>   B.1: Use __flush_dcache_range
>   B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

>   B.3: Use flush_icache_range

Actually modified version of flush_icache_range() which uses
"dc cvac" and not "dc cvau".

>
> Or do these all interleave somehow?  If so, I don't understand why.  Can
> you help?
>
> -Christoffer

--Anup
Christoffer Dall Aug. 16, 2013, 6:20 p.m. UTC | #22
On Fri, Aug 16, 2013 at 11:41:51PM +0530, Anup Patel wrote:
> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
> >> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
> >> <christoffer.dall@linaro.org> wrote:
> >> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
> >> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
> >> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
> >> >> > <christoffer.dall@linaro.org> wrote:
> >> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
> >> >> >>> On 2013-08-15 16:13, Anup Patel wrote:
> >> >> >>> > Hi Marc,
> >> >> >>> >
> >> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
> >> >> >>> > wrote:
> >> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
> >> >> >>> >>>
> >> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
> >> >> >>> >>> <marc.zyngier@arm.com>
> >> >> >>> >>> wrote:
> >> >> >>> >>>>
> >> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
> >> >> >>> >>>>>
> >> >> >>> >>>>>
> >> >> >>> >>>>> Hi Marc,
> >> >> >>> >>>>>
> >> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
> >> >> >>> >>>>> <marc.zyngier@arm.com>
> >> >> >>> >>>>> wrote:
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> Hi Anup,
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
> >> >> >>> >>>>>>> <marc.zyngier@arm.com>
> >> >> >>> >>>>>>> wrote:
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> Hi Pranav,
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
> >> >> >>> >>>>>>>>> dirty
> >> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
> >> >> >>> >>>>>>>>> this,
> >> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
> >> >> >>> >>>>>>>>> guest
> >> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
> >> >> >>> >>>>>>>>> disabled.
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
> >> >> >>> >>>>>>>>> L3-cache.
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
> >> >> >>> >>>>>>>>> <pranavkumar@linaro.org>
> >> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
> >> >> >>> >>>>>>>>> ---
> >> >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
> >> >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
> >> >> >>> >>>>>>>>>
> >> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
> >> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
> >> >> >>> >>>>>>>>> index efe609c..5129038 100644
> >> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
> >> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
> >> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
> >> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
> >> >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
> >> >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
> >> >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
> >> >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
> >> >> >>> >>>>>>>>> caches. */
> >> >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
> >> >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
> >> >> >>> >>>>>>>>> ASID-tagged
> >> >> >>> >>>>>>>>> VIVT
> >> >> >>> >>>>>>>>> */
> >> >> >>> >>>>>>>>>               /* any kind of VIPT cache */
> >> >> >>> >>>>>>>>>               __flush_icache_all();
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
> >> >> >>> >>>>>>>> past]
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
> >> >> >>> >>>>>>>> the
> >> >> >>> >>>>>>>> data
> >> >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
> >> >> >>> >>>>>>>> kvmtools
> >> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
> >> >> >>> >>>>>>>> can do a
> >> >> >>> >>>>>>>> "DC
> >> >> >>> >>>>>>>> DVAC" on this region.
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
> >> >> >>> >>>>>>> I am
> >> >> >>> >>>>>>> sure
> >> >> >>> >>>>>>> other architectures don't have such cache cleaning in
> >> >> >>> >>>>>>> user-space for
> >> >> >>> >>>>>>> KVM.
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>>
> >> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
> >> >> >>> >>>>>>>> - but
> >> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
> >> >> >>> >>>>>>>> guest
> >> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
> >> >> >>> >>>>>>> VCPU was
> >> >> >>> >>>>>>> our second option but current approach seemed very simple hence
> >> >> >>> >>>>>>> we went for this.
> >> >> >>> >>>>>>>
> >> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
> >> >> >>> >>>>>>> VCPU then
> >> >> >>> >>>>>>> we should revise this patch.
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
> >> >> >>> >>>>>> boot-tested
> >> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
> >> >> >>> >>>>>> your
> >> >> >>> >>>>>> issue.
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> As far as I can see, it should fix it without any additional
> >> >> >>> >>>>>> flushing.
> >> >> >>> >>>>>>
> >> >> >>> >>>>>> Please let me know how it goes.
> >> >> >>> >>>>>
> >> >> >>> >>>>>
> >> >> >>> >>>>>
> >> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
> >> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
> >> >> >>> >>>>> Write-Allocate,
> >> >> >>> >>>>> Outer Write-Back Write-Allocate memory"
> >> >> >>> >>>>>
> >> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
> >> >> >>> >>>>> treated as "Strongly-ordered device memory"
> >> >> >>> >>>>>
> >> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
> >> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
> >> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
> >> >> >>> >>>> be
> >> >> >>> >>>> accessed as device memory.
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
> >> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
> >> >> >>> >>>>> KVM ARM64 because we use VGIC.
> >> >> >>> >>>>>
> >> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
> >> >> >>> >>>>> when Stage1 MMU is off.
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>>
> >> >> >>> >>>> See above. My understanding is that HCR.DC controls the default
> >> >> >>> >>>> output of
> >> >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
> >> >> >>> >>> and wanted guest to decide the memory attribute. In other words,
> >> >> >>> >>> you
> >> >> >>> >>> did not want to enforce any memory attribute in Stage2.
> >> >> >>> >>>
> >> >> >>> >>> Please refer to this patch
> >> >> >>> >>> https://patchwork.kernel.org/patch/2543201/
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> This patch has never been merged. If you carry on following the
> >> >> >>> >> discussion,
> >> >> >>> >> you will certainly notice it was dropped for a very good reason:
> >> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
> >> >> >>> >>
> >> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
> >> >> >>> >> they are
> >> >> >>> >> essential to the patch I sent this morning.
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>         M.
> >> >> >>> >> --
> >> >> >>> >> Fast, cheap, reliable. Pick two.
> >> >> >>> >
> >> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
> >> >> >>> > is
> >> >> >>> > provided a DMA-capable device in pass-through mode. The reason being
> >> >> >>> > bootloader/firmware typically don't enable MMU and such
> >> >> >>> > bootloader/firmware
> >> >> >>> > will programme a pass-through DMA-capable device without any flushes
> >> >> >>> > to
> >> >> >>> > guest RAM (because it has not enabled MMU).
> >> >> >>> >
> >> >> >>> > A good example of such a device would be SATA AHCI controller given
> >> >> >>> > to a
> >> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
> >> >> >>> > bootloader/firmware
> >> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
> >> >> >>> > disk.
> >> >> >>> > In this situation, we will have to hack Guest bootloader/firmware
> >> >> >>> > AHCI driver to
> >> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
> >> >> >>>
> >> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
> >> >> >>> curiosity: is that a made up example or something you actually have?
> >> >> >>>
> >> >> >>> Back to square one:
> >> >> >>> Can you please benchmark the various cache cleaning options (global at
> >> >> >>> startup time, per-page on S2 translation fault, and user-space)?
> >> >> >>>
> >> >> >> Eh, why is this a more valid argument than the vgic?  The device
> >> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
> >> >> >> attributes to configure that memory region as device memory.  Why is it
> >> >> >> relevant if the device is DMA-capable in this context?
> >> >> >>
> >> >> >> -Christoffer
> >> >> >
> >> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
> >> >> > explicit cache flushes when programming DMA-capable device. Now If
> >> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
> >> >> > will not be visible to DMA-capable device.
> >> >> >
> >> >> > --Anup
> >> >>
> >> >> The approach of flushing d-cache by set/way upon first run of VCPU will
> >> >> not work because for set/way operations ARM ARM says: "For set/way
> >> >> operations, and for All (entire cache) operations, the point is defined to be
> >> >> to the next level of caching". In other words, set/way operations work upto
> >> >> point of unification.
> >> >>
> >> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
> >> >> at bootup time which does not work for us when L3-cache is enabled so,
> >> >> there is no point is adding one more __flush_dcache_all() upon first run of
> >> >> VCPU in KVM ARM64.
> >> >
> >> > When did anyone suggest using a cache cleaning method that doesn't apply
> >> > to this case.  I'm a little confused about your comment here.
> >>
> >> Please refer last reply from Marc Z where he says:
> >> "Can you please benchmark the various cache cleaning options (global at
> >> startup time, per-page on S2 translation fault, and user-space)?".
> >>
> >> Rather doing __flush_dcache_range() on each page in
> >> coherent_icache_guest_page() we could also flush entire d-cache upon
> >> first VCPU run. This only issue in flushing d-cache upon first VCPU run
> >> is that we cannot flush d-cache by set/way because as per ARM ARM
> >> all operations by set/way are upto PoU and not PoC. In presence of
> >> L3-Cache we need to flush d-cache upto PoC so we have to use
> >> __flush_dache_range()
> >>
> >> >
> >> >>
> >> >> IMHO, we are left with following options:
> >> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
> >> >>     upon first run of VCPU
> >> >
> >> > We could do that, but we have to ensure that no other memory regions can
> >> > be added later.  Is this the case?  I don't remember.
> >>
> >> Yes, memory regions being added later be problem for this option.
> >>
> >> >
> >> >> 2. Implement outer-cache framework for ARM64 and flush all
> >> >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
> >> >
> >> > What's the difference between (2) and (1)?
> >>
> >> Linux ARM (i.e. 32-bit port) has a framework for having outer
> >> caches (i.e. caches that are not part of the CPU but exist as
> >> separate entity between CPU and Memory for performance
> >> improvement). Using this framework we can flush all CPU caches
> >> and outer cache.
> >>
> >
> > And we don't have such a framework in arm64?  But __flush_dcache_range
> 
> Yes, for now we don't have outer-cache framework in arm64.
> 
> > does nevertheless flush outer caches as well?
> 
> Yes, __flush_dcache_range() flushes the outer cache too.
> 
> The __flush_dcache_range() works for us because it flushes (i.e.
> clean & invalidate) by VA upto PoC. All operation upto PoC on a
> cache line will force eviction of the cache line from all CPU caches
> (i.e. both L1 & L2) and also from external L3-cache to ensure that
> RAM has updated contents of cache line.
> 
> Now, the outer caches (such as L3-cache) typically provide a
> mechanism to flush entire L3-cache using L3-cache registers. If
> we have an outer-cache framework then we can flush all caches
> using __flush_dcache_all() + outer cache flush all.
> 
> >
> >> >
> >> >> 3. Use an alternate version of flush_icache_range() which will
> >> >>     flush d-cache by PoC instead of PoU. We can also ensure
> >> >>     that coherent_icache_guest_page() function will be called
> >> >>     upon Stage2 prefetch aborts only.
> >> >>
> >> >
> >> > I'm sorry, but I'm not following this discussion.  Are we not mixing a
> >> > discussion along two different axis here?  As I see it there are two
> >> > separate issues:
> >> >
> >> >  (A) When/Where to flush the memory regions
> >> >  (B) How to flush the memory regions, including the hardware method and
> >> >      the kernel software engineering aspect.
> >>
> >> Discussion here is about getting KVM ARM64 working in-presence
> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
> >> user-space typically loads images to guest RAM so, in-presence of
> >> huge L3-cache (few MBs). When the VCPU starts running some of the
> >> contents guest RAM will be still in L3-cache and VCPU runs with
> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
> >> see incorrect contents. To solve this problem we need to flush the
> >> guest RAM contents before they are accessed by first time by VCPU.
> >>
> > ok, I'm with you that far.
> >
> > But is it also not true that we need to decide between:
> >
> >   A.1: Flush the entire guest RAM before running the VCPU
> >   A.2: Flush the pages as we fault them in
> 
> Yes, thats the decision we have to make.
> 
> >
> > And (independently):
> >
> >   B.1: Use __flush_dcache_range
> >   B.2: Use something else + outer cache framework for arm64
> 
> This would be __flush_dcache_all() + outer cache flush all.
> 
> >   B.3: Use flush_icache_range
> 
> Actually modified version of flush_icache_range() which uses
> "dc cvac" and not "dc cvau".
> 
> >
> > Or do these all interleave somehow?  If so, I don't understand why.  Can
> > you help?
> >

Hi Anup,

Thanks for the explanation, it's crystal now.

-Christoffer
Anup Patel Aug. 16, 2013, 6:20 p.m. UTC | #23
On Fri, Aug 16, 2013 at 11:36 PM, Christoffer Dall
<christoffer.dall@linaro.org> wrote:
> On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote:
>> On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
>> > On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall
>> > <christoffer.dall@linaro.org> wrote:
>> > > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:
>> > >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote:
>> > >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall
>> > >> > <christoffer.dall@linaro.org> wrote:
>> > >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:
>> > >> >>> On 2013-08-15 16:13, Anup Patel wrote:
>> > >> >>> > Hi Marc,
>> > >> >>> >
>> > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com>
>> > >> >>> > wrote:
>> > >> >>> >> On 2013-08-15 14:31, Anup Patel wrote:
>> > >> >>> >>>
>> > >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier
>> > >> >>> >>> <marc.zyngier@arm.com>
>> > >> >>> >>> wrote:
>> > >> >>> >>>>
>> > >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote:
>> > >> >>> >>>>>
>> > >> >>> >>>>>
>> > >> >>> >>>>> Hi Marc,
>> > >> >>> >>>>>
>> > >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier
>> > >> >>> >>>>> <marc.zyngier@arm.com>
>> > >> >>> >>>>> wrote:
>> > >> >>> >>>>>>
>> > >> >>> >>>>>>
>> > >> >>> >>>>>> Hi Anup,
>> > >> >>> >>>>>>
>> > >> >>> >>>>>>
>> > >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote:
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier
>> > >> >>> >>>>>>> <marc.zyngier@arm.com>
>> > >> >>> >>>>>>> wrote:
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> Hi Pranav,
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have
>> > >> >>> >>>>>>>>> dirty
>> > >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle
>> > >> >>> >>>>>>>>> this,
>> > >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that
>> > >> >>> >>>>>>>>> guest
>> > >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is
>> > >> >>> >>>>>>>>> disabled.
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external
>> > >> >>> >>>>>>>>> L3-cache.
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar
>> > >> >>> >>>>>>>>> <pranavkumar@linaro.org>
>> > >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org>
>> > >> >>> >>>>>>>>> ---
>> > >> >>> >>>>>>>>>  arch/arm64/include/asm/kvm_mmu.h |    2 ++
>> > >> >>> >>>>>>>>>  1 file changed, 2 insertions(+)
>> > >> >>> >>>>>>>>>
>> > >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h
>> > >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h
>> > >> >>> >>>>>>>>> index efe609c..5129038 100644
>> > >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> > >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> > >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void
>> > >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn)
>> > >> >>> >>>>>>>>>       if (!icache_is_aliasing()) {            /* PIPT */
>> > >> >>> >>>>>>>>>               unsigned long hva = gfn_to_hva(kvm, gfn);
>> > >> >>> >>>>>>>>>               flush_icache_range(hva, hva + PAGE_SIZE);
>> > >> >>> >>>>>>>>> +             /* Flush d-cache for systems with external
>> > >> >>> >>>>>>>>> caches. */
>> > >> >>> >>>>>>>>> +             __flush_dcache_area((void *) hva, PAGE_SIZE);
>> > >> >>> >>>>>>>>>       } else if (!icache_is_aivivt()) {       /* non
>> > >> >>> >>>>>>>>> ASID-tagged
>> > >> >>> >>>>>>>>> VIVT
>> > >> >>> >>>>>>>>> */
>> > >> >>> >>>>>>>>>               /* any kind of VIPT cache */
>> > >> >>> >>>>>>>>>               __flush_icache_all();
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the
>> > >> >>> >>>>>>>> past]
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit
>> > >> >>> >>>>>>>> the
>> > >> >>> >>>>>>>> data
>> > >> >>> >>>>>>>> cache on each page mapping. It looks overkill.
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning?
>> > >> >>> >>>>>>>> kvmtools
>> > >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and
>> > >> >>> >>>>>>>> can do a
>> > >> >>> >>>>>>>> "DC
>> > >> >>> >>>>>>>> DVAC" on this region.
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space.
>> > >> >>> >>>>>>> I am
>> > >> >>> >>>>>>> sure
>> > >> >>> >>>>>>> other architectures don't have such cache cleaning in
>> > >> >>> >>>>>>> user-space for
>> > >> >>> >>>>>>> KVM.
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>>
>> > >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu
>> > >> >>> >>>>>>>> - but
>> > >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the
>> > >> >>> >>>>>>>> guest
>> > >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing).
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of
>> > >> >>> >>>>>>> VCPU was
>> > >> >>> >>>>>>> our second option but current approach seemed very simple hence
>> > >> >>> >>>>>>> we went for this.
>> > >> >>> >>>>>>>
>> > >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of
>> > >> >>> >>>>>>> VCPU then
>> > >> >>> >>>>>>> we should revise this patch.
>> > >> >>> >>>>>>
>> > >> >>> >>>>>>
>> > >> >>> >>>>>>
>> > >> >>> >>>>>>
>> > >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've
>> > >> >>> >>>>>> boot-tested
>> > >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes
>> > >> >>> >>>>>> your
>> > >> >>> >>>>>> issue.
>> > >> >>> >>>>>>
>> > >> >>> >>>>>> As far as I can see, it should fix it without any additional
>> > >> >>> >>>>>> flushing.
>> > >> >>> >>>>>>
>> > >> >>> >>>>>> Please let me know how it goes.
>> > >> >>> >>>>>
>> > >> >>> >>>>>
>> > >> >>> >>>>>
>> > >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are
>> > >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back
>> > >> >>> >>>>> Write-Allocate,
>> > >> >>> >>>>> Outer Write-Back Write-Allocate memory"
>> > >> >>> >>>>>
>> > >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are
>> > >> >>> >>>>> treated as "Strongly-ordered device memory"
>> > >> >>> >>>>>
>> > >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as
>> > >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be
>> > >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1.
>> > >> >>> >>>>
>> > >> >>> >>>>
>> > >> >>> >>>>
>> > >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to
>> > >> >>> >>>> be
>> > >> >>> >>>> accessed as device memory.
>> > >> >>> >>>>
>> > >> >>> >>>>
>> > >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software
>> > >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or
>> > >> >>> >>>>> KVM ARM64 because we use VGIC.
>> > >> >>> >>>>>
>> > >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM
>> > >> >>> >>>>> when Stage1 MMU is off.
>> > >> >>> >>>>
>> > >> >>> >>>>
>> > >> >>> >>>>
>> > >> >>> >>>> See above. My understanding is that HCR.DC controls the default
>> > >> >>> >>>> output of
>> > >> >>> >>>> Stage-1, and Stage-2 overrides still apply.
>> > >> >>> >>>
>> > >> >>> >>>
>> > >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant
>> > >> >>> >>> and wanted guest to decide the memory attribute. In other words,
>> > >> >>> >>> you
>> > >> >>> >>> did not want to enforce any memory attribute in Stage2.
>> > >> >>> >>>
>> > >> >>> >>> Please refer to this patch
>> > >> >>> >>> https://patchwork.kernel.org/patch/2543201/
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >> This patch has never been merged. If you carry on following the
>> > >> >>> >> discussion,
>> > >> >>> >> you will certainly notice it was dropped for a very good reason:
>> > >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html
>> > >> >>> >>
>> > >> >>> >> So Stage-2 memory attributes are used, they are not going away, and
>> > >> >>> >> they are
>> > >> >>> >> essential to the patch I sent this morning.
>> > >> >>> >>
>> > >> >>> >>
>> > >> >>> >>         M.
>> > >> >>> >> --
>> > >> >>> >> Fast, cheap, reliable. Pick two.
>> > >> >>> >
>> > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM
>> > >> >>> > is
>> > >> >>> > provided a DMA-capable device in pass-through mode. The reason being
>> > >> >>> > bootloader/firmware typically don't enable MMU and such
>> > >> >>> > bootloader/firmware
>> > >> >>> > will programme a pass-through DMA-capable device without any flushes
>> > >> >>> > to
>> > >> >>> > guest RAM (because it has not enabled MMU).
>> > >> >>> >
>> > >> >>> > A good example of such a device would be SATA AHCI controller given
>> > >> >>> > to a
>> > >> >>> > Guest/VM as direct access (using SystemMMU) and Guest
>> > >> >>> > bootloader/firmware
>> > >> >>> > accessing this SATA AHCI controller to load kernel images from SATA
>> > >> >>> > disk.
>> > >> >>> > In this situation, we will have to hack Guest bootloader/firmware
>> > >> >>> > AHCI driver to
>> > >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
>> > >> >>>
>> > >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of
>> > >> >>> curiosity: is that a made up example or something you actually have?
>> > >> >>>
>> > >> >>> Back to square one:
>> > >> >>> Can you please benchmark the various cache cleaning options (global at
>> > >> >>> startup time, per-page on S2 translation fault, and user-space)?
>> > >> >>>
>> > >> >> Eh, why is this a more valid argument than the vgic?  The device
>> > >> >> passthrough Stage-2 mappings would still have the Stage-2 memory
>> > >> >> attributes to configure that memory region as device memory.  Why is it
>> > >> >> relevant if the device is DMA-capable in this context?
>> > >> >>
>> > >> >> -Christoffer
>> > >> >
>> > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do
>> > >> > explicit cache flushes when programming DMA-capable device. Now If
>> > >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware
>> > >> > will not be visible to DMA-capable device.
>> > >> >
>> > >> > --Anup
>> > >>
>> > >> The approach of flushing d-cache by set/way upon first run of VCPU will
>> > >> not work because for set/way operations ARM ARM says: "For set/way
>> > >> operations, and for All (entire cache) operations, the point is defined to be
>> > >> to the next level of caching". In other words, set/way operations work upto
>> > >> point of unification.
>> > >>
>> > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
>> > >> at bootup time which does not work for us when L3-cache is enabled so,
>> > >> there is no point is adding one more __flush_dcache_all() upon first run of
>> > >> VCPU in KVM ARM64.
>> > >
>> > > When did anyone suggest using a cache cleaning method that doesn't apply
>> > > to this case.  I'm a little confused about your comment here.
>> >
>> > Please refer last reply from Marc Z where he says:
>> > "Can you please benchmark the various cache cleaning options (global at
>> > startup time, per-page on S2 translation fault, and user-space)?".
>> >
>> > Rather doing __flush_dcache_range() on each page in
>> > coherent_icache_guest_page() we could also flush entire d-cache upon
>> > first VCPU run. This only issue in flushing d-cache upon first VCPU run
>> > is that we cannot flush d-cache by set/way because as per ARM ARM
>> > all operations by set/way are upto PoU and not PoC. In presence of
>> > L3-Cache we need to flush d-cache upto PoC so we have to use
>> > __flush_dache_range()
>> >
>> > >
>> > >>
>> > >> IMHO, we are left with following options:
>> > >> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>> > >>     upon first run of VCPU
>> > >
>> > > We could do that, but we have to ensure that no other memory regions can
>> > > be added later.  Is this the case?  I don't remember.
>> >
>> > Yes, memory regions being added later be problem for this option.
>> >
>> > >
>> > >> 2. Implement outer-cache framework for ARM64 and flush all
>> > >>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
>> > >
>> > > What's the difference between (2) and (1)?
>> >
>> > Linux ARM (i.e. 32-bit port) has a framework for having outer
>> > caches (i.e. caches that are not part of the CPU but exist as
>> > separate entity between CPU and Memory for performance
>> > improvement). Using this framework we can flush all CPU caches
>> > and outer cache.
>> >
>>
>> And we don't have such a framework in arm64?  But __flush_dcache_range
>> does nevertheless flush outer caches as well?
>>
>> > >
>> > >> 3. Use an alternate version of flush_icache_range() which will
>> > >>     flush d-cache by PoC instead of PoU. We can also ensure
>> > >>     that coherent_icache_guest_page() function will be called
>> > >>     upon Stage2 prefetch aborts only.
>> > >>
>> > >
>> > > I'm sorry, but I'm not following this discussion.  Are we not mixing a
>> > > discussion along two different axis here?  As I see it there are two
>> > > separate issues:
>> > >
>> > >  (A) When/Where to flush the memory regions
>> > >  (B) How to flush the memory regions, including the hardware method and
>> > >      the kernel software engineering aspect.
>> >
>> > Discussion here is about getting KVM ARM64 working in-presence
>> > of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
>> > user-space typically loads images to guest RAM so, in-presence of
>> > huge L3-cache (few MBs). When the VCPU starts running some of the
>> > contents guest RAM will be still in L3-cache and VCPU runs with
>> > MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
>> > see incorrect contents. To solve this problem we need to flush the
>> > guest RAM contents before they are accessed by first time by VCPU.
>> >
>> ok, I'm with you that far.
>>
>> But is it also not true that we need to decide between:
>>
>>   A.1: Flush the entire guest RAM before running the VCPU
>>   A.2: Flush the pages as we fault them in
>>
>> And (independently):
>>
>>   B.1: Use __flush_dcache_range
>>   B.2: Use something else + outer cache framework for arm64
>>   B.3: Use flush_icache_range
>>
>> Or do these all interleave somehow?  If so, I don't understand why.  Can
>> you help?
>>
> Oh, I think I understand your point now.  You mean if we use
> flush_cache_all before we run the vcpu, then it's not sufficient?  I
> assumed we would simply be using __flush_dcache_area for the guest RAM
> regions which would flush to PoC.
>
> For the record, I think we need a solution that also covers the case
> where a memory region is registered later, and I therefore prefer having
> this functionality in the stage-2 fault handler, where we are already
> taking care of similar issues (like your patch suggested).
>
> Especially if we can limit ourselves to doing so when the guest MMU is
> disabled, then I really think this is going to be the least overhead and
> measuring the performance of this penalty vs. at first CPU execution is
> a bit overkill IMHO, since we are only talking about boot time for any
> reasonable guest (which would run with the MMU enabled for real
> workloads presumeably).

Yes, currently we call coherent_icache_guest_page() upon all
Stage2 translation faults but, we can be improve a bit here just like
your suggestion. May be we can call coherent_icache_guest_page()
only when VCPU MMU is off and we get Stage2 instruction prefetch
translation fault.

>
> The only caveat is the previously discussed issue if user space loads
> code after the first VCPU execution, and only user space would know if
> that happens, which would argue for user space doing the cleaning...
> Hmmm.
>
> I also still have my worries about swapping, since user space is free to
> map guest RAM as non-executable.
>
> Did I miss anything here?

I don't know much about Linux swapping but for the rest part we are in
sync.

>
> -Christoffer

--Anup
Catalin Marinas Aug. 29, 2013, 10:52 a.m. UTC | #24
Hi Anup,

Jumping late into this thread (holidays).

On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
> The approach of flushing d-cache by set/way upon first run of VCPU will
> not work because for set/way operations ARM ARM says: "For set/way
> operations, and for All (entire cache) operations, the point is defined to be
> to the next level of caching". In other words, set/way operations work upto
> point of unification.

I don't understand where you got the idea that set/way operations work
up to the point of unification. This is incorrect, the set/way
operations work on the level of cache specified by bits 3:1 in the
register passed to the DC CISW instruction. For your L3 cache, those
bits would be 2 (and __flush_dcache_all() implementation does this
dynamically).

For the I-cache all operation, that's correct, it only invalidates to
the PoU but it doesn't need to go any further, you do the D-cache
maintenance for this (no point in duplicating functionality).

> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
> at bootup time which does not work for us when L3-cache is enabled so,
> there is no point is adding one more __flush_dcache_all() upon first run of
> VCPU in KVM ARM64.

Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report
an L3 cache on your implementation?

It's not clear to me whether your L3 cache is inner or outer (or a mix).
You say that D-cache maintenance to PoC flushes the L3 which looks to me
like an inner cache, in which cache it should be reported in the LoC
bits in CLIDR_EL1.

> IMHO, we are left with following options:
> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>     upon first run of VCPU
> 2. Implement outer-cache framework for ARM64 and flush all
>     caches + outer cache (i.e. L3-cache) upon first run of VCPU

Do you have specific instructions for flushing the L3 cache only? It's
not clear to me what an outer-cache framework would to on AArch64. It
was added on AArch32 for the L2x0/PL310 which need separate instructions
by physical address for flushing the cache. I really hope we don't get
these again on ARMv8 hardware.

> 3. Use an alternate version of flush_icache_range() which will
>     flush d-cache by PoC instead of PoU. We can also ensure
>     that coherent_icache_guest_page() function will be called
>     upon Stage2 prefetch aborts only.

flush_icache_range() is meant to flush only to the PoU to ensure the I-D
cache coherency. I don't think we should change this.
Anup Patel Aug. 29, 2013, 12:31 p.m. UTC | #25
On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> Hi Anup,
>
> Jumping late into this thread (holidays).
>
> On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
>> The approach of flushing d-cache by set/way upon first run of VCPU will
>> not work because for set/way operations ARM ARM says: "For set/way
>> operations, and for All (entire cache) operations, the point is defined to be
>> to the next level of caching". In other words, set/way operations work upto
>> point of unification.
>
> I don't understand where you got the idea that set/way operations work
> up to the point of unification. This is incorrect, the set/way
> operations work on the level of cache specified by bits 3:1 in the
> register passed to the DC CISW instruction. For your L3 cache, those
> bits would be 2 (and __flush_dcache_all() implementation does this
> dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent
to CPU.

>
> For the I-cache all operation, that's correct, it only invalidates to
> the PoU but it doesn't need to go any further, you do the D-cache
> maintenance for this (no point in duplicating functionality).
>
>> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup()
>> at bootup time which does not work for us when L3-cache is enabled so,
>> there is no point is adding one more __flush_dcache_all() upon first run of
>> VCPU in KVM ARM64.
>
> Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report
> an L3 cache on your implementation?

Yes, CPU is not aware of L3-cache and its state.

In our case, CLIDR_EL1 does not report L3-cache.

>
> It's not clear to me whether your L3 cache is inner or outer (or a mix).
> You say that D-cache maintenance to PoC flushes the L3 which looks to me
> like an inner cache, in which cache it should be reported in the LoC
> bits in CLIDR_EL1.
>
>> IMHO, we are left with following options:
>> 1. Flush all RAM regions of VCPU using __flush_dcache_range()
>>     upon first run of VCPU
>> 2. Implement outer-cache framework for ARM64 and flush all
>>     caches + outer cache (i.e. L3-cache) upon first run of VCPU
>
> Do you have specific instructions for flushing the L3 cache only? It's
> not clear to me what an outer-cache framework would to on AArch64. It
> was added on AArch32 for the L2x0/PL310 which need separate instructions
> by physical address for flushing the cache. I really hope we don't get
> these again on ARMv8 hardware.
>
>> 3. Use an alternate version of flush_icache_range() which will
>>     flush d-cache by PoC instead of PoU. We can also ensure
>>     that coherent_icache_guest_page() function will be called
>>     upon Stage2 prefetch aborts only.
>
> flush_icache_range() is meant to flush only to the PoU to ensure the I-D
> cache coherency. I don't think we should change this.

We did not mean to change flush_icache_range() instead we suggest
to have separate flush_icache_range_kvm() for KVM ARM64 which flushes
d-cache to PoC.

>
> --
> Catalin

--Anup
Catalin Marinas Aug. 29, 2013, 12:53 p.m. UTC | #26
On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:
> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
> >> The approach of flushing d-cache by set/way upon first run of VCPU will
> >> not work because for set/way operations ARM ARM says: "For set/way
> >> operations, and for All (entire cache) operations, the point is defined to be
> >> to the next level of caching". In other words, set/way operations work upto
> >> point of unification.
> >
> > I don't understand where you got the idea that set/way operations work
> > up to the point of unification. This is incorrect, the set/way
> > operations work on the level of cache specified by bits 3:1 in the
> > register passed to the DC CISW instruction. For your L3 cache, those
> > bits would be 2 (and __flush_dcache_all() implementation does this
> > dynamically).
> 
> The L3-cache is not visible to CPU. It is totally independent and transparent
> to CPU.

OK. But you say that operations like DC CIVAC actually flush the L3? So
I don't see it as completely transparent to the CPU.

Do you have any configuration bits which would make the L3 completely
transparent like always caching even when accesses are non-cacheable and
DC ops to PoC ignoring it?

Now, back to the idea of outer_cache framework for arm64. Does your CPU
have separate instructions for flushing this L3 cache?
Anup Patel Aug. 29, 2013, 4:02 p.m. UTC | #27
On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:
>> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
>> >> The approach of flushing d-cache by set/way upon first run of VCPU will
>> >> not work because for set/way operations ARM ARM says: "For set/way
>> >> operations, and for All (entire cache) operations, the point is defined to be
>> >> to the next level of caching". In other words, set/way operations work upto
>> >> point of unification.
>> >
>> > I don't understand where you got the idea that set/way operations work
>> > up to the point of unification. This is incorrect, the set/way
>> > operations work on the level of cache specified by bits 3:1 in the
>> > register passed to the DC CISW instruction. For your L3 cache, those
>> > bits would be 2 (and __flush_dcache_all() implementation does this
>> > dynamically).
>>
>> The L3-cache is not visible to CPU. It is totally independent and transparent
>> to CPU.
>
> OK. But you say that operations like DC CIVAC actually flush the L3? So
> I don't see it as completely transparent to the CPU.

It is transparent from CPU perspective. In other words, there is nothing in
CPU for controlling/monitoring L3-cache.

>
> Do you have any configuration bits which would make the L3 completely
> transparent like always caching even when accesses are non-cacheable and
> DC ops to PoC ignoring it?

Actually, L3-cache monitors the types of read/write generated by CPU (i.e.
whether the request is cacheable/non-cacheable or whether the request is
due to DC ops to PoC, or ...).

To answer your query, there is no configuration to have L3 caching when
accesses are non-cacheable and DC ops to PoC.

>
> Now, back to the idea of outer_cache framework for arm64. Does your CPU
> have separate instructions for flushing this L3 cache?

No, CPU does not have separate instruction for flushing L3-cache. On the
other hand, L3-cache has MMIO registers which can be use to explicitly
flush L3-cache.

>
> --
> Catalin

--Anup
Catalin Marinas Aug. 30, 2013, 9:44 a.m. UTC | #28
On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:
> On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:
> >> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas
> >> <catalin.marinas@arm.com> wrote:
> >> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
> >> >> The approach of flushing d-cache by set/way upon first run of VCPU will
> >> >> not work because for set/way operations ARM ARM says: "For set/way
> >> >> operations, and for All (entire cache) operations, the point is defined to be
> >> >> to the next level of caching". In other words, set/way operations work upto
> >> >> point of unification.
> >> >
> >> > I don't understand where you got the idea that set/way operations work
> >> > up to the point of unification. This is incorrect, the set/way
> >> > operations work on the level of cache specified by bits 3:1 in the
> >> > register passed to the DC CISW instruction. For your L3 cache, those
> >> > bits would be 2 (and __flush_dcache_all() implementation does this
> >> > dynamically).
> >>
> >> The L3-cache is not visible to CPU. It is totally independent and transparent
> >> to CPU.
> >
> > OK. But you say that operations like DC CIVAC actually flush the L3? So
> > I don't see it as completely transparent to the CPU.
> 
> It is transparent from CPU perspective. In other words, there is nothing in
> CPU for controlling/monitoring L3-cache.

We probably have a different understanding of "transparent". It doesn't
look to me like any more transparent than the L1 or L2 cache. Basically,
from a software perspective, it needs maintenance. Whether the CPU
explicitly asks the L3 cache for this or the L3 cache figures it on its
own based on the L1/L2 operations is irrelevant.

It would have been transparent if the software didn't need to know about
it at all, but it's not the case.

> > Do you have any configuration bits which would make the L3 completely
> > transparent like always caching even when accesses are non-cacheable and
> > DC ops to PoC ignoring it?
> 
> Actually, L3-cache monitors the types of read/write generated by CPU (i.e.
> whether the request is cacheable/non-cacheable or whether the request is
> due to DC ops to PoC, or ...).
> 
> To answer your query, there is no configuration to have L3 caching when
> accesses are non-cacheable and DC ops to PoC.

So it's an outer cache with some "improvements" to handle DC ops to PoC.
I think it was a pretty bad decision on the hardware side as we really
try to get rid of outer caches for many reasons:

1. Non-standard cache flushing operations (MMIO-based)
2. It may require cache maintenance by physical address - something
   hard to get in a guest OS (unless you virtualise L3 cache
   maintenance)
3. Are barriers like DSB propagated correctly? Does a DC op to PoC
   followed by DSB ensure that the L3 drained the cachelines to RAM?

I think point 2 isn't required because your L3 detects DC ops to PoC. I
hope point 3 is handled correctly (otherwise look how "nice" the mb()
macro on arm is to cope with L2x0).

If only 1 is left, we don't need the full outer_cache framework but it
still needs to be addressed since the assumption is that flush_cache_all
(or __flush_dcache_all) flushes all cache levels. These are not used in
generic code but are used during kernel booting, KVM and cpuidle
drivers.

> > Now, back to the idea of outer_cache framework for arm64. Does your CPU
> > have separate instructions for flushing this L3 cache?
> 
> No, CPU does not have separate instruction for flushing L3-cache. On the
> other hand, L3-cache has MMIO registers which can be use to explicitly
> flush L3-cache.

I guess you use those in your firmware or boot loader since Linux
requires clean/invalidated caches at boot (and I plan to push a patch
which removes kernel D-cache cleaning during boot to spot such problems
early). A cpuidle driver would probably need this as well.
Catalin Marinas Aug. 30, 2013, 9:52 a.m. UTC | #29
On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:
> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
> <christoffer.dall@linaro.org> wrote:
> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
> >> Discussion here is about getting KVM ARM64 working in-presence
> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
> >> user-space typically loads images to guest RAM so, in-presence of
> >> huge L3-cache (few MBs). When the VCPU starts running some of the
> >> contents guest RAM will be still in L3-cache and VCPU runs with
> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
> >> see incorrect contents. To solve this problem we need to flush the
> >> guest RAM contents before they are accessed by first time by VCPU.
> >>
> > ok, I'm with you that far.
> >
> > But is it also not true that we need to decide between:
> >
> >   A.1: Flush the entire guest RAM before running the VCPU
> >   A.2: Flush the pages as we fault them in
> 
> Yes, thats the decision we have to make.
> 
> >
> > And (independently):
> >
> >   B.1: Use __flush_dcache_range
> >   B.2: Use something else + outer cache framework for arm64
> 
> This would be __flush_dcache_all() + outer cache flush all.

We need to be careful here since the __flush_dcache_all() operation uses
cache maintenance by set/way and these are *local* to a CPU (IOW not
broadcast). Do you have any guarantee that dirty cache lines don't
migrate between CPUs and __flush_dcache_all() wouldn't miss them?
Architecturally we don't, so this is not a safe operation that would
guarantee L1 cache flushing (we probably need to revisit some of the
__flush_dcache_all() calls in KVM, I haven't looked into this).

So I think we are left to the range operation where the DC ops to PoC
would be enough for your L3.

An outer cache flush all is probably only needed for cpuidle/suspend
(the booting part should be handled by the boot loader).
Anup Patel Aug. 30, 2013, 10:36 a.m. UTC | #30
On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:
>> On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas
>> <catalin.marinas@arm.com> wrote:
>> > On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:
>> >> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas
>> >> <catalin.marinas@arm.com> wrote:
>> >> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:
>> >> >> The approach of flushing d-cache by set/way upon first run of VCPU will
>> >> >> not work because for set/way operations ARM ARM says: "For set/way
>> >> >> operations, and for All (entire cache) operations, the point is defined to be
>> >> >> to the next level of caching". In other words, set/way operations work upto
>> >> >> point of unification.
>> >> >
>> >> > I don't understand where you got the idea that set/way operations work
>> >> > up to the point of unification. This is incorrect, the set/way
>> >> > operations work on the level of cache specified by bits 3:1 in the
>> >> > register passed to the DC CISW instruction. For your L3 cache, those
>> >> > bits would be 2 (and __flush_dcache_all() implementation does this
>> >> > dynamically).
>> >>
>> >> The L3-cache is not visible to CPU. It is totally independent and transparent
>> >> to CPU.
>> >
>> > OK. But you say that operations like DC CIVAC actually flush the L3? So
>> > I don't see it as completely transparent to the CPU.
>>
>> It is transparent from CPU perspective. In other words, there is nothing in
>> CPU for controlling/monitoring L3-cache.
>
> We probably have a different understanding of "transparent". It doesn't
> look to me like any more transparent than the L1 or L2 cache. Basically,
> from a software perspective, it needs maintenance. Whether the CPU
> explicitly asks the L3 cache for this or the L3 cache figures it on its
> own based on the L1/L2 operations is irrelevant.

Yes, we have a different understanding regarding "transparent". The goal
behind our L3 is to have external cache such that CPU is not aware of it
and CPU would work fine even if L3 is turned-off on-the-fly. Further such
L3 has to be smarter than normal outer-caches because it will be
maintained automatically by observing CPU operations.

>
> It would have been transparent if the software didn't need to know about
> it at all, but it's not the case.

Currently, KVM is the only place where we need L3 maintenance because
Guest starts with MMU turned-off otherwise we don't need L3 maintenance
in kernel for any kind of IO or Subsystem.

>
>> > Do you have any configuration bits which would make the L3 completely
>> > transparent like always caching even when accesses are non-cacheable and
>> > DC ops to PoC ignoring it?
>>
>> Actually, L3-cache monitors the types of read/write generated by CPU (i.e.
>> whether the request is cacheable/non-cacheable or whether the request is
>> due to DC ops to PoC, or ...).
>>
>> To answer your query, there is no configuration to have L3 caching when
>> accesses are non-cacheable and DC ops to PoC.
>
> So it's an outer cache with some "improvements" to handle DC ops to PoC.
> I think it was a pretty bad decision on the hardware side as we really
> try to get rid of outer caches for many reasons:

Getting rid off outer-caches (such as in this context) may make sense for
Embedded/Low-end systems but for Servers L3-cache makes lot of sense.

Claiming this to be a bad decision all depends on what end application
you are looking at.

>
> 1. Non-standard cache flushing operations (MMIO-based)
> 2. It may require cache maintenance by physical address - something
>    hard to get in a guest OS (unless you virtualise L3 cache
>    maintenance)

We don't have cache maintenance by physical address in our L3-cache.

> 3. Are barriers like DSB propagated correctly? Does a DC op to PoC
>    followed by DSB ensure that the L3 drained the cachelines to RAM?

Yes, DSB works perfectly fine for us with L3.
Yes, DC op to PoC followed by DSB works fine with L3 too.

>
> I think point 2 isn't required because your L3 detects DC ops to PoC. I
> hope point 3 is handled correctly (otherwise look how "nice" the mb()
> macro on arm is to cope with L2x0).
>
> If only 1 is left, we don't need the full outer_cache framework but it
> still needs to be addressed since the assumption is that flush_cache_all
> (or __flush_dcache_all) flushes all cache levels. These are not used in
> generic code but are used during kernel booting, KVM and cpuidle
> drivers.
>
>> > Now, back to the idea of outer_cache framework for arm64. Does your CPU
>> > have separate instructions for flushing this L3 cache?
>>
>> No, CPU does not have separate instruction for flushing L3-cache. On the
>> other hand, L3-cache has MMIO registers which can be use to explicitly
>> flush L3-cache.
>
> I guess you use those in your firmware or boot loader since Linux
> requires clean/invalidated caches at boot (and I plan to push a patch
> which removes kernel D-cache cleaning during boot to spot such problems
> early). A cpuidle driver would probably need this as well.

Yes, cpuidle would be another place where we may need L3-cache
maintenance. In other words, if we need to power-off L3-cache from
kernel then we need to first flush it.

>
> --
> Catalin

--Anup
Anup Patel Aug. 30, 2013, 10:44 a.m. UTC | #31
On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:
>> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
>> <christoffer.dall@linaro.org> wrote:
>> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
>> >> Discussion here is about getting KVM ARM64 working in-presence
>> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
>> >> user-space typically loads images to guest RAM so, in-presence of
>> >> huge L3-cache (few MBs). When the VCPU starts running some of the
>> >> contents guest RAM will be still in L3-cache and VCPU runs with
>> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
>> >> see incorrect contents. To solve this problem we need to flush the
>> >> guest RAM contents before they are accessed by first time by VCPU.
>> >>
>> > ok, I'm with you that far.
>> >
>> > But is it also not true that we need to decide between:
>> >
>> >   A.1: Flush the entire guest RAM before running the VCPU
>> >   A.2: Flush the pages as we fault them in
>>
>> Yes, thats the decision we have to make.
>>
>> >
>> > And (independently):
>> >
>> >   B.1: Use __flush_dcache_range
>> >   B.2: Use something else + outer cache framework for arm64
>>
>> This would be __flush_dcache_all() + outer cache flush all.
>
> We need to be careful here since the __flush_dcache_all() operation uses
> cache maintenance by set/way and these are *local* to a CPU (IOW not
> broadcast). Do you have any guarantee that dirty cache lines don't
> migrate between CPUs and __flush_dcache_all() wouldn't miss them?
> Architecturally we don't, so this is not a safe operation that would
> guarantee L1 cache flushing (we probably need to revisit some of the
> __flush_dcache_all() calls in KVM, I haven't looked into this).
>
> So I think we are left to the range operation where the DC ops to PoC
> would be enough for your L3.

If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC
by range would be the only option.

>
> An outer cache flush all is probably only needed for cpuidle/suspend
> (the booting part should be handled by the boot loader).

Yes, cpuidle/suspend would definitely require outer cache maintenance.

For KVM, we can avoid flushing d-cache to PoC every time in
coherent_icache_guest_page() by only doing it when Guest MMU is
turned-off. This may reduce the performance penalty.

>
> --
> Catalin

--Anup
Catalin Marinas Aug. 30, 2013, 12:52 p.m. UTC | #32
On Fri, Aug 30, 2013 at 11:36:30AM +0100, Anup Patel wrote:
> On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:
> >> Actually, L3-cache monitors the types of read/write generated by CPU (i.e.
> >> whether the request is cacheable/non-cacheable or whether the request is
> >> due to DC ops to PoC, or ...).
> >>
> >> To answer your query, there is no configuration to have L3 caching when
> >> accesses are non-cacheable and DC ops to PoC.
> >
> > So it's an outer cache with some "improvements" to handle DC ops to PoC.
> > I think it was a pretty bad decision on the hardware side as we really
> > try to get rid of outer caches for many reasons:
> 
> Getting rid off outer-caches (such as in this context) may make sense for
> Embedded/Low-end systems but for Servers L3-cache makes lot of sense.
> 
> Claiming this to be a bad decision all depends on what end application
> you are looking at.

It's not a bad decision to have big L3 cache, that's a very *good*
decision for server applications. The bad part is that it is not fully
integrated with the CPU (like allowing set/way operations to flush this
cache level).

> > 1. Non-standard cache flushing operations (MMIO-based)
> > 2. It may require cache maintenance by physical address - something
> >    hard to get in a guest OS (unless you virtualise L3 cache
> >    maintenance)
> 
> We don't have cache maintenance by physical address in our L3-cache.

Great.

> > 3. Are barriers like DSB propagated correctly? Does a DC op to PoC
> >    followed by DSB ensure that the L3 drained the cachelines to RAM?
> 
> Yes, DSB works perfectly fine for us with L3.
> Yes, DC op to PoC followed by DSB works fine with L3 too.

Even better.

See my other comment on flush_dcache_all() use in the kernel/KVM and why
I don't think it is suitable (which leaves us with DC ops to PoC in
KVM).

> > I think point 2 isn't required because your L3 detects DC ops to PoC. I
> > hope point 3 is handled correctly (otherwise look how "nice" the mb()
> > macro on arm is to cope with L2x0).
> >
> > If only 1 is left, we don't need the full outer_cache framework but it
> > still needs to be addressed since the assumption is that flush_cache_all
> > (or __flush_dcache_all) flushes all cache levels. These are not used in
> > generic code but are used during kernel booting, KVM and cpuidle
> > drivers.
> >
> >> > Now, back to the idea of outer_cache framework for arm64. Does your CPU
> >> > have separate instructions for flushing this L3 cache?
> >>
> >> No, CPU does not have separate instruction for flushing L3-cache. On the
> >> other hand, L3-cache has MMIO registers which can be use to explicitly
> >> flush L3-cache.
> >
> > I guess you use those in your firmware or boot loader since Linux
> > requires clean/invalidated caches at boot (and I plan to push a patch
> > which removes kernel D-cache cleaning during boot to spot such problems
> > early). A cpuidle driver would probably need this as well.
> 
> Yes, cpuidle would be another place where we may need L3-cache
> maintenance. In other words, if we need to power-off L3-cache from
> kernel then we need to first flush it.

The problem is if you need to disable the MMU in the kernel, you would
need to flush the L3 cache first. Normally this would be done in the
firmware with PSCI but most likely not the case for the Applied
hardware.
Catalin Marinas Aug. 30, 2013, 1:02 p.m. UTC | #33
On Fri, Aug 30, 2013 at 11:44:30AM +0100, Anup Patel wrote:
> On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:
> >> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
> >> <christoffer.dall@linaro.org> wrote:
> >> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
> >> >> Discussion here is about getting KVM ARM64 working in-presence
> >> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU
> >> >> user-space typically loads images to guest RAM so, in-presence of
> >> >> huge L3-cache (few MBs). When the VCPU starts running some of the
> >> >> contents guest RAM will be still in L3-cache and VCPU runs with
> >> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
> >> >> see incorrect contents. To solve this problem we need to flush the
> >> >> guest RAM contents before they are accessed by first time by VCPU.
> >> >>
> >> > ok, I'm with you that far.
> >> >
> >> > But is it also not true that we need to decide between:
> >> >
> >> >   A.1: Flush the entire guest RAM before running the VCPU
> >> >   A.2: Flush the pages as we fault them in
> >>
> >> Yes, thats the decision we have to make.
> >>
> >> >
> >> > And (independently):
> >> >
> >> >   B.1: Use __flush_dcache_range
> >> >   B.2: Use something else + outer cache framework for arm64
> >>
> >> This would be __flush_dcache_all() + outer cache flush all.
> >
> > We need to be careful here since the __flush_dcache_all() operation uses
> > cache maintenance by set/way and these are *local* to a CPU (IOW not
> > broadcast). Do you have any guarantee that dirty cache lines don't
> > migrate between CPUs and __flush_dcache_all() wouldn't miss them?
> > Architecturally we don't, so this is not a safe operation that would
> > guarantee L1 cache flushing (we probably need to revisit some of the
> > __flush_dcache_all() calls in KVM, I haven't looked into this).
> >
> > So I think we are left to the range operation where the DC ops to PoC
> > would be enough for your L3.
> 
> If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC
> by range would be the only option.

Yes. In the (upcoming) ARM ARMv8 there is a clear note that set/way
operations to flush the whole cache must not be used for the maintenance
of large buffer but only during power-down/power-up code sequences.

> > An outer cache flush all is probably only needed for cpuidle/suspend
> > (the booting part should be handled by the boot loader).
> 
> Yes, cpuidle/suspend would definitely require outer cache maintenance.
> 
> For KVM, we can avoid flushing d-cache to PoC every time in
> coherent_icache_guest_page() by only doing it when Guest MMU is
> turned-off. This may reduce the performance penalty.

That's for the KVM guys to decide ;)
Marc Zyngier Aug. 30, 2013, 1:21 p.m. UTC | #34
On 2013-08-30 11:44, Anup Patel wrote:
> On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:
>>> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall
>>> <christoffer.dall@linaro.org> wrote:
>>> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:
>>> >> Discussion here is about getting KVM ARM64 working in-presence
>>> >> of an external L3-cache (i.e. not part of CPU). Before starting 
>>> a VCPU
>>> >> user-space typically loads images to guest RAM so, in-presence 
>>> of
>>> >> huge L3-cache (few MBs). When the VCPU starts running some of 
>>> the
>>> >> contents guest RAM will be still in L3-cache and VCPU runs with
>>> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and
>>> >> see incorrect contents. To solve this problem we need to flush 
>>> the
>>> >> guest RAM contents before they are accessed by first time by 
>>> VCPU.
>>> >>
>>> > ok, I'm with you that far.
>>> >
>>> > But is it also not true that we need to decide between:
>>> >
>>> >   A.1: Flush the entire guest RAM before running the VCPU
>>> >   A.2: Flush the pages as we fault them in
>>>
>>> Yes, thats the decision we have to make.
>>>
>>> >
>>> > And (independently):
>>> >
>>> >   B.1: Use __flush_dcache_range
>>> >   B.2: Use something else + outer cache framework for arm64
>>>
>>> This would be __flush_dcache_all() + outer cache flush all.
>>
>> We need to be careful here since the __flush_dcache_all() operation 
>> uses
>> cache maintenance by set/way and these are *local* to a CPU (IOW not
>> broadcast). Do you have any guarantee that dirty cache lines don't
>> migrate between CPUs and __flush_dcache_all() wouldn't miss them?
>> Architecturally we don't, so this is not a safe operation that would
>> guarantee L1 cache flushing (we probably need to revisit some of the
>> __flush_dcache_all() calls in KVM, I haven't looked into this).
>>
>> So I think we are left to the range operation where the DC ops to 
>> PoC
>> would be enough for your L3.
>
> If __flush_dcache_all() is *local" to a CPU then I guess DC ops to 
> PoC
> by range would be the only option.
>
>>
>> An outer cache flush all is probably only needed for cpuidle/suspend
>> (the booting part should be handled by the boot loader).
>
> Yes, cpuidle/suspend would definitely require outer cache 
> maintenance.
>
> For KVM, we can avoid flushing d-cache to PoC every time in
> coherent_icache_guest_page() by only doing it when Guest MMU is
> turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these 
bits?

         M.
Catalin Marinas Aug. 30, 2013, 2:04 p.m. UTC | #35
On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:
> On 2013-08-30 11:44, Anup Patel wrote:
> > For KVM, we can avoid flushing d-cache to PoC every time in
> > coherent_icache_guest_page() by only doing it when Guest MMU is
> > turned-off. This may reduce the performance penalty.
> 
> What about the I and C bits in SCTLR_EL1? Does L3 also honour these 
> bits?

I would think so, it probably cares about how the transactions are
presented at the bus level by the CPU.
Marc Zyngier Aug. 30, 2013, 2:22 p.m. UTC | #36
On 2013-08-30 15:04, Catalin Marinas wrote:
> On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:
>> On 2013-08-30 11:44, Anup Patel wrote:
>> > For KVM, we can avoid flushing d-cache to PoC every time in
>> > coherent_icache_guest_page() by only doing it when Guest MMU is
>> > turned-off. This may reduce the performance penalty.
>>
>> What about the I and C bits in SCTLR_EL1? Does L3 also honour these
>> bits?
>
> I would think so, it probably cares about how the transactions are
> presented at the bus level by the CPU.

That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in 
order to find out whether or not we need to clean the cache.

Mumble mumble...

         M.
Will Deacon Aug. 30, 2013, 2:30 p.m. UTC | #37
On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:
> On 2013-08-30 15:04, Catalin Marinas wrote:
> > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:
> >> On 2013-08-30 11:44, Anup Patel wrote:
> >> > For KVM, we can avoid flushing d-cache to PoC every time in
> >> > coherent_icache_guest_page() by only doing it when Guest MMU is
> >> > turned-off. This may reduce the performance penalty.
> >>
> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour these
> >> bits?
> >
> > I would think so, it probably cares about how the transactions are
> > presented at the bus level by the CPU.
> 
> That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in 
> order to find out whether or not we need to clean the cache.

<horrible hack>
Could you use a dummy ATS_* operation, then read the cacheability attributes
out of the PAR?
</horrible hack>

Will
Anup Patel Aug. 30, 2013, 2:52 p.m. UTC | #38
Hi Will,

On Fri, Aug 30, 2013 at 8:00 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:
>> On 2013-08-30 15:04, Catalin Marinas wrote:
>> > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:
>> >> On 2013-08-30 11:44, Anup Patel wrote:
>> >> > For KVM, we can avoid flushing d-cache to PoC every time in
>> >> > coherent_icache_guest_page() by only doing it when Guest MMU is
>> >> > turned-off. This may reduce the performance penalty.
>> >>
>> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour these
>> >> bits?
>> >
>> > I would think so, it probably cares about how the transactions are
>> > presented at the bus level by the CPU.
>>
>> That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in
>> order to find out whether or not we need to clean the cache.
>
> <horrible hack>
> Could you use a dummy ATS_* operation, then read the cacheability attributes
> out of the PAR?
> </horrible hack>

We will have to use ATS12xxx operation which can only be executed from EL2
or EL3. The host kernel runs at EL1 hence we will need to use HVC call to
execute ATS12xxx operation in EL2 mode which in-turn means a world-switch
overhead just to execute ATS12xxx operation.

>
> Will

--Anup
Marc Zyngier Aug. 30, 2013, 3:12 p.m. UTC | #39
Hi Will,

On 2013-08-30 15:30, Will Deacon wrote:
> On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:
>> On 2013-08-30 15:04, Catalin Marinas wrote:
>> > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:
>> >> On 2013-08-30 11:44, Anup Patel wrote:
>> >> > For KVM, we can avoid flushing d-cache to PoC every time in
>> >> > coherent_icache_guest_page() by only doing it when Guest MMU is
>> >> > turned-off. This may reduce the performance penalty.
>> >>
>> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour 
>> these
>> >> bits?
>> >
>> > I would think so, it probably cares about how the transactions are
>> > presented at the bus level by the CPU.
>>
>> That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} 
>> in
>> order to find out whether or not we need to clean the cache.
>
> <horrible hack>
> Could you use a dummy ATS_* operation, then read the cacheability 
> attributes
> out of the PAR?
> </horrible hack>

That'd be doable, and maybe not that expensive (using ATS1, the Stage-1 
translation is probably still cached after the translation fault). 
Actually, I suspect this is more correct than just looking at SCTLR_EL1, 
as it doesn't assume that the guest uses always cacheable memory when 
enables the caches.

Shame HPFAR_EL2 doesn't report this information though...

         M.
diff mbox

Patch

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index a5f28e2..a71a9bf 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -60,10 +60,12 @@ 
 /*
  * The bits we set in HCR:
  * RW:		64bit by default, can be overriden for 32bit VMs
+ * TVM:		Trap writes to VM registers (until MMU is on)
  * TAC:		Trap ACTLR
  * TSC:		Trap SMC
  * TSW:		Trap cache operations by set/way
  * TWI:		Trap WFI
+ * DC:		Default Cacheable (until MMU is on)
  * TIDCP:	Trap L2CTLR/L2ECTLR
  * BSU_IS:	Upgrade barriers to the inner shareable domain
  * FB:		Force broadcast of all maintainance operations
@@ -74,7 +76,7 @@ 
  */
 #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
 			 HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-			 HCR_SWIO | HCR_TIDCP | HCR_RW)
+			 HCR_DC | HCR_TVM | HCR_SWIO | HCR_TIDCP | HCR_RW)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* Hyp System Control Register (SCTLR_EL2) bits */
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 9492360..6d81d87 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -121,6 +121,42 @@  done:
 }
 
 /*
+ * Generic accessor for VM registers. Only called as long as HCR_TVM
+ * is set.
+ */
+static bool access_vm_reg(struct kvm_vcpu *vcpu,
+			  const struct sys_reg_params *p,
+			  const struct sys_reg_desc *r)
+{
+	BUG_ON(!p->is_write);
+
+	vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+	return true;
+}
+
+/*
+ * SCTLR_EL1 accessor. Only called as long as HCR_TVM is set.  If the
+ * guest enables the MMU, we stop trapping the VM sys_regs and leave
+ * it in complete control of the caches.
+ */
+static bool access_sctlr_el1(struct kvm_vcpu *vcpu,
+			     const struct sys_reg_params *p,
+			     const struct sys_reg_desc *r)
+{
+	unsigned long val;
+
+	BUG_ON(!p->is_write);
+
+	val = *vcpu_reg(vcpu, p->Rt);
+	vcpu_sys_reg(vcpu, r->reg) = val;
+
+	if (val & (1 << 0))	/* MMU Enabled? */
+		vcpu->arch.hcr_el2 &= ~(HCR_DC | HCR_TVM);
+
+	return true;
+}
+
+/*
  * We could trap ID_DFR0 and tell the guest we don't support performance
  * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
  * NAKed, so it will read the PMCR anyway.
@@ -185,32 +221,32 @@  static const struct sys_reg_desc sys_reg_descs[] = {
 	  NULL, reset_mpidr, MPIDR_EL1 },
 	/* SCTLR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b0000), Op2(0b000),
-	  NULL, reset_val, SCTLR_EL1, 0x00C50078 },
+	  access_sctlr_el1, reset_val, SCTLR_EL1, 0x00C50078 },
 	/* CPACR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b0000), Op2(0b010),
 	  NULL, reset_val, CPACR_EL1, 0 },
 	/* TTBR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000),
-	  NULL, reset_unknown, TTBR0_EL1 },
+	  access_vm_reg, reset_unknown, TTBR0_EL1 },
 	/* TTBR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001),
-	  NULL, reset_unknown, TTBR1_EL1 },
+	  access_vm_reg, reset_unknown, TTBR1_EL1 },
 	/* TCR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010),
-	  NULL, reset_val, TCR_EL1, 0 },
+	  access_vm_reg, reset_val, TCR_EL1, 0 },
 
 	/* AFSR0_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000),
-	  NULL, reset_unknown, AFSR0_EL1 },
+	  access_vm_reg, reset_unknown, AFSR0_EL1 },
 	/* AFSR1_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001),
-	  NULL, reset_unknown, AFSR1_EL1 },
+	  access_vm_reg, reset_unknown, AFSR1_EL1 },
 	/* ESR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000),
-	  NULL, reset_unknown, ESR_EL1 },
+	  access_vm_reg, reset_unknown, ESR_EL1 },
 	/* FAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000),
-	  NULL, reset_unknown, FAR_EL1 },
+	  access_vm_reg, reset_unknown, FAR_EL1 },
 
 	/* PMINTENSET_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001),
@@ -221,17 +257,17 @@  static const struct sys_reg_desc sys_reg_descs[] = {
 
 	/* MAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
-	  NULL, reset_unknown, MAIR_EL1 },
+	  access_vm_reg, reset_unknown, MAIR_EL1 },
 	/* AMAIR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000),
-	  NULL, reset_amair_el1, AMAIR_EL1 },
+	  access_vm_reg, reset_amair_el1, AMAIR_EL1 },
 
 	/* VBAR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000),
 	  NULL, reset_val, VBAR_EL1, 0 },
 	/* CONTEXTIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001),
-	  NULL, reset_val, CONTEXTIDR_EL1, 0 },
+	  access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 },
 	/* TPIDR_EL1 */
 	{ Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100),
 	  NULL, reset_unknown, TPIDR_EL1 },