mbox series

[RFC,00/19] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

Message ID 20221107161740.144456-1-david@redhat.com
Headers show
Series mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning) | expand

Message

David Hildenbrand Nov. 7, 2022, 4:17 p.m. UTC
For now, we did not support reliable R/O long-term pinning in COW mappings.
That means, if we would trigger R/O long-term pinning in MAP_PRIVATE
mapping, we could end up pinning the (R/O-mapped) shared zeropage or a
pagecache page.

The next write access would trigger a write fault and replace the pinned
page by an exclusive anonymous page in the process page table; whatever the
process would write to that private page copy would not be visible by the
owner of the previous page pin: for example, RDMA could read stale data.
The end result is essentially an unexpected and hard-to-debug memory
corruption.

Some drivers tried working around that limitation by using
"FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now.
FOLL_WRITE would trigger a write fault, if required, and break COW before
pinning the page. FOLL_FORCE is required because the VMA might lack write
permissions, and drivers wanted to make that working as well, just like
one would expect (no write access, but still triggering a write access to
break COW).

However, that is not a practical solution, because
(1) Drivers that don't stick to that undocumented and debatable pattern
    would still run into that issue. For example, VFIO only uses
    FOLL_LONGTERM for R/O long-term pinning.
(2) Using FOLL_WRITE just to work around a COW mapping + page pinning
    limitation is unintuitive. FOLL_WRITE would, for example, mark the
    page softdirty or trigger uffd-wp, even though, there actually isn't
    going to be any write access.
(3) The purpose of FOLL_FORCE is debug access, not access without lack of
    VMA permissions by arbitrarty drivers.

So instead, make R/O long-term pinning work as expected, by breaking COW
in a COW mapping early, such that we can remove any FOLL_FORCE usage from
drivers. More details in patch #8.

Patches #1--#3 add COW tests for non-anonymous pages.
Patches #4--#7 prepare core MM for extended FAULT_FLAG_UNSHARE support in
COW mappings.
Patch #8 implements reliable R/O long-term pinning in COW mappings
Patches #9--#19 remove any FOLL_FORCE usage from drivers.

I'm refraining from CCing all driver maintainers on the whole patch set.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Shuah Khan <shuah@kernel.org
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: David Airlie <airlied@gmail.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>

David Hildenbrand (19):
  selftests/vm: anon_cow: prepare for non-anonymous COW tests
  selftests/vm: cow: basic COW tests for non-anonymous pages
  selftests/vm: cow: R/O long-term pinning reliability tests for
    non-anon pages
  mm: add early FAULT_FLAG_UNSHARE consistency checks
  mm: add early FAULT_FLAG_WRITE consistency checks
  mm: rework handling in do_wp_page() based on private vs. shared
    mappings
  mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for
    private mappings
  mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
  mm/gup: reliable R/O long-term pinning in COW mappings
  RDMA/umem: remove FOLL_FORCE usage
  RDMA/usnic: remove FOLL_FORCE usage
  RDMA/siw: remove FOLL_FORCE usage
  media: videobuf-dma-sg: remove FOLL_FORCE usage
  drm/etnaviv: remove FOLL_FORCE usage
  media: pci/ivtv: remove FOLL_FORCE usage
  mm/frame-vector: remove FOLL_FORCE usage
  drm/exynos: remove FOLL_FORCE usage
  RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage
  habanalabs: remove FOLL_FORCE usage

 drivers/gpu/drm/etnaviv/etnaviv_gem.c         |   8 +-
 drivers/gpu/drm/exynos/exynos_drm_g2d.c       |   2 +-
 drivers/infiniband/core/umem.c                |   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c    |   2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c      |   9 +-
 drivers/infiniband/sw/siw/siw_mem.c           |   9 +-
 drivers/media/common/videobuf2/frame_vector.c |   2 +-
 drivers/media/pci/ivtv/ivtv-udma.c            |   2 +-
 drivers/media/pci/ivtv/ivtv-yuv.c             |   5 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c     |  14 +-
 drivers/misc/habanalabs/common/memory.c       |   3 +-
 include/linux/mm.h                            |  27 +-
 include/linux/mm_types.h                      |   8 +-
 mm/gup.c                                      |  10 +-
 mm/huge_memory.c                              |   5 +-
 mm/hugetlb.c                                  |  12 +-
 mm/memory.c                                   |  97 +++--
 tools/testing/selftests/vm/.gitignore         |   2 +-
 tools/testing/selftests/vm/Makefile           |  10 +-
 tools/testing/selftests/vm/check_config.sh    |   4 +-
 .../selftests/vm/{anon_cow.c => cow.c}        | 387 +++++++++++++++++-
 tools/testing/selftests/vm/run_vmtests.sh     |   8 +-
 22 files changed, 516 insertions(+), 118 deletions(-)
 rename tools/testing/selftests/vm/{anon_cow.c => cow.c} (74%)

Comments

Nadav Amit Nov. 7, 2022, 7:03 p.m. UTC | #1
On Nov 7, 2022, at 8:17 AM, David Hildenbrand <david@redhat.com> wrote:

> !! External Email
> 
> Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
> care in all other handlers and might get "surprises" if we forget to do
> so.
> 
> Write faults without VM_MAYWRITE don't make any sense, and our
> maybe_mkwrite() logic could have hidden such abuse for now.
> 
> Write faults without VM_WRITE on something that is not a COW mapping is
> similarly broken, and e.g., do_wp_page() could end up placing an
> anonymous page into a shared mapping, which would be bad.
> 
> This is a preparation for reliable R/O long-term pinning of pages in
> private mappings, whereby we want to make sure that we will never break
> COW in a read-only private mapping.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> mm/memory.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index fe131273217a..826353da7b23 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
>                 */
>                if (!is_cow_mapping(vma->vm_flags))
>                        *flags &= ~FAULT_FLAG_UNSHARE;
> +       } else if (*flags & FAULT_FLAG_WRITE) {
> +               /* Write faults on read-only mappings are impossible ... */
> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
> +                       return VM_FAULT_SIGSEGV;
> +               /* ... and FOLL_FORCE only applies to COW mappings. */
> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
> +                                !is_cow_mapping(vma->vm_flags)))
> +                       return VM_FAULT_SIGSEGV;

Not sure about the WARN_*(). Seems as if it might trigger in benign even if
rare scenarios, e.g., mprotect() racing with page-fault.
David Hildenbrand Nov. 7, 2022, 7:27 p.m. UTC | #2
On 07.11.22 20:03, Nadav Amit wrote:
> On Nov 7, 2022, at 8:17 AM, David Hildenbrand <david@redhat.com> wrote:
> 
>> !! External Email
>>
>> Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
>> care in all other handlers and might get "surprises" if we forget to do
>> so.
>>
>> Write faults without VM_MAYWRITE don't make any sense, and our
>> maybe_mkwrite() logic could have hidden such abuse for now.
>>
>> Write faults without VM_WRITE on something that is not a COW mapping is
>> similarly broken, and e.g., do_wp_page() could end up placing an
>> anonymous page into a shared mapping, which would be bad.
>>
>> This is a preparation for reliable R/O long-term pinning of pages in
>> private mappings, whereby we want to make sure that we will never break
>> COW in a read-only private mapping.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/memory.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index fe131273217a..826353da7b23 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
>>                  */
>>                 if (!is_cow_mapping(vma->vm_flags))
>>                         *flags &= ~FAULT_FLAG_UNSHARE;
>> +       } else if (*flags & FAULT_FLAG_WRITE) {
>> +               /* Write faults on read-only mappings are impossible ... */
>> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
>> +                       return VM_FAULT_SIGSEGV;
>> +               /* ... and FOLL_FORCE only applies to COW mappings. */
>> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
>> +                                !is_cow_mapping(vma->vm_flags)))
>> +                       return VM_FAULT_SIGSEGV;
> 
> Not sure about the WARN_*(). Seems as if it might trigger in benign even if
> rare scenarios, e.g., mprotect() racing with page-fault.
> 

We most certainly would want to catch any such broken/racy cases. There 
are no benign cases I could possibly think of.

Page faults need the mmap lock in read. mprotect() / VMA changes need 
the mmap lock in write. Whoever calls handle_mm_fault() is supposed to 
properly check VMA permissions.
Nadav Amit Nov. 7, 2022, 7:50 p.m. UTC | #3
On Nov 7, 2022, at 11:27 AM, David Hildenbrand <david@redhat.com> wrote:

> !! External Email
> 
> On 07.11.22 20:03, Nadav Amit wrote:
>> On Nov 7, 2022, at 8:17 AM, David Hildenbrand <david@redhat.com> wrote:
>> 
>>> !! External Email
>>> 
>>> Let's catch abuse of FAULT_FLAG_WRITE early, such that we don't have to
>>> care in all other handlers and might get "surprises" if we forget to do
>>> so.
>>> 
>>> Write faults without VM_MAYWRITE don't make any sense, and our
>>> maybe_mkwrite() logic could have hidden such abuse for now.
>>> 
>>> Write faults without VM_WRITE on something that is not a COW mapping is
>>> similarly broken, and e.g., do_wp_page() could end up placing an
>>> anonymous page into a shared mapping, which would be bad.
>>> 
>>> This is a preparation for reliable R/O long-term pinning of pages in
>>> private mappings, whereby we want to make sure that we will never break
>>> COW in a read-only private mapping.
>>> 
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> mm/memory.c | 8 ++++++++
>>> 1 file changed, 8 insertions(+)
>>> 
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index fe131273217a..826353da7b23 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5159,6 +5159,14 @@ static vm_fault_t sanitize_fault_flags(struct vm_area_struct *vma,
>>>                 */
>>>                if (!is_cow_mapping(vma->vm_flags))
>>>                        *flags &= ~FAULT_FLAG_UNSHARE;
>>> +       } else if (*flags & FAULT_FLAG_WRITE) {
>>> +               /* Write faults on read-only mappings are impossible ... */
>>> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_MAYWRITE)))
>>> +                       return VM_FAULT_SIGSEGV;
>>> +               /* ... and FOLL_FORCE only applies to COW mappings. */
>>> +               if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE) &&
>>> +                                !is_cow_mapping(vma->vm_flags)))
>>> +                       return VM_FAULT_SIGSEGV;
>> 
>> Not sure about the WARN_*(). Seems as if it might trigger in benign even if
>> rare scenarios, e.g., mprotect() racing with page-fault.
> 
> We most certainly would want to catch any such broken/racy cases. There
> are no benign cases I could possibly think of.
> 
> Page faults need the mmap lock in read. mprotect() / VMA changes need
> the mmap lock in write. Whoever calls handle_mm_fault() is supposed to
> properly check VMA permissions.

My bad. I now see it. Thanks for explaining.
David Hildenbrand Nov. 8, 2022, 9:29 a.m. UTC | #4
On 07.11.22 18:27, Linus Torvalds wrote:
> On Mon, Nov 7, 2022 at 8:18 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> So instead, make R/O long-term pinning work as expected, by breaking COW
>> in a COW mapping early, such that we can remove any FOLL_FORCE usage from
>> drivers.
> 
> Nothing makes me unhappy from a quick scan through these patches.
> 
> And I'd really love to just have this long saga ended, and FOLL_FORCE
> finally relegated to purely ptrace accesses.
> 
> So an enthusiastic Ack from me.

Thanks Linus! My hope is that we can remove it from all drivers and not 
have to leave it in for some corner cases; so far it looks promising.
Christoph Hellwig Nov. 14, 2022, 6:03 a.m. UTC | #5
On Mon, Nov 07, 2022 at 09:27:23AM -0800, Linus Torvalds wrote:
> And I'd really love to just have this long saga ended, and FOLL_FORCE
> finally relegated to purely ptrace accesses.

At that point we should also rename it to FOLL_PTRACE to make that
very clear, and also break anything in-flight accidentally readding it,
which I'd otherwise expect to happen.
David Hildenbrand Nov. 14, 2022, 8:07 a.m. UTC | #6
On 14.11.22 07:03, Christoph Hellwig wrote:
> On Mon, Nov 07, 2022 at 09:27:23AM -0800, Linus Torvalds wrote:
>> And I'd really love to just have this long saga ended, and FOLL_FORCE
>> finally relegated to purely ptrace accesses.
> 
> At that point we should also rename it to FOLL_PTRACE to make that
> very clear, and also break anything in-flight accidentally readding it,
> which I'd otherwise expect to happen.

Good idea; I'll include a patch in v1.