diff mbox series

[v12,2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230406074005.1784728-3-usama.anjum@collabora.com
State New
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum April 6, 2023, 7:40 a.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

This IOCTL can be extended to get information about more PTE bits.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v12:
- Add hugetlb support to cover all memory types
- Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
- Review updates to the code

Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c            | 426 ++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h |   8 +
 include/uapi/linux/fs.h       |  53 +++++
 3 files changed, 487 insertions(+)

Comments

Michał Mirosław April 6, 2023, 8:12 p.m. UTC | #1
On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       bool is_written, is_file, is_present, is_swap;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long addr = end;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {
[...]
> +               return ret;
> +       }
> +process_smaller_pages:
> +       if (pmd_trans_unstable(pmd))
> +               return 0;

Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?

Best regards
Michał Mirosław
Muhammad Usama Anjum April 6, 2023, 9:03 p.m. UTC | #2
On 4/7/23 1:00 AM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> Hello,
>>
>> Thank you so much for the review. Do you have any thoughts on the build
>> error on arc architecture?
>> https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.com
> 
> Maybe copy HPAGE_* defines from x86 and key on CONFIG_PGTABLE_LEVELS >
> 2? I don't know much about arc arch, though.
> 
>> On 4/6/23 8:52 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:>
> [...]
>>>> +#define PM_SCAN_BITMAP(wt, file, present, swap)        \
>>>> +       (wt | file << 1 | present << 2 | swap << 3)
>>> Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not
>>> have to worry about operator precedence when passed a complex
>>> expression.
>> Like this?
>> #define PM_SCAN_BITMAP(wt, file, present, swap) \
>>         ((wt) | (file << 1) | (present << 2) | (swap << 3))
> 
> The value would be:
>  ( (wt) | ((file) << 1) | ... )
> IOW, each parameter should have parentheses around its name.
Will do.

> 
> [...]
>>>> +               cur->len += n_pages;
>>>> +               p->found_pages += n_pages;
>>>> +
>>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>>>> +
>>>> +               return 0;
>>>> +       }
>>>> +
>>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
>>>
>>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
>>> have vec_len == 0 here, then we'd not fit the entry in the userspace
>>> buffer anyway. Am I missing something?
>> No. I'd explained it with diagram last time:
>> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
>>
>> I'll add a concise comment here.
> 
> So it seems, but I think the code changed a bit and maybe could be
> simplified now? Since p->vec_len == 0 is currently not valid, the
> field could count only the entries available in p->vec[] -- IOW: not
> include p->cur in the count.
I see. But this'll not work as we need to count p->cur to don't go above
the maximum count, p->vec_size.

> 
> BTW, `if (no space) return -ENOSPC` will avoid additional indentation
> for the non-merging case.
I'll update.

> 
> [...]
>>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
>>>> +                                      struct page_region __user *vec,
>>>> +                                      unsigned long *vec_index)
>>>
>>> ..._deposit() is used only in single place - please inline.
>> It is already inline.
> 
> Sorry. I mean: please paste the code in place of the single call.
I've made it a separate function to make the code look better in the caller
function and logically easier to understand. This function is ugly.
do_pagemap_scan() is also already very long function with lots of things
happening. If you still insist, I'll remove this function.

> 
> [...]
>>>> +               /*
>>>> +                * Break huge page into small pages if the WP operation need to
>>>> +                * be performed is on a portion of the huge page or if max_pages
>>>> +                * pages limit would exceed.
>>>
>>> BTW, could the `max_pages` limit be relaxed a bit (in that it would be
>>> possible to return more pages if they all merge into the last vector
>>> entry) so that it would not need to split otherwise-matching huge
>>> page? It would remove the need for this special handling in the kernel
>>> and splitting the page by this read-only-appearing ioctl?
>> No, this cannot be done. Otherwise we'll not be able to emulate Windows
>> syscall GetWriteWatch() which specifies the exact number of pages. Usually
>> in most of cases, either user will not use THP or not perform the operation
>> on partial huge page. So this part is only there to keep things correct for
>> those users who do use THP and partial pagemap_scan operations.
> 
> I see that `GetWriteWatch` returns a list of pages not ranges of
> pages. That makes sense (more or less). (BTW, It could be emulated in
> userspace by caching the last not-fully-consumed range.)
First of all, caching is avoided as then state maintained is needed. This
is probably not accepted in Wine upstream later. Secondly, even if we have
cache, Get + WP operation would not be accurate when we ask only N pages,
but it gets N + X pages where X pages will not be consumed by the
application at this time.

> 
>>>> +                */
>>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>>>> +                   ((end - start < HPAGE_SIZE) ||
>>>> +                    (p->max_pages &&
>>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
>>>> +
>>>> +                       split_huge_pmd(vma, pmd, start);
>>>> +                       goto process_smaller_pages;
>>>> +               }
>>>> +
>>>> +               if (p->max_pages &&
>>>> +                   p->found_pages + n_pages > p->max_pages)
>>>> +                       n_pages = p->max_pages - p->found_pages;
>>>> +
>>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
>>>> +                                         is_swap, p, start, n_pages);
>>>> +               if (ret < 0)
>>>> +                       return ret;
> 
> So let's simplify this:
> 
> if (p->max_pages && n_pages > max_pages - found_pages)
>   n_pages = max_pages - found_pages;
> 
> if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
>   split_thp();
>   goto process_smaller_pages;
> }
Clever!! This looks very sleek.

> 
> BTW, THP handling could be extracted to a function that would return
> -EAGAIN if it has split the page or it wasn't a THP -- and that would
> mean `goto process_smaller_pages`.
Other functions in this file handle the THP in this same way. So it feels
like more intuitive that we follow to same pattern in this file.

> 
>>> Why not propagate the error from uffd_wp_range()?
>> uffd_wp_range() returns status in long variable. We cannot return long in
>> this function. So intead of type casting long to int and then return I've
>> used -EINVAL. Would following be more suitable?
>>
>> long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true);
>> if (ret2 < 0)
>>         return (int)ret2;
> 
> I think it's ok, since negative values are expected to be error codes.
> And since you can't overflow int with HPAGE_SIZE pages, then I
> wouldn't use `ret2` but cast the return and add a comment why it's
> safe.
I'll update.

> 
> [...]
>>>> +       start = (unsigned long)untagged_addr(arg.start);
>>>> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
>>>
>>> Is the inner cast needed?
>> arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly
>> to struct page_region pointer errors out. So I've added specific casting to
>> unsigned long first before casting to pointers.
> 
> I see. So to convey the intention, the `arg.start` and `arg.vec`
> should be casted to unsigned long, not the `untagged_addr()` return
> values.
I'll update.

> 
>>>> +       ret = pagemap_scan_args_valid(&arg, start, vec);
>>>> +       if (ret)
>>>> +               return ret;
>>>> +
>>>> +       end = start + arg.len;
>>>> +       p.max_pages = arg.max_pages;
>>>> +       p.found_pages = 0;
>>>> +       p.flags = arg.flags;
>>>> +       p.required_mask = arg.required_mask;
>>>> +       p.anyof_mask = arg.anyof_mask;
>>>> +       p.excluded_mask = arg.excluded_mask;
>>>> +       p.return_mask = arg.return_mask;
>>>> +       p.cur.len = 0;
>>>> +       p.cur.start = 0;
>>>> +       p.vec = NULL;
>>>> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>>>
>>> Nit: parentheses are not needed here, please remove.
>> Will do.
>>
>>>
>>>> +
>>>> +       /*
>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>> +        * we want to return output to user in compact form where no two
>>>> +        * consecutive regions should be continuous and have the same flags.
>>>> +        * So store the latest element in p.cur between different walks and
>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>> +        */
>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>> +                             GFP_KERNEL);
>>>> +       if (!p.vec)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       walk_start = walk_end = start;
>>>> +       while (walk_end < end && !ret) {
>>>
>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>> error will be lost) - is it intended?
>> It is intentional. -ENOSPC means that the user buffer is full even though
>> there was more memory to walk over. We don't treat this error. So when
>> buffer gets full, we stop walking over further as user buffer has gotten
>> full and return as success.
> 
> Thanks. What's the difference between -ENOSPC and
> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> flow).
-ENOSPC --> user buffer has been filled completely
PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
			    still have more space

> 
> [...]
>>>> --- a/include/linux/userfaultfd_k.h
>>>> +++ b/include/linux/userfaultfd_k.h
>>>> @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
>>>>
>>>>  #else /* CONFIG_USERFAULTFD */
>>>>
>>>> +static inline long uffd_wp_range(struct mm_struct *dst_mm,
>>>> +                                struct vm_area_struct *vma,
>>>> +                                unsigned long start, unsigned long len,
>>>> +                                bool enable_wp)
>>>> +{
>>>> +       return 0;
>>>> +}
> [...]
>>> Shouldn't this part be in the patch introducing uffd_wp_range()?
>> We have not added uffd_wp_range() in previous patches. We just need this
>> stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.
>>
>> I'd this as separate patch before this patch. Mike asked me to merge it
>> with this patch in order not to break bisectability.
>> [1] https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org
> 
> I would understand the reply [1] to mean that the uffd_wp_range() stub
> should go in the same patch where uffd_wp_range() is implemented. But
> uffd_wp_range() is already in (since f369b07c86140) so I don't see how
> having the stub in a separate commit sequenced before this one could
> break bisect?
Sorry, I mis-interpreted it. I'll make it a separate patch.

> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum April 6, 2023, 9:12 p.m. UTC | #3
On 4/7/23 1:12 AM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> [...]
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
> [...]
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       bool is_written, is_file, is_present, is_swap;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long addr = end;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
> [...]
>> +               return ret;
>> +       }
>> +process_smaller_pages:
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
> 
> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
I'm not entirely sure. But the idea is if THP is unstable, we should
return. As it doesn't seem like after splitting THP can be unstable, we
should not check it. Do you agree with the following?


--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1957,11 +1957,11 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd,
unsigned long start,

 		return ret;
 	}
-process_smaller_pages:
+
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
-
+process_smaller_pages:
 	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE)
Michał Mirosław April 7, 2023, 7:23 a.m. UTC | #4
On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> > [...]
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> > [...]
> >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >> +                                 unsigned long end, struct mm_walk *walk)
> >> +{
[...]
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >> +       if (ptl) {
> > [...]
> >> +               return ret;
> >> +       }
> >> +process_smaller_pages:
> >> +       if (pmd_trans_unstable(pmd))
> >> +               return 0;
> >
> > Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> I'm not entirely sure. But the idea is if THP is unstable, we should
> return. As it doesn't seem like after splitting THP can be unstable, we
> should not check it. Do you agree with the following?

The description of pmd_trans_unstable() [1] seems to indicate that it
is needed only after split_huge_pmd().

[1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394

Best Regards
Michał Mirosław
Michał Mirosław April 7, 2023, 7:34 a.m. UTC | #5
On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >>>> +               cur->len += n_pages;
> >>>> +               p->found_pages += n_pages;
> >>>> +
> >>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
> >>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
> >>>> +
> >>>> +               return 0;
> >>>> +       }
> >>>> +
> >>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
> >>>
> >>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
> >>> have vec_len == 0 here, then we'd not fit the entry in the userspace
> >>> buffer anyway. Am I missing something?
> >> No. I'd explained it with diagram last time:
> >> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
> >>
> >> I'll add a concise comment here.
> >
> > So it seems, but I think the code changed a bit and maybe could be
> > simplified now? Since p->vec_len == 0 is currently not valid, the
> > field could count only the entries available in p->vec[] -- IOW: not
> > include p->cur in the count.
> I see. But this'll not work as we need to count p->cur to don't go above
> the maximum count, p->vec_size.

You can subtract 1 from p->vec_size before the page walk to account
for the buffer in `cur`.

[...]
> >>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
> >>>> +                                      struct page_region __user *vec,
> >>>> +                                      unsigned long *vec_index)
> >>>
> >>> ..._deposit() is used only in single place - please inline.
> >> It is already inline.
> >
> > Sorry. I mean: please paste the code in place of the single call.
> I've made it a separate function to make the code look better in the caller
> function and logically easier to understand. This function is ugly.
> do_pagemap_scan() is also already very long function with lots of things
> happening. If you still insist, I'll remove this function.

Please do remove - it will make the copy to userspace code all neatly together.

[...]
> >>>> +                */
> >>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
> >>>> +                   ((end - start < HPAGE_SIZE) ||
> >>>> +                    (p->max_pages &&
> >>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
> >>>> +
> >>>> +                       split_huge_pmd(vma, pmd, start);
> >>>> +                       goto process_smaller_pages;
> >>>> +               }
> >>>> +
> >>>> +               if (p->max_pages &&
> >>>> +                   p->found_pages + n_pages > p->max_pages)
> >>>> +                       n_pages = p->max_pages - p->found_pages;
> >>>> +
> >>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
> >>>> +                                         is_swap, p, start, n_pages);
> >>>> +               if (ret < 0)
> >>>> +                       return ret;
> >
> > So let's simplify this:
> >
> > if (p->max_pages && n_pages > max_pages - found_pages)
> >   n_pages = max_pages - found_pages;
> >
> > if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
> >   split_thp();
> >   goto process_smaller_pages;
> > }
> Clever!! This looks very sleek.
>
> >
> > BTW, THP handling could be extracted to a function that would return
> > -EAGAIN if it has split the page or it wasn't a THP -- and that would
> > mean `goto process_smaller_pages`.
> Other functions in this file handle the THP in this same way. So it feels
> like more intuitive that we follow to same pattern in this file.

I'll leave it to you. Extracting THP support would avoid a goto and
#ifdef inside a function, though (and make the function smaller).

> >>>> +       /*
> >>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>> +        * we want to return output to user in compact form where no two
> >>>> +        * consecutive regions should be continuous and have the same flags.
> >>>> +        * So store the latest element in p.cur between different walks and
> >>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>> +        */
> >>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>> +                             GFP_KERNEL);
> >>>> +       if (!p.vec)
> >>>> +               return -ENOMEM;
> >>>> +
> >>>> +       walk_start = walk_end = start;
> >>>> +       while (walk_end < end && !ret) {
> >>>
> >>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>> error will be lost) - is it intended?
> >> It is intentional. -ENOSPC means that the user buffer is full even though
> >> there was more memory to walk over. We don't treat this error. So when
> >> buffer gets full, we stop walking over further as user buffer has gotten
> >> full and return as success.
> >
> > Thanks. What's the difference between -ENOSPC and
> > PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> > flow).
> -ENOSPC --> user buffer has been filled completely
> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>                             still have more space

What is the difference in code behaviour when those two cases are
compared? (I'd expect none.)

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 9:35 a.m. UTC | #6
On 4/7/23 12:23 PM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>> [...]
>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>> +{
> [...]
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>> +       if (ptl) {
>>> [...]
>>>> +               return ret;
>>>> +       }
>>>> +process_smaller_pages:
>>>> +       if (pmd_trans_unstable(pmd))
>>>> +               return 0;
>>>
>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>> I'm not entirely sure. But the idea is if THP is unstable, we should
>> return. As it doesn't seem like after splitting THP can be unstable, we
>> should not check it. Do you agree with the following?
> 
> The description of pmd_trans_unstable() [1] seems to indicate that it
> is needed only after split_huge_pmd().
> 
> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
in normal case when ptl is NULL to rule out the case if pmd is unstable
before performing operation on normal pages:

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
...
}
if (pmd_trans_unstable(pmd))
	return 0;

This file has usage examples of pmd_trans_unstable():

https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887

So we are good with what we have in this patch.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:04 a.m. UTC | #7
On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 12:23 PM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>> [...]
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> >>> [...]
> >>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >>>> +                                 unsigned long end, struct mm_walk *walk)
> >>>> +{
> > [...]
> >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >>>> +       if (ptl) {
> >>> [...]
> >>>> +               return ret;
> >>>> +       }
> >>>> +process_smaller_pages:
> >>>> +       if (pmd_trans_unstable(pmd))
> >>>> +               return 0;
> >>>
> >>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> >> I'm not entirely sure. But the idea is if THP is unstable, we should
> >> return. As it doesn't seem like after splitting THP can be unstable, we
> >> should not check it. Do you agree with the following?
> >
> > The description of pmd_trans_unstable() [1] seems to indicate that it
> > is needed only after split_huge_pmd().
> >
> > [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
> in normal case when ptl is NULL to rule out the case if pmd is unstable
> before performing operation on normal pages:
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> ...
> }
> if (pmd_trans_unstable(pmd))
>         return 0;
>
> This file has usage examples of pmd_trans_unstable():
>
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>
> So we are good with what we have in this patch.

Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:04 a.m. UTC | #8
On 4/7/23 12:34 PM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>>> +               cur->len += n_pages;
>>>>>> +               p->found_pages += n_pages;
>>>>>> +
>>>>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>>>>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>>>>>> +
>>>>>> +               return 0;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
>>>>>
>>>>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
>>>>> have vec_len == 0 here, then we'd not fit the entry in the userspace
>>>>> buffer anyway. Am I missing something?
>>>> No. I'd explained it with diagram last time:
>>>> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
>>>>
>>>> I'll add a concise comment here.
>>>
>>> So it seems, but I think the code changed a bit and maybe could be
>>> simplified now? Since p->vec_len == 0 is currently not valid, the
>>> field could count only the entries available in p->vec[] -- IOW: not
>>> include p->cur in the count.
>> I see. But this'll not work as we need to count p->cur to don't go above
>> the maximum count, p->vec_size.
> 
> You can subtract 1 from p->vec_size before the page walk to account
> for the buffer in `cur`.
Yeah, it can be done. But I've thought about it. It would look more uglier.
I'll have to subtract 1 from vec_len in the start and then add 1 in the
end. The current algorithm seems simpler.

> 
> [...]
>>>>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
>>>>>> +                                      struct page_region __user *vec,
>>>>>> +                                      unsigned long *vec_index)
>>>>>
>>>>> ..._deposit() is used only in single place - please inline.
>>>> It is already inline.
>>>
>>> Sorry. I mean: please paste the code in place of the single call.
>> I've made it a separate function to make the code look better in the caller
>> function and logically easier to understand. This function is ugly.
>> do_pagemap_scan() is also already very long function with lots of things
>> happening. If you still insist, I'll remove this function.
> 
> Please do remove - it will make the copy to userspace code all neatly together.
Will remove it.

> 
> [...]
>>>>>> +                */
>>>>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>>>>>> +                   ((end - start < HPAGE_SIZE) ||
>>>>>> +                    (p->max_pages &&
>>>>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
>>>>>> +
>>>>>> +                       split_huge_pmd(vma, pmd, start);
>>>>>> +                       goto process_smaller_pages;
>>>>>> +               }
>>>>>> +
>>>>>> +               if (p->max_pages &&
>>>>>> +                   p->found_pages + n_pages > p->max_pages)
>>>>>> +                       n_pages = p->max_pages - p->found_pages;
>>>>>> +
>>>>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
>>>>>> +                                         is_swap, p, start, n_pages);
>>>>>> +               if (ret < 0)
>>>>>> +                       return ret;
>>>
>>> So let's simplify this:
>>>
>>> if (p->max_pages && n_pages > max_pages - found_pages)
>>>   n_pages = max_pages - found_pages;
>>>
>>> if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
>>>   split_thp();
>>>   goto process_smaller_pages;
>>> }
>> Clever!! This looks very sleek.
>>
>>>
>>> BTW, THP handling could be extracted to a function that would return
>>> -EAGAIN if it has split the page or it wasn't a THP -- and that would
>>> mean `goto process_smaller_pages`.
>> Other functions in this file handle the THP in this same way. So it feels
>> like more intuitive that we follow to same pattern in this file.
> 
> I'll leave it to you. Extracting THP support would avoid a goto and
> #ifdef inside a function, though (and make the function smaller).
> 
>>>>>> +       /*
>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>> +        * we want to return output to user in compact form where no two
>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>> +        */
>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>> +                             GFP_KERNEL);
>>>>>> +       if (!p.vec)
>>>>>> +               return -ENOMEM;
>>>>>> +
>>>>>> +       walk_start = walk_end = start;
>>>>>> +       while (walk_end < end && !ret) {
>>>>>
>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>> error will be lost) - is it intended?
>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>> there was more memory to walk over. We don't treat this error. So when
>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>> full and return as success.
>>>
>>> Thanks. What's the difference between -ENOSPC and
>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>> flow).
>> -ENOSPC --> user buffer has been filled completely
>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>                             still have more space
> 
> What is the difference in code behaviour when those two cases are
> compared? (I'd expect none.)
There is difference:
We add data to user buffer. If it succeeds with return code 0, we engage
the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
to the user buffer.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:14 a.m. UTC | #9
On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
[...]
> >>>>>> +       /*
> >>>>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>> +        * we want to return output to user in compact form where no two
> >>>>>> +        * consecutive regions should be continuous and have the same flags.
> >>>>>> +        * So store the latest element in p.cur between different walks and
> >>>>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>>>> +        */
> >>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>> +                             GFP_KERNEL);
> >>>>>> +       if (!p.vec)
> >>>>>> +               return -ENOMEM;
> >>>>>> +
> >>>>>> +       walk_start = walk_end = start;
> >>>>>> +       while (walk_end < end && !ret) {
> >>>>>
> >>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>> error will be lost) - is it intended?
> >>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>> there was more memory to walk over. We don't treat this error. So when
> >>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>> full and return as success.
> >>>
> >>> Thanks. What's the difference between -ENOSPC and
> >>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>> flow).
> >> -ENOSPC --> user buffer has been filled completely
> >> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>                             still have more space
> >
> > What is the difference in code behaviour when those two cases are
> > compared? (I'd expect none.)
> There is difference:
> We add data to user buffer. If it succeeds with return code 0, we engage
> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> to the user buffer.

Thanks! I see it now. I see a few more corner cases here:
1. If we did engage WP but fail to copy the vector we return -EFAULT
but the WP is already engaged. I'm not sure this is something worth
guarding against, but documenting that would be helpful I think.
2. If uffd_wp_range() fails, but we have already processed pages
earlier, we should treat the error like ENOSPC and back out the failed
range (the earier changes would be lost otherwise).

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:14 a.m. UTC | #10
On 4/7/23 3:04 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 12:23 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>> [...]
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>>>> [...]
>>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>>>> +{
>>> [...]
>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>> +       if (ptl) {
>>>>> [...]
>>>>>> +               return ret;
>>>>>> +       }
>>>>>> +process_smaller_pages:
>>>>>> +       if (pmd_trans_unstable(pmd))
>>>>>> +               return 0;
>>>>>
>>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>>>> I'm not entirely sure. But the idea is if THP is unstable, we should
>>>> return. As it doesn't seem like after splitting THP can be unstable, we
>>>> should not check it. Do you agree with the following?
>>>
>>> The description of pmd_trans_unstable() [1] seems to indicate that it
>>> is needed only after split_huge_pmd().
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
>> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
>> in normal case when ptl is NULL to rule out the case if pmd is unstable
>> before performing operation on normal pages:
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> if (ptl) {
>> ...
>> }
>> if (pmd_trans_unstable(pmd))
>>         return 0;
>>
>> This file has usage examples of pmd_trans_unstable():
>>
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>>
>> So we are good with what we have in this patch.
> 
> Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
I'm not sure. I've not done research on it if we need to signal
ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
isn't doing anything if pmd is unstable. Hence we also not doing anything.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:21 a.m. UTC | #11
On Fri, 7 Apr 2023 at 12:15, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 3:04 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 12:23 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> >>>>> <usama.anjum@collabora.com> wrote:
> >>>>> [...]
> >>>>>> --- a/fs/proc/task_mmu.c
> >>>>>> +++ b/fs/proc/task_mmu.c
> >>>>> [...]
> >>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >>>>>> +                                 unsigned long end, struct mm_walk *walk)
> >>>>>> +{
> >>> [...]
> >>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >>>>>> +       if (ptl) {
> >>>>> [...]
> >>>>>> +               return ret;
> >>>>>> +       }
> >>>>>> +process_smaller_pages:
> >>>>>> +       if (pmd_trans_unstable(pmd))
> >>>>>> +               return 0;
> >>>>>
> >>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> >>>> I'm not entirely sure. But the idea is if THP is unstable, we should
> >>>> return. As it doesn't seem like after splitting THP can be unstable, we
> >>>> should not check it. Do you agree with the following?
> >>>
> >>> The description of pmd_trans_unstable() [1] seems to indicate that it
> >>> is needed only after split_huge_pmd().
> >>>
> >>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
> >> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
> >> in normal case when ptl is NULL to rule out the case if pmd is unstable
> >> before performing operation on normal pages:
> >>
> >> ptl = pmd_trans_huge_lock(pmd, vma);
> >> if (ptl) {
> >> ...
> >> }
> >> if (pmd_trans_unstable(pmd))
> >>         return 0;
> >>
> >> This file has usage examples of pmd_trans_unstable():
> >>
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
> >>
> >> So we are good with what we have in this patch.
> >
> > Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
> I'm not sure. I've not done research on it if we need to signal
> ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
> pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
> isn't doing anything if pmd is unstable. Hence we also not doing anything.

Doesn't this mean that if we scan a file-backed vma we would miss
non-present parts of the mapping in the output?

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:50 a.m. UTC | #12
On 4/7/23 3:21 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 12:15, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 3:04 PM, Michał Mirosław wrote:
>>> On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 12:23 PM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>>>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>>>>>> <usama.anjum@collabora.com> wrote:
>>>>>>> [...]
>>>>>>>> --- a/fs/proc/task_mmu.c
>>>>>>>> +++ b/fs/proc/task_mmu.c
>>>>>>> [...]
>>>>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>>>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>>>>>> +{
>>>>> [...]
>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>>>> +       if (ptl) {
>>>>>>> [...]
>>>>>>>> +               return ret;
>>>>>>>> +       }
>>>>>>>> +process_smaller_pages:
>>>>>>>> +       if (pmd_trans_unstable(pmd))
>>>>>>>> +               return 0;
>>>>>>>
>>>>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>>>>>> I'm not entirely sure. But the idea is if THP is unstable, we should
>>>>>> return. As it doesn't seem like after splitting THP can be unstable, we
>>>>>> should not check it. Do you agree with the following?
>>>>>
>>>>> The description of pmd_trans_unstable() [1] seems to indicate that it
>>>>> is needed only after split_huge_pmd().
>>>>>
>>>>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
>>>> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
>>>> in normal case when ptl is NULL to rule out the case if pmd is unstable
>>>> before performing operation on normal pages:
>>>>
>>>> ptl = pmd_trans_huge_lock(pmd, vma);
>>>> if (ptl) {
>>>> ...
>>>> }
>>>> if (pmd_trans_unstable(pmd))
>>>>         return 0;
>>>>
>>>> This file has usage examples of pmd_trans_unstable():
>>>>
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>>>>
>>>> So we are good with what we have in this patch.
>>>
>>> Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
>> I'm not sure. I've not done research on it if we need to signal
>> ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
>> pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
>> isn't doing anything if pmd is unstable. Hence we also not doing anything.
> 
> Doesn't this mean that if we scan a file-backed vma we would miss
> non-present parts of the mapping in the output?
I'm trying to mimic the same information through ioctl which is attained by
reading the file. We'll only miss the unstable VMA here. I'm don't know
about how often the PMD is unstable and its effects.


> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 11:10 a.m. UTC | #13
On 4/7/23 3:14 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 12:34 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>>>>> +       /*
>>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>>>> +        * we want to return output to user in compact form where no two
>>>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>>>> +        */
>>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>>>> +                             GFP_KERNEL);
>>>>>>>> +       if (!p.vec)
>>>>>>>> +               return -ENOMEM;
>>>>>>>> +
>>>>>>>> +       walk_start = walk_end = start;
>>>>>>>> +       while (walk_end < end && !ret) {
>>>>>>>
>>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>>>> error will be lost) - is it intended?
>>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>>>> there was more memory to walk over. We don't treat this error. So when
>>>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>>>> full and return as success.
>>>>>
>>>>> Thanks. What's the difference between -ENOSPC and
>>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>>>> flow).
>>>> -ENOSPC --> user buffer has been filled completely
>>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>>>                             still have more space
>>>
>>> What is the difference in code behaviour when those two cases are
>>> compared? (I'd expect none.)
>> There is difference:
>> We add data to user buffer. If it succeeds with return code 0, we engage
>> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
>> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
>> to the user buffer.
> 
> Thanks! I see it now. I see a few more corner cases here:
> 1. If we did engage WP but fail to copy the vector we return -EFAULT
> but the WP is already engaged. I'm not sure this is something worth
> guarding against, but documenting that would be helpful I think.
Sure.

> 2. If uffd_wp_range() fails, but we have already processed pages
> earlier, we should treat the error like ENOSPC and back out the failed
> range (the earier changes would be lost otherwise).
Backing out is easier to do for hugepages. But for normal pages, we'll have
to write some code to find where the current data was added (in cur or in
vec) and back out from that. I'll have to write some more code to avoid the
side-effects as well.

But aren't we going over-engineering here? Error occurred and we are trying
to keep the previously generated correct data and returning successfully
still to the user? I don't think we should do this. An error is error. We
should return the error simply even if the memory flags would get lost. We
don't know what caused the error in uffd_wp_range(). Under normal
situation, we there shouldn't have had error.


> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum April 17, 2023, 11:11 a.m. UTC | #14
On 4/11/23 2:29 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 3:14 PM, Michał Mirosław wrote:
>>> On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 12:34 PM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>>>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>>>>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>>>>>>>> +       /*
>>>>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>>>>>> +        * we want to return output to user in compact form where no two
>>>>>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>>>>>> +        */
>>>>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>>>>>> +                             GFP_KERNEL);
>>>>>>>>>> +       if (!p.vec)
>>>>>>>>>> +               return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +       walk_start = walk_end = start;
>>>>>>>>>> +       while (walk_end < end && !ret) {
>>>>>>>>>
>>>>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>>>>>> error will be lost) - is it intended?
>>>>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>>>>>> there was more memory to walk over. We don't treat this error. So when
>>>>>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>>>>>> full and return as success.
>>>>>>>
>>>>>>> Thanks. What's the difference between -ENOSPC and
>>>>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>>>>>> flow).
>>>>>> -ENOSPC --> user buffer has been filled completely
>>>>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>>>>>                             still have more space
>>>>>
>>>>> What is the difference in code behaviour when those two cases are
>>>>> compared? (I'd expect none.)
>>>> There is difference:
>>>> We add data to user buffer. If it succeeds with return code 0, we engage
>>>> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
>>>> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
>>>> to the user buffer.
>>>
>>> Thanks! I see it now. I see a few more corner cases here:
>>> 1. If we did engage WP but fail to copy the vector we return -EFAULT
>>> but the WP is already engaged. I'm not sure this is something worth
>>> guarding against, but documenting that would be helpful I think.
>> Sure.
>>
>>> 2. If uffd_wp_range() fails, but we have already processed pages
>>> earlier, we should treat the error like ENOSPC and back out the failed
>>> range (the earier changes would be lost otherwise).
>> Backing out is easier to do for hugepages. But for normal pages, we'll have
>> to write some code to find where the current data was added (in cur or in
>> vec) and back out from that. I'll have to write some more code to avoid the
>> side-effects as well.
> 
> If I read the code correctly, the last page should always be in `cur`
> and on failure only a single page is needed to be backed out. Did I
> miss something?
I'm leaving using uffd_wp_range() in next revision as it is costing
performance. This will not be needed in next revision.

> 
>> But aren't we going over-engineering here? Error occurred and we are trying
>> to keep the previously generated correct data and returning successfully
>> still to the user? I don't think we should do this. An error is error. We
>> should return the error simply even if the memory flags would get lost. We
>> don't know what caused the error in uffd_wp_range(). Under normal
>> situation, we there shouldn't have had error.
> 
> In this case it means that on (intermittent) allocation error we get
> inconsistent or non-deterministic results. I wouldn't want to be the
> one debugging this later - I'd prefer either the syscall be
> "exception-safe" (give consistent and predictable output) or kill the
> process.
> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum June 7, 2023, 5:45 a.m. UTC | #15
On 4/6/23 5:56 PM, Muhammad Usama Anjum wrote:
> On 4/6/23 4:40 PM, kernel test robot wrote:
>> Hi Muhammad,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>> [also build test ERROR on next-20230406]
>> [cannot apply to linus/master v6.3-rc5]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
>> patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
>> config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
>> compiler: arceb-elf-gcc (GCC) 12.1.0
>> reproduce (this is a W=1 build):
>>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>         chmod +x ~/bin/make.cross
>>         # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
>>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>         git checkout f13abb36f64c77913509da8ca157512d2fb9f031
>>         # save the config file
>>         mkdir build_dir && cp config build_dir/.config
>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/
>>
>> If you fix the issue, kindly add following tag where applicable
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/
>>
>> All error/warnings (new ones prefixed by >>):
>>
>>    fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>>>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
>>     1921 |                     ((end - start < HPAGE_SIZE) ||
>>          |                                     ^~~~~~~~~~
>>          |                                     PAGE_SIZE
> It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
> doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
> include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
> CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?
I'm still looking for solution. Vineet do you have some thoughts?

> 
> The remaining build failures are because the wrong tree. I base my patches
> on latest next, while the bot has based patches on mm-everything. I guess
> today's next would have latest mm stuff, a rebase would make things correct
> or I'll shift to mm-everything.
> 
>
Muhammad Usama Anjum June 13, 2023, 10:35 a.m. UTC | #16
Hi Vineet,

It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?

Should I just compile out this code for arc architecture specifically?

Thanks,
Usama


On 6/7/23 10:45 AM, Muhammad Usama Anjum wrote:
> On 4/6/23 5:56 PM, Muhammad Usama Anjum wrote:
>> On 4/6/23 4:40 PM, kernel test robot wrote:
>>> Hi Muhammad,
>>>
>>> kernel test robot noticed the following build errors:
>>>
>>> [auto build test ERROR on akpm-mm/mm-everything]
>>> [also build test ERROR on next-20230406]
>>> [cannot apply to linus/master v6.3-rc5]
>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>> And when submitting patch, we suggest to use '--base' as documented in
>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>
>>> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>>> patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
>>> patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
>>> config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
>>> compiler: arceb-elf-gcc (GCC) 12.1.0
>>> reproduce (this is a W=1 build):
>>>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>>         chmod +x ~/bin/make.cross
>>>         # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
>>>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>>>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>>         git checkout f13abb36f64c77913509da8ca157512d2fb9f031
>>>         # save the config file
>>>         mkdir build_dir && cp config build_dir/.config
>>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
>>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/
>>>
>>> If you fix the issue, kindly add following tag where applicable
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/
>>>
>>> All error/warnings (new ones prefixed by >>):
>>>
>>>    fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>>>>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
>>>     1921 |                     ((end - start < HPAGE_SIZE) ||
>>>          |                                     ^~~~~~~~~~
>>>          |                                     PAGE_SIZE
>> It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
>> doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
>> include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
>> CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?
> I'm still looking for solution. Vineet do you have some thoughts?
> 
>>
>> The remaining build failures are because the wrong tree. I base my patches
>> on latest next, while the bot has based patches on mm-everything. I guess
>> today's next would have latest mm stuff, a rebase would make things correct
>> or I'll shift to mm-everything.
>>
>>
>
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6134f0df4164..36531b04bcd6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1135,6 +1136,26 @@  static inline void clear_soft_dirty(struct vm_area_struct *vma,
 }
 #endif
 
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline bool is_huge_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && huge_pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+#endif
+
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+		(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)));
+}
+
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
@@ -1768,11 +1789,416 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_FOUND_MAX_PAGES	(1)
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define PM_SCAN_OP_IS_WP(a)	(a->flags & PM_SCAN_OP_WP)
+#define PM_SCAN_BITMAP(wt, file, present, swap)	\
+	(wt | file << 1 | present << 2 | swap << 3)
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region cur;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool pagemap_scan_is_written_set(struct pagemap_scan_private *p)
+{
+	return	((p->required_mask | p->anyof_mask | p->excluded_mask) &
+		 PAGE_IS_WRITTEN);
+}
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (pagemap_scan_is_written_set(p) && (!userfaultfd_wp(vma) ||
+	    !userfaultfd_wp_async(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
+	struct page_region *cur = &p->cur;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	if ((p->required_mask & bitmap) != p->required_mask)
+		return 0;
+	if (p->anyof_mask && !(p->anyof_mask & bitmap))
+		return 0;
+	if (p->excluded_mask & bitmap)
+		return 0;
+
+	bitmap &= p->return_mask;
+	if (!bitmap)
+		return 0;
+
+	if ((cur->start + cur->len * PAGE_SIZE == addr) &&
+	    (cur->bitmap == bitmap)) {
+		cur->len += n_pages;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
+
+		if (cur->len) {
+			memcpy(&p->vec[p->vec_index], cur,
+			       sizeof(struct page_region));
+			p->vec_index++;
+		}
+		cur->start = addr;
+		cur->len = n_pages;
+		cur->bitmap = bitmap;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	return -ENOSPC;
+}
+
+static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
+				       struct page_region __user *vec,
+				       unsigned long *vec_index)
+{
+	struct page_region *cur = &p->cur;
+
+	if (!cur->len)
+		return 0;
+
+	if (copy_to_user(&vec[*vec_index], cur, sizeof(struct page_region)))
+		return -EFAULT;
+
+	p->vec_index++;
+	(*vec_index)++;
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	bool is_written, is_file, is_present, is_swap;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = end;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t *pte;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages = (end - start)/PAGE_SIZE;
+
+		is_written = !is_pmd_uffd_wp(*pmd);
+		is_file = vma->vm_file;
+		is_present = pmd_present(*pmd);
+		is_swap = is_swap_pmd(*pmd);
+
+		spin_unlock(ptl);
+
+		/*
+		 * Break huge page into small pages if the WP operation need to
+		 * be performed is on a portion of the huge page or if max_pages
+		 * pages limit would exceed.
+		 */
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    ((end - start < HPAGE_SIZE) ||
+		     (p->max_pages &&
+		      (p->max_pages - p->found_pages) < n_pages))) {
+
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+
+		if (p->max_pages &&
+		    p->found_pages + n_pages > p->max_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		ret = pagemap_scan_output(is_written, is_file, is_present,
+					  is_swap, p, start, n_pages);
+		if (ret < 0)
+			return ret;
+
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
+			ret = -EINVAL;
+
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif
+
+	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+		is_written = !is_pte_uffd_wp(*pte);
+		is_file = vma->vm_file;
+		is_present = pte_present(*pte);
+		is_swap = is_swap_pte(*pte);
+
+		pte_unmap_unlock(pte, ptl);
+
+		ret = pagemap_scan_output(is_written, is_file, is_present,
+					  is_swap, p, addr, 1);
+		if (ret < 0)
+			return ret;
+
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
+			return -EINVAL;
+	}
+
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	bool is_written;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long n_pages = (end - start)/PAGE_SIZE;
+	pte_t pte;
+	int ret;
+
+	pte = huge_ptep_get(ptep);
+	is_written = !is_huge_pte_uffd_wp(pte);
+
+	/*
+	 * Partial hugetlb page clear isn't supported
+	 */
+	if (is_written && PM_SCAN_OP_IS_WP(p) &&
+	    ((end - start < HPAGE_SIZE) ||
+	     (p->max_pages &&
+	      (p->max_pages - p->found_pages) < n_pages)))
+		return -EPERM;
+
+	if (p->max_pages &&
+	    p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(is_written, vma->vm_file, pte_present(pte),
+				  is_swap_pte(pte), p, start, n_pages);
+	if (ret < 0)
+		return ret;
+
+	if (is_written && PM_SCAN_OP_IS_WP(p) &&
+	    uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
+		ret = -EINVAL;
+
+	return ret;
+}
+#else
+#define pagemap_scan_hugetlb_entry NULL
+#endif
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long n_pages;
+	int ret = 0;
+
+	if (!vma)
+		return 0;
+
+	n_pages = (end - addr)/PAGE_SIZE;
+	if (p->max_pages && p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+				  n_pages);
+
+	return ret;
+}
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+	.hugetlb_entry = pagemap_scan_hugetlb_entry,
+};
+
+static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
+				   struct page_region __user *vec)
+{
+	/* Detect illegal size, flags, len and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return -EINVAL;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return -EINVAL;
+	if (!arg->len)
+		return -EINVAL;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return -EINVAL;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return -EINVAL;
+	if (!arg->return_mask)
+		return -EINVAL;
+
+	/* Validate memory ranges */
+	if (!(arg->flags & PM_SCAN_OP_GET))
+		return -EINVAL;
+	if (!arg->vec)
+		return -EINVAL;
+	if (arg->vec_len == 0)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok((void __user *)start, arg->len))
+		return -EFAULT;
+
+	if (!PM_SCAN_OP_IS_WP(arg))
+		return 0;
+
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) &
+	    ~PAGE_IS_WRITTEN)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long do_pagemap_cmd(struct mm_struct *mm,
+			   struct pm_scan_arg __user *uarg)
+{
+	unsigned long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	struct pm_scan_arg arg;
+	int ret = 0;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	start = (unsigned long)untagged_addr(arg.start);
+	vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
+
+	ret = pagemap_scan_args_valid(&arg, start, vec);
+	if (ret)
+		return ret;
+
+	end = start + arg.len;
+	p.max_pages = arg.max_pages;
+	p.found_pages = 0;
+	p.flags = arg.flags;
+	p.required_mask = arg.required_mask;
+	p.anyof_mask = arg.anyof_mask;
+	p.excluded_mask = arg.excluded_mask;
+	p.return_mask = arg.return_mask;
+	p.cur.len = 0;
+	p.cur.start = 0;
+	p.vec = NULL;
+	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur between different walks and
+	 * store the p.cur at the end of the walk to the user buffer.
+	 */
+	p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
+			      GFP_KERNEL);
+	if (!p.vec)
+		return -ENOMEM;
+
+	walk_start = walk_end = start;
+	while (walk_end < end && !ret) {
+		p.vec_index = 0;
+
+		empty_slots = arg.vec_len - vec_index;
+		p.vec_len = min(p.vec_len, empty_slots);
+
+		walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (walk_end > end)
+			walk_end = end;
+
+		mmap_read_lock(mm);
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
+			goto free_data;
+
+		walk_start = walk_end;
+		if (p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index *
+					 sizeof(struct page_region))) {
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+	ret = pagemap_scan_deposit(&p, vec, &vec_index);
+	if (!ret)
+		ret = vec_index;
+free_data:
+	kfree(p.vec);
+
+	return ret;
+}
+
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+
+	switch (cmd) {
+	case PAGEMAP_SCAN:
+		return do_pagemap_cmd(mm, uarg);
+
+	default:
+		return -EINVAL;
+	}
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = pagemap_scan_ioctl,
+	.compat_ioctl	= pagemap_scan_ioctl,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ce4f9d18de3e..325672cd6d6f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -210,6 +210,14 @@  extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
 
 #else /* CONFIG_USERFAULTFD */
 
+static inline long uffd_wp_range(struct mm_struct *dst_mm,
+				 struct vm_area_struct *vma,
+				 unsigned long start, unsigned long len,
+				 bool enable_wp)
+{
+	return 0;
+}
+
 /* mm helpers */
 static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
 				unsigned long reason)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..47879c38ce2f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,57 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */