Message ID | 20230309135718.1490461-1-usama.anjum@collabora.com |
---|---|
Headers | show |
Series | Implement IOCTL to get and optionally clear info about PTEs | expand |
On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear > the info about page table entries. The following operations are supported > in this ioctl: > - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), > file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped > (PAGE_IS_SWAPPED). > - Find pages which have been written-to and write protect the pages > (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) [...] > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -19,6 +19,7 @@ > #include <linux/shmem_fs.h> > #include <linux/uaccess.h> > #include <linux/pkeys.h> > +#include <linux/minmax.h> > > #include <asm/elf.h> > #include <asm/tlb.h> > @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, > } > #endif > > +static inline bool is_pte_uffd_wp(pte_t pte) > +{ > + return ((pte_present(pte) && pte_uffd_wp(pte)) || > + (pte_swp_uffd_wp_any(pte))); Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove here and in all following if()s. (Nit: those extra parentheses are used inconsistently in the patch anyway.) [...] > +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p) This seems to check if the PAGE_IS_WRITTEN flag is tested, so "pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt" acronym as it seems used also on following code. > +{ > + return ((p->required_mask & PAGE_IS_WRITTEN) || > + (p->anyof_mask & PAGE_IS_WRITTEN) || > + (p->excluded_mask & PAGE_IS_WRITTEN)); Nit: It looks like it should answer "do any of the masks contain PAGE_IS_WRITTEN?" so maybe: return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN; [...] > +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > + struct pagemap_scan_private *p, > + unsigned long addr, unsigned int n_pages) > +{ > + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); > + struct page_region *cur = &p->cur; > + bool cpy = true; > + > + if (p->max_pages && (p->found_pages == p->max_pages)) > + return -ENOSPC; > + > + if (!n_pages) > + return -EINVAL; > + > + if (p->required_mask) > + cpy = ((p->required_mask & bitmap) == p->required_mask); > + if (cpy && p->anyof_mask) > + cpy = (p->anyof_mask & bitmap); > + if (cpy && p->excluded_mask) > + cpy = !(p->excluded_mask & bitmap); Since the rest of the code is executed only when `cpy` is true, this could just return early for easier understanding. BTW, some of the tests are redundant. Eg: if required_mask == 0, then `required_mask & x == required_mask` will always hold. Same for `excluded_mask & x == 0`. > + > + bitmap = bitmap & p->return_mask; Nit: bitmap &= p->return_mask; > + if (cpy && bitmap) { Assuming early returns on `!cpy` are done earlier: if (!bitmap) return 0; > + if ((cur->len) && (cur->bitmap == bitmap) && > + (cur->start + cur->len * PAGE_SIZE == addr)) { I'd recommend removing the extra parentheses as they make the code less readable for me (too many parentheses to match visually). The `cur->len` test seems redundant: is it possible to have `cur->start == addr` in that case (I guess it would have to get `n_pages == 0` in an earlier invocation)? > + > + cur->len += n_pages; > + p->found_pages += n_pages; Please add an early return so that 'else' chaining won't be necessary. > + } else if ((!p->vec_index) || > + ((p->vec_index + 1) < p->vec_len)) { Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: if (vec_index >= p->vec_len) return -ENOSPC; > + if (cur->len) { > + memcpy(&p->vec[p->vec_index], cur, > + sizeof(struct page_region)); > + p->vec_index++; > + } > + > + cur->start = addr; > + cur->len = n_pages; > + cur->bitmap = bitmap; > + p->found_pages += n_pages; > + } else { > + return -ENOSPC; > + } > + } > + > + return 0; > +} [...] > +static int pagemap_scan_deposit(struct pagemap_scan_private *p, > + struct page_region __user *vec, > + unsigned long *vec_index) > +{ > + struct page_region *cur = &p->cur; > + > + if (cur->len) { if (!cur->len) return 0; > + if (copy_to_user(&vec[*vec_index], cur, > + sizeof(struct page_region))) > + return -EFAULT; > + > + p->vec_index++; > + (*vec_index)++; > + } > + > + return 0; > +} > +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > + unsigned long end, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + bool is_writ, is_file, is_pres, is_swap; > + unsigned long addr = end; > + spinlock_t *ptl; > + int ret = 0; > + pte_t *pte; > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return NULL if transparent hugepages are not compiled in. OTOH I see BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this case), so that would need to be worked around first. > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (ptl) { > + unsigned long n_pages; > + > + is_writ = !is_pmd_uffd_wp(*pmd); `is_written`? > + /* > + * Break huge page into small pages if operation needs to be > + * performed is on a portion of the huge page. > + */ > + if (is_writ && PM_SCAN_OP_IS_WP(p) && > + (end - start < HPAGE_SIZE)) { > + spin_unlock(ptl); > + > + split_huge_pmd(vma, pmd, start); > + goto process_smaller_pages; > + } > + > + n_pages = (end - start)/PAGE_SIZE; > + if (p->max_pages && > + p->found_pages + n_pages >= p->max_pages) Nit: greater-than is also correct and avoids no-op assignment. > + n_pages = p->max_pages - p->found_pages; > + > + ret = pagemap_scan_output(is_writ, vma->vm_file, > + pmd_present(*pmd), is_swap_pmd(*pmd), > + p, start, n_pages); > + spin_unlock(ptl); if (ret || !is_written) return ret; This will avoid those tests in the following if(). > + > + if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) && > + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0) > + ret = -EINVAL; > + > + return ret; After above early returns, this will be always `return 0;`. > + } > +process_smaller_pages: > + if (pmd_trans_unstable(pmd)) > + return 0; > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { The `!ret` can be removed if the EINVAL case was to `break` by itself. > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + > + is_writ = !is_pte_uffd_wp(*pte); > + is_file = vma->vm_file; > + is_pres = pte_present(*pte); > + is_swap = is_swap_pte(*pte); > + > + pte_unmap_unlock(pte, ptl); > + > + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, > + p, addr, 1); > + if (ret) > + break; > + > + if (PM_SCAN_OP_IS_WP(p) && is_writ && > + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) > + ret = -EINVAL; > + } > + > + cond_resched(); > + return ret; > +} > + > +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, > + int depth, struct mm_walk *walk) > +{ > + struct pagemap_scan_private *p = walk->private; > + struct vm_area_struct *vma = walk->vma; > + unsigned long n_pages; > + int ret = 0; > + > + if (vma) { if (!vma) return 0; > + n_pages = (end - addr)/PAGE_SIZE; > + if (p->max_pages && > + p->found_pages + n_pages >= p->max_pages) > + n_pages = p->max_pages - p->found_pages; > + > + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, > + addr, n_pages); > + } > + > + return ret; > +} > +/* No hugetlb support is present. */ "FIXME: hugetlb support is not implemented."? (There seems to be no #ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the current implementation.) > +static const struct mm_walk_ops pagemap_scan_ops = { > + .test_walk = pagemap_scan_test_walk, > + .pmd_entry = pagemap_scan_pmd_entry, > + .pte_hole = pagemap_scan_pte_hole, > +}; > + > +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg, > + struct page_region __user *vec, > + unsigned long start) > +{ > + /* Detect illegal size, flags and masks */ > + if (arg->size != sizeof(struct pm_scan_arg)) > + return false; > + if (arg->flags & ~PM_SCAN_OPS) > + return false; > + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask | > + arg->return_mask) & ~PM_SCAN_BITS_ALL) > + return false; > + if (!arg->required_mask && !arg->anyof_mask && > + !arg->excluded_mask) > + return false; Is there an assumption in the code that those checks are needed? I'd expect that no selection criteria makes a valid page set? > + if (!arg->return_mask) > + return false; > + > + /* Validate memory ranges */ > + if (!(arg->flags & PM_SCAN_OP_GET)) > + return false; > + if (!arg->vec) > + return false; > + if (arg->vec_len == 0) > + return false; > + if (!access_ok((void __user *)vec, > + arg->vec_len * sizeof(struct page_region))) > + return false; Is there a provision that userspace threads are all blocked from manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug and the writes should be checked each time as another userspace thread could remap the memory while the ioctl() is working. Anyway, the return should be EFAULT for this case. > + if (!IS_ALIGNED(start, PAGE_SIZE)) > + return false; > + if (!access_ok((void __user *)start, arg->len)) > + return false; This I guess want's to check if the range to be scanned is mapped - but isn't this what the ioctl() should do during the scan? (But, also see above.) > + if (PM_SCAN_OP_IS_WP(arg)) { if (!...IS_WP) return true; > + if (arg->required_mask & PM_SCAN_NON_WT_BITS) > + return false; > + if (arg->anyof_mask & PM_SCAN_NON_WT_BITS) > + return false; > + if (arg->excluded_mask & PM_SCAN_NON_WT_BITS) > + return false; Please see: pagemap_scan_is_wt_required comment. Also, it seems this constant is used only here, so ~PAGE_IS_WRITTEN might be enough? [...] > +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg) > +{ > + unsigned long start, end, walk_start, walk_end; > + unsigned long empty_slots, vec_index = 0; > + struct page_region __user *vec; > + struct pagemap_scan_private p; > + int ret = 0; > + > + start = (unsigned long)untagged_addr(arg->start); > + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); > + > + if (!pagemap_scan_args_valid(arg, vec, start)) > + return -EINVAL; > + > + end = start + arg->len; > + p.max_pages = arg->max_pages; > + p.found_pages = 0; > + p.flags = arg->flags; > + p.required_mask = arg->required_mask; > + p.anyof_mask = arg->anyof_mask; > + p.excluded_mask = arg->excluded_mask; > + p.return_mask = arg->return_mask; > + p.cur.len = 0; > + p.vec = NULL; > + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); > + > + /* > + * Allocate smaller buffer to get output from inside the page walk > + * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As > + * we want to return output to user in compact form where no two > + * consecutive regions should be continuous and have the same flags. > + * So store the latest element in p.cur between different walks and > + * store the p.cur at the end of the walk to the user buffer. > + */ > + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), > + GFP_KERNEL); > + if (!p.vec) > + return -ENOMEM; > + > + walk_start = walk_end = start; > + while (walk_end < end) { > + p.vec_index = 0; > + > + empty_slots = arg->vec_len - vec_index; > + p.vec_len = min(p.vec_len, empty_slots); > + > + walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; > + if (walk_end > end) > + walk_end = end; > + > + mmap_read_lock(mm); > + ret = walk_page_range(mm, walk_start, walk_end, > + &pagemap_scan_ops, &p); > + mmap_read_unlock(mm); > + > + if (!(!ret || ret == -ENOSPC)) if (ret && ret != -ENOSPC) > + goto free_data; > + > + walk_start = walk_end; > + if (p.vec_index) { > + if (copy_to_user(&vec[vec_index], p.vec, > + p.vec_index * > + sizeof(struct page_region))) { > + ret = -EFAULT; > + goto free_data; > + } > + vec_index += p.vec_index; > + } > + } > + ret = pagemap_scan_deposit(&p, vec, &vec_index); > + if (!ret) > + ret = vec_index; > +free_data: > + kfree(p.vec); > + > + return ret; > +} > + > +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, > + unsigned long arg) > +{ > + struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg; > + struct mm_struct *mm = file->private_data; > + struct pm_scan_arg argument; > + > + if (cmd == PAGEMAP_SCAN) { switch() for easier expansion later? > + if (copy_from_user(&argument, uarg, > + sizeof(struct pm_scan_arg))) sizeof(*argument); Could you push this to do_pagemap_cmd()? In case this file gets more ioctl() commands there won't be need to add more command-specific structures in this function. > + return -EFAULT; > + return do_pagemap_cmd(mm, &argument); > + } > + > + return -EINVAL; > +} > + > const struct file_operations proc_pagemap_operations = { > .llseek = mem_lseek, /* borrow this */ > .read = pagemap_read, > .open = pagemap_open, > .release = pagemap_release, > + .unlocked_ioctl = pagemap_scan_ioctl, > + .compat_ioctl = pagemap_scan_ioctl, Is this correct? Would the code need a different userspace pointer handling for 32-bit userspace on 64-bit kernel? > }; > #endif /* CONFIG_PROC_PAGE_MONITOR */
On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote: > + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + > + is_writ = !is_pte_uffd_wp(*pte); > + is_file = vma->vm_file; > + is_pres = pte_present(*pte); > + is_swap = is_swap_pte(*pte); > + > + pte_unmap_unlock(pte, ptl); > + > + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, > + p, addr, 1); > + if (ret) > + break; > + > + if (PM_SCAN_OP_IS_WP(p) && is_writ && > + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) > + ret = -EINVAL; > + } This is not real atomic.. Taking the spinlock for eacy pte is not only overkill but wrong in atomicity because the pte can change right after spinlock unlocked. Unfortunately you also cannot reuse uffd_wp_range() because that's not atomic either, my fault here. Probably I was thinking mostly from soft-dirty pov on batching the collect+reset. You need to take the spin lock, collect whatever bits, set/clear whatever bits, only until then release the spin lock. "Not atomic" means you can have some page got dirtied but you could miss it. Depending on how strict you want, I think it'll break apps like CRIU if strict atomicity needed for migrating a process. If we want to have a new interface anyway, IMHO we'd better do that in the strict way. Same comment applies to the THP handling (where I cut from the context).
On 3/15/23 8:55 PM, Peter Xu wrote: > On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote: >> + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); >> + >> + is_writ = !is_pte_uffd_wp(*pte); >> + is_file = vma->vm_file; >> + is_pres = pte_present(*pte); >> + is_swap = is_swap_pte(*pte); >> + >> + pte_unmap_unlock(pte, ptl); >> + >> + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, >> + p, addr, 1); >> + if (ret) >> + break; >> + >> + if (PM_SCAN_OP_IS_WP(p) && is_writ && >> + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) >> + ret = -EINVAL; >> + } > > This is not real atomic.. > > Taking the spinlock for eacy pte is not only overkill but wrong in > atomicity because the pte can change right after spinlock unlocked. Let me explain. It seems like wrong, but it isn't. In my rigorous testing, it didn't show any side-effect. Here we are finding out if a page is written. If page is written, only then we clear it. Lets look at the different possibilities here: - If a page isn't written, we'll not clear it. - If a page is written and there isn't any race, we'll clear written-to flag by write protecting it. - If a page is written but before clearing it, data is written again to the page. The page would remain written and we'll clear it. - If a page is written but before clearing it, it gets write protected, we'll still write protected it. There is double right protection here, but no side-effect. Lets turn this into a truth table for easier understanding. Here first coulmn and thrid column represents this above code. 2nd column represents any other thread interacting with the page. If page is written/dirty some other task interacts wp_page no does nothing no no writes to page no no wp the page no yes does nothing yes yes write to page yes yes wp the page yes As you can see there isn't any side-effect happening. We aren't over doing the wp or under-doing the write-protect. Even if we were doing something wrong here and I bring the lock over all of this, the pages get become written or wp just after unlocking. It is expected. This current implementation doesn't seem to be breaking this. Is my understanding wrong somewhere here? Can you point out? Previous to this current locking design were either buggy or slower when multiple threads were working on same pages. Current implementation removes the limitations: - The memcpy inside pagemap_scan_output is happening with pte unlocked. - We are only wp a page if we have noted this page to be dirty - No mm write lock is required. Only read lock works fine just like userfaultfd_writeprotect() takes only read lock. There is only one con here that we are locking and unlocking the pte lock again and again. Please have a look at my explanation and let me know what do you think. > > Unfortunately you also cannot reuse uffd_wp_range() because that's not > atomic either, my fault here. Probably I was thinking mostly from > soft-dirty pov on batching the collect+reset. > > You need to take the spin lock, collect whatever bits, set/clear whatever > bits, only until then release the spin lock. > > "Not atomic" means you can have some page got dirtied but you could miss > it. Depending on how strict you want, I think it'll break apps like CRIU > if strict atomicity needed for migrating a process. If we want to have a > new interface anyway, IMHO we'd better do that in the strict way. In my rigorous multi-threaded testing where a lots of threads are working on same set of pages, we aren't losing even a single update. I can share the test if you want. > > Same comment applies to the THP handling (where I cut from the context). >
On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote: > On 3/15/23 8:55 PM, Peter Xu wrote: > > On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote: > >> + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { > >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > >> + > >> + is_writ = !is_pte_uffd_wp(*pte); > >> + is_file = vma->vm_file; > >> + is_pres = pte_present(*pte); > >> + is_swap = is_swap_pte(*pte); > >> + > >> + pte_unmap_unlock(pte, ptl); > >> + > >> + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, > >> + p, addr, 1); > >> + if (ret) > >> + break; > >> + > >> + if (PM_SCAN_OP_IS_WP(p) && is_writ && > >> + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) > >> + ret = -EINVAL; > >> + } > > > > This is not real atomic.. > > > > Taking the spinlock for eacy pte is not only overkill but wrong in > > atomicity because the pte can change right after spinlock unlocked. > Let me explain. It seems like wrong, but it isn't. In my rigorous testing, > it didn't show any side-effect. Here we are finding out if a page is > written. If page is written, only then we clear it. Lets look at the > different possibilities here: > - If a page isn't written, we'll not clear it. > - If a page is written and there isn't any race, we'll clear written-to > flag by write protecting it. > - If a page is written but before clearing it, data is written again to the > page. The page would remain written and we'll clear it. > - If a page is written but before clearing it, it gets write protected, > we'll still write protected it. There is double right protection here, but > no side-effect. > > Lets turn this into a truth table for easier understanding. Here first > coulmn and thrid column represents this above code. 2nd column represents > any other thread interacting with the page. > > If page is written/dirty some other task interacts wp_page > no does nothing no > no writes to page no > no wp the page no > yes does nothing yes > yes write to page yes > yes wp the page yes > > As you can see there isn't any side-effect happening. We aren't over doing > the wp or under-doing the write-protect. > > Even if we were doing something wrong here and I bring the lock over all of > this, the pages get become written or wp just after unlocking. It is > expected. This current implementation doesn't seem to be breaking this. > > Is my understanding wrong somewhere here? Can you point out? Yes you're right. With is_writ check it looks all fine. > > Previous to this current locking design were either buggy or slower when > multiple threads were working on same pages. Current implementation removes > the limitations: > - The memcpy inside pagemap_scan_output is happening with pte unlocked. Why this has anything to worry? Isn't that memcpy only applies to a page_region struct? > - We are only wp a page if we have noted this page to be dirty > - No mm write lock is required. Only read lock works fine just like > userfaultfd_writeprotect() takes only read lock. I didn't even notice you used to use write lock. Yes I think read lock is suffice here. > > There is only one con here that we are locking and unlocking the pte lock > again and again. > > Please have a look at my explanation and let me know what do you think. I think this is fine as long as the semantics is correct, which I believe is the case. The spinlock can be optimized, but it can be done on top if needs more involved changes. > > > > > Unfortunately you also cannot reuse uffd_wp_range() because that's not > > atomic either, my fault here. Probably I was thinking mostly from > > soft-dirty pov on batching the collect+reset. > > > > You need to take the spin lock, collect whatever bits, set/clear whatever > > bits, only until then release the spin lock. > > > > "Not atomic" means you can have some page got dirtied but you could miss > > it. Depending on how strict you want, I think it'll break apps like CRIU > > if strict atomicity needed for migrating a process. If we want to have a > > new interface anyway, IMHO we'd better do that in the strict way. > In my rigorous multi-threaded testing where a lots of threads are working > on same set of pages, we aren't losing even a single update. I can share > the test if you want. Good to have tests covering that. I'd say you can add the test into selftests along with the series when you repost if it's convenient. It can be part of an existing test or it can be a new one under mm/. Thanks,
On 3/16/23 12:53 AM, Peter Xu wrote: > On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote: >> On 3/15/23 8:55 PM, Peter Xu wrote: >>> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote: >>>> + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { >>>> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); >>>> + >>>> + is_writ = !is_pte_uffd_wp(*pte); >>>> + is_file = vma->vm_file; >>>> + is_pres = pte_present(*pte); >>>> + is_swap = is_swap_pte(*pte); >>>> + >>>> + pte_unmap_unlock(pte, ptl); >>>> + >>>> + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, >>>> + p, addr, 1); >>>> + if (ret) >>>> + break; >>>> + >>>> + if (PM_SCAN_OP_IS_WP(p) && is_writ && >>>> + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) >>>> + ret = -EINVAL; >>>> + } >>> >>> This is not real atomic.. >>> >>> Taking the spinlock for eacy pte is not only overkill but wrong in >>> atomicity because the pte can change right after spinlock unlocked. >> Let me explain. It seems like wrong, but it isn't. In my rigorous testing, >> it didn't show any side-effect. Here we are finding out if a page is >> written. If page is written, only then we clear it. Lets look at the >> different possibilities here: >> - If a page isn't written, we'll not clear it. >> - If a page is written and there isn't any race, we'll clear written-to >> flag by write protecting it. >> - If a page is written but before clearing it, data is written again to the >> page. The page would remain written and we'll clear it. >> - If a page is written but before clearing it, it gets write protected, >> we'll still write protected it. There is double right protection here, but >> no side-effect. >> >> Lets turn this into a truth table for easier understanding. Here first >> coulmn and thrid column represents this above code. 2nd column represents >> any other thread interacting with the page. >> >> If page is written/dirty some other task interacts wp_page >> no does nothing no >> no writes to page no >> no wp the page no >> yes does nothing yes >> yes write to page yes >> yes wp the page yes >> >> As you can see there isn't any side-effect happening. We aren't over doing >> the wp or under-doing the write-protect. >> >> Even if we were doing something wrong here and I bring the lock over all of >> this, the pages get become written or wp just after unlocking. It is >> expected. This current implementation doesn't seem to be breaking this. >> >> Is my understanding wrong somewhere here? Can you point out? > > Yes you're right. With is_writ check it looks all fine. > >> >> Previous to this current locking design were either buggy or slower when >> multiple threads were working on same pages. Current implementation removes >> the limitations: >> - The memcpy inside pagemap_scan_output is happening with pte unlocked. > > Why this has anything to worry? Isn't that memcpy only applies to a > page_region struct? Yeah, correct. I'm just saying that memcpy without pte lock is better than memcpy with pte locked. :) > >> - We are only wp a page if we have noted this page to be dirty >> - No mm write lock is required. Only read lock works fine just like >> userfaultfd_writeprotect() takes only read lock. > > I didn't even notice you used to use write lock. Yes I think read lock is > suffice here. > >> >> There is only one con here that we are locking and unlocking the pte lock >> again and again. >> >> Please have a look at my explanation and let me know what do you think. > > I think this is fine as long as the semantics is correct, which I believe > is the case. The spinlock can be optimized, but it can be done on top if > needs more involved changes. > >> >>> >>> Unfortunately you also cannot reuse uffd_wp_range() because that's not >>> atomic either, my fault here. Probably I was thinking mostly from >>> soft-dirty pov on batching the collect+reset. >>> >>> You need to take the spin lock, collect whatever bits, set/clear whatever >>> bits, only until then release the spin lock. >>> >>> "Not atomic" means you can have some page got dirtied but you could miss >>> it. Depending on how strict you want, I think it'll break apps like CRIU >>> if strict atomicity needed for migrating a process. If we want to have a >>> new interface anyway, IMHO we'd better do that in the strict way. >> In my rigorous multi-threaded testing where a lots of threads are working >> on same set of pages, we aren't losing even a single update. I can share >> the test if you want. > > Good to have tests covering that. I'd say you can add the test into > selftests along with the series when you repost if it's convenient. It can > be part of an existing test or it can be a new one under mm/. Sure, I'll add it to the selftests. Thank you for reviewing and asking the questions. > > Thanks, >
Hi, Thank you so much for reviewing. On 3/13/23 9:02 PM, Michał Mirosław wrote: > On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear >> the info about page table entries. The following operations are supported >> in this ioctl: >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN), >> file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped >> (PAGE_IS_SWAPPED). >> - Find pages which have been written-to and write protect the pages >> (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE) > [...] >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -19,6 +19,7 @@ >> #include <linux/shmem_fs.h> >> #include <linux/uaccess.h> >> #include <linux/pkeys.h> >> +#include <linux/minmax.h> >> >> #include <asm/elf.h> >> #include <asm/tlb.h> >> @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma, >> } >> #endif >> >> +static inline bool is_pte_uffd_wp(pte_t pte) >> +{ >> + return ((pte_present(pte) && pte_uffd_wp(pte)) || >> + (pte_swp_uffd_wp_any(pte))); > > Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove > here and in all following if()s. (Nit: those extra parentheses are > used inconsistently in the patch anyway.) I'll remove these in next version. > > [...] >> +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p) > > This seems to check if the PAGE_IS_WRITTEN flag is tested, so > "pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt" > acronym as it seems used also on following code. I'll expand wt. > >> +{ >> + return ((p->required_mask & PAGE_IS_WRITTEN) || >> + (p->anyof_mask & PAGE_IS_WRITTEN) || >> + (p->excluded_mask & PAGE_IS_WRITTEN)); > > Nit: It looks like it should answer "do any of the masks contain > PAGE_IS_WRITTEN?" so maybe: > > return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN; I'll update. > > [...] > >> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >> + struct pagemap_scan_private *p, >> + unsigned long addr, unsigned int n_pages) >> +{ >> + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); >> + struct page_region *cur = &p->cur; >> + bool cpy = true; >> + >> + if (p->max_pages && (p->found_pages == p->max_pages)) >> + return -ENOSPC; >> + >> + if (!n_pages) >> + return -EINVAL; >> + >> + if (p->required_mask) >> + cpy = ((p->required_mask & bitmap) == p->required_mask); >> + if (cpy && p->anyof_mask) >> + cpy = (p->anyof_mask & bitmap); >> + if (cpy && p->excluded_mask) >> + cpy = !(p->excluded_mask & bitmap); > > Since the rest of the code is executed only when `cpy` is true, this > could just return early for easier understanding. Hmm... I'll do the following: if (!cpy || !bitmap) return 0; > > BTW, some of the tests are redundant. Eg: if required_mask == 0, then > `required_mask & x == required_mask` will always hold. Same for > `excluded_mask & x == 0`. Correct. This is why I'm checking if required_mask is set and then comparing bitmap with it. required_mask may be 0 if not set. This if will ignore the subsequent check. if (p->required_mask) cpy = ((p->required_mask & bitmap) == p->required_mask); I don't see any redundancy here. Please let me know otherwise? > >> + >> + bitmap = bitmap & p->return_mask; > > Nit: bitmap &= p->return_mask; Sure. Will do. Just for my knowledge, what does "Nit" signify if a comment is marked with it? > >> + if (cpy && bitmap) { > > Assuming early returns on `!cpy` are done earlier: > > if (!bitmap) > return 0; I've posted condition above which would better suit here. > >> + if ((cur->len) && (cur->bitmap == bitmap) && >> + (cur->start + cur->len * PAGE_SIZE == addr)) { > > I'd recommend removing the extra parentheses as they make the code > less readable for me (too many parentheses to match visually). I'll remove parenthesis. > The `cur->len` test seems redundant: is it possible to have > `cur->start == addr` in that case (I guess it would have to get > `n_pages == 0` in an earlier invocation)? No, both wouldn't work. cur->len == 0 means that it has only garbage. It is essential to check the validity from cur->len before performing other checks. Also cur->start can never be equal to addr as we are walking over page addressing in serial manner. We want to see here if the current address matches the previous data by finding the ending address of last stored data (cur->start + cur->len * PAGE_SIZE). > >> + >> + cur->len += n_pages; >> + p->found_pages += n_pages; > > Please add an early return so that 'else' chaining won't be necessary. I'll do it. > >> + } else if ((!p->vec_index) || >> + ((p->vec_index + 1) < p->vec_len)) { > > Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: > > if (vec_index >= p->vec_len) > return -ENOSPC; No, it'll not work. Lets leave it as it is. :) It has gotten somewhat complex, but I don't have any other way to make it simpler which works. First note the following points: 1) We walk over 512 page or 1 thp at a time to not over allocate memory in kernel (p->vec). 2) We also want to merge the consective pages with same flags into one struct page_region. p->vec of current walk may merge with next walk. So we cannot write to user memory until we find the results of the next walk. So most recent data is put into p->cur. When non-intersecting or mergeable data is found, we move p->cur to p->vec[p->index] inside the page walk. After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all the walks are over. We move the p->cur to arg->vec. It completes the data transfer to user buffer. -------------- | p->cur | -------------- | | V -------------- | | | | | p->vec | | | | | -------------- | | V -------------- | | | | | | | arg->vec | | | | | | | -------------- I'm so sorry that it has gotten this much complex. It was way simpler when we were walking over all the memory in one go. But then we needed an unbounded memory from the kernel which we don't want. > >> + if (cur->len) { >> + memcpy(&p->vec[p->vec_index], cur, >> + sizeof(struct page_region)); >> + p->vec_index++; >> + } >> + >> + cur->start = addr; >> + cur->len = n_pages; >> + cur->bitmap = bitmap; >> + p->found_pages += n_pages; >> + } else { >> + return -ENOSPC; >> + } >> + } >> + >> + return 0; >> +} > [...] > >> +static int pagemap_scan_deposit(struct pagemap_scan_private *p, >> + struct page_region __user *vec, >> + unsigned long *vec_index) >> +{ >> + struct page_region *cur = &p->cur; >> + >> + if (cur->len) { > > if (!cur->len) > return 0; Sure. > >> + if (copy_to_user(&vec[*vec_index], cur, >> + sizeof(struct page_region))) >> + return -EFAULT; >> + >> + p->vec_index++; >> + (*vec_index)++; >> + } >> + >> + return 0; >> +} > >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >> + unsigned long end, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + bool is_writ, is_file, is_pres, is_swap; >> + unsigned long addr = end; >> + spinlock_t *ptl; >> + int ret = 0; >> + pte_t *pte; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return > NULL if transparent hugepages are not compiled in. OTOH I see > BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this > case), so that would need to be worked around first. I'd got the build error [1]. So I'd added these. I've tested it again with the same config. We don't need these #ifdef now. I'll remove these. [1] https://lore.kernel.org/all/202211120107.cYLiq2cH-lkp@intel.com > >> + ptl = pmd_trans_huge_lock(pmd, vma); >> + if (ptl) { >> + unsigned long n_pages; >> + >> + is_writ = !is_pmd_uffd_wp(*pmd); > > `is_written`? I'd kept it is_writ to match the pattern of is_file, is_pres and is_swap. I'll update it to is_written and is_pres to is_present. > >> + /* >> + * Break huge page into small pages if operation needs to be >> + * performed is on a portion of the huge page. >> + */ >> + if (is_writ && PM_SCAN_OP_IS_WP(p) && >> + (end - start < HPAGE_SIZE)) { >> + spin_unlock(ptl); >> + >> + split_huge_pmd(vma, pmd, start); >> + goto process_smaller_pages; >> + } >> + >> + n_pages = (end - start)/PAGE_SIZE; >> + if (p->max_pages && >> + p->found_pages + n_pages >= p->max_pages) > > Nit: greater-than is also correct and avoids no-op assignment. Ohh... I'll update. > >> + n_pages = p->max_pages - p->found_pages; >> + >> + ret = pagemap_scan_output(is_writ, vma->vm_file, >> + pmd_present(*pmd), is_swap_pmd(*pmd), >> + p, start, n_pages); >> + spin_unlock(ptl); > > if (ret || !is_written) > return ret; > > This will avoid those tests in the following if(). Done. > >> + >> + if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) && >> + uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0) >> + ret = -EINVAL; >> + >> + return ret; > > After above early returns, this will be always `return 0;`. Sure. > >> + } >> +process_smaller_pages: >> + if (pmd_trans_unstable(pmd)) >> + return 0; >> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> + >> + for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) { > > The `!ret` can be removed if the EINVAL case was to `break` by itself. Sure. Will do. > >> + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); >> + >> + is_writ = !is_pte_uffd_wp(*pte); >> + is_file = vma->vm_file; >> + is_pres = pte_present(*pte); >> + is_swap = is_swap_pte(*pte); >> + >> + pte_unmap_unlock(pte, ptl); >> + >> + ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap, >> + p, addr, 1); >> + if (ret) >> + break; >> + >> + if (PM_SCAN_OP_IS_WP(p) && is_writ && >> + uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0) >> + ret = -EINVAL; >> + } >> + >> + cond_resched(); >> + return ret; >> +} >> + >> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, >> + int depth, struct mm_walk *walk) >> +{ >> + struct pagemap_scan_private *p = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + unsigned long n_pages; >> + int ret = 0; >> + >> + if (vma) { > > if (!vma) return 0; Will do. > >> + n_pages = (end - addr)/PAGE_SIZE; >> + if (p->max_pages && >> + p->found_pages + n_pages >= p->max_pages) >> + n_pages = p->max_pages - p->found_pages; >> + >> + ret = pagemap_scan_output(false, vma->vm_file, false, false, p, >> + addr, n_pages); >> + } >> + >> + return ret; >> +} > > >> +/* No hugetlb support is present. */ > > "FIXME: hugetlb support is not implemented."? (There seems to be no > #ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the > current implementation.) I'm working on adding hugetlb support. I'll remove this comment. > >> +static const struct mm_walk_ops pagemap_scan_ops = { >> + .test_walk = pagemap_scan_test_walk, >> + .pmd_entry = pagemap_scan_pmd_entry, >> + .pte_hole = pagemap_scan_pte_hole, >> +}; >> + >> +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg, >> + struct page_region __user *vec, >> + unsigned long start) >> +{ >> + /* Detect illegal size, flags and masks */ >> + if (arg->size != sizeof(struct pm_scan_arg)) >> + return false; >> + if (arg->flags & ~PM_SCAN_OPS) >> + return false; >> + if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask | >> + arg->return_mask) & ~PM_SCAN_BITS_ALL) >> + return false; > >> + if (!arg->required_mask && !arg->anyof_mask && >> + !arg->excluded_mask) >> + return false; > > Is there an assumption in the code that those checks are needed? I'd > expect that no selection criteria makes a valid page set? In my view, selection criterion must be specified for the ioctl to work. If there is no criterio, user should go and read pagemap file directly. So the assumption is that at least one selection criterion must be specified. > >> + if (!arg->return_mask) >> + return false; >> + >> + /* Validate memory ranges */ >> + if (!(arg->flags & PM_SCAN_OP_GET)) >> + return false; >> + if (!arg->vec) >> + return false; >> + if (arg->vec_len == 0) >> + return false; > >> + if (!access_ok((void __user *)vec, >> + arg->vec_len * sizeof(struct page_region))) >> + return false; > > Is there a provision that userspace threads are all blocked from > manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug > and the writes should be checked each time as another userspace thread > could remap the memory while the ioctl() is working. mincore() syscall is doing in the same way. It checks the validity in the start only. What provision should I add? Isn't it obvious that the user should not remap such memory? > Anyway, the > return should be EFAULT for this case. I'll update. > >> + if (!IS_ALIGNED(start, PAGE_SIZE)) >> + return false; >> + if (!access_ok((void __user *)start, arg->len)) >> + return false; > > This I guess want's to check if the range to be scanned is mapped - > but isn't this what the ioctl() should do during the scan? (But, also > see above.) No, start represents the memory which the user wants to watch. User must allocate this memory first and then pass the address to this ioctl to find out the flags per page. > >> + if (PM_SCAN_OP_IS_WP(arg)) { > > if (!...IS_WP) return true; I was liking this way. Anyways I'll update. > >> + if (arg->required_mask & PM_SCAN_NON_WT_BITS) >> + return false; >> + if (arg->anyof_mask & PM_SCAN_NON_WT_BITS) >> + return false; >> + if (arg->excluded_mask & PM_SCAN_NON_WT_BITS) >> + return false; > > Please see: pagemap_scan_is_wt_required comment. Also, it seems this > constant is used only here, so ~PAGE_IS_WRITTEN might be enough? Yup, I'll update. > > [...] >> +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg) >> +{ >> + unsigned long start, end, walk_start, walk_end; >> + unsigned long empty_slots, vec_index = 0; >> + struct page_region __user *vec; >> + struct pagemap_scan_private p; >> + int ret = 0; >> + >> + start = (unsigned long)untagged_addr(arg->start); >> + vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec); >> + >> + if (!pagemap_scan_args_valid(arg, vec, start)) >> + return -EINVAL; >> + >> + end = start + arg->len; >> + p.max_pages = arg->max_pages; >> + p.found_pages = 0; >> + p.flags = arg->flags; >> + p.required_mask = arg->required_mask; >> + p.anyof_mask = arg->anyof_mask; >> + p.excluded_mask = arg->excluded_mask; >> + p.return_mask = arg->return_mask; >> + p.cur.len = 0; >> + p.vec = NULL; >> + p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT); >> + >> + /* >> + * Allocate smaller buffer to get output from inside the page walk >> + * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As >> + * we want to return output to user in compact form where no two >> + * consecutive regions should be continuous and have the same flags. >> + * So store the latest element in p.cur between different walks and >> + * store the p.cur at the end of the walk to the user buffer. >> + */ >> + p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), >> + GFP_KERNEL); >> + if (!p.vec) >> + return -ENOMEM; >> + >> + walk_start = walk_end = start; >> + while (walk_end < end) { >> + p.vec_index = 0; >> + >> + empty_slots = arg->vec_len - vec_index; >> + p.vec_len = min(p.vec_len, empty_slots); >> + >> + walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK; >> + if (walk_end > end) >> + walk_end = end; >> + >> + mmap_read_lock(mm); >> + ret = walk_page_range(mm, walk_start, walk_end, >> + &pagemap_scan_ops, &p); >> + mmap_read_unlock(mm); >> + >> + if (!(!ret || ret == -ENOSPC)) > > if (ret && ret != -ENOSPC) Sorry, I should have thought of this one. Thanks. > >> + goto free_data; >> + >> + walk_start = walk_end; >> + if (p.vec_index) { >> + if (copy_to_user(&vec[vec_index], p.vec, >> + p.vec_index * >> + sizeof(struct page_region))) { >> + ret = -EFAULT; >> + goto free_data; >> + } >> + vec_index += p.vec_index; >> + } >> + } >> + ret = pagemap_scan_deposit(&p, vec, &vec_index); >> + if (!ret) >> + ret = vec_index; >> +free_data: >> + kfree(p.vec); >> + >> + return ret; >> +} >> + >> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, >> + unsigned long arg) >> +{ >> + struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg; >> + struct mm_struct *mm = file->private_data; >> + struct pm_scan_arg argument; >> + >> + if (cmd == PAGEMAP_SCAN) { > > switch() for easier expansion later? I'd switch here once. I'll add it again. > >> + if (copy_from_user(&argument, uarg, >> + sizeof(struct pm_scan_arg))) > > sizeof(*argument); > > Could you push this to do_pagemap_cmd()? In case this file gets more > ioctl() commands there won't be need to add more command-specific > structures in this function. Sure, I'll update. > >> + return -EFAULT; >> + return do_pagemap_cmd(mm, &argument); >> + } >> + >> + return -EINVAL; >> +} >> + >> const struct file_operations proc_pagemap_operations = { >> .llseek = mem_lseek, /* borrow this */ >> .read = pagemap_read, >> .open = pagemap_open, >> .release = pagemap_release, >> + .unlocked_ioctl = pagemap_scan_ioctl, >> + .compat_ioctl = pagemap_scan_ioctl, > > Is this correct? Would the code need a different userspace pointer > handling for 32-bit userspace on 64-bit kernel? Yeah, it is needed for 32-bit application to run on 64-bit kernel. > >> }; >> #endif /* CONFIG_PROC_PAGE_MONITOR */
Hello, Muhammad, On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote: > Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page > faults on its own. It can be used to track that which pages have been > written-to from the time the pages were write-protected. It is very > efficient way to track the changes as uffd is by nature pte/pmd based. > > UFFD synchronous WP sends the page faults to the userspace where the > pages which have been written-to can be tracked. But it is not efficient. > This is why this asynchronous version is being added. After setting the > WP Async, the pages which have been written to can be found in the pagemap > file or information can be obtained from the PAGEMAP_IOCTL. > > Suggested-by: Peter Xu <peterx@redhat.com> > Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Here's the patch that can enable WP_ASYNC for all kinds of memories (as I promised..). Currently I only tested btrfs (besides the common three) which is the major fs I use locally, but I guess it'll also enable the rest no matter what's underneath, just like soft-dirty. As I mentioned, I just feel it very unfortunate to have a lot of suffixes for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for this WP_ASYNC from the start because the workflow should really be similar to anon/shmem handling for most of the rest, just a few tweaks here and there. I had a feeling that some type of special VMA will work weirdly, but let's see.. so far I don't come up with any. If the patch looks fine to you, please consider replace this patch with patch 1 of mine where I attached. Then patch 1 can be reviewed alongside with your series. Logically patch 1 can be reviewed separately too, because it works perfectly afaiu without the atomic version of pagemap already. But on my side I don't think it justifies anything really matters, so unless someone thinks it a good idea to post / review / merge it separately, you can keep that with your new pagemap ioctl. Patch 2 is only for your reference. It's not for merging quality so please don't put it into your series. I do plan to cleanup the userfaultfd selftests in the near future first (when I wrote this I am more eager to do so..). I also think your final pagemap test cases can cover quite a bit. Thanks,
On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > > Hi, > > Thank you so much for reviewing. > > On 3/13/23 9:02 PM, Michał Mirosław wrote: > > On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: [...] > >> --- a/fs/proc/task_mmu.c > >> +++ b/fs/proc/task_mmu.c [...] > >> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > >> + struct pagemap_scan_private *p, > >> + unsigned long addr, unsigned int n_pages) > >> +{ > >> + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); > >> + struct page_region *cur = &p->cur; > >> + bool cpy = true; > >> + > >> + if (p->max_pages && (p->found_pages == p->max_pages)) > >> + return -ENOSPC; > >> + > >> + if (!n_pages) > >> + return -EINVAL; > >> + > >> + if (p->required_mask) > >> + cpy = ((p->required_mask & bitmap) == p->required_mask); > >> + if (cpy && p->anyof_mask) > >> + cpy = (p->anyof_mask & bitmap); > >> + if (cpy && p->excluded_mask) > >> + cpy = !(p->excluded_mask & bitmap); > > > > Since the rest of the code is executed only when `cpy` is true, this > > could just return early for easier understanding. > Hmm... I'll do the following: > if (!cpy || !bitmap) > return 0; > > BTW, some of the tests are redundant. Eg: if required_mask == 0, then > > `required_mask & x == required_mask` will always hold. Same for > > `excluded_mask & x == 0`. > Correct. This is why I'm checking if required_mask is set and then > comparing bitmap with it. required_mask may be 0 if not set. This if will > ignore the subsequent check. > > if (p->required_mask) > cpy = ((p->required_mask & bitmap) == p->required_mask); > > I don't see any redundancy here. Please let me know otherwise? [...] > >> + if (cpy && bitmap) { > > > > Assuming early returns on `!cpy` are done earlier: > > > > if (!bitmap) > > return 0; > I've posted condition above which would better suit here. [...] Since the `cpy` condition is updated and passed to each new branch (IOW: after setting cpy = 0 for whatever reason all the further code is skipped) you can drop the variable and do early returns everywhere. E.g.: if ((bitmap & p->required_mask) != p->required_mask) return 0; if (p->anyof_mask && !(bitmap & p->anyof_mask)) return 0; if (bitmap & p->excluded_mask) return 0; if (!bitmap) return 0; Also you can take the "special" effect of masking with zero to be always zero (and in C - false) to avoid testing for an empty mask separately in most cases. > Just for my knowledge, what does "Nit" signify if a comment is marked with it? A low priority / cosmetic item that you might consider ignoring if a fix is too expensive or controversial. >> + if ((cur->len) && (cur->bitmap == bitmap) && >> + (cur->start + cur->len * PAGE_SIZE == addr)) { > > I'd recommend removing the extra parentheses as they make the code > less readable for me (too many parentheses to match visually). I'll remove parenthesis. [...] >> The `cur->len` test seems redundant: is it possible to have >> `cur->start == addr` in that case (I guess it would have to get >> `n_pages == 0` in an earlier invocation)? > No, both wouldn't work. cur->len == 0 means that it has only garbage. It is > essential to check the validity from cur->len before performing other > checks. Also cur->start can never be equal to addr as we are walking over > page addressing in serial manner. We want to see here if the current > address matches the previous data by finding the ending address of last > stored data (cur->start + cur->len * PAGE_SIZE). If cur->len == 0, then it doesn't matter if it gets merged or not - it can be filtered out during the flush (see below). [...] > >> + } else if ((!p->vec_index) || > >> + ((p->vec_index + 1) < p->vec_len)) { > > > > Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: > > > > if (vec_index >= p->vec_len) > > return -ENOSPC; > > No, it'll not work. Lets leave it as it is. :) > > It has gotten somewhat complex, but I don't have any other way to make it > simpler which works. First note the following points: > 1) We walk over 512 page or 1 thp at a time to not over allocate memory in > kernel (p->vec). > 2) We also want to merge the consecutive pages with the same flags into one > struct page_region. p->vec of current walk may merge with next walk. So we > cannot write to user memory until we find the results of the next walk. > > So most recent data is put into p->cur. When non-intersecting or mergeable > data is found, we move p->cur to p->vec[p->index] inside the page walk. > After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all > the walks are over. We move the p->cur to arg->vec. It completes the data > transfer to user buffer. [...] > I'm so sorry that it has gotten this much complex. It was way simpler when > we were walking over all the memory in one go. But then we needed an > unbounded memory from the kernel which we don't want. [...] I've gone through and hopefully understood the code. I'm not sure this needs to be so complicated: when traversing a single PMD you can always copy p->cur to p->vec[p->vec_index++] because you can have at most pages_per_PMD non-merges (in the worst case the last page always is left in p->cur and whole p->vec is used). After each PMD p->vec needs a flush if p->vec_index > 0, skipping the dummy entry at front (len == 0; if present). (This is mostly how it is implemented now, but I propose to remove the "overflow" check and do the starting guard removal only every PMD.) BTW, the pagemap_scan_deposit() got me a bit confused: it seems that it is just a copy of the p->vec flush to userspace. Please either use it for both p->vec and p->cur flushing or inline. BTW#2, I think the ENOSPC return in pagemap_scan_output() should happen later - only if the pages would match and that caused the count to exceed the limit. For THP n_pages should be truncated to the limit (and ENOSPC returned right away) only after the pages were verified to match. [...] > >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, > >> + unsigned long end, struct mm_walk *walk) > >> +{ > >> + struct pagemap_scan_private *p = walk->private; > >> + struct vm_area_struct *vma = walk->vma; > >> + bool is_writ, is_file, is_pres, is_swap; > >> + unsigned long addr = end; > >> + spinlock_t *ptl; > >> + int ret = 0; > >> + pte_t *pte; > >> + > >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return > > NULL if transparent hugepages are not compiled in. OTOH I see > > BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this > > case), so that would need to be worked around first. > I'd got the build error [1]. So I'd added these. I've tested it again with > the same config. We don't need these #ifdef now. I'll remove these. I mean that there are cases like [1] that actually need the #ifdef at least to wrap HPAGE_SIZE usage. But maybe just this constant can be wrapped so that we keep the code always compile-tested? [1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66 [...] > >> + if (!arg->required_mask && !arg->anyof_mask && > >> + !arg->excluded_mask) > >> + return false; > > > > Is there an assumption in the code that those checks are needed? I'd > > expect that no selection criteria makes a valid page set? > In my view, selection criterion must be specified for the ioctl to work. If > there is no criterio, user should go and read pagemap file directly. So the > assumption is that at least one selection criterion must be specified. Yes. I'm not sure we need to prevent multiple ways of doing the same thing. But doesn't pagemap reading lack the range aggregation feature? [...] > >> + if (!access_ok((void __user *)vec, > >> + arg->vec_len * sizeof(struct page_region))) > >> + return false; > > > > Is there a provision that userspace threads are all blocked from > > manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug > > and the writes should be checked each time as another userspace thread > > could remap the memory while the ioctl() is working. > mincore() syscall is doing in the same way. It checks the validity in the > start only. What provision should I add? Isn't it obvious that the user > should not remap such memory? On the second look, I think the code already checks that while doing copy_to_user(), so this check is redundant and can be removed. > > > >> + if (!IS_ALIGNED(start, PAGE_SIZE)) > >> + return false; > >> + if (!access_ok((void __user *)start, arg->len)) > >> + return false; > > > > This I guess wants to check if the range to be scanned is mapped - > > but isn't this what the ioctl() should do during the scan? (But, also > > see above.) > No, start represents the memory which the user wants to watch. User must > allocate this memory first and then pass the address to this ioctl to find > out the flags per page. From: + * struct pm_scan_arg - Pagemap ioctl argument + * @size: Size of the structure + * @flags: Flags for the IOCTL + * @start: Starting address of the region + * @len: Length of the region (All the pages in this length are included) ... I'd expect the `start` field to just be a virtual address to start scanning from. Does it need to be mapped? For CRIU usecase I'd start with "start = 0" to find out all mappings, but 0 is (always) not mapped. Is this supposed to only work on already discovered page ranges? Anyway, I'd expect the code should be tolerant of another thread changing the mappings while this ioctl() is walking the page tables - is it so? If yes, then this check serves at most as an optimization used only for an invalid call. > >> const struct file_operations proc_pagemap_operations = { > >> .llseek = mem_lseek, /* borrow this */ > >> .read = pagemap_read, > >> .open = pagemap_open, > >> .release = pagemap_release, > >> + .unlocked_ioctl = pagemap_scan_ioctl, > >> + .compat_ioctl = pagemap_scan_ioctl, > > > > Is this correct? Would the code need a different userspace pointer > > handling for 32-bit userspace on 64-bit kernel? > Yeah, it is needed for 32-bit application to run on 64-bit kernel. I mean is using the same function for both entry points correct? Don't the pointers to userspace memory (e.g. arg->vec) need to be mapped for 32-bit process? Best Regards Michał Mirosław
On 3/17/23 2:28 AM, Michał Mirosław wrote: > On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> >> Hi, >> >> Thank you so much for reviewing. >> >> On 3/13/23 9:02 PM, Michał Mirosław wrote: >>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum >>> <usama.anjum@collabora.com> wrote: > [...] >>>> --- a/fs/proc/task_mmu.c >>>> +++ b/fs/proc/task_mmu.c > [...] >>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >>>> + struct pagemap_scan_private *p, >>>> + unsigned long addr, unsigned int n_pages) >>>> +{ >>>> + unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap); >>>> + struct page_region *cur = &p->cur; >>>> + bool cpy = true; >>>> + >>>> + if (p->max_pages && (p->found_pages == p->max_pages)) >>>> + return -ENOSPC; >>>> + >>>> + if (!n_pages) >>>> + return -EINVAL; >>>> + >>>> + if (p->required_mask) >>>> + cpy = ((p->required_mask & bitmap) == p->required_mask); >>>> + if (cpy && p->anyof_mask) >>>> + cpy = (p->anyof_mask & bitmap); >>>> + if (cpy && p->excluded_mask) >>>> + cpy = !(p->excluded_mask & bitmap); >>> >>> Since the rest of the code is executed only when `cpy` is true, this >>> could just return early for easier understanding. >> Hmm... I'll do the following: >> if (!cpy || !bitmap) >> return 0; >>> BTW, some of the tests are redundant. Eg: if required_mask == 0, then >>> `required_mask & x == required_mask` will always hold. Same for >>> `excluded_mask & x == 0`. >> Correct. This is why I'm checking if required_mask is set and then >> comparing bitmap with it. required_mask may be 0 if not set. This if will >> ignore the subsequent check. >> >> if (p->required_mask) >> cpy = ((p->required_mask & bitmap) == p->required_mask); >> >> I don't see any redundancy here. Please let me know otherwise? > [...] >>>> + if (cpy && bitmap) { >>> >>> Assuming early returns on `!cpy` are done earlier: >>> >>> if (!bitmap) >>> return 0; >> I've posted condition above which would better suit here. > [...] > > Since the `cpy` condition is updated and passed to each new branch > (IOW: after setting cpy = 0 for whatever reason all the further code > is skipped) you can drop the variable and do early returns everywhere. > E.g.: > > if ((bitmap & p->required_mask) != p->required_mask) > return 0; > if (p->anyof_mask && !(bitmap & p->anyof_mask)) > return 0; > if (bitmap & p->excluded_mask) > return 0; > if (!bitmap) > return 0; Clever. Will do. > > Also you can take the "special" effect of masking with zero to be > always zero (and in C - false) to avoid testing for an empty mask > separately in most cases. Done. > >> Just for my knowledge, what does "Nit" signify if a comment is marked with it? > > A low priority / cosmetic item that you might consider ignoring if a > fix is too expensive or controversial. > >>> + if ((cur->len) && (cur->bitmap == bitmap) && >>> + (cur->start + cur->len * PAGE_SIZE == addr)) { >> >> I'd recommend removing the extra parentheses as they make the code >> less readable for me (too many parentheses to match visually). > I'll remove parenthesis. > > [...] >>> The `cur->len` test seems redundant: is it possible to have >>> `cur->start == addr` in that case (I guess it would have to get >>> `n_pages == 0` in an earlier invocation)? >> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is >> essential to check the validity from cur->len before performing other >> checks. Also cur->start can never be equal to addr as we are walking over >> page addressing in serial manner. We want to see here if the current >> address matches the previous data by finding the ending address of last >> stored data (cur->start + cur->len * PAGE_SIZE). > > If cur->len == 0, then it doesn't matter if it gets merged or not - it > can be filtered out during the flush (see below). > > [...] >>>> + } else if ((!p->vec_index) || >>>> + ((p->vec_index + 1) < p->vec_len)) { >>> >>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: >>> >>> if (vec_index >= p->vec_len) >>> return -ENOSPC; >> >> No, it'll not work. Lets leave it as it is. :) >> >> It has gotten somewhat complex, but I don't have any other way to make it >> simpler which works. First note the following points: >> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in >> kernel (p->vec). >> 2) We also want to merge the consecutive pages with the same flags into one >> struct page_region. p->vec of current walk may merge with next walk. So we >> cannot write to user memory until we find the results of the next walk. >> >> So most recent data is put into p->cur. When non-intersecting or mergeable >> data is found, we move p->cur to p->vec[p->index] inside the page walk. >> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all >> the walks are over. We move the p->cur to arg->vec. It completes the data >> transfer to user buffer. > [...] >> I'm so sorry that it has gotten this much complex. It was way simpler when >> we were walking over all the memory in one go. But then we needed an >> unbounded memory from the kernel which we don't want. > [...] > > I've gone through and hopefully understood the code. I'm not sure this > needs to be so complicated: when traversing a single PMD you can > always copy p->cur to p->vec[p->vec_index++] because you can have at > most pages_per_PMD non-merges (in the worst case the last page always > is left in p->cur and whole p->vec is used). After each PMD p->vec > needs a flush if p->vec_index > 0, skipping the dummy entry at front > (len == 0; if present). (This is mostly how it is implemented now, but > I propose to remove the "overflow" check and do the starting guard > removal only every PMD.) Sorry, unable to understand where to remove the guard? > > BTW, the pagemap_scan_deposit() got me a bit confused: it seems that > it is just a copy of the p->vec flush to userspace. Please either use > it for both p->vec and p->cur flushing or inline. I can inline this function if you say so, now that you understand all the logic. I don't see what else can be done here. > > BTW#2, I think the ENOSPC return in pagemap_scan_output() should > happen later - only if the pages would match and that caused the count > to exceed the limit. For THP n_pages should be truncated to the limit > (and ENOSPC returned right away) only after the pages were verified to > match. We have 2 counters here: * the p->max_pages optionally can be set to find out only N pages of interest. So p->found_pages is counting this. We need to return early if the page limit is complete. * the p->vec_index keeps track of output buffer array size > > [...] >>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start, >>>> + unsigned long end, struct mm_walk *walk) >>>> +{ >>>> + struct pagemap_scan_private *p = walk->private; >>>> + struct vm_area_struct *vma = walk->vma; >>>> + bool is_writ, is_file, is_pres, is_swap; >>>> + unsigned long addr = end; >>>> + spinlock_t *ptl; >>>> + int ret = 0; >>>> + pte_t *pte; >>>> + >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>> >>> Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return >>> NULL if transparent hugepages are not compiled in. OTOH I see >>> BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this >>> case), so that would need to be worked around first. >> I'd got the build error [1]. So I'd added these. I've tested it again with >> the same config. We don't need these #ifdef now. I'll remove these. > > I mean that there are cases like [1] that actually need the #ifdef at > least to wrap HPAGE_SIZE usage. But maybe just this constant can be > wrapped so that we keep the code always compile-tested? Some arch define HPAGE_SIZE even if huge page config isn't enabled and some don't. Lets just add #ifdef CONFIG_TRANSPARENT_HUGEPAGE as it is just like the similar code in this same file is using this same #ifdef. > > [1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66 > > [...] >>>> + if (!arg->required_mask && !arg->anyof_mask && >>>> + !arg->excluded_mask) >>>> + return false; >>> >>> Is there an assumption in the code that those checks are needed? I'd >>> expect that no selection criteria makes a valid page set? >> In my view, selection criterion must be specified for the ioctl to work. If >> there is no criterio, user should go and read pagemap file directly. So the >> assumption is that at least one selection criterion must be specified. > > Yes. I'm not sure we need to prevent multiple ways of doing the same > thing. But doesn't pagemap reading lack the range aggregation feature? Yeah, correct. But note that we are supporting only selective 4 flags in this ioctl, not all pagemap flags. So it is useful for only those users who depend only on these 4 flags. Out pagemap_ioctl interface is not so much generic that we can cater anyone. Its interface is specific and we are adding only those cases which are of our interest. So if someone wants range aggregation from pagemap_ioctl, he'll need to add that flag in the IOCTL first. When IOCTL support is added, he can specify the selection criterion etc. > > [...] >>>> + if (!access_ok((void __user *)vec, >>>> + arg->vec_len * sizeof(struct page_region))) >>>> + return false; >>> >>> Is there a provision that userspace threads are all blocked from >>> manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug >>> and the writes should be checked each time as another userspace thread >>> could remap the memory while the ioctl() is working. >> mincore() syscall is doing in the same way. It checks the validity in the >> start only. What provision should I add? Isn't it obvious that the user >> should not remap such memory? > > On the second look, I think the code already checks that while doing > copy_to_user(), so this check is redundant and can be removed. I'll remove. > >>> >>>> + if (!IS_ALIGNED(start, PAGE_SIZE)) >>>> + return false; >>>> + if (!access_ok((void __user *)start, arg->len)) >>>> + return false; >>> >>> This I guess wants to check if the range to be scanned is mapped - >>> but isn't this what the ioctl() should do during the scan? (But, also >>> see above.) >> No, start represents the memory which the user wants to watch. User must >> allocate this memory first and then pass the address to this ioctl to find >> out the flags per page. > > From: > + * struct pm_scan_arg - Pagemap ioctl argument > + * @size: Size of the structure > + * @flags: Flags for the IOCTL > + * @start: Starting address of the region > + * @len: Length of the region (All the pages in this > length are included) > ... > > I'd expect the `start` field to just be a virtual address to start > scanning from. Does it need to be mapped? For CRIU usecase I'd start > with "start = 0" to find out all mappings, but 0 is (always) not > mapped. Is this supposed to only work on already discovered page > ranges? Anyway, I'd expect the code should be tolerant of another > thread changing the mappings while this ioctl() is walking the page > tables - is it so? If yes, then this check serves at most as an > optimization used only for an invalid call. Ohh... Ignore my previous comment. Yeah, any valid memory range can be passed to view the page flags. This check just verifies if the memory range is valid. > >>>> const struct file_operations proc_pagemap_operations = { >>>> .llseek = mem_lseek, /* borrow this */ >>>> .read = pagemap_read, >>>> .open = pagemap_open, >>>> .release = pagemap_release, >>>> + .unlocked_ioctl = pagemap_scan_ioctl, >>>> + .compat_ioctl = pagemap_scan_ioctl, >>> >>> Is this correct? Would the code need a different userspace pointer >>> handling for 32-bit userspace on 64-bit kernel? >> Yeah, it is needed for 32-bit application to run on 64-bit kernel. > > I mean is using the same function for both entry points correct? Don't > the pointers to userspace memory (e.g. arg->vec) need to be mapped for > 32-bit process? No, every member is our argument structure is of 64 bit in our structure which keeps memory layout same. So we don't need any specific conversion here. (Even if we had any 32-bit variable, we just needed to make sure that the layout remains the same in the memory.) Thanks, Usama > > Best Regards > > Michał Mirosław
On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote: > On 3/17/23 2:28 AM, Michał Mirosław wrote: > > On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum > > <usama.anjum@collabora.com> wrote: > >> On 3/13/23 9:02 PM, Michał Mirosław wrote: > >>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum > >>> <usama.anjum@collabora.com> wrote: > > [...] > >>>> --- a/fs/proc/task_mmu.c > >>>> +++ b/fs/proc/task_mmu.c > > [...] > >>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, > > [...] > >>> The `cur->len` test seems redundant: is it possible to have > >>> `cur->start == addr` in that case (I guess it would have to get > >>> `n_pages == 0` in an earlier invocation)? > >> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is > >> essential to check the validity from cur->len before performing other > >> checks. Also cur->start can never be equal to addr as we are walking over > >> page addressing in serial manner. We want to see here if the current > >> address matches the previous data by finding the ending address of last > >> stored data (cur->start + cur->len * PAGE_SIZE). > > > > If cur->len == 0, then it doesn't matter if it gets merged or not - it > > can be filtered out during the flush (see below). > > [...] > >>>> + } else if ((!p->vec_index) || > >>>> + ((p->vec_index + 1) < p->vec_len)) { > >>> > >>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: > >>> > >>> if (vec_index >= p->vec_len) > >>> return -ENOSPC; > >> > >> No, it'll not work. Lets leave it as it is. :) > >> > >> It has gotten somewhat complex, but I don't have any other way to make it > >> simpler which works. First note the following points: > >> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in > >> kernel (p->vec). > >> 2) We also want to merge the consecutive pages with the same flags into one > >> struct page_region. p->vec of current walk may merge with next walk. So we > >> cannot write to user memory until we find the results of the next walk. > >> > >> So most recent data is put into p->cur. When non-intersecting or mergeable > >> data is found, we move p->cur to p->vec[p->index] inside the page walk. > >> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all > >> the walks are over. We move the p->cur to arg->vec. It completes the data > >> transfer to user buffer. > > [...] > >> I'm so sorry that it has gotten this much complex. It was way simpler when > >> we were walking over all the memory in one go. But then we needed an > >> unbounded memory from the kernel which we don't want. > > [...] > > > > I've gone through and hopefully understood the code. I'm not sure this > > needs to be so complicated: when traversing a single PMD you can > > always copy p->cur to p->vec[p->vec_index++] because you can have at > > most pages_per_PMD non-merges (in the worst case the last page always > > is left in p->cur and whole p->vec is used). After each PMD p->vec > > needs a flush if p->vec_index > 0, skipping the dummy entry at front > > (len == 0; if present). (This is mostly how it is implemented now, but > > I propose to remove the "overflow" check and do the starting guard > > removal only every PMD.) > Sorry, unable to understand where to remove the guard? Instead of checking for it in pagemap_scan_output() for each page you can skip it in do_pagemap_cmd() when doing the flush. > > BTW#2, I think the ENOSPC return in pagemap_scan_output() should > > happen later - only if the pages would match and that caused the count > > to exceed the limit. For THP n_pages should be truncated to the limit > > (and ENOSPC returned right away) only after the pages were verified to > > match. > We have 2 counters here: > * the p->max_pages optionally can be set to find out only N pages of > interest. So p->found_pages is counting this. We need to return early if > the page limit is complete. > * the p->vec_index keeps track of output buffer array size I think I get how the limits are supposed to work, but I also think the implementation is not optimal. An example (assuming max_pages = 1 and vec_len = 1): - a matching page has been found - a second - non-matching - is tried but results in immediate -ENOSPC. -> In this case I'd expect the early return to happen just after the first page is found so that non A similar problem occurs for hugepage: when the limit is hit (we found >= max_pages, n_pages is possibly truncated), but the scan continues until next page / PMD. [...] > >>>> + if (!arg->required_mask && !arg->anyof_mask && > >>>> + !arg->excluded_mask) > >>>> + return false; > >>> > >>> Is there an assumption in the code that those checks are needed? I'd > >>> expect that no selection criteria makes a valid page set? > >> In my view, selection criterion must be specified for the ioctl to work. If > >> there is no criterio, user should go and read pagemap file directly. So the > >> assumption is that at least one selection criterion must be specified. > > > > Yes. I'm not sure we need to prevent multiple ways of doing the same > > thing. But doesn't pagemap reading lack the range aggregation feature? > Yeah, correct. But note that we are supporting only selective 4 flags in > this ioctl, not all pagemap flags. So it is useful for only those users who > depend only on these 4 flags. Out pagemap_ioctl interface is not so much > generic that we can cater anyone. Its interface is specific and we are > adding only those cases which are of our interest. So if someone wants > range aggregation from pagemap_ioctl, he'll need to add that flag in the > IOCTL first. When IOCTL support is added, he can specify the selection > criterion etc. The available flag set is not a problem. An example usecase: dumping the memory state for debugging: ioctl(return_mask=ALL) returns a conveniently compact vector of ranges of pages that are actually used by the process (not only having reserved the virtual space). This is actually something that helps dumping processes with using tools like AddressSanitizer that create huge sparse mappings. Best Regards Michał Mirosław
On 3/17/23 7:15 PM, Michał Mirosław wrote: > On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum > <usama.anjum@collabora.com> wrote: >> On 3/17/23 2:28 AM, Michał Mirosław wrote: >>> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum >>> <usama.anjum@collabora.com> wrote: >>>> On 3/13/23 9:02 PM, Michał Mirosław wrote: >>>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum >>>>> <usama.anjum@collabora.com> wrote: >>> [...] >>>>>> --- a/fs/proc/task_mmu.c >>>>>> +++ b/fs/proc/task_mmu.c >>> [...] >>>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap, >>> [...] >>>>> The `cur->len` test seems redundant: is it possible to have >>>>> `cur->start == addr` in that case (I guess it would have to get >>>>> `n_pages == 0` in an earlier invocation)? >>>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is >>>> essential to check the validity from cur->len before performing other >>>> checks. Also cur->start can never be equal to addr as we are walking over >>>> page addressing in serial manner. We want to see here if the current >>>> address matches the previous data by finding the ending address of last >>>> stored data (cur->start + cur->len * PAGE_SIZE). >>> >>> If cur->len == 0, then it doesn't matter if it gets merged or not - it >>> can be filtered out during the flush (see below). >>> [...] >>>>>> + } else if ((!p->vec_index) || >>>>>> + ((p->vec_index + 1) < p->vec_len)) { >>>>> >>>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better: >>>>> >>>>> if (vec_index >= p->vec_len) >>>>> return -ENOSPC; >>>> >>>> No, it'll not work. Lets leave it as it is. :) >>>> >>>> It has gotten somewhat complex, but I don't have any other way to make it >>>> simpler which works. First note the following points: >>>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in >>>> kernel (p->vec). >>>> 2) We also want to merge the consecutive pages with the same flags into one >>>> struct page_region. p->vec of current walk may merge with next walk. So we >>>> cannot write to user memory until we find the results of the next walk. >>>> >>>> So most recent data is put into p->cur. When non-intersecting or mergeable >>>> data is found, we move p->cur to p->vec[p->index] inside the page walk. >>>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all >>>> the walks are over. We move the p->cur to arg->vec. It completes the data >>>> transfer to user buffer. >>> [...] >>>> I'm so sorry that it has gotten this much complex. It was way simpler when >>>> we were walking over all the memory in one go. But then we needed an >>>> unbounded memory from the kernel which we don't want. >>> [...] >>> >>> I've gone through and hopefully understood the code. I'm not sure this >>> needs to be so complicated: when traversing a single PMD you can >>> always copy p->cur to p->vec[p->vec_index++] because you can have at >>> most pages_per_PMD non-merges (in the worst case the last page always >>> is left in p->cur and whole p->vec is used). After each PMD p->vec >>> needs a flush if p->vec_index > 0, skipping the dummy entry at front >>> (len == 0; if present). (This is mostly how it is implemented now, but >>> I propose to remove the "overflow" check and do the starting guard >>> removal only every PMD.) >> Sorry, unable to understand where to remove the guard? > > Instead of checking for it in pagemap_scan_output() for each page you > can skip it in do_pagemap_cmd() when doing the flush. No, this cannot be done because in do_pagemap_cmd() we don't know that we have space for new pages in the output buffer or not because the next page may be aggregated to already present data. > >>> BTW#2, I think the ENOSPC return in pagemap_scan_output() should >>> happen later - only if the pages would match and that caused the count >>> to exceed the limit. For THP n_pages should be truncated to the limit >>> (and ENOSPC returned right away) only after the pages were verified to >>> match. >> We have 2 counters here: >> * the p->max_pages optionally can be set to find out only N pages of >> interest. So p->found_pages is counting this. We need to return early if >> the page limit is complete. >> * the p->vec_index keeps track of output buffer array size > > I think I get how the limits are supposed to work, but I also think > the implementation is not optimal. An example (assuming max_pages = 1 > and vec_len = 1): > - a matching page has been found > - a second - non-matching - is tried but results in immediate -ENOSPC. > -> In this case I'd expect the early return to happen just after the > first page is found so that non > A similar problem occurs for hugepage: when the limit is hit (we found >> = max_pages, n_pages is possibly truncated), but the scan continues > until next page / PMD. I'll try to check if I can optimize it. It seems like I should be able to update this pretty easily by returning a negative status/error which signifies that we have found the max_pages. Now just abort in sane way. > > [...] >>>>>> + if (!arg->required_mask && !arg->anyof_mask && >>>>>> + !arg->excluded_mask) >>>>>> + return false; >>>>> >>>>> Is there an assumption in the code that those checks are needed? I'd >>>>> expect that no selection criteria makes a valid page set? >>>> In my view, selection criterion must be specified for the ioctl to work. If >>>> there is no criterio, user should go and read pagemap file directly. So the >>>> assumption is that at least one selection criterion must be specified. >>> >>> Yes. I'm not sure we need to prevent multiple ways of doing the same >>> thing. But doesn't pagemap reading lack the range aggregation feature? >> Yeah, correct. But note that we are supporting only selective 4 flags in >> this ioctl, not all pagemap flags. So it is useful for only those users who >> depend only on these 4 flags. Out pagemap_ioctl interface is not so much >> generic that we can cater anyone. Its interface is specific and we are >> adding only those cases which are of our interest. So if someone wants >> range aggregation from pagemap_ioctl, he'll need to add that flag in the >> IOCTL first. When IOCTL support is added, he can specify the selection >> criterion etc. > > The available flag set is not a problem. An example usecase: dumping > the memory state for debugging: ioctl(return_mask=ALL) returns a > conveniently compact vector of ranges of pages that are actually used > by the process (not only having reserved the virtual space). This is > actually something that helps dumping processes with using tools like > AddressSanitizer that create huge sparse mappings. I don't know, we are adding more and more use cases as people are noticing it. I've not thought about this use case. So I need more understanding about it: How should I identify "which pages are used"? Does use mean present and swapped both? We we want to find present or swapped pages in other words !pte_none pages and return in compact form, it can already be done by ioctl(anyod_mask=PRESET | SWAPPED, return_mask=ALL). > > Best Regards > Michał Mirosław
On 3/17/23 12:20 AM, Peter Xu wrote: > Hello, Muhammad, > > On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote: >> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page >> faults on its own. It can be used to track that which pages have been >> written-to from the time the pages were write-protected. It is very >> efficient way to track the changes as uffd is by nature pte/pmd based. >> >> UFFD synchronous WP sends the page faults to the userspace where the >> pages which have been written-to can be tracked. But it is not efficient. >> This is why this asynchronous version is being added. After setting the >> WP Async, the pages which have been written to can be found in the pagemap >> file or information can be obtained from the PAGEMAP_IOCTL. >> >> Suggested-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> > > Here's the patch that can enable WP_ASYNC for all kinds of memories (as I > promised..). Currently I only tested btrfs (besides the common three) > which is the major fs I use locally, but I guess it'll also enable the rest > no matter what's underneath, just like soft-dirty. > > As I mentioned, I just feel it very unfortunate to have a lot of suffixes > for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for > this WP_ASYNC from the start because the workflow should really be similar > to anon/shmem handling for most of the rest, just a few tweaks here and > there. > > I had a feeling that some type of special VMA will work weirdly, but let's > see.. so far I don't come up with any. > > If the patch looks fine to you, please consider replace this patch with > patch 1 of mine where I attached. Then patch 1 can be reviewed alongside > with your series. > > Logically patch 1 can be reviewed separately too, because it works > perfectly afaiu without the atomic version of pagemap already. But on my > side I don't think it justifies anything really matters, so unless someone > thinks it a good idea to post / review / merge it separately, you can keep > that with your new pagemap ioctl. > > Patch 2 is only for your reference. It's not for merging quality so please > don't put it into your series. I do plan to cleanup the userfaultfd > selftests in the near future first (when I wrote this I am more eager to do > so..). I also think your final pagemap test cases can cover quite a bit. > > Thanks, Thank you so much for the patch. I've tested hugetlb mem. This patch is working fine for hugetlb shmem: *shmid = shmget(2, size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W); mem = shmat(*shmid, 0, 0); I've found slight issue with hugetlb mem which has been mmaped: mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_HUGETLB | MAP_PRIVATE, -1, 0); The issue is that even after witting to this memory, the wp flag is still present there and memory doesn't appear to be dirty when it should have been dirty. The temporary fix is to write to memory and write protect the memory one extra time. Here is how I'm checking if WP flag is set or not: static inline bool is_huge_pte_uffd_wp(pte_t pte) { return ((pte_present(pte) && huge_pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte)); } I've isolated the reproducer inside kselftests by commenting the unrelated code. Please have a look at the attached kselftest and follow from main or search `//#define tmpfix` in the code.