mbox series

[v4,0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY

Message ID 20250604231151.799834-1-surenb@google.com
Headers show
Series use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY | expand

Message

Suren Baghdasaryan June 4, 2025, 11:11 p.m. UTC
Reading /proc/pid/maps requires read-locking mmap_lock which prevents any
other task from concurrently modifying the address space. This guarantees
coherent reporting of virtual address ranges, however it can block
important updates from happening. Oftentimes /proc/pid/maps readers are
low priority monitoring tasks and them blocking high priority tasks
results in priority inversion.

Locking the entire address space is required to present fully coherent
picture of the address space, however even current implementation does not
strictly guarantee that by outputting vmas in page-size chunks and
dropping mmap_lock in between each chunk. Address space modifications are
possible while mmap_lock is dropped and userspace reading the content is
expected to deal with possible concurrent address space modifications.
Considering these relaxed rules, holding mmap_lock is not strictly needed
as long as we can guarantee that a concurrently modified vma is reported
either in its original form or after it was modified.

This patchset switches from holding mmap_lock while reading /proc/pid/maps
to taking per-vma locks as we walk the vma tree. This reduces the
contention with tasks modifying the address space because they would have
to contend for the same vma as opposed to the entire address space. Same
is done for PROCMAP_QUERY ioctl which locks only the vma that fell into
the requested range instead of the entire address space. Previous version
of this patchset [1] tried to perform /proc/pid/maps reading under RCU,
however its implementation is quite complex and the results are worse than
the new version because it still relied on mmap_lock speculation which
retries if any part of the address space gets modified. New implementaion
is both simpler and results in less contention. Note that similar approach
would not work for /proc/pid/smaps reading as it also walks the page table
and that's not RCU-safe.

Paul McKenney's designed a test [2] to measure mmap/munmap latencies while
concurrently reading /proc/pid/maps. The test has a pair of processes
scanning /proc/PID/maps, and another process unmapping and remapping 4K
pages from a 128MB range of anonymous memory.  At the end of each 10
second run, the latency of each mmap() or munmap() operation is measured,
and for each run the maximum and mean latency is printed. The map/unmap
process is started first, its PID is passed to the scanners, and then the
map/unmap process waits until both scanners are running before starting
its timed test.  The scanners keep scanning until the specified
/proc/PID/maps file disappears. This test registered close to 10x
improvement in update latencies:

Before the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.011     0.008     0.455
    0.011     0.008     0.472
    0.011     0.008     0.535
    0.011     0.009     0.545
    ...
    0.011     0.014     2.875
    0.011     0.014     2.913
    0.011     0.014     3.007
    0.011     0.015     3.018

After the change:
./run-proc-vs-map.sh --nsamples 100 --rawdata -- --busyduration 2
    0.006     0.005     0.036
    0.006     0.005     0.039
    0.006     0.005     0.039
    0.006     0.005     0.039
    ...
    0.006     0.006     0.403
    0.006     0.006     0.474
    0.006     0.006     0.479
    0.006     0.006     0.498

The patchset also adds a number of tests to check for /proc/pid/maps data
coherency. They are designed to detect any unexpected data tearing while
performing some common address space modifications (vma split, resize and
remap). Even before these changes, reading /proc/pid/maps might have
inconsistent data because the file is read page-by-page with mmap_lock
being dropped between the pages. An example of user-visible inconsistency
can be that the same vma is printed twice: once before it was modified and
then after the modifications. For example if vma was extended, it might be
found and reported twice. What is not expected is to see a gap where there
should have been a vma both before and after modification. This patchset
increases the chances of such tearing, therefore it's event more important
now to test for unexpected inconsistencies.

[1] https://lore.kernel.org/all/20250418174959.1431962-1-surenb@google.com/
[2] https://github.com/paulmckrcu/proc-mmap_sem-test

Suren Baghdasaryan (7):
  selftests/proc: add /proc/pid/maps tearing from vma split test
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    resizing
  selftests/proc: extend /proc/pid/maps tearing test to include vma
    remapping
  selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently
    modified
  selftests/proc: add verbose more for tests to facilitate debugging
  mm/maps: read proc/pid/maps under per-vma lock
  mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks

 fs/proc/internal.h                         |   6 +
 fs/proc/task_mmu.c                         | 233 +++++-
 tools/testing/selftests/proc/proc-pid-vm.c | 793 ++++++++++++++++++++-
 3 files changed, 1011 insertions(+), 21 deletions(-)


base-commit: 2d0c297637e7d59771c1533847c666cdddc19884

Comments

Lorenzo Stoakes June 13, 2025, 3:01 p.m. UTC | #1
Hi Suren,

I promised I'd share VMA merging scenarios so we can be absolutely sure we have
all cases covered, I share that below. I also included information on split.

Hopefully this is useful! And maybe we can somehow put in a comment or commit
msg or something somewhere? Not sure if a bit much for that though :)

Note that in all of the below we hold exclusive mmap, vma + rmap write locks.

## Merge with change to EXISTING VMA

### Merge both

                      start    end
                         |<---->|
                 |-------********-------|
                   prev   middle   next
                  extend  delete  delete

1. Set prev VMA range [prev->vm_start, next->vmend)
2. Overwrite prev, middle, next nodes in maple tree with prev
3. Detach middle VMA
4. Free middle VMA
5. Detach next VMA
6. Free next VMA

### Merge left full

                       start        end
                         |<--------->|
                 |-------*************
                   prev     middle
                  extend    delete

1. Set prev VMA range [prev->vm_start, end)
2. Overwrite prev, middle nodes in maple tree with prev
3. Detach middle VMA
4. Free middle VMA

### Merge left partial

                       start   end
		         |<---->|
		 |-------*************
		   prev     middle
		  extend  partial overwrite

1. Set prev VMA range [prev->vm_start, end)
2. Set middle range [end, middle->vm_end)
3. Overwrite prev, middle (partial) nodes in maple tree with prev

### Merge right full

               start        end
		 |<--------->|
		 *************-------|
		    middle     next
		    delete    extend

1. Set next range [start, next->vm_end)
2. Overwrite middle, next nodes in maple tree with next
3. Detach middle VMA
4. Free middle VMA

### Merge right partial

                   start    end
		     |<----->|
		 *************-------|
		    middle     next
		    shrink    extend

1. Set middle range [middle->vm_start, start)
2. Set next range [start, next->vm_end)
3. Overwrite middle (partial), next nodes in maple tree with next

## Merge due to introduction of proposed NEW VMA

These cases are easier as there's no existing VMA to either remove or partially
adjust.

### Merge both

                       start     end
		         |<------>|
		 |-------..........-------|
		   prev  (proposed)  next
		  extend            delete

1. Set prev VMA range [prev->vm_start, next->vm_end)
2. Overwrite prev, next nodes in maple tree with prev
3. Detach next VMA
4. Delete next VMA

### Merge left

                       start     end
		         |<------>|
		 |-------..........
		   prev  (proposed)
		  extend

1. Set prev VMA range [prev->vm_start, end)
2. Overwrite prev node in maple tree with newly extended prev

(This is what's used for brk() and bprm_mm_init() stack relocation in
relocate_vma_down() too)

### Merge right

                       start     end
		         |<------>|
		         ..........-------|
		         (proposed)  next
		                    extend

1. Set next VMA range [start, next->vm_end)
2. Overwrite next node in maple tree with newly extended next

## Split VMA

If new below:

                    addr
                |-----.-----|
                | new .     |
                |-----.-----|
                     vma
Otherwise:

                    addr
                |-----.-----|
                |     . new |
                |-----.-----|
		     vma

1. Duplicate vma
2. If new below, set new range to [vma-vm_start, addr)
3. Otherwise, set new range to [addr, vma->vm_end)
4. If new below, Set vma range to [addr, vma->vm_end)
5. Otherwise, set vma range to [vma->vm_start, addr)
6. Partially overwrite vma node in maple tree with new

Cheers, Lorenzo
Suren Baghdasaryan June 13, 2025, 7:11 p.m. UTC | #2
On Fri, Jun 13, 2025 at 8:01 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hi Suren,
>
> I promised I'd share VMA merging scenarios so we can be absolutely sure we have
> all cases covered, I share that below. I also included information on split.

Thanks Lorenzo! This is great and very helpful.

>
> Hopefully this is useful! And maybe we can somehow put in a comment or commit
> msg or something somewhere? Not sure if a bit much for that though :)

I'll see if I can add a short version into my next cover letter.

>
> Note that in all of the below we hold exclusive mmap, vma + rmap write locks.
>
> ## Merge with change to EXISTING VMA
>
> ### Merge both
>
>                       start    end
>                          |<---->|
>                  |-------********-------|
>                    prev   middle   next
>                   extend  delete  delete
>
> 1. Set prev VMA range [prev->vm_start, next->vmend)
> 2. Overwrite prev, middle, next nodes in maple tree with prev
> 3. Detach middle VMA
> 4. Free middle VMA
> 5. Detach next VMA
> 6. Free next VMA

This case should be fine with per-vma locks while reading
/proc/pid/maps. In the worst case we will report some of the original
vmas before the merge and then the final merged vma, so prev might be
seen twice but no gaps should be observed.

>
> ### Merge left full
>
>                        start        end
>                          |<--------->|
>                  |-------*************
>                    prev     middle
>                   extend    delete
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Overwrite prev, middle nodes in maple tree with prev
> 3. Detach middle VMA
> 4. Free middle VMA

Same as the previous case. Worst case we report prev twice - once
before the merge, once after the merge.

>
> ### Merge left partial
>
>                        start   end
>                          |<---->|
>                  |-------*************
>                    prev     middle
>                   extend  partial overwrite
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Set middle range [end, middle->vm_end)
> 3. Overwrite prev, middle (partial) nodes in maple tree with prev

We might report prev twice here and this might cause us to retry if we
see a temporary gap between old prev and new middle vma. But retry
should handle this case, so I think we are good here.

>
> ### Merge right full
>
>                start        end
>                  |<--------->|
>                  *************-------|
>                     middle     next
>                     delete    extend
>
> 1. Set next range [start, next->vm_end)
> 2. Overwrite middle, next nodes in maple tree with next
> 3. Detach middle VMA
> 4. Free middle VMA

Worst case we report middle twice.

>
> ### Merge right partial
>
>                    start    end
>                      |<----->|
>                  *************-------|
>                     middle     next
>                     shrink    extend
>
> 1. Set middle range [middle->vm_start, start)
> 2. Set next range [start, next->vm_end)
> 3. Overwrite middle (partial), next nodes in maple tree with next

Worse case we retry and report middle twice.

>
> ## Merge due to introduction of proposed NEW VMA
>
> These cases are easier as there's no existing VMA to either remove or partially
> adjust.
>
> ### Merge both
>
>                        start     end
>                          |<------>|
>                  |-------..........-------|
>                    prev  (proposed)  next
>                   extend            delete
>
> 1. Set prev VMA range [prev->vm_start, next->vm_end)
> 2. Overwrite prev, next nodes in maple tree with prev
> 3. Detach next VMA
> 4. Delete next VMA

Worst case we report prev twice after retry.

>
> ### Merge left
>
>                        start     end
>                          |<------>|
>                  |-------..........
>                    prev  (proposed)
>                   extend
>
> 1. Set prev VMA range [prev->vm_start, end)
> 2. Overwrite prev node in maple tree with newly extended prev

Worst case we report prev twice.

>
> (This is what's used for brk() and bprm_mm_init() stack relocation in
> relocate_vma_down() too)
>
> ### Merge right
>
>                        start     end
>                          |<------>|
>                          ..........-------|
>                          (proposed)  next
>                                     extend
>
> 1. Set next VMA range [start, next->vm_end)
> 2. Overwrite next node in maple tree with newly extended next

This will show either a legit gap + original next or the extended next
with no gap. Both ways we are fine.

>
> ## Split VMA
>
> If new below:
>
>                     addr
>                 |-----.-----|
>                 | new .     |
>                 |-----.-----|
>                      vma
> Otherwise:
>
>                     addr
>                 |-----.-----|
>                 |     . new |
>                 |-----.-----|
>                      vma
>
> 1. Duplicate vma
> 2. If new below, set new range to [vma-vm_start, addr)
> 3. Otherwise, set new range to [addr, vma->vm_end)
> 4. If new below, Set vma range to [addr, vma->vm_end)
> 5. Otherwise, set vma range to [vma->vm_start, addr)
> 6. Partially overwrite vma node in maple tree with new

These are fine too. We will either report before-split view or after-split view.
Thanks,
Suren.

>
> Cheers, Lorenzo
Lorenzo Stoakes June 13, 2025, 7:26 p.m. UTC | #3
On Fri, Jun 13, 2025 at 12:11:43PM -0700, Suren Baghdasaryan wrote:
> On Fri, Jun 13, 2025 at 8:01 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Hi Suren,
> >
> > I promised I'd share VMA merging scenarios so we can be absolutely sure we have
> > all cases covered, I share that below. I also included information on split.
>
> Thanks Lorenzo! This is great and very helpful.

No problem! I do intend to look at the tests here too, but I just didn't
have time to get to that this week.

I don't think this should block the respin should it? Anyway hopefully I'll
be able to take a look next week.

>
> >
> > Hopefully this is useful! And maybe we can somehow put in a comment or commit
> > msg or something somewhere? Not sure if a bit much for that though :)
>
> I'll see if I can add a short version into my next cover letter.

Thanks!

Liam suggested somehow integrating this into our VMA userland testing or at
least as documentation there, I will put on my todo :)

Your replies below honestly make me feel more relaxed about this change
overall - it helps us really identify the known cases (Donald Rumsfeld of
course would tell us to fear the unknown unknowns but we do what we can) -
and if they are clearly thought out and confirmed to be safe then happy
days.

I wonder if we ought to have the tests explicitly try to trigger each case?
I'm not sure how practical/useful that would be however.

>
> >
> > Note that in all of the below we hold exclusive mmap, vma + rmap write locks.
> >
> > ## Merge with change to EXISTING VMA
> >
> > ### Merge both
> >
> >                       start    end
> >                          |<---->|
> >                  |-------********-------|
> >                    prev   middle   next
> >                   extend  delete  delete
> >
> > 1. Set prev VMA range [prev->vm_start, next->vmend)
> > 2. Overwrite prev, middle, next nodes in maple tree with prev
> > 3. Detach middle VMA
> > 4. Free middle VMA
> > 5. Detach next VMA
> > 6. Free next VMA
>
> This case should be fine with per-vma locks while reading
> /proc/pid/maps. In the worst case we will report some of the original
> vmas before the merge and then the final merged vma, so prev might be
> seen twice but no gaps should be observed.
>
> >
> > ### Merge left full
> >
> >                        start        end
> >                          |<--------->|
> >                  |-------*************
> >                    prev     middle
> >                   extend    delete
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Overwrite prev, middle nodes in maple tree with prev
> > 3. Detach middle VMA
> > 4. Free middle VMA
>
> Same as the previous case. Worst case we report prev twice - once
> before the merge, once after the merge.
>
> >
> > ### Merge left partial
> >
> >                        start   end
> >                          |<---->|
> >                  |-------*************
> >                    prev     middle
> >                   extend  partial overwrite
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Set middle range [end, middle->vm_end)
> > 3. Overwrite prev, middle (partial) nodes in maple tree with prev
>
> We might report prev twice here and this might cause us to retry if we
> see a temporary gap between old prev and new middle vma. But retry
> should handle this case, so I think we are good here.
>
> >
> > ### Merge right full
> >
> >                start        end
> >                  |<--------->|
> >                  *************-------|
> >                     middle     next
> >                     delete    extend
> >
> > 1. Set next range [start, next->vm_end)
> > 2. Overwrite middle, next nodes in maple tree with next
> > 3. Detach middle VMA
> > 4. Free middle VMA
>
> Worst case we report middle twice.
>
> >
> > ### Merge right partial
> >
> >                    start    end
> >                      |<----->|
> >                  *************-------|
> >                     middle     next
> >                     shrink    extend
> >
> > 1. Set middle range [middle->vm_start, start)
> > 2. Set next range [start, next->vm_end)
> > 3. Overwrite middle (partial), next nodes in maple tree with next
>
> Worse case we retry and report middle twice.
>
> >
> > ## Merge due to introduction of proposed NEW VMA
> >
> > These cases are easier as there's no existing VMA to either remove or partially
> > adjust.
> >
> > ### Merge both
> >
> >                        start     end
> >                          |<------>|
> >                  |-------..........-------|
> >                    prev  (proposed)  next
> >                   extend            delete
> >
> > 1. Set prev VMA range [prev->vm_start, next->vm_end)
> > 2. Overwrite prev, next nodes in maple tree with prev
> > 3. Detach next VMA
> > 4. Delete next VMA
>
> Worst case we report prev twice after retry.
>
> >
> > ### Merge left
> >
> >                        start     end
> >                          |<------>|
> >                  |-------..........
> >                    prev  (proposed)
> >                   extend
> >
> > 1. Set prev VMA range [prev->vm_start, end)
> > 2. Overwrite prev node in maple tree with newly extended prev
>
> Worst case we report prev twice.
>
> >
> > (This is what's used for brk() and bprm_mm_init() stack relocation in
> > relocate_vma_down() too)
> >
> > ### Merge right
> >
> >                        start     end
> >                          |<------>|
> >                          ..........-------|
> >                          (proposed)  next
> >                                     extend
> >
> > 1. Set next VMA range [start, next->vm_end)
> > 2. Overwrite next node in maple tree with newly extended next
>
> This will show either a legit gap + original next or the extended next
> with no gap. Both ways we are fine.
>
> >
> > ## Split VMA
> >
> > If new below:
> >
> >                     addr
> >                 |-----.-----|
> >                 | new .     |
> >                 |-----.-----|
> >                      vma
> > Otherwise:
> >
> >                     addr
> >                 |-----.-----|
> >                 |     . new |
> >                 |-----.-----|
> >                      vma
> >
> > 1. Duplicate vma
> > 2. If new below, set new range to [vma-vm_start, addr)
> > 3. Otherwise, set new range to [addr, vma->vm_end)
> > 4. If new below, Set vma range to [addr, vma->vm_end)
> > 5. Otherwise, set vma range to [vma->vm_start, addr)
> > 6. Partially overwrite vma node in maple tree with new
>
> These are fine too. We will either report before-split view or after-split view.
> Thanks,
> Suren.
>
> >
> > Cheers, Lorenzo