mbox series

[v1,00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb

Message ID 20211217113049.23850-1-david@redhat.com
Headers show
Series mm: COW fixes part 1: fix the COW security issue for THP and hugetlb | expand

Message

David Hildenbrand Dec. 17, 2021, 11:30 a.m. UTC
Hi everybody,

as discussed in the linux-mm alignment session on Wednesday, this is part 1
of the COW fixes: fix the COW security issue using GUP-triggered
unsharing of shared anonymous pages (ordinary, THP, hugetlb). In the
meeting slides, this approach was referred to as "Copy On Read". If anybody
wants to have access to the slides, please feel free to reach out.


The patches are based on v5.16-rc5 and available at:
  https://github.com/davidhildenbrand/linux/pull/new/unshare_v1


It is currently again possible for a child process to observe modifications
of anonymous pages performed by the parent process after fork() in some
cases, which is not only a violation of the POSIX semantics of MAP_PRIVATE,
but more importantly a real security issue.

This issue, including other related COW issues, has been summarized at [1]:
"
  1. Observing Memory Modifications of Private Pages From A Child Process
  
  Long story short: process-private memory might not be as private as you
  think once you fork(): successive modifications of private memory
  regions in the parent process can still be observed by the child
  process, for example, by smart use of vmsplice()+munmap().
  
  The core problem is that pinning pages readable in a child process, such
  as done via the vmsplice system call, can result in a child process
  observing memory modifications done in the parent process the child is
  not supposed to observe. [1] contains an excellent summary and [2]
  contains further details. This issue was assigned CVE-2020-29374 [9].
  
  For this to trigger, it's required to use a fork() without subsequent
  exec(), for example, as used under Android zygote. Without further
  details about an application that forks less-privileged child processes,
  one cannot really say what's actually affected and what's not -- see the
  details section the end of this mail for a short sshd/openssh analysis.
  
  While commit 17839856fd58 ("gup: document and work around "COW can break
  either way" issue") fixed this issue and resulted in other problems
  (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
  simplification") re-introduced part of the problem unfortunately.
  
  The original reproducer can be modified quite easily to use THP [3] and
  make the issue appear again on upstream kernels. I modified it to use
  hugetlb [4] and it triggers as well. The problem is certainly less
  severe with hugetlb than with THP; it merely highlights that we still
  have plenty of open holes we should be closing/fixing.
  
  Regarding vmsplice(), the only known workaround is to disallow the
  vmsplice() system call ... or disable THP and hugetlb. But who knows
  what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
  the end, it's a more generic issue.
"

This security issue was first reported by Jann Horn on 27 May 2020 and it
currently affects anonymous THP and hugetlb again. The "security issue"
part for hugetlb might be less important than for THP. However, with this
approach it's just easy to get the MAP_PRIVATE semantics of any anonymous
pages in that regard and avoid any such information leaks without much
added complexity.

Ordinary anonymous pages are currently not affected, because the COW logic
was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification")
for them to COW on "page_count() != 1" instead of "mapcount > 1", which
unfortunately results in other COW issues, some of them documented in [1]
as well.

To fix this COW issue once and for all, introduce GUP-triggered unsharing
that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to
traditional COW, unsharing will leave the copied page mapped
write-protected in the page table, not having the semantics of a write
fault.

Logically, unsharing is triggered "early", as soon as GUP performs the
action that could result in a COW getting missed later and the security
issue triggering: however, unsharing is not triggered as before via a
write fault with undesired side effects.

Long story short, GUP triggers unsharing if all of the following conditions
are met:
* The page is mapped R/O
* We have an anonymous page, excluding KSM
* We want to read (!FOLL_WRITE)
* Unsharing is not disabled (!FOLL_NOUNSHARE)
* We want to take a reference (FOLL_GET or FOLL_PIN)
* The page is a shared anonymous page: mapcount > 1

To reliably detect shared anonymous THP without heavy locking, introduce
a mapcount_seqcount seqlock that protects the mapcount of a THP and can
be used to read an atomic mapcount value. The mapcount_seqlock is stored
inside the memmap of the compound page -- to keep it simple, factor out
a raw_seqlock_t from the seqlock_t.

As this patch series introduces the same unsharing logic for any
anonymous pages, it also paves the way to fix other COW issues, e.g.,
documented in [1], without reintroducing the security issue or
reintroducing other issues we observed in the past (e.g., broken ptrace on
pmem).

All reproducers for this COW issue have been consolidated in the selftest
included in this series. Hopefully we'll get this fixed for good.

Future work:

* get_user_pages_fast_only() can currently spin on the mapcount_seqcount
  when reading the mapcount, which might be a rare event. While this is
  fine even when done from get_user_pages_fast_only() in IRQ context, we
  might want to just fail fast in get_user_pages_fast_only(). We already
  have patches prepared that add page_anon_maybe_shared() and
  page_trans_huge_anon_maybe_shared() that will return "true" in case
  spinning would be required and make get_user_pages_fast_only() fail fast.
  I'm excluding them for simplicity.

  ... even better would be finding a way to just not need the
  mapcount_seqcount, but THP splitting and PageDoubleMap() gives us a
  hard time -- but maybe we'll eventually find a way someday :)

* Part 2 will tackle the other user-space visible breakages / COW issues
  raised in [1]. This series is the basis for adjusting the COW logic once
  again without re-introducing the COW issue fixed in this series and
  without reintroducing the issues we saw with the original CVE fix
  (e.g., breaking ptrace on pmem). There might be further parts to improve
  the GUP long-term <-> MM synchronicity and to optimize some things
  around that.


The idea is by Andrea and some patches are rewritten versions of prototype
patches by Andrea. I cross-compiled and tested as good as possible.

I'll CC locking+selftest folks only on the relevant patch and the cover
letter to minimze the noise. I'll put everyone on CC who was either
involved with the COW issues in the past or attended the linux-mm alignment
session on Wednesday. Appologies if I forget anyone :)

[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com


David Hildenbrand (11):
  seqlock: provide lockdep-free raw_seqcount_t variant
  mm: thp: consolidate mapcount logic on THP split
  mm: simplify hugetlb and file-THP handling in __page_mapcount()
  mm: thp: simlify total_mapcount()
  mm: thp: allow for reading the THP mapcount atomically via a
    raw_seqlock_t
  mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
  mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
    (!hugetlb)
  mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE
  mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required
    (hugetlb)
  mm: thp: introduce and use page_trans_huge_anon_shared()
  selftests/vm: add tests for the known COW security issues

 Documentation/locking/seqlock.rst         |  50 ++++
 include/linux/huge_mm.h                   |  72 +++++
 include/linux/mm.h                        |  14 +
 include/linux/mm_types.h                  |   9 +
 include/linux/seqlock.h                   | 145 +++++++---
 mm/gup.c                                  |  89 +++++-
 mm/huge_memory.c                          | 120 +++++++--
 mm/hugetlb.c                              | 129 +++++++--
 mm/memory.c                               | 136 ++++++++--
 mm/rmap.c                                 |  40 +--
 mm/swapfile.c                             |  35 ++-
 mm/util.c                                 |  24 +-
 tools/testing/selftests/vm/Makefile       |   1 +
 tools/testing/selftests/vm/gup_cow.c      | 312 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh |  16 ++
 15 files changed, 1044 insertions(+), 148 deletions(-)
 create mode 100644 tools/testing/selftests/vm/gup_cow.c

Comments

Nadav Amit Dec. 17, 2021, 5:02 p.m. UTC | #1
> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> Sometimes it is required to have a seqcount implementation that uses
> a structure with a fixed and minimal size -- just a bare unsigned int --
> independent of the kernel configuration. This is especially valuable, when
> the raw_ variants of the seqlock function will be used and the additional
> lockdep part of the seqcount_t structure remains essentially unused.
> 
> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
> the raw functions to have a basic seqlock.
> 
> The target use case is embedding a raw_seqcount_t in the "struct page",
> where we really want a minimal size and cannot tolerate a sudden grow of
> the seqcount_t structure resulting in a significant "struct page"
> increase or even a layout change.
> 
> Provide raw_read_seqcount_retry(), to make it easy to match to
> raw_read_seqcount_begin() in the code.
> 
> Let's add a short documentation as well.
> 
> Note: There might be other possible users for raw_seqcount_t where the
>      lockdep part might be completely unused and just wastes memory --
>      essentially any users that only use the raw_ function variants.
> 

Is it possible to force some policy when raw_seqcount_t is used to
prevent its abuse? For instance not to allow to acquire other (certain?)
locks when it is held?

[ snip ]

> +/**
> + * raw_seqcount_init() - runtime initializer for raw_seqcount_t
> + * @s: Pointer to the raw_seqcount_t instance
> + */
> +# define raw_seqcount_init(s) __raw_seqcount_init(s)
> +
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
> 
> # define SEQCOUNT_DEP_MAP_INIT(lockname)				\
> @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
> # define seqcount_lockdep_reader_access(x)
> #endif
> 
> +/**
> + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
> + */
> +#define RAW_SEQCNT_ZERO() 0

I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.

Moreover, the documentation showed:

+	/* static */
+	static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount);
+

But RAW_SEQCNT_ZERO does not have an argument?
Nadav Amit Dec. 17, 2021, 5:16 p.m. UTC | #2
> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
> 
> This is a preparation for further changes.

It would be useful to add “no functional change intended” or something.

> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> mm/util.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
> {
> 	int ret;
> 
> -	ret = atomic_read(&page->_mapcount) + 1;
> +	if (PageHuge(page))
> +		return compound_mapcount(page);

Before you return, perhaps you can add an assertion like:

	VM_BUG_ON(PageDoubleMap(page));

This would be make the code clearer and would ease debugging in the
future (if support for double-map is expanded).
David Hildenbrand Dec. 17, 2021, 5:29 p.m. UTC | #3
On 17.12.21 18:02, Nadav Amit wrote:
> 
> 
>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> Sometimes it is required to have a seqcount implementation that uses
>> a structure with a fixed and minimal size -- just a bare unsigned int --
>> independent of the kernel configuration. This is especially valuable, when
>> the raw_ variants of the seqlock function will be used and the additional
>> lockdep part of the seqcount_t structure remains essentially unused.
>>
>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>> the raw functions to have a basic seqlock.
>>
>> The target use case is embedding a raw_seqcount_t in the "struct page",
>> where we really want a minimal size and cannot tolerate a sudden grow of
>> the seqcount_t structure resulting in a significant "struct page"
>> increase or even a layout change.
>>
>> Provide raw_read_seqcount_retry(), to make it easy to match to
>> raw_read_seqcount_begin() in the code.
>>
>> Let's add a short documentation as well.
>>
>> Note: There might be other possible users for raw_seqcount_t where the
>>      lockdep part might be completely unused and just wastes memory --
>>      essentially any users that only use the raw_ function variants.
>>
> 
> Is it possible to force some policy when raw_seqcount_t is used to
> prevent its abuse? For instance not to allow to acquire other (certain?)
> locks when it is held?
> 

Good question ... in this series we won't be taking additional locks on
the reader or the writer side. Something like lockdep_forbid() /
lockdep_allow() to disallow any kind of locking. I haven't heard of
anything like that, maybe someone reading along has a clue?

The writer side might be easy to handle, but some seqcount operations
that don't do the full read()->retry() cycle are problematic
(->raw_read_seqcount).

> [ snip ]
> 
>> +/**
>> + * raw_seqcount_init() - runtime initializer for raw_seqcount_t
>> + * @s: Pointer to the raw_seqcount_t instance
>> + */
>> +# define raw_seqcount_init(s) __raw_seqcount_init(s)
>> +
>> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>
>> # define SEQCOUNT_DEP_MAP_INIT(lockname)				\
>> @@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>> # define seqcount_lockdep_reader_access(x)
>> #endif
>>
>> +/**
>> + * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
>> + */
>> +#define RAW_SEQCNT_ZERO() 0
> 
> I am not sure why RAW_SWQCNT_ZERO() should be a function-like macro.
> 

I think I just went for consistency with SEQCNT_ZERO() -- but I agree,
that can just be simplified!

Thanks!
David Hildenbrand Dec. 17, 2021, 5:30 p.m. UTC | #4
On 17.12.21 18:16, Nadav Amit wrote:
> 
>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
>>
>> Let's return early for hugetlb, which really only relies on the compound
>> mapcount so far and does not support PageDoubleMap() yet. Use the chance
>> to cleanup the file-THP case to make it easier to grasp. While at it, use
>> head_compound_mapcount().
>>
>> This is a preparation for further changes.
> 
> It would be useful to add “no functional change intended” or something.

Absolutely, same applies to other "simplification" patches.

> 
>>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> mm/util.c | 11 +++++++----
>> 1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/util.c b/mm/util.c
>> index 741ba32a43ac..3239e75c148d 100644
>> --- a/mm/util.c
>> +++ b/mm/util.c
>> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
>> {
>> 	int ret;
>>
>> -	ret = atomic_read(&page->_mapcount) + 1;
>> +	if (PageHuge(page))
>> +		return compound_mapcount(page);
> 
> Before you return, perhaps you can add an assertion like:
> 
> 	VM_BUG_ON(PageDoubleMap(page));
> 
> This would be make the code clearer and would ease debugging in the
> future (if support for double-map is expanded).
> 

I'd probably have to add this to a couple of places -- and I assume
anybody working on that has to grep the kernel for use of PageDoubleMap
already.

Thanks!
David Hildenbrand Dec. 17, 2021, 5:49 p.m. UTC | #5
On 17.12.21 18:29, David Hildenbrand wrote:
> On 17.12.21 18:02, Nadav Amit wrote:
>>
>>
>>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
>>>
>>> Sometimes it is required to have a seqcount implementation that uses
>>> a structure with a fixed and minimal size -- just a bare unsigned int --
>>> independent of the kernel configuration. This is especially valuable, when
>>> the raw_ variants of the seqlock function will be used and the additional
>>> lockdep part of the seqcount_t structure remains essentially unused.
>>>
>>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>>> the raw functions to have a basic seqlock.
>>>
>>> The target use case is embedding a raw_seqcount_t in the "struct page",
>>> where we really want a minimal size and cannot tolerate a sudden grow of
>>> the seqcount_t structure resulting in a significant "struct page"
>>> increase or even a layout change.
>>>
>>> Provide raw_read_seqcount_retry(), to make it easy to match to
>>> raw_read_seqcount_begin() in the code.
>>>
>>> Let's add a short documentation as well.
>>>
>>> Note: There might be other possible users for raw_seqcount_t where the
>>>      lockdep part might be completely unused and just wastes memory --
>>>      essentially any users that only use the raw_ function variants.
>>>
>>
>> Is it possible to force some policy when raw_seqcount_t is used to
>> prevent its abuse? For instance not to allow to acquire other (certain?)
>> locks when it is held?
>>
> 
> Good question ... in this series we won't be taking additional locks on
> the reader or the writer side. Something like lockdep_forbid() /
> lockdep_allow() to disallow any kind of locking. I haven't heard of
> anything like that, maybe someone reading along has a clue?
> 
> The writer side might be easy to handle, but some seqcount operations
> that don't do the full read()->retry() cycle are problematic
> (->raw_read_seqcount).

Sorry, I forgot to mention an important point: the raw_seqcount_t
doesn't give you any additional "power" to abuse.

You can just use the ordinary seqcount_t with the raw_ functions. One
example is mm->write_protect_seq . So whatever we would want to "invent"
should also apply to the raw_ functions in general --  which might be
undesired or impossible (IIRC IRQ context).
Nadav Amit Dec. 17, 2021, 6:01 p.m. UTC | #6
> On Dec 17, 2021, at 9:49 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 17.12.21 18:29, David Hildenbrand wrote:
>> On 17.12.21 18:02, Nadav Amit wrote:
>>> 
>>> 
>>>> On Dec 17, 2021, at 3:30 AM, David Hildenbrand <david@redhat.com> wrote:
>>>> 
>>>> Sometimes it is required to have a seqcount implementation that uses
>>>> a structure with a fixed and minimal size -- just a bare unsigned int --
>>>> independent of the kernel configuration. This is especially valuable, when
>>>> the raw_ variants of the seqlock function will be used and the additional
>>>> lockdep part of the seqcount_t structure remains essentially unused.
>>>> 
>>>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>>>> the raw functions to have a basic seqlock.
>>>> 
>>>> The target use case is embedding a raw_seqcount_t in the "struct page",
>>>> where we really want a minimal size and cannot tolerate a sudden grow of
>>>> the seqcount_t structure resulting in a significant "struct page"
>>>> increase or even a layout change.
>>>> 
>>>> Provide raw_read_seqcount_retry(), to make it easy to match to
>>>> raw_read_seqcount_begin() in the code.
>>>> 
>>>> Let's add a short documentation as well.
>>>> 
>>>> Note: There might be other possible users for raw_seqcount_t where the
>>>>     lockdep part might be completely unused and just wastes memory --
>>>>     essentially any users that only use the raw_ function variants.
>>>> 
>>> 
>>> Is it possible to force some policy when raw_seqcount_t is used to
>>> prevent its abuse? For instance not to allow to acquire other (certain?)
>>> locks when it is held?
>>> 
>> 
>> Good question ... in this series we won't be taking additional locks on
>> the reader or the writer side. Something like lockdep_forbid() /
>> lockdep_allow() to disallow any kind of locking. I haven't heard of
>> anything like that, maybe someone reading along has a clue?
>> 
>> The writer side might be easy to handle, but some seqcount operations
>> that don't do the full read()->retry() cycle are problematic
>> (->raw_read_seqcount).
> 
> Sorry, I forgot to mention an important point: the raw_seqcount_t
> doesn't give you any additional "power" to abuse.
> 
> You can just use the ordinary seqcount_t with the raw_ functions. One
> example is mm->write_protect_seq . So whatever we would want to "invent"
> should also apply to the raw_ functions in general --  which might be
> undesired or impossible (IIRC IRQ context).
> 

Thanks for the clarification. I was unfamiliar with
raw_read_seqcount_begin() (and friends). Indeed it is very very rarely
used.
Mike Kravetz Dec. 17, 2021, 6:06 p.m. UTC | #7
On 12/17/21 03:30, David Hildenbrand wrote:
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance

It is too early to say if hugetlb double mapping will use PageDoubleMap().
I do not think (hope) it will be necessary.  So, I think you can drop mention
of it here.

> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
> 
> This is a preparation for further changes.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
David Hildenbrand Dec. 17, 2021, 6:11 p.m. UTC | #8
On 17.12.21 19:06, Mike Kravetz wrote:
> On 12/17/21 03:30, David Hildenbrand wrote:
>> Let's return early for hugetlb, which really only relies on the compound
>> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> 
> It is too early to say if hugetlb double mapping will use PageDoubleMap().
> I do not think (hope) it will be necessary.  So, I think you can drop mention
> of it here.

Desires have most certainly been expressed from a couple of parties --
to PTE map huge pages :) Hopefully we'll find a way to avoid
PageDoubleMap, I agree.

Dropping the comment!

> 
>> to cleanup the file-THP case to make it easier to grasp. While at it, use
>> head_compound_mapcount().
>>
>> This is a preparation for further changes.
>>
>> Reviewed-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> 

Thanks!
Yang Shi Dec. 17, 2021, 7:07 p.m. UTC | #9
On Fri, Dec 17, 2021 at 3:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
>
> This is a preparation for further changes.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Yang Shi <shy828301@gmail.com>

> ---
>  mm/util.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
>  {
>         int ret;
>
> -       ret = atomic_read(&page->_mapcount) + 1;
> +       if (PageHuge(page))
> +               return compound_mapcount(page);
>         /*
>          * For file THP page->_mapcount contains total number of mapping
>          * of the page: no need to look into compound_mapcount.
>          */
> -       if (!PageAnon(page) && !PageHuge(page))
> -               return ret;
> +       if (!PageAnon(page))
> +               return atomic_read(&page->_mapcount) + 1;
> +
> +       ret = atomic_read(&page->_mapcount) + 1;
>         page = compound_head(page);
> -       ret += atomic_read(compound_mapcount_ptr(page)) + 1;
> +       ret += head_compound_mapcount(page);
>         if (PageDoubleMap(page))
>                 ret--;
>         return ret;
> --
> 2.31.1
>
Thomas Gleixner Dec. 17, 2021, 9:28 p.m. UTC | #10
On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
> Sometimes it is required to have a seqcount implementation that uses
> a structure with a fixed and minimal size -- just a bare unsigned int --
> independent of the kernel configuration. This is especially valuable, when
> the raw_ variants of the seqlock function will be used and the additional
> lockdep part of the seqcount_t structure remains essentially unused.
>
> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
> the raw functions to have a basic seqlock.
>
> The target use case is embedding a raw_seqcount_t in the "struct page",
> where we really want a minimal size and cannot tolerate a sudden grow of
> the seqcount_t structure resulting in a significant "struct page"
> increase or even a layout change.

Cannot tolerate? Could you please provide a reason and not just a
statement?

> Provide raw_read_seqcount_retry(), to make it easy to match to
> raw_read_seqcount_begin() in the code.
>
> Let's add a short documentation as well.
>
> Note: There might be other possible users for raw_seqcount_t where the
>       lockdep part might be completely unused and just wastes memory --
>       essentially any users that only use the raw_ function variants.

Even when the reader side uses raw_seqcount_begin/retry() the writer
side still can use the non-raw variant which validates that the
associated lock is held on write.

Aside of that your proposed extra raw sequence count needs extra care
vs. PREEMPT_RT and this want's to be very clearly documented. Why?

The lock association has two purposes:

    1) Lockdep coverage which unearthed bugs already

    2) PREEMPT_RT livelock prevention

       Assume the following:

       spin_lock(wrlock);
       write_seqcount_begin(seq);

       ---> preemption by a high priority reader

       seqcount_begin(seq); <-- live lock

       The RT substitution does:

       seqcount_begin(seq)
           cnt = READ_ONCE(seq->sequence);

           if (cnt & 1) {
           	lock(s->lock);
                unlock(s->lock);
           }

       which prevents the deadlock because it makes the reader block on
       the associated lock, which allows the preempted writer to make
       progress.

       This applies to raw_seqcount_begin() as well.

I have no objections against the construct itself, but this has to be
properly documented vs. the restriction this imposes.

As you can see above the writer side therefore has to ensure that it
cannot preempted on PREEMPT_RT, which limits the possibilities what you
can do inside a preemption (or interrupt) disabled section on RT enabled
kernels. See Documentation/locking/locktypes.rst for further information.

Thanks,

        tglx
David Hildenbrand Dec. 17, 2021, 10:02 p.m. UTC | #11
On 17.12.21 22:28, Thomas Gleixner wrote:
> On Fri, Dec 17 2021 at 12:30, David Hildenbrand wrote:
>> Sometimes it is required to have a seqcount implementation that uses
>> a structure with a fixed and minimal size -- just a bare unsigned int --
>> independent of the kernel configuration. This is especially valuable, when
>> the raw_ variants of the seqlock function will be used and the additional
>> lockdep part of the seqcount_t structure remains essentially unused.
>>
>> Let's provide a lockdep-free raw_seqcount_t variant that can be used via
>> the raw functions to have a basic seqlock.
>>
>> The target use case is embedding a raw_seqcount_t in the "struct page",
>> where we really want a minimal size and cannot tolerate a sudden grow of
>> the seqcount_t structure resulting in a significant "struct page"
>> increase or even a layout change.
> 

Hi Thomas,

thanks for your feedback!

> Cannot tolerate? Could you please provide a reason and not just a
> statement?

Absolutely.

"struct page" is supposed to have a minimal size with a fixed layout.
Embedding something inside such a structure can change the fixed layout
in a way that it can just completely breaks any assumptions on location
of values.

Therefore, embedding a complex structure in it is usually avoided -- and
if we have to (spin_lock), we work around sudden size increases.

There are ways around it: allocate the lock and only store the pointer
in the struct page. But that most certainly adds complexity, which is
why I want to avoid it for now.


I'll extend that answer and add it to the patch description.

> 
>> Provide raw_read_seqcount_retry(), to make it easy to match to
>> raw_read_seqcount_begin() in the code.
>>
>> Let's add a short documentation as well.
>>
>> Note: There might be other possible users for raw_seqcount_t where the
>>       lockdep part might be completely unused and just wastes memory --
>>       essentially any users that only use the raw_ function variants.
> 
> Even when the reader side uses raw_seqcount_begin/retry() the writer
> side still can use the non-raw variant which validates that the
> associated lock is held on write.

Yes, that's my understanding as well.

> 
> Aside of that your proposed extra raw sequence count needs extra care
> vs. PREEMPT_RT and this want's to be very clearly documented. Why?
> 
> The lock association has two purposes:
> 
>     1) Lockdep coverage which unearthed bugs already

Yes, that's a real shame to lose.

> 
>     2) PREEMPT_RT livelock prevention
> 
>        Assume the following:
> 
>        spin_lock(wrlock);
>        write_seqcount_begin(seq);
> 
>        ---> preemption by a high priority reader
> 
>        seqcount_begin(seq); <-- live lock
> 
>        The RT substitution does:
> 
>        seqcount_begin(seq)
>            cnt = READ_ONCE(seq->sequence);
> 
>            if (cnt & 1) {
>            	lock(s->lock);
>                 unlock(s->lock);
>            }
> 
>        which prevents the deadlock because it makes the reader block on
>        the associated lock, which allows the preempted writer to make
>        progress.
> 
>        This applies to raw_seqcount_begin() as well.
> 
> I have no objections against the construct itself, but this has to be
> properly documented vs. the restriction this imposes.

Absolutely, any input is highly appreciated.

But to mention it again: whatever you can do with raw_seqcount_t, you
can do with seqcount_t, and there are already users relying completely
on the raw_ function variants (see my other reply).

So the documentation should most probably be extended to cover the raw_
functions and seqcount_t in general.

> 
> As you can see above the writer side therefore has to ensure that it
> cannot preempted on PREEMPT_RT, which limits the possibilities what you
> can do inside a preemption (or interrupt) disabled section on RT enabled
> kernels. See Documentation/locking/locktypes.rst for further information.

It's going to be used for THP, which are currently incompatible with
PREEMPT_RT (disabled in the Kconfig). But preemption is also disabled
because we're using bit_spin_lock(), which does a bit_spin_lock().

Certainly worth documenting!

Thanks for your input!
Kirill A. Shutemov Dec. 18, 2021, 2:31 p.m. UTC | #12
On Fri, Dec 17, 2021 at 12:30:41PM +0100, David Hildenbrand wrote:
> Let's return early for hugetlb, which really only relies on the compound
> mapcount so far and does not support PageDoubleMap() yet. Use the chance
> to cleanup the file-THP case to make it easier to grasp. While at it, use
> head_compound_mapcount().
> 
> This is a preparation for further changes.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/util.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 741ba32a43ac..3239e75c148d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
>  {
>  	int ret;
>  
> -	ret = atomic_read(&page->_mapcount) + 1;
> +	if (PageHuge(page))
> +		return compound_mapcount(page);

It would be nice to make PageHuge() inlinable first. It's a shame the we
need to have to do a function call for PageHuge() check.

Otherwise, looks good:

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>