diff mbox series

[v4,08/10] mm/gup: limit number of gup migration failures, honor failures

Message ID 20201217185243.3288048-9-pasha.tatashin@soleen.com
State New
Headers show
Series [v4,01/10] mm/gup: don't pin migrated cma pages in movable zone | expand

Commit Message

Pasha Tatashin Dec. 17, 2020, 6:52 p.m. UTC
check_and_migrate_movable_pages() does not honor isolation errors, and also
retries migration failures indefinably.

Fix both of the above issues: add a new function that checks and unpins
pages range check_and_unpin_pages().

Move the retry loop from  check_and_migrate_movable_pages() to
__gup_longterm_locked().

Rename check_and_migrate_movable_pages() as migrate_movable_pages() and
make this function accept already unpinned pages. Also, track the errors
during isolation, so they can be re-tried with a different maximum limit,
the isolation errors should be ephemeral.

Signed-off-by: Pavel Tatashin <pasha.tatashin@soleen.com>
---
 mm/gup.c | 179 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 111 insertions(+), 68 deletions(-)

Comments

Michal Hocko Dec. 18, 2020, 10:46 a.m. UTC | #1
On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:
[...]
> +#define PINNABLE_MIGRATE_MAX	10

> +#define PINNABLE_ISOLATE_MAX	100


Why would we need to limit the isolation retries. Those should always be
temporary failure unless I am missing something. I am not sure about the
PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages
already implements its retry logic why do you want to count retries on
top of that? I do agree that the existing logic is suboptimal because
the migration failure might be ephemeral or permanent but that should be
IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report
failures that are permanent - e.g. any potential pre-existing long term
pin - if that is possible at all. If not what would cause permanent
migration failure? OOM?
-- 
Michal Hocko
SUSE Labs
Pasha Tatashin Dec. 18, 2020, 12:43 p.m. UTC | #2
On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote:
>

> On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:

> [...]

> > +#define PINNABLE_MIGRATE_MAX 10

> > +#define PINNABLE_ISOLATE_MAX 100

>

> Why would we need to limit the isolation retries. Those should always be

> temporary failure unless I am missing something.


Actually, during development, I was retrying isolate errors
infinitely, but during testing found a hung where when FOLL_TOUCH
without FOLL_WRITE is passed (fault in kernel without write flag), the
zero page is faulted. The isolation of the zero page was failing every
time, therefore the process was hanging.

Since then, I fixed this problem by adding FOLL_WRITE unconditionally
to FOLL_LONGTERM, but I was worried about other possible bugs that
would cause hangs, so decided to limit isolation errors. If you think
it its not necessary, I can unlimit isolate retires.

> I am not sure about the

> PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages

> already implements its retry logic why do you want to count retries on

> top of that? I do agree that the existing logic is suboptimal because


True, but again, just recently, I worked on a race bug where pages can
end up in per-cpu list after lru_add_drain_all() but before isolation,
so I think retry is necessary.

> the migration failure might be ephemeral or permanent but that should be

> IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report

> failures that are permanent - e.g. any potential pre-existing long term

> pin - if that is possible at all. If not what would cause permanent

> migration failure? OOM?


Yes, OOM is the main cause for migration failures. And also a few
cases described in movable zone comment, where it is possible during
boot some pages can be allocated by memblock in movable zone due to
lack of memory resources (even if those resources were added later),
hardware page poisoning is another rare example.

> --

> Michal Hocko

> SUSE Labs
David Hildenbrand Dec. 18, 2020, 1:04 p.m. UTC | #3
> Am 18.12.2020 um 13:43 schrieb Pavel Tatashin <pasha.tatashin@soleen.com>:

> 

> On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote:

>> 

>> On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:

>> [...]

>>> +#define PINNABLE_MIGRATE_MAX 10

>>> +#define PINNABLE_ISOLATE_MAX 100

>> 

>> Why would we need to limit the isolation retries. Those should always be

>> temporary failure unless I am missing something.

> 

> Actually, during development, I was retrying isolate errors

> infinitely, but during testing found a hung where when FOLL_TOUCH

> without FOLL_WRITE is passed (fault in kernel without write flag), the

> zero page is faulted. The isolation of the zero page was failing every

> time, therefore the process was hanging.

> 

> Since then, I fixed this problem by adding FOLL_WRITE unconditionally

> to FOLL_LONGTERM, but I was worried about other possible bugs that

> would cause hangs, so decided to limit isolation errors. If you think

> it its not necessary, I can unlimit isolate retires.

> 

>> I am not sure about the

>> PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages

>> already implements its retry logic why do you want to count retries on

>> top of that? I do agree that the existing logic is suboptimal because

> 

> True, but again, just recently, I worked on a race bug where pages can

> end up in per-cpu list after lru_add_drain_all() but before isolation,

> so I think retry is necessary.

> 

>> the migration failure might be ephemeral or permanent but that should be

>> IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report

>> failures that are permanent - e.g. any potential pre-existing long term

>> pin - if that is possible at all. If not what would cause permanent

>> migration failure? OOM?

> 

> Yes, OOM is the main cause for migration failures. And also a few

> cases described in movable zone comment, where it is possible during

> boot some pages can be allocated by memblock in movable zone due to

> lack of memory resources (even if those resources were added later),

> hardware page poisoning is another rare example.

> 


How is concurrent migration handled? Like memory offlining, compaction, alloc_contig_range() while trying to pin?


>> --

>> Michal Hocko

>> SUSE Labs

>
Michal Hocko Dec. 18, 2020, 1:14 p.m. UTC | #4
On Fri 18-12-20 07:43:15, Pavel Tatashin wrote:
> On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote:

> >

> > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:

> > [...]

> > > +#define PINNABLE_MIGRATE_MAX 10

> > > +#define PINNABLE_ISOLATE_MAX 100

> >

> > Why would we need to limit the isolation retries. Those should always be

> > temporary failure unless I am missing something.

> 

> Actually, during development, I was retrying isolate errors

> infinitely, but during testing found a hung where when FOLL_TOUCH

> without FOLL_WRITE is passed (fault in kernel without write flag), the

> zero page is faulted. The isolation of the zero page was failing every

> time, therefore the process was hanging.


Why would you migrate zero page in the first place? Simply instantiate
it.
 
> Since then, I fixed this problem by adding FOLL_WRITE unconditionally

> to FOLL_LONGTERM, but I was worried about other possible bugs that

> would cause hangs, so decided to limit isolation errors. If you think

> it its not necessary, I can unlimit isolate retires.


It should have a really good reason to exist. Worries about some corner
cases is definitely not a reason to put some awkward retry mechanism.
My historical experience is that these things are extremely hard to get
rid of later.

> > I am not sure about the

> > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages

> > already implements its retry logic why do you want to count retries on

> > top of that? I do agree that the existing logic is suboptimal because

> 

> True, but again, just recently, I worked on a race bug where pages can

> end up in per-cpu list after lru_add_drain_all() but before isolation,

> so I think retry is necessary.


There are ways to make sure pages are not ending on pcp list. Have a
look at how hotplug does that.

> > the migration failure might be ephemeral or permanent but that should be

> > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report

> > failures that are permanent - e.g. any potential pre-existing long term

> > pin - if that is possible at all. If not what would cause permanent

> > migration failure? OOM?

> 

> Yes, OOM is the main cause for migration failures.


Then you can treat ENOMEM as a permanent failure.

> And also a few

> cases described in movable zone comment, where it is possible during

> boot some pages can be allocated by memblock in movable zone due to

> lack of memory resources (even if those resources were added later),


Do you have any examples? I find it hard to follow that somebody would
be pinning early boot allocations.

> hardware page poisoning is another rare example.


Could you elaborate please?
-- 
Michal Hocko
SUSE Labs
Jason Gunthorpe Dec. 18, 2020, 2:19 p.m. UTC | #5
On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:
> Hi Jason,

> 

> Thank you for your comments. My replies below.

> 

> On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:

> >

> > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:

> > > +/*

> > > + * Verify that there are no unpinnable (movable) pages, if so return true.

> > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.

> > > + */

> > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,

> > > +                               unsigned int gup_flags)

> > > +{

> > > +     unsigned long i, step;

> > > +

> > > +     for (i = 0; i < nr_pages; i += step) {

> > > +             struct page *head = compound_head(pages[i]);

> > > +

> > > +             step = compound_nr(head) - (pages[i] - head);

> >

> > You can't assume that all of a compound head is in the pages array,

> > this assumption would only work inside the page walkers if the page

> > was found in a PMD or something.

> 

> I am not sure I understand your comment. The compound head is not

> taken from the pages array, and not assumed to be in it. It is exactly

> the same logic as that we currently have:

> https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565


Oh, that existing logic is wrong too :( Another bug.

You can't skip pages in the pages[] array under the assumption they
are contiguous. ie the i+=step is wrong.

> >

> > > +     if (gup_flags & FOLL_PIN) {

> > > +             unpin_user_pages(pages, nr_pages);

> >

> > So we throw everything away? Why? That isn't how the old algorithm worked

> 

> It is exactly like the old algorithm worked: if there are pages to be

> migrated (not pinnable pages) we unpinned everything.

> See here:

> https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603


Hmm, OK, but I'm not sure that is great either

> cleaner, and handle errors. We must unpin everything because if we

> fail, no pages should stay pinned, and also if we migrated some pages,

> the pages array must be updated, so we need to call

> __get_user_pages_locked() pin and repopulated pages array.


However the page can't be unpinned until it is put on the LRU (and I'm
hoping that the LRU is enough of a 'lock' to make that safe, no idea)

> > I don't like this at all. It shouldn't be so flakey

> >

> > Can you do migration without the LRU?

> 

> I do not think it is possible, we must isolate pages before migration.


I don't like this at all :( Lots of stuff relies on GUP, introducing a
random flakiness like this not good.

Jason
Pasha Tatashin Jan. 13, 2021, 7:43 p.m. UTC | #6
On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>

> On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:

> > Hi Jason,

> >

> > Thank you for your comments. My replies below.

> >

> > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:

> > >

> > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:

> > > > +/*

> > > > + * Verify that there are no unpinnable (movable) pages, if so return true.

> > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.

> > > > + */

> > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,

> > > > +                               unsigned int gup_flags)

> > > > +{

> > > > +     unsigned long i, step;

> > > > +

> > > > +     for (i = 0; i < nr_pages; i += step) {

> > > > +             struct page *head = compound_head(pages[i]);

> > > > +

> > > > +             step = compound_nr(head) - (pages[i] - head);

> > >

> > > You can't assume that all of a compound head is in the pages array,

> > > this assumption would only work inside the page walkers if the page

> > > was found in a PMD or something.

> >

> > I am not sure I understand your comment. The compound head is not

> > taken from the pages array, and not assumed to be in it. It is exactly

> > the same logic as that we currently have:

> > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565

>

> Oh, that existing logic is wrong too :( Another bug.


I do not think there is a bug.

> You can't skip pages in the pages[] array under the assumption they

> are contiguous. ie the i+=step is wrong.


If pages[i] is part of a compound page, the other parts of this page
must be sequential in this array for this compound page (it might
start in the middle through). If they are not sequential then the
translation will be broken, as these pages also correspond to virtual
addresses from [start, start + nr_pages) in __gup_longterm_locked.

For example, when __gup_longterm_locked() is returned, the following
must be true:
PHYSICAL                           VIRTUAL
page_to_phys(pages[0]) -> start + 0 * PAGE_SIZE
page_to_phys(pages[1]) -> start + 1 * PAGE_SIZE
page_to_phys(pages[2]) -> start + 2 * PAGE_SIZE
page_to_phys(pages[3]) -> start + 3 * PAGE_SIZE
...
page_to_phys(pages[nr_pages - 1]) -> start + (nr_pages - 1) * PAGE_SIZE

If any pages[i] is part of a compound page (i.e. huge page), we can't
have other pages to be in the middle of that page in the array..

>

> > >

> > > > +     if (gup_flags & FOLL_PIN) {

> > > > +             unpin_user_pages(pages, nr_pages);

> > >

> > > So we throw everything away? Why? That isn't how the old algorithm worked

> >

> > It is exactly like the old algorithm worked: if there are pages to be

> > migrated (not pinnable pages) we unpinned everything.

> > See here:

> > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603

>

> Hmm, OK, but I'm not sure that is great either


I will send out a new series. We can discuss it there if you have
suggestions for improvement here.

>

> > cleaner, and handle errors. We must unpin everything because if we

> > fail, no pages should stay pinned, and also if we migrated some pages,

> > the pages array must be updated, so we need to call

> > __get_user_pages_locked() pin and repopulated pages array.

>

> However the page can't be unpinned until it is put on the LRU (and I'm

> hoping that the LRU is enough of a 'lock' to make that safe, no idea)

>

> > > I don't like this at all. It shouldn't be so flakey

> > >

> > > Can you do migration without the LRU?

> >

> > I do not think it is possible, we must isolate pages before migration.

>

> I don't like this at all :( Lots of stuff relies on GUP, introducing a

> random flakiness like this not good.


This is actually standard migration procedure, elsewhere in the kernel
we migrate pages in exactly the same fashion: isolate and later
migrate. The isolation works for LRU only pages.

>

> Jason
Pasha Tatashin Jan. 13, 2021, 7:49 p.m. UTC | #7
On Fri, Dec 18, 2020 at 8:14 AM Michal Hocko <mhocko@suse.com> wrote:
>

> On Fri 18-12-20 07:43:15, Pavel Tatashin wrote:

> > On Fri, Dec 18, 2020 at 5:46 AM Michal Hocko <mhocko@suse.com> wrote:

> > >

> > > On Thu 17-12-20 13:52:41, Pavel Tatashin wrote:

> > > [...]

> > > > +#define PINNABLE_MIGRATE_MAX 10

> > > > +#define PINNABLE_ISOLATE_MAX 100

> > >

> > > Why would we need to limit the isolation retries. Those should always be

> > > temporary failure unless I am missing something.

> >

> > Actually, during development, I was retrying isolate errors

> > infinitely, but during testing found a hung where when FOLL_TOUCH

> > without FOLL_WRITE is passed (fault in kernel without write flag), the

> > zero page is faulted. The isolation of the zero page was failing every

> > time, therefore the process was hanging.

>

> Why would you migrate zero page in the first place? Simply instantiate

> it.


This is exactly the idea behind FOLL_WRITE; it causes zero pages to be
created in the right zone right away, and no migration is necessary.

>

> > Since then, I fixed this problem by adding FOLL_WRITE unconditionally

> > to FOLL_LONGTERM, but I was worried about other possible bugs that

> > would cause hangs, so decided to limit isolation errors. If you think

> > it its not necessary, I can unlimit isolate retires.

>

> It should have a really good reason to exist. Worries about some corner

> cases is definitely not a reason to put some awkward retry mechanism.

> My historical experience is that these things are extremely hard to get

> rid of later.

>

> > > I am not sure about the

> > > PINNABLE_MIGRATE_MAX either. Why do we want to limit that? migrate_pages

> > > already implements its retry logic why do you want to count retries on

> > > top of that? I do agree that the existing logic is suboptimal because

> >

> > True, but again, just recently, I worked on a race bug where pages can

> > end up in per-cpu list after lru_add_drain_all() but before isolation,

> > so I think retry is necessary.

>

> There are ways to make sure pages are not ending on pcp list. Have a

> look at how hotplug does that.


Sounds good to me, I will remove PINNABLE_MIGRATE_MAX, and leave only
PINNABLE_ISOLATE_MAX for transient isolation errors.

>

> > > the migration failure might be ephemeral or permanent but that should be

> > > IMHO addressed at migrate_pages (resp. unmap_and_move) and simply report

> > > failures that are permanent - e.g. any potential pre-existing long term

> > > pin - if that is possible at all. If not what would cause permanent

> > > migration failure? OOM?

> >

> > Yes, OOM is the main cause for migration failures.

>

> Then you can treat ENOMEM as a permanent failure.


Sounds good.
Jason Gunthorpe Jan. 13, 2021, 7:55 p.m. UTC | #8
On Wed, Jan 13, 2021 at 02:43:50PM -0500, Pavel Tatashin wrote:
> On Fri, Dec 18, 2020 at 9:19 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:

> >

> > On Thu, Dec 17, 2020 at 05:02:03PM -0500, Pavel Tatashin wrote:

> > > Hi Jason,

> > >

> > > Thank you for your comments. My replies below.

> > >

> > > On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:

> > > >

> > > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:

> > > > > +/*

> > > > > + * Verify that there are no unpinnable (movable) pages, if so return true.

> > > > > + * Otherwise an unpinnable pages is found return false, and unpin all pages.

> > > > > + */

> > > > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,

> > > > > +                               unsigned int gup_flags)

> > > > > +{

> > > > > +     unsigned long i, step;

> > > > > +

> > > > > +     for (i = 0; i < nr_pages; i += step) {

> > > > > +             struct page *head = compound_head(pages[i]);

> > > > > +

> > > > > +             step = compound_nr(head) - (pages[i] - head);

> > > >

> > > > You can't assume that all of a compound head is in the pages array,

> > > > this assumption would only work inside the page walkers if the page

> > > > was found in a PMD or something.

> > >

> > > I am not sure I understand your comment. The compound head is not

> > > taken from the pages array, and not assumed to be in it. It is exactly

> > > the same logic as that we currently have:

> > > https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565

> >

> > Oh, that existing logic is wrong too :( Another bug.

> 

> I do not think there is a bug.

> 

> > You can't skip pages in the pages[] array under the assumption they

> > are contiguous. ie the i+=step is wrong.

> 

> If pages[i] is part of a compound page, the other parts of this page

> must be sequential in this array for this compound page


That is true only if the PMD points to the page. If the PTE points to
a tail page then there is no requirement that other PTEs are
contiguous with the compount page.

At this point we have no idea if the GUP logic got this compound page
as a head page in a PMD or as a tail page from a PTE, so we can't
assume a contiguous run of addresses.

Look at split_huge_pmd() - it doesn't break up the compound page it
just converts the PMD to a PTE array and scatters the tail pages to
the PTE.

I understand Matt is pushing on this idea more by having compound
pages in the page cache, but still mapping tail pages when required.

> This is actually standard migration procedure, elsewhere in the kernel

> we migrate pages in exactly the same fashion: isolate and later

> migrate. The isolation works for LRU only pages.


But do other places cause a userspace visible random failure when LRU
isolation fails?

I don't like it at all, what is the user supposed to do?

Jason
Pasha Tatashin Jan. 13, 2021, 8:05 p.m. UTC | #9
> > > Oh, that existing logic is wrong too :( Another bug.

> >

> > I do not think there is a bug.

> >

> > > You can't skip pages in the pages[] array under the assumption they

> > > are contiguous. ie the i+=step is wrong.

> >

> > If pages[i] is part of a compound page, the other parts of this page

> > must be sequential in this array for this compound page

>

> That is true only if the PMD points to the page. If the PTE points to

> a tail page then there is no requirement that other PTEs are

> contiguous with the compount page.

>

> At this point we have no idea if the GUP logic got this compound page

> as a head page in a PMD or as a tail page from a PTE, so we can't

> assume a contiguous run of addresses.


I see, I will fix this bug in an upstream as a separate patch in my
series, and keep the fix when my fixes are applied.

>

> Look at split_huge_pmd() - it doesn't break up the compound page it

> just converts the PMD to a PTE array and scatters the tail pages to

> the PTE.


Got it, unfortunately the fix will deoptimize the code by having to
check every page if it is part of a previous compound page or not.

>

> I understand Matt is pushing on this idea more by having compound

> pages in the page cache, but still mapping tail pages when required.

>

> > This is actually standard migration procedure, elsewhere in the kernel

> > we migrate pages in exactly the same fashion: isolate and later

> > migrate. The isolation works for LRU only pages.

>

> But do other places cause a userspace visible random failure when LRU

> isolation fails?


Makes sense, I will remove maximum retries for isolation, and retry
indefinitely, the same as it is done during memory hot-remove. So, we
will fail only when migration fails.

>

> I don't like it at all, what is the user supposed to do?

>

> Jason
Jason Gunthorpe Jan. 13, 2021, 11:40 p.m. UTC | #10
On Wed, Jan 13, 2021 at 03:05:41PM -0500, Pavel Tatashin wrote:

> Makes sense, I will remove maximum retries for isolation, and retry

> indefinitely, the same as it is done during memory hot-remove. So, we

> will fail only when migration fails.


It would be good to make this also as a stand alone pre-fix as well..

Thanks,
Jason
Pasha Tatashin Jan. 15, 2021, 6:10 p.m. UTC | #11
On Wed, Jan 13, 2021 at 3:05 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>

> > > > Oh, that existing logic is wrong too :( Another bug.

> > >

> > > I do not think there is a bug.

> > >

> > > > You can't skip pages in the pages[] array under the assumption they

> > > > are contiguous. ie the i+=step is wrong.

> > >

> > > If pages[i] is part of a compound page, the other parts of this page

> > > must be sequential in this array for this compound page

> >

> > That is true only if the PMD points to the page. If the PTE points to

> > a tail page then there is no requirement that other PTEs are

> > contiguous with the compount page.

> >

> > At this point we have no idea if the GUP logic got this compound page

> > as a head page in a PMD or as a tail page from a PTE, so we can't

> > assume a contiguous run of addresses.

>

> I see, I will fix this bug in an upstream as a separate patch in my

> series, and keep the fix when my fixes are applied.

>

> >

> > Look at split_huge_pmd() - it doesn't break up the compound page it

> > just converts the PMD to a PTE array and scatters the tail pages to

> > the PTE.


Hi Jason,

I've been thinking about this some more. Again, I am not sure this is
a bug. I understand split_huge_pmd() may split the PMD size page into
PTEs and leave the compound page intact. However, in order for pages[]
to have non sequential addresses in compound page, those PTEs must
also be migrated after split_huge_pmd(), however when we migrate them
we will either migrate the whole compound page or do
split_huge_page_to_list() which will in turn do ClearPageCompound().
Please let me know if I am missing something.

Thank you,
Pasha

>

> Got it, unfortunately the fix will deoptimize the code by having to

> check every page if it is part of a previous compound page or not.

>

> >

> > I understand Matt is pushing on this idea more by having compound

> > pages in the page cache, but still mapping tail pages when required.

> >

> > > This is actually standard migration procedure, elsewhere in the kernel

> > > we migrate pages in exactly the same fashion: isolate and later

> > > migrate. The isolation works for LRU only pages.

> >

> > But do other places cause a userspace visible random failure when LRU

> > isolation fails?

>

> Makes sense, I will remove maximum retries for isolation, and retry

> indefinitely, the same as it is done during memory hot-remove. So, we

> will fail only when migration fails.

>

> >

> > I don't like it at all, what is the user supposed to do?

> >

> > Jason
Jason Gunthorpe Jan. 15, 2021, 6:40 p.m. UTC | #12
On Fri, Jan 15, 2021 at 01:10:27PM -0500, Pavel Tatashin wrote:

> I've been thinking about this some more. Again, I am not sure this is

> a bug. I understand split_huge_pmd() may split the PMD size page into

> PTEs and leave the compound page intact. However, in order for pages[]

> to have non sequential addresses in compound page, those PTEs must

> also be migrated after split_huge_pmd(), 


Why focus on migrated? Anything could happen to those PTEs: they could
be COW'd, they could be mmap/munmap'd, etc.

Jason
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 1ebb7cc2fbe4..70cc8b8f67c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1550,27 +1550,57 @@  struct page *get_dump_page(unsigned long addr)
 }
 #endif /* CONFIG_ELF_CORE */
 
-static long check_and_migrate_movable_pages(struct mm_struct *mm,
-					    unsigned long start,
-					    unsigned long nr_pages,
-					    struct page **pages,
-					    struct vm_area_struct **vmas,
-					    unsigned int gup_flags)
-{
-	unsigned long i;
-	unsigned long step;
-	bool drain_allow = true;
-	bool migrate_allow = true;
+/*
+ * Verify that there are no unpinnable (movable) pages, if so return true.
+ * Otherwise an unpinnable pages is found return false, and unpin all pages.
+ */
+static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
+				  unsigned int gup_flags)
+{
+	unsigned long i, step;
+
+	for (i = 0; i < nr_pages; i += step) {
+		struct page *head = compound_head(pages[i]);
+
+		step = compound_nr(head) - (pages[i] - head);
+		if (!is_pinnable_page(head))
+			break;
+	}
+
+	if (i >= nr_pages)
+		return true;
+
+	if (gup_flags & FOLL_PIN) {
+		unpin_user_pages(pages, nr_pages);
+	} else {
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+	}
+
+	return false;
+}
+
+#define PINNABLE_MIGRATE_MAX	10
+#define PINNABLE_ISOLATE_MAX	100
+
+/*
+ * Migrate pages that cannot be pinned.  Return zero on success and error code
+ * on migration failure. If migration was successful but page isolation had
+ * failures return number of pages that failed to be isolated.
+ */
+static long migrate_movable_pages(unsigned long nr_pages, struct page **pages)
+{
+	unsigned long i, step;
 	LIST_HEAD(movable_page_list);
-	long ret = nr_pages;
+	long ret = 0;
+	long error_count = 0;
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_NOWARN,
 	};
 
-check_again:
-	for (i = 0; i < nr_pages;) {
-
+	lru_add_drain_all();
+	for (i = 0; i < nr_pages; i += step) {
 		struct page *head = compound_head(pages[i]);
 
 		/*
@@ -1583,62 +1613,42 @@  static long check_and_migrate_movable_pages(struct mm_struct *mm,
 		 * these entries, try to move them out if possible.
 		 */
 		if (!is_pinnable_page(head)) {
-			if (PageHuge(head))
-				isolate_huge_page(head, &movable_page_list);
-			else {
-				if (!PageLRU(head) && drain_allow) {
-					lru_add_drain_all();
-					drain_allow = false;
-				}
-
+			if (PageHuge(head)) {
+				if (!isolate_huge_page(head, &movable_page_list))
+					error_count += step;
+			} else {
 				if (!isolate_lru_page(head)) {
 					list_add_tail(&head->lru, &movable_page_list);
 					mod_node_page_state(page_pgdat(head),
 							    NR_ISOLATED_ANON +
 							    page_is_file_lru(head),
 							    thp_nr_pages(head));
+				} else {
+					error_count += step;
 				}
 			}
 		}
-
-		i += step;
 	}
 
 	if (!list_empty(&movable_page_list)) {
-		/*
-		 * drop the above get_user_pages reference.
-		 */
-		if (gup_flags & FOLL_PIN)
-			unpin_user_pages(pages, nr_pages);
-		else
-			for (i = 0; i < nr_pages; i++)
-				put_page(pages[i]);
+		ret = migrate_pages(&movable_page_list, alloc_migration_target,
+				    NULL, (unsigned long)&mtc, MIGRATE_SYNC,
+				    MR_LONGTERM_PIN);
+		/* Assume -EBUSY failure if some pages were not migrated */
+		if (ret > 0)
+			ret = -EBUSY;
+	}
 
-		if (migrate_pages(&movable_page_list, alloc_migration_target, NULL,
-			(unsigned long)&mtc, MIGRATE_SYNC, MR_LONGTERM_PIN)) {
-			/*
-			 * some of the pages failed migration. Do get_user_pages
-			 * without migration.
-			 */
-			migrate_allow = false;
+	if (ret && !list_empty(&movable_page_list))
+		putback_movable_pages(&movable_page_list);
 
-			if (!list_empty(&movable_page_list))
-				putback_movable_pages(&movable_page_list);
-		}
-		/*
-		 * We did migrate all the pages, Try to get the page references
-		 * again migrating any pages which we failed to isolate earlier.
-		 */
-		ret = __get_user_pages_locked(mm, start, nr_pages,
-					      pages, vmas, NULL,
-					      gup_flags);
-
-		if ((ret > 0) && migrate_allow) {
-			nr_pages = ret;
-			drain_allow = true;
-			goto check_again;
-		}
-	}
+	/*
+	 * Check if there were isolation errors, if so they should not be
+	 * counted toward PINNABLE_MIGRATE_MAX, so separate them, by
+	 * returning number of pages failed to isolate.
+	 */
+	if (!ret && error_count)
+		ret = error_count;
 
 	return ret;
 }
@@ -1654,22 +1664,55 @@  static long __gup_longterm_locked(struct mm_struct *mm,
 				  struct vm_area_struct **vmas,
 				  unsigned int gup_flags)
 {
-	unsigned long flags = 0;
+	int migrate_retry = 0;
+	int isolate_retry = 0;
+	unsigned int flags;
 	long rc;
 
-	if (gup_flags & FOLL_LONGTERM)
-		flags = memalloc_pin_save();
+	if (!(gup_flags & FOLL_LONGTERM))
+		return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+					       NULL, gup_flags);
 
-	rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
-				     gup_flags);
+	/*
+	 * Without FOLL_WRITE fault handler may return zero page, which can
+	 * be in a movable zone, and also will fail to isolate during migration,
+	 * thus the longterm pin will fail.
+	 */
+	gup_flags &= FOLL_WRITE;
 
-	if (gup_flags & FOLL_LONGTERM) {
-		if (rc > 0)
-			rc = check_and_migrate_movable_pages(mm, start, rc,
-							     pages, vmas,
-							     gup_flags);
-		memalloc_pin_restore(flags);
+	flags = memalloc_pin_save();
+	/*
+	 * Migration may fail, we retry before giving up. Also, because after
+	 * migration pages[] becomes outdated, we unpin and repin all pages
+	 * in the range, so pages array is repopulated with new values.
+	 * Also, because of this we cannot retry migration failures in a loop
+	 * without pinning/unpinnig pages.
+	 */
+	for (; ; ) {
+		rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+					     NULL, gup_flags);
+
+		/* Return if error or if all pages are pinnable */
+		if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags))
+			break;
+
+		/* Some pages are not pinnable, migrate them */
+		rc = migrate_movable_pages(rc, pages);
+
+		/*
+		 * If there is an error, and we tried maximum number of times
+		 * bail out. Notice: we return an error code, and all pages are
+		 * unpinned
+		 */
+		if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) {
+			break;
+		} else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) {
+			rc = -EBUSY;
+			break;
+		}
 	}
+	memalloc_pin_restore(flags);
+
 	return rc;
 }