diff mbox series

[1/1] mm: avoid re-using pfmemalloc page in page_frag_alloc()

Message ID 20201103193239.1807-1-dongli.zhang@oracle.com
State New
Headers show
Series [1/1] mm: avoid re-using pfmemalloc page in page_frag_alloc() | expand

Commit Message

Dongli Zhang Nov. 3, 2020, 7:32 p.m. UTC
The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb().
This ends up to page_frag_alloc() to allocate skb->data from
page_frag_cache->va.

During the memory pressure, page_frag_cache->va may be allocated as
pfmemalloc page. As a result, the skb->pfmemalloc is always true as
skb->data is from page_frag_cache->va. The skb will be dropped if the
sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
under memory pressure.

However, once kernel is not under memory pressure any longer (suppose large
amount of memory pages are just reclaimed), the page_frag_alloc() may still
re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
result, the skb->pfmemalloc is always true unless page_frag_cache->va is
re-allocated, even the kernel is not under memory pressure any longer.

Here is how kernel runs into issue.

1. The kernel is under memory pressure and allocation of
PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
the pfmemalloc page is allocated for page_frag_cache->va.

2: All skb->data from page_frag_cache->va (pfmemalloc) will have
skb->pfmemalloc=true. The skb will always be dropped by sock without
SOCK_MEMALLOC. This is an expected behaviour.

3. Suppose a large amount of pages are reclaimed and kernel is not under
memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

4. Unfortunately, page_frag_alloc() does not proactively re-allocate
page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
skb->pfmemalloc is always true even kernel is not under memory pressure any
longer.

Therefore, this patch always checks and tries to avoid re-using the
pfmemalloc page for page_frag_alloc->va.

Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Bert Barbe <bert.barbe@oracle.com>
Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Matthew Wilcox Nov. 3, 2020, 8:35 p.m. UTC | #1
On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote:
> The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb().

> This ends up to page_frag_alloc() to allocate skb->data from

> page_frag_cache->va.

> 

> During the memory pressure, page_frag_cache->va may be allocated as

> pfmemalloc page. As a result, the skb->pfmemalloc is always true as

> skb->data is from page_frag_cache->va. The skb will be dropped if the

> sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour

> under memory pressure.

> 

> However, once kernel is not under memory pressure any longer (suppose large

> amount of memory pages are just reclaimed), the page_frag_alloc() may still

> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a

> result, the skb->pfmemalloc is always true unless page_frag_cache->va is

> re-allocated, even the kernel is not under memory pressure any longer.

> 

> Here is how kernel runs into issue.

> 

> 1. The kernel is under memory pressure and allocation of

> PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,

> the pfmemalloc page is allocated for page_frag_cache->va.

> 

> 2: All skb->data from page_frag_cache->va (pfmemalloc) will have

> skb->pfmemalloc=true. The skb will always be dropped by sock without

> SOCK_MEMALLOC. This is an expected behaviour.

> 

> 3. Suppose a large amount of pages are reclaimed and kernel is not under

> memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

> 

> 4. Unfortunately, page_frag_alloc() does not proactively re-allocate

> page_frag_alloc->va and will always re-use the prior pfmemalloc page. The

> skb->pfmemalloc is always true even kernel is not under memory pressure any

> longer.

> 

> Therefore, this patch always checks and tries to avoid re-using the

> pfmemalloc page for page_frag_alloc->va.

> 

> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>

> Cc: Bert Barbe <bert.barbe@oracle.com>

> Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>

> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

> Cc: Manjunath Patil <manjunath.b.patil@oracle.com>

> Cc: Joe Jin <joe.jin@oracle.com>

> Cc: SRINIVAS <srinivas.eeda@oracle.com>

> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

> ---

>  mm/page_alloc.c | 10 ++++++++++

>  1 file changed, 10 insertions(+)

> 

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index 23f5066bd4a5..291df2f9f8f3 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc,

>  	struct page *page;

>  	int offset;

>  

> +	/*

> +	 * Try to avoid re-using pfmemalloc page because kernel may already

> +	 * run out of the memory pressure situation at any time.

> +	 */

> +	if (unlikely(nc->va && nc->pfmemalloc)) {

> +		page = virt_to_page(nc->va);

> +		__page_frag_cache_drain(page, nc->pagecnt_bias);

> +		nc->va = NULL;

> +	}


I think this is the wrong way to solve this problem.  Instead, we should
use up this page, but refuse to recycle it.  How about something like this (not even compile tested):

+++ b/mm/page_alloc.c
@@ -5139,6 +5139,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,
 
                if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
                        goto refill;
+               if (nc->pfmemalloc) {
+                       free_the_page(page);
+                       goto refill;
+               }
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
                /* if size can vary use size else just use PAGE_SIZE */
Dongli Zhang Nov. 3, 2020, 8:57 p.m. UTC | #2
Hi Matthew,

On 11/3/20 12:35 PM, Matthew Wilcox wrote:
> On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote:

>> The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb().

>> This ends up to page_frag_alloc() to allocate skb->data from

>> page_frag_cache->va.

>>

>> During the memory pressure, page_frag_cache->va may be allocated as

>> pfmemalloc page. As a result, the skb->pfmemalloc is always true as

>> skb->data is from page_frag_cache->va. The skb will be dropped if the

>> sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour

>> under memory pressure.

>>

>> However, once kernel is not under memory pressure any longer (suppose large

>> amount of memory pages are just reclaimed), the page_frag_alloc() may still

>> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a

>> result, the skb->pfmemalloc is always true unless page_frag_cache->va is

>> re-allocated, even the kernel is not under memory pressure any longer.

>>

>> Here is how kernel runs into issue.

>>

>> 1. The kernel is under memory pressure and allocation of

>> PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,

>> the pfmemalloc page is allocated for page_frag_cache->va.

>>

>> 2: All skb->data from page_frag_cache->va (pfmemalloc) will have

>> skb->pfmemalloc=true. The skb will always be dropped by sock without

>> SOCK_MEMALLOC. This is an expected behaviour.

>>

>> 3. Suppose a large amount of pages are reclaimed and kernel is not under

>> memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

>>

>> 4. Unfortunately, page_frag_alloc() does not proactively re-allocate

>> page_frag_alloc->va and will always re-use the prior pfmemalloc page. The

>> skb->pfmemalloc is always true even kernel is not under memory pressure any

>> longer.

>>

>> Therefore, this patch always checks and tries to avoid re-using the

>> pfmemalloc page for page_frag_alloc->va.

>>

>> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>

>> Cc: Bert Barbe <bert.barbe@oracle.com>

>> Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>

>> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

>> Cc: Manjunath Patil <manjunath.b.patil@oracle.com>

>> Cc: Joe Jin <joe.jin@oracle.com>

>> Cc: SRINIVAS <srinivas.eeda@oracle.com>

>> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>

>> ---

>>  mm/page_alloc.c | 10 ++++++++++

>>  1 file changed, 10 insertions(+)

>>

>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

>> index 23f5066bd4a5..291df2f9f8f3 100644

>> --- a/mm/page_alloc.c

>> +++ b/mm/page_alloc.c

>> @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc,

>>  	struct page *page;

>>  	int offset;

>>  

>> +	/*

>> +	 * Try to avoid re-using pfmemalloc page because kernel may already

>> +	 * run out of the memory pressure situation at any time.

>> +	 */

>> +	if (unlikely(nc->va && nc->pfmemalloc)) {

>> +		page = virt_to_page(nc->va);

>> +		__page_frag_cache_drain(page, nc->pagecnt_bias);

>> +		nc->va = NULL;

>> +	}

> 

> I think this is the wrong way to solve this problem.  Instead, we should

> use up this page, but refuse to recycle it.  How about something like this (not even compile tested):


Thank you very much for the feedback. Yes, the option is to use the same page
until it is used up (offset < 0). Instead of recycling it, the kernel free it
and allocate new one.

This depends on whether we will tolerate the packet drop until this page is used up.

For virtio-net, the payload (skb->data) is of size 128-byte. The padding and
alignment will finally make it as 512-byte.

Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped
before the page is used up.

Dongli Zhang

> 

> +++ b/mm/page_alloc.c

> @@ -5139,6 +5139,10 @@ void *page_frag_alloc(struct page_frag_cache *nc,

>  

>                 if (!page_ref_sub_and_test(page, nc->pagecnt_bias))

>                         goto refill;

> +               if (nc->pfmemalloc) {

> +                       free_the_page(page);

> +                       goto refill;

> +               }

>  

>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)

>                 /* if size can vary use size else just use PAGE_SIZE */

>
Matthew Wilcox Nov. 3, 2020, 9:15 p.m. UTC | #3
On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote:
> On 11/3/20 12:35 PM, Matthew Wilcox wrote:

> > On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote:

> >> However, once kernel is not under memory pressure any longer (suppose large

> >> amount of memory pages are just reclaimed), the page_frag_alloc() may still

> >> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a

> >> result, the skb->pfmemalloc is always true unless page_frag_cache->va is

> >> re-allocated, even the kernel is not under memory pressure any longer.

> >> +	/*

> >> +	 * Try to avoid re-using pfmemalloc page because kernel may already

> >> +	 * run out of the memory pressure situation at any time.

> >> +	 */

> >> +	if (unlikely(nc->va && nc->pfmemalloc)) {

> >> +		page = virt_to_page(nc->va);

> >> +		__page_frag_cache_drain(page, nc->pagecnt_bias);

> >> +		nc->va = NULL;

> >> +	}

> > 

> > I think this is the wrong way to solve this problem.  Instead, we should

> > use up this page, but refuse to recycle it.  How about something like this (not even compile tested):

> 

> Thank you very much for the feedback. Yes, the option is to use the same page

> until it is used up (offset < 0). Instead of recycling it, the kernel free it

> and allocate new one.

> 

> This depends on whether we will tolerate the packet drop until this page is used up.

> 

> For virtio-net, the payload (skb->data) is of size 128-byte. The padding and

> alignment will finally make it as 512-byte.

> 

> Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped

> before the page is used up.


My thinking is that if the kernel is under memory pressure then freeing
the page and allocating a new one is likely to put even more strain
on the memory allocator, so we want to do this "soon", rather than at
each allocation.

Thanks for providing the numbers.  Do you think that dropping (up to)
7 packets is acceptable?

We could also do something like ...

        if (unlikely(nc->pfmemalloc)) {
                page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
                if (page)
                        nc->pfmemalloc = 0;
                put_page(page);
        }

to test if the memory allocator has free pages at the moment.  Not sure
whether that's a good idea or not -- hopefully you have a test environment
set up where you can reproduce this condition on demand and determine
which of these three approaches is best!
Dongli Zhang Nov. 3, 2020, 9:37 p.m. UTC | #4
On 11/3/20 1:15 PM, Matthew Wilcox wrote:
> On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote:

>> On 11/3/20 12:35 PM, Matthew Wilcox wrote:

>>> On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote:

>>>> However, once kernel is not under memory pressure any longer (suppose large

>>>> amount of memory pages are just reclaimed), the page_frag_alloc() may still

>>>> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a

>>>> result, the skb->pfmemalloc is always true unless page_frag_cache->va is

>>>> re-allocated, even the kernel is not under memory pressure any longer.

>>>> +	/*

>>>> +	 * Try to avoid re-using pfmemalloc page because kernel may already

>>>> +	 * run out of the memory pressure situation at any time.

>>>> +	 */

>>>> +	if (unlikely(nc->va && nc->pfmemalloc)) {

>>>> +		page = virt_to_page(nc->va);

>>>> +		__page_frag_cache_drain(page, nc->pagecnt_bias);

>>>> +		nc->va = NULL;

>>>> +	}

>>>

>>> I think this is the wrong way to solve this problem.  Instead, we should

>>> use up this page, but refuse to recycle it.  How about something like this (not even compile tested):

>>

>> Thank you very much for the feedback. Yes, the option is to use the same page

>> until it is used up (offset < 0). Instead of recycling it, the kernel free it

>> and allocate new one.

>>

>> This depends on whether we will tolerate the packet drop until this page is used up.

>>

>> For virtio-net, the payload (skb->data) is of size 128-byte. The padding and

>> alignment will finally make it as 512-byte.

>>

>> Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped

>> before the page is used up.

> 

> My thinking is that if the kernel is under memory pressure then freeing

> the page and allocating a new one is likely to put even more strain

> on the memory allocator, so we want to do this "soon", rather than at

> each allocation.

> 

> Thanks for providing the numbers.  Do you think that dropping (up to)

> 7 packets is acceptable?>

> We could also do something like ...

> 

>         if (unlikely(nc->pfmemalloc)) {

>                 page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);

>                 if (page)

>                         nc->pfmemalloc = 0;

>                 put_page(page);

>         }

> 

> to test if the memory allocator has free pages at the moment.  Not sure

> whether that's a good idea or not -- hopefully you have a test environment

> set up where you can reproduce this condition on demand and determine

> which of these three approaches is best!

> 



From mm's perspective, we expect to reduce the number of page allocation
(especially under memory pressure).

From networking's perspective, we expect to reduce the number of skb drop.

That's why I CCed netdev folks (including David and Jakub), although the patch
is for mm/page_alloc.c. The page_frag_alloc() is primarily used by networking
and nvme-tcp.


Unfortunately, so far I do not have the env to reproduce. I reproduced with a
patch to fail page allocation and set nc->pfmemalloc on purpose.

From mm's perspective, I think to use up the page is a good option. Indeed tt is
system administrator's duty to avoid memory pressure, in order to avoid the
extra packet drops.

Dongli Zhang
Rama Nichanamatlu Nov. 4, 2020, 1:16 a.m. UTC | #5
>Thanks for providing the numbers.  Do you think that dropping (up to)

>7 packets is acceptable?


net.ipv4.tcp_syn_retries = 6

tcp clients wouldn't even get that far leading to connect establish issues.

-rama
On Tue, Nov 03, 2020 at 09:15:41PM +0000, Matthew Wilcox wrote:
>On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote:

>> On 11/3/20 12:35 PM, Matthew Wilcox wrote:

>> > On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote:

>> >> However, once kernel is not under memory pressure any longer (suppose large

>> >> amount of memory pages are just reclaimed), the page_frag_alloc() may still

>> >> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a

>> >> result, the skb->pfmemalloc is always true unless page_frag_cache->va is

>> >> re-allocated, even the kernel is not under memory pressure any longer.

>> >> +	/*

>> >> +	 * Try to avoid re-using pfmemalloc page because kernel may already

>> >> +	 * run out of the memory pressure situation at any time.

>> >> +	 */

>> >> +	if (unlikely(nc->va && nc->pfmemalloc)) {

>> >> +		page = virt_to_page(nc->va);

>> >> +		__page_frag_cache_drain(page, nc->pagecnt_bias);

>> >> +		nc->va = NULL;

>> >> +	}

>> >

>> > I think this is the wrong way to solve this problem.  Instead, we should

>> > use up this page, but refuse to recycle it.  How about something like this (not even compile tested):

>>

>> Thank you very much for the feedback. Yes, the option is to use the same page

>> until it is used up (offset < 0). Instead of recycling it, the kernel free it

>> and allocate new one.

>>

>> This depends on whether we will tolerate the packet drop until this page is used up.

>>

>> For virtio-net, the payload (skb->data) is of size 128-byte. The padding and

>> alignment will finally make it as 512-byte.

>>

>> Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped

>> before the page is used up.

>

>My thinking is that if the kernel is under memory pressure then freeing

>the page and allocating a new one is likely to put even more strain

>on the memory allocator, so we want to do this "soon", rather than at

>each allocation.

>

>Thanks for providing the numbers.  Do you think that dropping (up to)

>7 packets is acceptable?

>

>We could also do something like ...

>

>        if (unlikely(nc->pfmemalloc)) {

>                page = alloc_page(GFP_NOWAIT | __GFP_NOWARN);

>                if (page)

>                        nc->pfmemalloc = 0;

>                put_page(page);

>        }

>

>to test if the memory allocator has free pages at the moment.  Not sure

>whether that's a good idea or not -- hopefully you have a test environment

>set up where you can reproduce this condition on demand and determine

>which of these three approaches is best!
Eric Dumazet Nov. 4, 2020, 8:50 a.m. UTC | #6
On 11/4/20 2:16 AM, Rama Nichanamatlu wrote:
>> Thanks for providing the numbers.  Do you think that dropping (up to)
>> 7 packets is acceptable?
> 
> net.ipv4.tcp_syn_retries = 6
> 
> tcp clients wouldn't even get that far leading to connect establish issues.

This does not really matter. If host was under memory pressure,
dropping a few packets is really not an issue.

Please do not add expensive checks in fast path, just to "not drop a packet"
even if the world is collapsing.

Also consider that NIC typically have thousands of pre-allocated page/frags
for their RX ring buffers, they might all have pfmemalloc set, so we are speaking
of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags.

If we want to solve this issue more generically, we would have to try
to copy data into a non pfmemalloc frag instead of dropping skb that
had frags allocated minutes ago under memory pressure.

This copy could happen in core networking stack, but this seems adding
more pressure to mm layer under pressure.
Matthew Wilcox Nov. 4, 2020, 12:36 p.m. UTC | #7
On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote:
> On 11/4/20 2:16 AM, Rama Nichanamatlu wrote:

> >> Thanks for providing the numbers.  Do you think that dropping (up to)

> >> 7 packets is acceptable?

> > 

> > net.ipv4.tcp_syn_retries = 6

> > 

> > tcp clients wouldn't even get that far leading to connect establish issues.

> 

> This does not really matter. If host was under memory pressure,

> dropping a few packets is really not an issue.

> 

> Please do not add expensive checks in fast path, just to "not drop a packet"

> even if the world is collapsing.


Right, that was my first patch -- to only recheck if we're about to
reuse the page.  Do you think that's acceptable, or is that still too
close to the fast path?

> Also consider that NIC typically have thousands of pre-allocated page/frags

> for their RX ring buffers, they might all have pfmemalloc set, so we are speaking

> of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags.

> 

> If we want to solve this issue more generically, we would have to try

> to copy data into a non pfmemalloc frag instead of dropping skb that

> had frags allocated minutes ago under memory pressure.


I don't think we need to copy anything.  We need to figure out if the
system is still under memory pressure, and if not, we can clear the
pfmemalloc bit on the frag, as in my second patch.  The 'least change'
way of doing that is to try to allocate a page, but the VM could export
a symbol that says "we're not under memory pressure any more".

Did you want to move checking that into the networking layer, or do you
want to keep it in the pagefrag allocator?
Eric Dumazet Nov. 4, 2020, 12:51 p.m. UTC | #8
On 11/4/20 1:36 PM, Matthew Wilcox wrote:
> On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote:
>> On 11/4/20 2:16 AM, Rama Nichanamatlu wrote:
>>>> Thanks for providing the numbers.  Do you think that dropping (up to)
>>>> 7 packets is acceptable?
>>>
>>> net.ipv4.tcp_syn_retries = 6
>>>
>>> tcp clients wouldn't even get that far leading to connect establish issues.
>>
>> This does not really matter. If host was under memory pressure,
>> dropping a few packets is really not an issue.
>>
>> Please do not add expensive checks in fast path, just to "not drop a packet"
>> even if the world is collapsing.
> 
> Right, that was my first patch -- to only recheck if we're about to
> reuse the page.  Do you think that's acceptable, or is that still too
> close to the fast path?

I think it is totally acceptable.

The same strategy is used in NIC drivers, before recycling a page.

If page_is_pfmemalloc() returns true, they simply release the 'problematic'page
and attempt a new allocation.

( git grep -n page_is_pfmemalloc -- drivers/net/ethernet/ )


> 
>> Also consider that NIC typically have thousands of pre-allocated page/frags
>> for their RX ring buffers, they might all have pfmemalloc set, so we are speaking
>> of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags.
>>
>> If we want to solve this issue more generically, we would have to try
>> to copy data into a non pfmemalloc frag instead of dropping skb that
>> had frags allocated minutes ago under memory pressure.
> 
> I don't think we need to copy anything.  We need to figure out if the
> system is still under memory pressure, and if not, we can clear the
> pfmemalloc bit on the frag, as in my second patch.  The 'least change'
> way of doing that is to try to allocate a page, but the VM could export
> a symbol that says "we're not under memory pressure any more".
> 
> Did you want to move checking that into the networking layer, or do you
> want to keep it in the pagefrag allocator?

I think your proposal is fine, thanks !
Dongli Zhang Nov. 4, 2020, 4:41 p.m. UTC | #9
On 11/4/20 4:51 AM, Eric Dumazet wrote:
> 
> 
> On 11/4/20 1:36 PM, Matthew Wilcox wrote:
>> On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote:
>>> On 11/4/20 2:16 AM, Rama Nichanamatlu wrote:
>>>>> Thanks for providing the numbers.  Do you think that dropping (up to)
>>>>> 7 packets is acceptable?
>>>>
>>>> net.ipv4.tcp_syn_retries = 6
>>>>
>>>> tcp clients wouldn't even get that far leading to connect establish issues.
>>>
>>> This does not really matter. If host was under memory pressure,
>>> dropping a few packets is really not an issue.
>>>
>>> Please do not add expensive checks in fast path, just to "not drop a packet"
>>> even if the world is collapsing.
>>
>> Right, that was my first patch -- to only recheck if we're about to
>> reuse the page.  Do you think that's acceptable, or is that still too
>> close to the fast path?
> 
> I think it is totally acceptable.
> 
> The same strategy is used in NIC drivers, before recycling a page.
> 
> If page_is_pfmemalloc() returns true, they simply release the 'problematic'page
> and attempt a new allocation.
> 
> ( git grep -n page_is_pfmemalloc -- drivers/net/ethernet/ )

While the drivers may implement their own page_frag_cache to manage skb->frags ...

... the skb->data is usually allocated via __netdev_alloc_skb() or
napi_alloc_skb(), which end up to the global this_cpu_ptr(&netdev_alloc_cache)
or this_cpu_ptr(&napi_alloc_cache.page).

> 
> 
>>
>>> Also consider that NIC typically have thousands of pre-allocated page/frags
>>> for their RX ring buffers, they might all have pfmemalloc set, so we are speaking
>>> of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags.
>>>
>>> If we want to solve this issue more generically, we would have to try
>>> to copy data into a non pfmemalloc frag instead of dropping skb that
>>> had frags allocated minutes ago under memory pressure.
>>
>> I don't think we need to copy anything.  We need to figure out if the
>> system is still under memory pressure, and if not, we can clear the
>> pfmemalloc bit on the frag, as in my second patch.  The 'least change'
>> way of doing that is to try to allocate a page, but the VM could export
>> a symbol that says "we're not under memory pressure any more".
>>
>> Did you want to move checking that into the networking layer, or do you
>> want to keep it in the pagefrag allocator?
> 
> I think your proposal is fine, thanks !

Hi Matthew, are you going to send out the patch to avoid pfmemalloc recycle?

Thank you very much!

Dongli Zhang
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..291df2f9f8f3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5075,6 +5075,16 @@  void *page_frag_alloc(struct page_frag_cache *nc,
 	struct page *page;
 	int offset;
 
+	/*
+	 * Try to avoid re-using pfmemalloc page because kernel may already
+	 * run out of the memory pressure situation at any time.
+	 */
+	if (unlikely(nc->va && nc->pfmemalloc)) {
+		page = virt_to_page(nc->va);
+		__page_frag_cache_drain(page, nc->pagecnt_bias);
+		nc->va = NULL;
+	}
+
 	if (unlikely(!nc->va)) {
 refill:
 		page = __page_frag_cache_refill(nc, gfp_mask);