[0/7,v4] Introduce a bulk order-0 page allocator with two in-tree users

Message ID	20210312154331.32229-1-mgorman@techsingularity.net
Headers	show Return-Path: <netdev-owner@kernel.org> From: Mel Gorman <mgorman@techsingularity.net> To: Andrew Morton <akpm@linux-foundation.org> Cc: Chuck Lever <chuck.lever@oracle.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Christoph Hellwig <hch@infradead.org>, Alexander Duyck <alexander.duyck@gmail.com>, Matthew Wilcox <willy@infradead.org>, LKML <linux-kernel@vger.kernel.org>, Linux-Net <netdev@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Linux-NFS <linux-nfs@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Date: Fri, 12 Mar 2021 15:43:24 +0000 Message-Id: <20210312154331.32229-1-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce a bulk order-0 page allocator with two in-tree users \| expand [0/7,v4] Introduce a bulk order-0 page allocator with two in-tree users [4/7] SUNRPC: Set rq_page_end differently [6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map [7/7] net: page_pool: use alloc_pages_bulk in refill code path

Mel Gorman March 12, 2021, 3:43 p.m. UTC

This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
directly to the subsystem maintainers. While sunrpc is low-risk, I'm
vaguely aware that there are other prototype series on netdev that affect
page_pool. The conflict should be obvious in linux-next.

Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
  series, the mm patches have to be merged before the sunrpc and net
  users.

Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it
can be made more efficient. Despite that, this is a performance-related
for users that require multiple pages for an operation without multiple
round-trips to the page allocator. Quoting the last patch for the
high-speed networking use-case.

    For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
    redirecting xdp_frame packets into a veth, that does XDP_PASS to
    create an SKB from the xdp_frame, which then cannot return the page
    to the page_pool. In this case, we saw[1] an improvement of 18.8%
    from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Both users in this series are corner cases (NFS and high-speed networks)
so it is unlikely that most users will see any benefit in the short
term. Potential other users are batch allocations for page cache
readahead, fault around and SLUB allocations when high-order pages are
unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with
the initial RFC.

Patch 1 moves GFP flag initialision to prepare_alloc_pages

Patch 2 renames a variable name that is particularly unpopular

Patch 3 adds a bulk page allocator

Patch 4 is a sunrpc cleanup that is a pre-requisite.

Patch 5 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 6 is a preparation patch only for the network user

Patch 7 converts the net page pool to the bulk allocator for order-0 pages.

 include/linux/gfp.h   |  12 ++++
 mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----
 net/core/page_pool.c  | 101 +++++++++++++++++-----------
 net/sunrpc/svc_xprt.c |  47 +++++++++----
 4 files changed, 240 insertions(+), 69 deletions(-)

Chuck Lever March 12, 2021, 7:22 p.m. UTC | #1

Mel, I can send you a tidied and tested update to this patch,
or you can drop the two NFSD patches and I can submit them via
the NFSD tree when alloc_pages_bulk() is merged.

> On Mar 12, 2021, at 1:44 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> 
> On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>> 
>> From: Chuck Lever <chuck.lever@oracle.com>
>> 
>> Reduce the rate at which nfsd threads hammer on the page allocator.
>> This improves throughput scalability by enabling the threads to run
>> more independently of each other.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>> ---
>> net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
>> 1 file changed, 31 insertions(+), 12 deletions(-)
>> 
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index cfa7e4776d0e..38a8d6283801 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
>> static int svc_alloc_arg(struct svc_rqst *rqstp)
>> {
>>        struct svc_serv *serv = rqstp->rq_server;
>> +       unsigned long needed;
>>        struct xdr_buf *arg;
>> +       struct page *page;
>>        int pages;
>>        int i;
>> 
>> -       /* now allocate needed pages.  If we get a failure, sleep briefly */
>>        pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>>        if (pages > RPCSVC_MAXPAGES) {
>>                pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>                /* use as many pages as possible */
>>                pages = RPCSVC_MAXPAGES;
>>        }
>> -       for (i = 0; i < pages ; i++)
>> -               while (rqstp->rq_pages[i] == NULL) {
>> -                       struct page *p = alloc_page(GFP_KERNEL);
>> -                       if (!p) {
>> -                               set_current_state(TASK_INTERRUPTIBLE);
>> -                               if (signalled() || kthread_should_stop()) {
>> -                                       set_current_state(TASK_RUNNING);
>> -                                       return -EINTR;
>> -                               }
>> -                               schedule_timeout(msecs_to_jiffies(500));
>> +
> 
>> +       for (needed = 0, i = 0; i < pages ; i++)
>> +               if (!rqstp->rq_pages[i])
>> +                       needed++;
> 
> I would use an opening and closing braces for the for loop since
> technically the if is a multiline statement. It will make this more
> readable.
> 
>> +       if (needed) {
>> +               LIST_HEAD(list);
>> +
>> +retry:
> 
> Rather than kind of open code a while loop why not just make this
> "while (needed)"? Then all you have to do is break out of the for loop
> and you will automatically return here instead of having to jump to
> two different labels.
> 
>> +               alloc_pages_bulk(GFP_KERNEL, needed, &list);
> 
> Rather than not using the return value would it make sense here to
> perhaps subtract it from needed? Then you would know if any of the
> allocation requests weren't fulfilled.
> 
>> +               for (i = 0; i < pages; i++) {
> 
> It is probably optimizing for the exception case, but I don't think
> you want the "i = 0" here. If you are having to stop because the list
> is empty it probably makes sense to resume where you left off. So you
> should probably be initializing i to 0 before we check for needed.
> 
>> +                       if (!rqstp->rq_pages[i]) {
> 
> It might be cleaner here to just do a "continue" if rq_pages[i] is populated.
> 
>> +                               page = list_first_entry_or_null(&list,
>> +                                                               struct page,
>> +                                                               lru);
>> +                               if (unlikely(!page))
>> +                                       goto empty_list;
> 
> I think I preferred the original code that wasn't jumping away from
> the loop here. With the change I suggested above that would switch the
> if(needed) to while(needed) you could have it just break out of the
> for loop to place itself back in the while loop.
> 
>> +                               list_del(&page->lru);
>> +                               rqstp->rq_pages[i] = page;
>> +                               needed--;
>>                        }
>> -                       rqstp->rq_pages[i] = p;
>>                }
>> +       }
>>        rqstp->rq_page_end = &rqstp->rq_pages[pages];
>>        rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
>> 
>> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>>        arg->len = (pages-1)*PAGE_SIZE;
>>        arg->tail[0].iov_len = 0;
>>        return 0;
>> +
>> +empty_list:
>> +       set_current_state(TASK_INTERRUPTIBLE);
>> +       if (signalled() || kthread_should_stop()) {
>> +               set_current_state(TASK_RUNNING);
>> +               return -EINTR;
>> +       }
>> +       schedule_timeout(msecs_to_jiffies(500));
>> +       goto retry;
>> }
>> 
>> static bool
>> --
>> 2.26.2

--
Chuck Lever

Mel Gorman March 13, 2021, 12:59 p.m. UTC | #2

On Fri, Mar 12, 2021 at 07:22:43PM +0000, Chuck Lever III wrote:
> Mel, I can send you a tidied and tested update to this patch,

> or you can drop the two NFSD patches and I can submit them via

> the NFSD tree when alloc_pages_bulk() is merged.

> 

Send me a tidied version anyway. I'm happy enough to include them in the
series even if it ultimately gets merged via the NFSD tree. It'll need
to be kept as a separate pull request to avoid delaying unrelated NFSD
patches until Andrew merges the mm parts.

-- 
Mel Gorman
SUSE Labs

Alexander Lobakin March 17, 2021, 4:31 p.m. UTC | #3

From: Mel Gorman <mgorman@techsingularity.net>

Date: Fri, 12 Mar 2021 15:43:24 +0000

Hi there,

> This series is based on top of Matthew Wilcox's series "Rationalise

> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to

> test and are not using Andrew's tree as a baseline, I suggest using the

> following git tree

>

> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2


I gave this series a go on my setup, it showed a bump of 10 Mbps on
UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

(4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
allocations with MTU of 1508 bytes, linear frames via build_skb(),
GRO + TSO/USO)

I didn't have time to drill into the code, so for now can't provide
any additional details. You can request anything you need though and
I'll try to find a window to collect it.

> Note to Chuck and Jesper -- as this is a cross-subsystem series, you may

> want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)

> directly to the subsystem maintainers. While sunrpc is low-risk, I'm

> vaguely aware that there are other prototype series on netdev that affect

> page_pool. The conflict should be obvious in linux-next.

>

> Changelog since v3

> o Rebase on top of Matthew's series consolidating the alloc_pages API

> o Rename alloced to allocated

> o Split out preparation patch for prepare_alloc_pages

> o Defensive check for bulk allocation or <= 0 pages

> o Call single page allocation path only if no pages were allocated

> o Minor cosmetic cleanups

> o Reorder patch dependencies by subsystem. As this is a cross-subsystem

>   series, the mm patches have to be merged before the sunrpc and net

>   users.

>

> Changelog since v2

> o Prep new pages with IRQs enabled

> o Minor documentation update

>

> Changelog since v1

> o Parenthesise binary and boolean comparisons

> o Add reviewed-bys

> o Rebase to 5.12-rc2

>

> This series introduces a bulk order-0 page allocator with sunrpc and

> the network page pool being the first users. The implementation is not

> particularly efficient and the intention is to iron out what the semantics

> of the API should have for users. Once the semantics are ironed out, it

> can be made more efficient. Despite that, this is a performance-related

> for users that require multiple pages for an operation without multiple

> round-trips to the page allocator. Quoting the last patch for the

> high-speed networking use-case.

>

>     For XDP-redirect workload with 100G mlx5 driver (that use page_pool)

>     redirecting xdp_frame packets into a veth, that does XDP_PASS to

>     create an SKB from the xdp_frame, which then cannot return the page

>     to the page_pool. In this case, we saw[1] an improvement of 18.8%

>     from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

>

> Both users in this series are corner cases (NFS and high-speed networks)

> so it is unlikely that most users will see any benefit in the short

> term. Potential other users are batch allocations for page cache

> readahead, fault around and SLUB allocations when high-order pages are

> unavailable. It's unknown how much benefit would be seen by converting

> multiple page allocation calls to a single batch or what difference it may

> make to headline performance. It's a chicken and egg problem given that

> the potential benefit cannot be investigated without an implementation

> to test against.

>

> Light testing passed, I'm relying on Chuck and Jesper to test the target

> users more aggressively but both report performance improvements with

> the initial RFC.

>

> Patch 1 moves GFP flag initialision to prepare_alloc_pages

>

> Patch 2 renames a variable name that is particularly unpopular

>

> Patch 3 adds a bulk page allocator

>

> Patch 4 is a sunrpc cleanup that is a pre-requisite.

>

> Patch 5 is the sunrpc user. Chuck also has a patch which further caches

> 	pages but is not included in this series. It's not directly

> 	related to the bulk allocator and as it caches pages, it might

> 	have other concerns (e.g. does it need a shrinker?)

>

> Patch 6 is a preparation patch only for the network user

>

> Patch 7 converts the net page pool to the bulk allocator for order-0 pages.

>

>  include/linux/gfp.h   |  12 ++++

>  mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----

>  net/core/page_pool.c  | 101 +++++++++++++++++-----------

>  net/sunrpc/svc_xprt.c |  47 +++++++++----

>  4 files changed, 240 insertions(+), 69 deletions(-)

>

> --

> 2.26.2


Thanks,
Al

Jesper Dangaard Brouer March 17, 2021, 4:38 p.m. UTC | #4

On Wed, 17 Mar 2021 16:31:07 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Mel Gorman <mgorman@techsingularity.net>

> Date: Fri, 12 Mar 2021 15:43:24 +0000

> 

> Hi there,

> 

> > This series is based on top of Matthew Wilcox's series "Rationalise

> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to

> > test and are not using Andrew's tree as a baseline, I suggest using the

> > following git tree

> >

> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  

> 

> I gave this series a go on my setup, it showed a bump of 10 Mbps on

> UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

> 

> (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0

> allocations with MTU of 1508 bytes, linear frames via build_skb(),

> GRO + TSO/USO)


What NIC driver is this?

> I didn't have time to drill into the code, so for now can't provide

> any additional details. You can request anything you need though and

> I'll try to find a window to collect it.

> 

> > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may

> > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)

> > directly to the subsystem maintainers. While sunrpc is low-risk, I'm

> > vaguely aware that there are other prototype series on netdev that affect

> > page_pool. The conflict should be obvious in linux-next.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Alexander Lobakin March 17, 2021, 4:52 p.m. UTC | #5

From: Jesper Dangaard Brouer <brouer@redhat.com>

Date: Wed, 17 Mar 2021 17:38:44 +0100

> On Wed, 17 Mar 2021 16:31:07 +0000

> Alexander Lobakin <alobakin@pm.me> wrote:

>

> > From: Mel Gorman <mgorman@techsingularity.net>

> > Date: Fri, 12 Mar 2021 15:43:24 +0000

> >

> > Hi there,

> >

> > > This series is based on top of Matthew Wilcox's series "Rationalise

> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to

> > > test and are not using Andrew's tree as a baseline, I suggest using the

> > > following git tree

> > >

> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

> >

> > I gave this series a go on my setup, it showed a bump of 10 Mbps on

> > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

> >

> > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0

> > allocations with MTU of 1508 bytes, linear frames via build_skb(),

> > GRO + TSO/USO)

>

> What NIC driver is this?


Ah, forgot to mention. It's a WIP driver, not yet mainlined.
The NIC itself is basically on-SoC 1G chip.

> > I didn't have time to drill into the code, so for now can't provide

> > any additional details. You can request anything you need though and

> > I'll try to find a window to collect it.

> >

> > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may

> > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)

> > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm

> > > vaguely aware that there are other prototype series on netdev that affect

> > > page_pool. The conflict should be obvious in linux-next.

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer


Al

Jesper Dangaard Brouer March 17, 2021, 5:19 p.m. UTC | #6

On Wed, 17 Mar 2021 16:52:32 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>

> Date: Wed, 17 Mar 2021 17:38:44 +0100

> 

> > On Wed, 17 Mar 2021 16:31:07 +0000

> > Alexander Lobakin <alobakin@pm.me> wrote:

> >  

> > > From: Mel Gorman <mgorman@techsingularity.net>

> > > Date: Fri, 12 Mar 2021 15:43:24 +0000

> > >

> > > Hi there,

> > >  

> > > > This series is based on top of Matthew Wilcox's series "Rationalise

> > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to

> > > > test and are not using Andrew's tree as a baseline, I suggest using the

> > > > following git tree

> > > >

> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  

> > >

> > > I gave this series a go on my setup, it showed a bump of 10 Mbps on

> > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

> > >

> > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0

> > > allocations with MTU of 1508 bytes, linear frames via build_skb(),

> > > GRO + TSO/USO)  

> >

> > What NIC driver is this?  

> 

> Ah, forgot to mention. It's a WIP driver, not yet mainlined.

> The NIC itself is basically on-SoC 1G chip.


Hmm, then it is really hard to check if your driver is doing something
else that could cause this.

Well, can you try to lower the page_pool bulking size, to test the
theory from Wilcox that we should do smaller bulking to avoid pushing
cachelines into L2 when walking the LRU list.  You might have to go as
low as bulk=8 (for N-way associative level of L1 cache).

In function: __page_pool_alloc_pages_slow() adjust variable:
  const int bulk = PP_ALLOC_CACHE_REFILL;


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Alexander Lobakin March 17, 2021, 10:25 p.m. UTC | #7

From: Jesper Dangaard Brouer <brouer@redhat.com>

Date: Wed, 17 Mar 2021 18:19:43 +0100

> On Wed, 17 Mar 2021 16:52:32 +0000

> Alexander Lobakin <alobakin@pm.me> wrote:

>

> > From: Jesper Dangaard Brouer <brouer@redhat.com>

> > Date: Wed, 17 Mar 2021 17:38:44 +0100

> >

> > > On Wed, 17 Mar 2021 16:31:07 +0000

> > > Alexander Lobakin <alobakin@pm.me> wrote:

> > >

> > > > From: Mel Gorman <mgorman@techsingularity.net>

> > > > Date: Fri, 12 Mar 2021 15:43:24 +0000

> > > >

> > > > Hi there,

> > > >

> > > > > This series is based on top of Matthew Wilcox's series "Rationalise

> > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to

> > > > > test and are not using Andrew's tree as a baseline, I suggest using the

> > > > > following git tree

> > > > >

> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

> > > >

> > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on

> > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

> > > >

> > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0

> > > > allocations with MTU of 1508 bytes, linear frames via build_skb(),

> > > > GRO + TSO/USO)

> > >

> > > What NIC driver is this?

> >

> > Ah, forgot to mention. It's a WIP driver, not yet mainlined.

> > The NIC itself is basically on-SoC 1G chip.

>

> Hmm, then it is really hard to check if your driver is doing something

> else that could cause this.

>

> Well, can you try to lower the page_pool bulking size, to test the

> theory from Wilcox that we should do smaller bulking to avoid pushing

> cachelines into L2 when walking the LRU list.  You might have to go as

> low as bulk=8 (for N-way associative level of L1 cache).


Turned out it suffered from GCC's decisions.
All of the following was taken on GCC 10.2.0 with -O2 in dotconfig.

vmlinux differences between baseline and this series:

(I used your followup instead of the last patch from the tree)

Function                                     old     new   delta
__rmqueue_pcplist                              -    2024   +2024
__alloc_pages_bulk                             -    1456   +1456
__page_pool_alloc_pages_slow                 284     600    +316
page_pool_dma_map                              -     164    +164
get_page_from_freelist                      5676    3760   -1916

The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows
down the "regular" page allocator, but makes __alloc_pages_bulk()
much slower than the per-page (in my case at least) due to calling
this function out from the loop.

One possible solution is to mark __rmqueue_pcplist() and
rmqueue_bulk() as __always_inline. Only both and only with
__always_inline, or GCC will emit rmqueue_bulk.constprop and
make the numbers even poorer.
This nearly doubles the size of bulk allocator, but eliminates
all performance hits.

Function                                     old     new   delta
__alloc_pages_bulk                          1456    3512   +2056
get_page_from_freelist                      3760    5744   +1984
find_suitable_fallback.part                    -     160    +160
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136
__rmqueue_pcplist                           2024       -   -2024

Between baseline and this series with __always_inline hints:

Function                                     old     new   delta
__alloc_pages_bulk                             -    3512   +3512
find_suitable_fallback.part                    -     160    +160
get_page_from_freelist                      5676    5744     +68
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136

Another suboptimal place I've found is two functions in Page Pool
code which are marked as 'noinline'.
Maybe there's a reason behind this, but removing the annotations
and additionally marking page_pool_dma_map() as inline simplifies
the object code and in fact improves the performance (+15 Mbps on
my setup):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function                                     old     new   delta
page_pool_alloc_pages                        100    1124   +1024
page_pool_dma_map                            164       -    -164
page_pool_refill_alloc_cache                 332       -    -332
__page_pool_alloc_pages_slow                 600       -    -600

1124 is a normal size for a hotpath function.
These fragmentation and jumps between page_pool_alloc_pages(),
__page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache()
are really excessive and unhealthy for performance, as well as
page_pool_dma_map() uninlined by GCC.

So the best results I got so far were with these additional changes:
 - mark __rmqueue_pcplist() as __always_inline;
 - mark rmqueue_bulk() as __always_inline;
 - drop 'noinline' from page_pool_refill_alloc_cache();
 - drop 'noinline' from __page_pool_alloc_pages_slow();
 - mark page_pool_dma_map() as inline.

(inlines in C files aren't generally recommended, but well, GCC
 is far from perfect)

> In function: __page_pool_alloc_pages_slow() adjust variable:

>   const int bulk = PP_ALLOC_CACHE_REFILL;


Regarding bulk size, it makes no sense on my machine. I tried
{ 8, 16, 32, 64 } and they differed by 1-2 Mbps max / standard
deviation.
Most of the bulk operations I've seen usually take the value of
16 as a "golden ratio" though.

>

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer


Thanks,
Al

Vlastimil Babka March 19, 2021, 4:11 p.m. UTC | #8

On 3/12/21 4:43 PM, Mel Gorman wrote:
> __alloc_pages updates GFP flags to enforce what flags are allowed

> during a global context such as booting or suspend. This patch moves the

> enforcement from __alloc_pages to prepare_alloc_pages so the code can be

> shared between the single page allocator and a new bulk page allocator.

> 

> When moving, it is obvious that __alloc_pages() and __alloc_pages

> use different names for the same variable. This is an unnecessary

> complication so rename gfp_mask to gfp in prepare_alloc_pages() so the

> name is consistent.

> 

> No functional change.


Hm, I have some doubts.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

> ---

>  mm/page_alloc.c | 25 +++++++++++++------------

>  1 file changed, 13 insertions(+), 12 deletions(-)

> 

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index 00b67c47ad87..f0c1d74ead6f 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -4914,15 +4914,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

>  	return page;

>  }

>  

> -static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,

> +static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,

>  		int preferred_nid, nodemask_t *nodemask,

>  		struct alloc_context *ac, gfp_t *alloc_gfp,

>  		unsigned int *alloc_flags)

>  {

> -	ac->highest_zoneidx = gfp_zone(gfp_mask);

> -	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);

> +	gfp &= gfp_allowed_mask;

> +	*alloc_gfp = gfp;

> +


...

> @@ -4980,8 +4983,6 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,

>  		return NULL;

>  	}

>  

> -	gfp &= gfp_allowed_mask;

> -	alloc_gfp = gfp;

>  	if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,

>  			&alloc_gfp, &alloc_flags))

>  		return NULL;


As a result, "gfp" doesn't have the restrictions by gfp_allowed_mask applied,
only alloc_gfp does. But in case we go to slowpath, before
going there we throw away the current alloc_gfp:

    alloc_gfp = current_gfp_context(gfp);
    ...
    page = __alloc_pages_slowpath(alloc_gfp, ...);

So we lost the gfp_allowed_mask restrictions here?

Vlastimil Babka March 19, 2021, 4:22 p.m. UTC | #9

On 3/12/21 4:43 PM, Mel Gorman wrote:
> Review feedback of the bulk allocator twice found problems with "alloced"

> being a counter for pages allocated. The naming was based on the API name

> "alloc" and was based on the idea that verbal communication about malloc

> tends to use the fake word "malloced" instead of the fake word mallocated.

> To be consistent, this preparation patch renames alloced to allocated

> in rmqueue_bulk so the bulk allocator and per-cpu allocator use similar

> names when the bulk allocator is introduced.

> 

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>


Acked-by: Vlastimil Babka <vbabka@suse.cz>


> ---

>  mm/page_alloc.c | 8 ++++----

>  1 file changed, 4 insertions(+), 4 deletions(-)

> 

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index f0c1d74ead6f..880b1d6368bd 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -2904,7 +2904,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,

>  			unsigned long count, struct list_head *list,

>  			int migratetype, unsigned int alloc_flags)

>  {

> -	int i, alloced = 0;

> +	int i, allocated = 0;

>  

>  	spin_lock(&zone->lock);

>  	for (i = 0; i < count; ++i) {

> @@ -2927,7 +2927,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,

>  		 * pages are ordered properly.

>  		 */

>  		list_add_tail(&page->lru, list);

> -		alloced++;

> +		allocated++;

>  		if (is_migrate_cma(get_pcppage_migratetype(page)))

>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,

>  					      -(1 << order));

> @@ -2936,12 +2936,12 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,

>  	/*

>  	 * i pages were removed from the buddy list even if some leak due

>  	 * to check_pcp_refill failing so adjust NR_FREE_PAGES based

> -	 * on i. Do not confuse with 'alloced' which is the number of

> +	 * on i. Do not confuse with 'allocated' which is the number of

>  	 * pages added to the pcp list.

>  	 */

>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

>  	spin_unlock(&zone->lock);

> -	return alloced;

> +	return allocated;

>  }

>  

>  #ifdef CONFIG_NUMA

>

Mel Gorman March 19, 2021, 5:49 p.m. UTC | #10

On Fri, Mar 19, 2021 at 05:11:39PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:

> > __alloc_pages updates GFP flags to enforce what flags are allowed

> > during a global context such as booting or suspend. This patch moves the

> > enforcement from __alloc_pages to prepare_alloc_pages so the code can be

> > shared between the single page allocator and a new bulk page allocator.

> > 

> > When moving, it is obvious that __alloc_pages() and __alloc_pages

> > use different names for the same variable. This is an unnecessary

> > complication so rename gfp_mask to gfp in prepare_alloc_pages() so the

> > name is consistent.

> > 

> > No functional change.

> 

> Hm, I have some doubts.

> 


And you were right, I'll drop the patch and apply the same mask to the
bulk allocator.

-- 
Mel Gorman
SUSE Labs

Vlastimil Babka March 19, 2021, 6:18 p.m. UTC | #11

On 3/12/21 4:43 PM, Mel Gorman wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,

> and __alloc_pages_bulk_nodemask. A caller requests a number of pages

> to be allocated and added to a list. They can be freed in bulk using

> free_pages_bulk().

> 

> The API is not guaranteed to return the requested number of pages and

> may fail if the preferred allocation zone has limited free memory, the

> cpuset changes during the allocation or page debugging decides to fail

> an allocation. It's up to the caller to request more pages in batch

> if necessary.

> 

> Note that this implementation is not very efficient and could be improved

> but it would require refactoring. The intent is to make it available early

> to determine what semantics are required by different callers. Once the

> full semantics are nailed down, it can be refactored.

> 

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>


Acked-by: Vlastimil Babka <vbabka@suse.cz>


Although maybe premature, if it changes significantly due to the users'
performance feedback, let's see :)

Some nits below:

...

> @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,

>  	return true;

>  }

>  

> +/*

> + * This is a batched version of the page allocator that attempts to

> + * allocate nr_pages quickly from the preferred zone and add them to list.

> + *

> + * Returns the number of pages allocated.

> + */

> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

> +			nodemask_t *nodemask, int nr_pages,

> +			struct list_head *alloc_list)

> +{

> +	struct page *page;

> +	unsigned long flags;

> +	struct zone *zone;

> +	struct zoneref *z;

> +	struct per_cpu_pages *pcp;

> +	struct list_head *pcp_list;

> +	struct alloc_context ac;

> +	gfp_t alloc_gfp;

> +	unsigned int alloc_flags;

> +	int allocated = 0;

> +

> +	if (WARN_ON_ONCE(nr_pages <= 0))

> +		return 0;

> +

> +	if (nr_pages == 1)

> +		goto failed;

> +

> +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */

> +	if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,

> +	&alloc_gfp, &alloc_flags))


Unusual identation here.

> +		return 0;

> +	gfp = alloc_gfp;

> +

> +	/* Find an allowed local zone that meets the high watermark. */

> +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {

> +		unsigned long mark;

> +

> +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&

> +		    !__cpuset_zone_allowed(zone, gfp)) {

> +			continue;

> +		}

> +

> +		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&

> +		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {

> +			goto failed;

> +		}

> +

> +		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;

> +		if (zone_watermark_fast(zone, 0,  mark,

> +				zonelist_zone_idx(ac.preferred_zoneref),

> +				alloc_flags, gfp)) {

> +			break;

> +		}

> +	}

> +	if (!zone)

> +		return 0;


Why not also "goto failed;" here?

Mel Gorman March 22, 2021, 8:30 a.m. UTC | #12

On Fri, Mar 19, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:

> > This patch adds a new page allocator interface via alloc_pages_bulk,

> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages

> > to be allocated and added to a list. They can be freed in bulk using

> > free_pages_bulk().

> > 

> > The API is not guaranteed to return the requested number of pages and

> > may fail if the preferred allocation zone has limited free memory, the

> > cpuset changes during the allocation or page debugging decides to fail

> > an allocation. It's up to the caller to request more pages in batch

> > if necessary.

> > 

> > Note that this implementation is not very efficient and could be improved

> > but it would require refactoring. The intent is to make it available early

> > to determine what semantics are required by different callers. Once the

> > full semantics are nailed down, it can be refactored.

> > 

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

> 

> Acked-by: Vlastimil Babka <vbabka@suse.cz>

> 

> Although maybe premature, if it changes significantly due to the users'

> performance feedback, let's see :)

> 


Indeed. The next version will have no users so that Jesper and Chuck
can check if an array-based or LRU based version is better. There were
also bugs such as broken accounting of stats that had to be fixed which
increases overhead.

> Some nits below:

> 

> ...

> 

> > @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,

> >  	return true;

> >  }

> >  

> > +/*

> > + * This is a batched version of the page allocator that attempts to

> > + * allocate nr_pages quickly from the preferred zone and add them to list.

> > + *

> > + * Returns the number of pages allocated.

> > + */

> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,

> > +			nodemask_t *nodemask, int nr_pages,

> > +			struct list_head *alloc_list)

> > +{

> > +	struct page *page;

> > +	unsigned long flags;

> > +	struct zone *zone;

> > +	struct zoneref *z;

> > +	struct per_cpu_pages *pcp;

> > +	struct list_head *pcp_list;

> > +	struct alloc_context ac;

> > +	gfp_t alloc_gfp;

> > +	unsigned int alloc_flags;

> > +	int allocated = 0;

> > +

> > +	if (WARN_ON_ONCE(nr_pages <= 0))

> > +		return 0;

> > +

> > +	if (nr_pages == 1)

> > +		goto failed;

> > +

> > +	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */

> > +	if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,

> > +	&alloc_gfp, &alloc_flags))

> 

> Unusual identation here.

> 


Fixed

> > +		return 0;

> > +	gfp = alloc_gfp;

> > +

> > +	/* Find an allowed local zone that meets the high watermark. */

> > +	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {

> > +		unsigned long mark;

> > +

> > +		if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&

> > +		    !__cpuset_zone_allowed(zone, gfp)) {

> > +			continue;

> > +		}

> > +

> > +		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&

> > +		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {

> > +			goto failed;

> > +		}

> > +

> > +		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;

> > +		if (zone_watermark_fast(zone, 0,  mark,

> > +				zonelist_zone_idx(ac.preferred_zoneref),

> > +				alloc_flags, gfp)) {

> > +			break;

> > +		}

> > +	}

> > +	if (!zone)

> > +		return 0;

> 

> Why not also "goto failed;" here?


Good question. When first written, it was because the zone search for the
normal allocator was almost certainly going to fail to find a zone and
it was expected that callers would prefer to fail fast over blocking.
Now we know that sunrpc can sleep on a failing allocation and it would
be better to enter the single page allocator and reclaim pages instead of
"sleep and hope for the best".

-- 
Mel Gorman
SUSE Labs

[0/7,v4] Introduce a bulk order-0 page allocator with two in-tree users

Message

Comments