diff mbox series

ceph: fix excessive page cache usage

Message ID 20230504082510.247-1-sehuww@mail.scut.edu.cn
State New
Headers show
Series ceph: fix excessive page cache usage | expand

Commit Message

Hu Weiwen May 4, 2023, 8:25 a.m. UTC
Currently, `ceph_netfs_expand_readahead()` tries to align the read
request with strip_unit, which by default is set to 4MB.  This means
that small files will require at least 4MB of page cache, leading to
inefficient usage of the page cache.

Bound `rreq->len` to the actual file size to restore the previous page
cache usage.

Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
---

We recently updated our kernel. And we are investigating the performance
regression on our machine learning jobs.  For example, one of our jobs
repeatedly read a dataset of 62GB, 100k files.  I expect all these IO
request would hit the page cache, since we have more that 100GB memory
for cache.  However, a lot of network IO is observed, and our HDD ceph
cluster is fully loaded, resulting in very bad performance.

The regression is bisected to commit
49870056005c ("ceph: convert ceph_readpages to ceph_readahead").
This commit is merged in kernel 5.13.  After this commit, we need 400GB
of memory to fully cache these 100k files, which is unacceptable.

The post-EOF page cache is populated at:
(gathered by `perf record -a -e filemap:mm_filemap_add_to_page_cache -g sleep 2`)

python 3619706 [005] 3103609.736344: filemap:mm_filemap_add_to_page_cache: dev 0:62 ino 1002245af9b page=0x7daf4c pfn=0x7daf4c ofs=1048576
        ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms])
        ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms])
        ffffffff9aca945d add_to_page_cache_lru+0x4d ([kernel.kallsyms])
        ffffffff9acb66d8 readahead_expand+0x128 ([kernel.kallsyms])
        ffffffffc0e68fbc netfs_rreq_expand+0x8c ([kernel.kallsyms])
        ffffffffc0e6a6c2 netfs_readahead+0xf2 ([kernel.kallsyms])
        ffffffffc104817c ceph_readahead+0xbc ([kernel.kallsyms])
        ffffffff9acb63c5 read_pages+0x95 ([kernel.kallsyms])
        ffffffff9acb6921 page_cache_ra_unbounded+0x161 ([kernel.kallsyms])
        ffffffff9acb6a1d do_page_cache_ra+0x3d ([kernel.kallsyms])
        ffffffff9acb6b67 ondemand_readahead+0x137 ([kernel.kallsyms])
        ffffffff9acb700f page_cache_sync_ra+0xcf ([kernel.kallsyms])
        ffffffff9acab80c filemap_get_pages+0xdc ([kernel.kallsyms])
        ffffffff9acabe4e filemap_read+0xbe ([kernel.kallsyms])
        ffffffff9acac285 generic_file_read_iter+0xe5 ([kernel.kallsyms])
        ffffffffc1041b82 ceph_read_iter+0x182 ([kernel.kallsyms])
        ffffffff9ad82bf0 new_sync_read+0x110 ([kernel.kallsyms])
        ffffffff9ad83432 vfs_read+0x102 ([kernel.kallsyms])
        ffffffff9ad858d7 ksys_read+0x67 ([kernel.kallsyms])
        ffffffff9ad8597a __x64_sys_read+0x1a ([kernel.kallsyms])
        ffffffff9b76563c do_syscall_64+0x5c ([kernel.kallsyms])
        ffffffff9b800099 entry_SYSCALL_64_after_hwframe+0x61 ([kernel.kallsyms])
            7fad6ca683cc __libc_read+0x4c (/lib/x86_64-linux-gnu/libpthread-2.31.so)

The readahead is expanded too much. 


 fs/ceph/addr.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Xiubo Li May 4, 2023, 11:11 a.m. UTC | #1
Hi Weiwen,

As discussed in another thread I have fold your change to my fix in V3.

Thanks

- Xiubo

On 5/4/23 16:25, Hu Weiwen wrote:
> Currently, `ceph_netfs_expand_readahead()` tries to align the read
> request with strip_unit, which by default is set to 4MB.  This means
> that small files will require at least 4MB of page cache, leading to
> inefficient usage of the page cache.
>
> Bound `rreq->len` to the actual file size to restore the previous page
> cache usage.
>
> Fixes: 49870056005c ("ceph: convert ceph_readpages to ceph_readahead")
> Signed-off-by: Hu Weiwen <sehuww@mail.scut.edu.cn>
> ---
>
> We recently updated our kernel. And we are investigating the performance
> regression on our machine learning jobs.  For example, one of our jobs
> repeatedly read a dataset of 62GB, 100k files.  I expect all these IO
> request would hit the page cache, since we have more that 100GB memory
> for cache.  However, a lot of network IO is observed, and our HDD ceph
> cluster is fully loaded, resulting in very bad performance.
>
> The regression is bisected to commit
> 49870056005c ("ceph: convert ceph_readpages to ceph_readahead").
> This commit is merged in kernel 5.13.  After this commit, we need 400GB
> of memory to fully cache these 100k files, which is unacceptable.
>
> The post-EOF page cache is populated at:
> (gathered by `perf record -a -e filemap:mm_filemap_add_to_page_cache -g sleep 2`)
>
> python 3619706 [005] 3103609.736344: filemap:mm_filemap_add_to_page_cache: dev 0:62 ino 1002245af9b page=0x7daf4c pfn=0x7daf4c ofs=1048576
>          ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms])
>          ffffffff9aca933a __add_to_page_cache_locked+0x2aa ([kernel.kallsyms])
>          ffffffff9aca945d add_to_page_cache_lru+0x4d ([kernel.kallsyms])
>          ffffffff9acb66d8 readahead_expand+0x128 ([kernel.kallsyms])
>          ffffffffc0e68fbc netfs_rreq_expand+0x8c ([kernel.kallsyms])
>          ffffffffc0e6a6c2 netfs_readahead+0xf2 ([kernel.kallsyms])
>          ffffffffc104817c ceph_readahead+0xbc ([kernel.kallsyms])
>          ffffffff9acb63c5 read_pages+0x95 ([kernel.kallsyms])
>          ffffffff9acb6921 page_cache_ra_unbounded+0x161 ([kernel.kallsyms])
>          ffffffff9acb6a1d do_page_cache_ra+0x3d ([kernel.kallsyms])
>          ffffffff9acb6b67 ondemand_readahead+0x137 ([kernel.kallsyms])
>          ffffffff9acb700f page_cache_sync_ra+0xcf ([kernel.kallsyms])
>          ffffffff9acab80c filemap_get_pages+0xdc ([kernel.kallsyms])
>          ffffffff9acabe4e filemap_read+0xbe ([kernel.kallsyms])
>          ffffffff9acac285 generic_file_read_iter+0xe5 ([kernel.kallsyms])
>          ffffffffc1041b82 ceph_read_iter+0x182 ([kernel.kallsyms])
>          ffffffff9ad82bf0 new_sync_read+0x110 ([kernel.kallsyms])
>          ffffffff9ad83432 vfs_read+0x102 ([kernel.kallsyms])
>          ffffffff9ad858d7 ksys_read+0x67 ([kernel.kallsyms])
>          ffffffff9ad8597a __x64_sys_read+0x1a ([kernel.kallsyms])
>          ffffffff9b76563c do_syscall_64+0x5c ([kernel.kallsyms])
>          ffffffff9b800099 entry_SYSCALL_64_after_hwframe+0x61 ([kernel.kallsyms])
>              7fad6ca683cc __libc_read+0x4c (/lib/x86_64-linux-gnu/libpthread-2.31.so)
>
> The readahead is expanded too much.
>
>
>   fs/ceph/addr.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6bb251a4d613..d508901d3739 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -197,6 +197,8 @@ static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
>   
>   	/* Now, round up the length to the next block */
>   	rreq->len = roundup(rreq->len, lo->stripe_unit);
> +	/* But do not exceed the file size */
> +	rreq->len = min(rreq->len, (size_t)(rreq->i_size - rreq->start));
>   }
>   
>   static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)
diff mbox series

Patch

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6bb251a4d613..d508901d3739 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -197,6 +197,8 @@  static void ceph_netfs_expand_readahead(struct netfs_io_request *rreq)
 
 	/* Now, round up the length to the next block */
 	rreq->len = roundup(rreq->len, lo->stripe_unit);
+	/* But do not exceed the file size */
+	rreq->len = min(rreq->len, (size_t)(rreq->i_size - rreq->start));
 }
 
 static bool ceph_netfs_clamp_length(struct netfs_io_subrequest *subreq)