diff mbox series

[v2] ceph: Fix kernel crash in generic/397 test

Message ID 20250117035044.23309-1-slava@dubeyko.com
State New
Headers show
Series [v2] ceph: Fix kernel crash in generic/397 test | expand

Commit Message

Viacheslav Dubeyko Jan. 17, 2025, 3:50 a.m. UTC
From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The generic/397 test generate kernel crash for the case of
encrypted inode with unaligned file size (for example, 33K
or 1K):

Jan 3 12:34:40 ceph-testing-0001 root: run xfstest generic/397
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40
Jan 3 12:34:40 ceph-testing-0001 systemd1: Started /usr/bin/bash c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/397.
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2.
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:40 ceph-testing-0001 kernel: [ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.374578] -----------[ cut here ]------------
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.374586] kernel BUG at net/ceph/messenger.c:1070!
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.380167] Workqueue: ceph-msgr ceph_con_workfn
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.391307] PKRU: 55555554
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.391567] Call Trace:
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.391807] <TASK>
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.392021] ? show_regs+0x71/0x90
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.392391] ? die+0x38/0xa0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.392667] ? do_trap+0xdb/0x100
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.392981] ? do_error_trap+0x75/0xb0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.393842] ? exc_invalid_op+0x53/0x80
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.394694] ? asm_exc_invalid_op+0x1b/0x20
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.396027] ? _raw_spin_unlock+0xe/0x40
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.396428] ? raw_spin_rq_unlock+0x10/0x40
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.396842] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.397338] ? __schedule+0x44b/0x16b0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.397738] ceph_con_workfn+0x326/0x750
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.398121] process_one_work+0x188/0x3d0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.398522] ? __pfx_worker_thread+0x10/0x10
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.398929] worker_thread+0x2b5/0x3c0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.399310] ? __pfx_worker_thread+0x10/0x10
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.399727] kthread+0xe1/0x120
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.400031] ? __pfx_kthread+0x10/0x10
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.400431] ret_from_fork+0x43/0x70
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.400771] ? __pfx_kthread+0x10/0x10
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.401127] ret_from_fork_asm+0x1a/0x30
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.401543] </TASK>
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.407319] ---[ end trace 0000000000000000 ]---
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jan 3 12:34:41 ceph-testing-0001 kernel: [ 878.417630] PKRU: 55555554

BUG_ON(length > msg->data_length) triggers the issue:

(gdb) l *ceph_msg_data_cursor_init+0x42
0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070).
1065
1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor,
1067 struct ceph_msg *msg, size_t length)
1068 {
1069 BUG_ON(!length);
1070 BUG_ON(length > msg->data_length);
1071 BUG_ON(!msg->num_data_items);
1072
1073 cursor->total_resid = length;
1074 cursor->data = msg->data;

The issue takes place because of this:
Jan 6 14:59:24 ceph-testing-0001 kernel: [ 202.628853] libceph: pid 144:net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864
1070 BUG_ON(length > msg->data_length);
msg->sparse_read_total 36864 > msg->data_length 33792

The generic/397 test (xfstests) executes such steps:
(1) create encrypted files and directories;
(2) access the created files and folders with encryption key;
(3) access the created files and folders without encryption key.

The issue takes place in this portion of code:

    if (IS_ENCRYPTED(inode)) {
            struct page **pages;
            size_t page_off;

            err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
                                            &page_off);
            if (err < 0) {
                    doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
                          ceph_vinop(inode), err);
                    goto out;
            }

            /* should always give us a page-aligned read */
            WARN_ON_ONCE(page_off);
            len = err;
            err = 0;

            osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false,
                                             false);

The reason of the issue is that subreq->io_iter.count keeps
unaligned value of length:

Jan 16 12:46:56 ceph-testing-0001 kernel: [  347.751182] pid 8059:lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064
Jan 16 12:46:56 ceph-testing-0001 kernel: [  347.752808] pid 8059:lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064
Jan 16 12:46:56 ceph-testing-0001 kernel: [  347.754394] pid 8059:lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064

This patch simply assigns the aligned value to
subreq->io_iter.count before calling iov_iter_get_pages_alloc2().

./check generic/397
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0-rc7+ #58 SMP PREEMPT_DYNAMIC Wed Jan 15 00:07:06 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40629:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<hidden>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:<port>:/scratch /mnt/scratch

generic/397 1s ...  1s
Ran: generic/397
Passed all 1 tests

Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
---
 fs/ceph/addr.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Alex Markuze Jan. 20, 2025, 9:47 a.m. UTC | #1
Easiest is to run xfstets. Ping me on slack I can show you, its simple.

On Mon, Jan 20, 2025 at 11:34 AM David Howells <dhowells@redhat.com> wrote:
>
> Is there a way for me to test this?  I have a ceph server set up and can mount
> a filesystem from it.  How do a make a file content-encrypted on ceph?
>
> David
>
Viacheslav Dubeyko Jan. 27, 2025, 6:40 p.m. UTC | #2
Hi David,

On Mon, 2025-01-20 at 11:47 +0200, Alex Markuze wrote:
> Easiest is to run xfstets. Ping me on slack I can show you, its simple.
> 
> On Mon, Jan 20, 2025 at 11:34 AM David Howells <dhowells@redhat.com> wrote:
> > 
> > Is there a way for me to test this?  I have a ceph server set up and can mount
> > a filesystem from it.  How do a make a file content-encrypted on ceph?
> > 
> > David
> > 
> 

So, is suggested fix correct or not? If it is not, then which solution
could be the right fix here?

Thanks,
Slava.
David Howells Jan. 28, 2025, 4:13 p.m. UTC | #3
Hi Viacheslav,

> So, is suggested fix correct or not? If it is not, then which solution
> could be the right fix here?

It'll probably do.  I can't reproduce the issue because I keep hitting a hang
(which I'm trying to debug).

David
David Howells Jan. 28, 2025, 4:57 p.m. UTC | #4
David Howells <dhowells@redhat.com> wrote:

> > So, is suggested fix correct or not? If it is not, then which solution
> > could be the right fix here?
> 
> It'll probably do.  I can't reproduce the issue because I keep hitting a hang
> (which I'm trying to debug).

Actually, the hang only seems to be happening with KASAN; non-KASAN seems to
go without crashing - except that the filename handling is screwed:

generic/397       - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad)
    --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
    +++ /root/xfstests-dev/results//generic/397.out.bad 2025-01-28 16:51:03.583414909 +0000
    @@ -1,13 +1,27 @@
     QA output created by 397
    +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy>Xx���ׄ��z]�F¨h��ݕ�©M���UP³\�
    +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
    +Only in /xfstest.scratch/edir: +mHXwVcuhV3fcBS7kmbPmA
    +Only in /xfstest.scratch/edir: 8eVfRGNiQlxQ7bFdyyE+AA
    +Only in /xfstest.scratch/edir: ANIxymzfodiFr5DMRbVVtw
    +Only in /xfstest.scratch/edir: Br7L2ajr28IeGcjUD+xeTKOV3BLG9+At3L7poBkNSOuo8SZ64W3ulcTKtnxy5ZqMwbQ9kUHd+4Y7VVG6zrxWMmfGOv2KQ4PHyx+FDY4B5NQileJawmBw65oCdofnbycn6Mm10+S2EpqErf30Cl0vgHNTwSn2jExBJhuMjGa2kg7G1DLdgs9dzJZyPzOyVvjCnlbTbMV1uQ29e5QhZBuZGwqIxecfweAmr,fD2R,QWh5W5reU
    ...

Mounting the volume manually and looking inside seems to show that the
filenames in the filesystem are looking correct:

	andromeda# ls /xfstest.scratch/scratch/ref_dir/
	a
	abcdefghijklmnopqrstuvwxyz
	empty
	subdir/
	symlink@
	symlink2@
	symlink3@
	yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

but I can't tell if that's because it's correctly decrypting them there - or
whether they're actually stored unencrypted.

So it's entirely possible that due to the filename wrangling being screwed,
I'm not hitting the problem.

David
David Howells Jan. 28, 2025, 8:01 p.m. UTC | #5
I added some tracing to fs/ceph/addr.c and this highlights the bug causing the
hang that I'm seeing.

So what I see is ceph_writepages_start() being entered and getting a
collection of folios from filemap_get_folios_tag():

    netfs_ceph_writepages: i=10000004f52 ix=0
    netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6

Then we get out the first dirty folio from the batch and attempt to lock it:

    netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock

which succeeds.  We then pass through a number of lines:

    netfs_ceph_wp_track: i=10000004f52 line=1218

which is the "/* shift unused page to beginning of fbatch */" comment, then:

    netfs_ceph_wp_track: i=10000004f52 line=1238

which is followed by "offset = ceph_fscrypt_page_offset(pages[0]);", then:

    netfs_ceph_wp_track: i=10000004f52 line=1264

which is the error handling path of:

		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
			rc = -EIO;
			goto release_folios;
		}

and then:

    netfs_ceph_wp_track: i=10000004f52 line=1389

which is "release_folios:".

We then reenter ceph_writepages_start(), get the same batch of dirty folios
and try to lock them again:

    netfs_ceph_writepages: i=10000004f52 ix=0
    netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6
    netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock

and that's where we hang.

I think the problem is that the error handling here:

		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
			rc = -EIO;
			goto release_folios;
		}

is insufficient.  The folios are locked and can't just be released.

Why ceph_inc_osd_stopping_blocker() fails is also something that needs looking
at.

David
Viacheslav Dubeyko Jan. 28, 2025, 8:16 p.m. UTC | #6
On Tue, 2025-01-28 at 20:01 +0000, David Howells wrote:
> I added some tracing to fs/ceph/addr.c and this highlights the bug causing the
> hang that I'm seeing.
> 
> So what I see is ceph_writepages_start() being entered and getting a
> collection of folios from filemap_get_folios_tag():
> 
>     netfs_ceph_writepages: i=10000004f52 ix=0
>     netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6
> 
> Then we get out the first dirty folio from the batch and attempt to lock it:
> 
>     netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock
> 
> which succeeds.  We then pass through a number of lines:
> 
>     netfs_ceph_wp_track: i=10000004f52 line=1218
> 
> which is the "/* shift unused page to beginning of fbatch */" comment, then:
> 
>     netfs_ceph_wp_track: i=10000004f52 line=1238
> 
> which is followed by "offset = ceph_fscrypt_page_offset(pages[0]);", then:
> 
>     netfs_ceph_wp_track: i=10000004f52 line=1264
> 
> which is the error handling path of:
> 
> 		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
> 			rc = -EIO;
> 			goto release_folios;
> 		}
> 
> and then:
> 
>     netfs_ceph_wp_track: i=10000004f52 line=1389
> 
> which is "release_folios:".
> 
> We then reenter ceph_writepages_start(), get the same batch of dirty folios
> and try to lock them again:
> 
>     netfs_ceph_writepages: i=10000004f52 ix=0
>     netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6
>     netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock
> 
> and that's where we hang.
> 
> I think the problem is that the error handling here:
> 
> 		if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) {
> 			rc = -EIO;
> 			goto release_folios;
> 		}
> 
> is insufficient.  The folios are locked and can't just be released.
> 
> Why ceph_inc_osd_stopping_blocker() fails is also something that needs looking
> at.
> 

Yeah, I am trying to solve this issue now. :) I am reproducing the issue for
generic/421.

It's only the first issue. Also this code [1] doesn't work because page is
already locked and it will be unlocked only in writepages_finish():

			if (folio_test_writeback(folio) ||
			    folio_test_private_2(folio) /* [DEPRECATED] */) {
				if (wbc->sync_mode == WB_SYNC_NONE) {
					doutc(cl, "%p under writeback\n",
folio);
					folio_unlock(folio);
					continue;
				}
				doutc(cl, "waiting on writeback %p\n", folio);
				folio_wait_writeback(folio);
				folio_wait_private_2(folio); /* [DEPRECATED] */
			}

It looks like we need to check it before the lock here [2].

And even after solving these two issues, I can see dirty memory pages after
unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am
trying to figure out what I am missing here yet.

[1] https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L1101
[2] https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L1059
David Howells Jan. 28, 2025, 10:34 p.m. UTC | #7
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> And even after solving these two issues, I can see dirty memory pages after
> unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am
> trying to figure out what I am missing here yet.

Do you want me to push a branch with my tracepoints that I'm using somewhere
that you can grab it?

David
Viacheslav Dubeyko Jan. 28, 2025, 10:37 p.m. UTC | #8
On Tue, 2025-01-28 at 22:34 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > And even after solving these two issues, I can see dirty memory pages after
> > unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am
> > trying to figure out what I am missing here yet.
> 
> Do you want me to push a branch with my tracepoints that I'm using somewhere
> that you can grab it?
> 

Sounds good! Maybe it can help me. :)

Thanks,
Slava.
Alex Markuze Jan. 29, 2025, 10:39 a.m. UTC | #9
FYI, This the set of fscrypt of tests that keep failing, w/o this patch.
Many of these revoke keys mid I/O.
generic/397
generic/421  #Test revoking an encryption key during concurrent I/O.
generic/429. #Test revoking an encryption key during concurrent I/O.
And additional fscrypt races
generic/440. #Test revoking an encryption key during concurrent I/O.
generic/580  #Testing the different keyring policies - also revokes
keys on open files
generic/593  #Test adding a key to a filesystem's fscrypt keyring via
an "fscrypt-provisioning" keyring key.
generic/595  #Test revoking an encryption key during concurrent I/O.

p.s
generic/650 is yanking CPUs mid run so may also sporadically fail.
unrelated to fscrypt.

On Wed, Jan 29, 2025 at 12:37 AM Viacheslav Dubeyko
<Slava.Dubeyko@ibm.com> wrote:
>
> On Tue, 2025-01-28 at 22:34 +0000, David Howells wrote:
> > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> >
> > > And even after solving these two issues, I can see dirty memory pages after
> > > unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am
> > > trying to figure out what I am missing here yet.
> >
> > Do you want me to push a branch with my tracepoints that I'm using somewhere
> > that you can grab it?
> >
>
> Sounds good! Maybe it can help me. :)
>
> Thanks,
> Slava.
>
>
David Howells Jan. 29, 2025, 1:41 p.m. UTC | #10
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > Do you want me to push a branch with my tracepoints that I'm using somewhere
> > that you can grab it?
> 
> Sounds good! Maybe it can help me. :)

Take a look at:

   https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/

The "ceph-folio" branch has Willy's folio conversion patches plus a tracing
patch plus a patch that's an unsuccessful attempt by me to fix the hang I was
seeing.

The tracepoint I'm using (netfs_folio) takes a folio pointer, so it was easier
to do it on top of Willy's patches.

The "netfs-crypto" branch are my patches to implement content crypto in
netfslib.  I've tested them to some extent with AFS, but the test code I have
in AFS only supports crypto of files where the file is an exact multiple of
page size as AFS doesn't support any sort of xattr and so I can't store the
real EOF pointer so simply.

The "ceph-iter" branch are my patches on top of a merge of those two
(excluding the debugging patches) to try and convert ceph to fully using
netfslib and to pass an iterator all the way down to the socket, aiming to
reduce the number of data types to basically two.

David
David Howells Jan. 29, 2025, 1:42 p.m. UTC | #11
Alex Markuze <amarkuze@redhat.com> wrote:

> FYI, This the set of fscrypt of tests that keep failing, w/o this patch.
> Many of these revoke keys mid I/O.
> generic/397
> generic/421  #Test revoking an encryption key during concurrent I/O.
> generic/429. #Test revoking an encryption key during concurrent I/O.
> And additional fscrypt races
> generic/440. #Test revoking an encryption key during concurrent I/O.
> generic/580  #Testing the different keyring policies - also revokes
> keys on open files
> generic/593  #Test adding a key to a filesystem's fscrypt keyring via
> an "fscrypt-provisioning" keyring key.
> generic/595  #Test revoking an encryption key during concurrent I/O.

I presume I don't need to add a key and that these do it for themselves?

David
Alex Markuze Jan. 29, 2025, 1:54 p.m. UTC | #12
Yeah, all tests are fully independent.
Just make sure you see them being executed or you can just run them stand alone.
e.g., sudo ./check generic/429

On Wed, Jan 29, 2025 at 3:42 PM David Howells <dhowells@redhat.com> wrote:
>
> Alex Markuze <amarkuze@redhat.com> wrote:
>
> > FYI, This the set of fscrypt of tests that keep failing, w/o this patch.
> > Many of these revoke keys mid I/O.
> > generic/397
> > generic/421  #Test revoking an encryption key during concurrent I/O.
> > generic/429. #Test revoking an encryption key during concurrent I/O.
> > And additional fscrypt races
> > generic/440. #Test revoking an encryption key during concurrent I/O.
> > generic/580  #Testing the different keyring policies - also revokes
> > keys on open files
> > generic/593  #Test adding a key to a filesystem's fscrypt keyring via
> > an "fscrypt-provisioning" keyring key.
> > generic/595  #Test revoking an encryption key during concurrent I/O.
>
> I presume I don't need to add a key and that these do it for themselves?
>
> David
>
Viacheslav Dubeyko Jan. 29, 2025, 7:16 p.m. UTC | #13
On Wed, 2025-01-29 at 13:41 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > > Do you want me to push a branch with my tracepoints that I'm using somewhere
> > > that you can grab it?
> > 
> > Sounds good! Maybe it can help me. :)
> 
> Take a look at:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/  
> 
> The "ceph-folio" branch has Willy's folio conversion patches plus a tracing
> patch plus a patch that's an unsuccessful attempt by me to fix the hang I was
> seeing.
> 
> The tracepoint I'm using (netfs_folio) takes a folio pointer, so it was easier
> to do it on top of Willy's patches.
> 
> The "netfs-crypto" branch are my patches to implement content crypto in
> netfslib.  I've tested them to some extent with AFS, but the test code I have
> in AFS only supports crypto of files where the file is an exact multiple of
> page size as AFS doesn't support any sort of xattr and so I can't store the
> real EOF pointer so simply.
> 
> The "ceph-iter" branch are my patches on top of a merge of those two
> (excluding the debugging patches) to try and convert ceph to fully using
> netfslib and to pass an iterator all the way down to the socket, aiming to
> reduce the number of data types to basically two.
> 

Great! Thanks a lot.

I believe I have been found all current issues in ceph_writepages_start().
So, I need to clean up the current messy state of the fix and the method itself.
Let me make this clean up, test the fix (probably, I could have some issues with
the fix yet), and share the patch finally.

As far as I can see, there are several issues in ceph_writepages_start():
(1) We have double lock issue (reason of the hang);
(2) We have issue with not correct place for folio_wait_writeback();
(3) The ceph_inc_osd_stopping_blocker() could not provide guarantee of waiting
finishing all dirty memory pages flush. It's racy now, as far as I can see. But
I need to check it more accurately by testing.
(4) The folio_batch with found dirty pages by filemap_get_folios_tag() is not
processed properly. And this is why some number of dirty pages simply never
processed and we still have dirty pages after unmount.
(5) The whole method of ceph_writepages_start() is huge and messy for my taste
and this is the reason of all of these issues (it's hard to follow the logic of
the method in this unreasonable complexity).

Thanks,
Slava.
diff mbox series

Patch

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 85936f6d2bf7..5e6ba92219f3 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -396,6 +396,15 @@  static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
 		struct page **pages;
 		size_t page_off;
 
+		/*
+		 * The io_iter.count needs to be corrected to aligned length.
+		 * Otherwise, iov_iter_get_pages_alloc2() operates with
+		 * the initial unaligned length value. As a result,
+		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
+		 * if msg->sparse_read_total > msg->data_length.
+		 */
+		subreq->io_iter.count = len;
+
 		err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off);
 		if (err < 0) {
 			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
@@ -405,6 +414,7 @@  static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
 
 		/* should always give us a page-aligned read */
 		WARN_ON_ONCE(page_off);
+
 		len = err;
 		err = 0;