Message ID | 20250117035044.23309-1-slava@dubeyko.com |
---|---|
State | New |
Headers | show |
Series | [v2] ceph: Fix kernel crash in generic/397 test | expand |
Easiest is to run xfstets. Ping me on slack I can show you, its simple. On Mon, Jan 20, 2025 at 11:34 AM David Howells <dhowells@redhat.com> wrote: > > Is there a way for me to test this? I have a ceph server set up and can mount > a filesystem from it. How do a make a file content-encrypted on ceph? > > David >
Hi David, On Mon, 2025-01-20 at 11:47 +0200, Alex Markuze wrote: > Easiest is to run xfstets. Ping me on slack I can show you, its simple. > > On Mon, Jan 20, 2025 at 11:34 AM David Howells <dhowells@redhat.com> wrote: > > > > Is there a way for me to test this? I have a ceph server set up and can mount > > a filesystem from it. How do a make a file content-encrypted on ceph? > > > > David > > > So, is suggested fix correct or not? If it is not, then which solution could be the right fix here? Thanks, Slava.
Hi Viacheslav, > So, is suggested fix correct or not? If it is not, then which solution > could be the right fix here? It'll probably do. I can't reproduce the issue because I keep hitting a hang (which I'm trying to debug). David
David Howells <dhowells@redhat.com> wrote: > > So, is suggested fix correct or not? If it is not, then which solution > > could be the right fix here? > > It'll probably do. I can't reproduce the issue because I keep hitting a hang > (which I'm trying to debug). Actually, the hang only seems to be happening with KASAN; non-KASAN seems to go without crashing - except that the filename handling is screwed: generic/397 - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad) --- tests/generic/397.out 2024-09-12 12:36:14.167441927 +0100 +++ /root/xfstests-dev/results//generic/397.out.bad 2025-01-28 16:51:03.583414909 +0000 @@ -1,13 +1,27 @@ QA output created by 397 +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy>Xx���ׄ��z]�F¨h��ݕ�©M���UP³\� +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy +Only in /xfstest.scratch/edir: +mHXwVcuhV3fcBS7kmbPmA +Only in /xfstest.scratch/edir: 8eVfRGNiQlxQ7bFdyyE+AA +Only in /xfstest.scratch/edir: ANIxymzfodiFr5DMRbVVtw +Only in /xfstest.scratch/edir: Br7L2ajr28IeGcjUD+xeTKOV3BLG9+At3L7poBkNSOuo8SZ64W3ulcTKtnxy5ZqMwbQ9kUHd+4Y7VVG6zrxWMmfGOv2KQ4PHyx+FDY4B5NQileJawmBw65oCdofnbycn6Mm10+S2EpqErf30Cl0vgHNTwSn2jExBJhuMjGa2kg7G1DLdgs9dzJZyPzOyVvjCnlbTbMV1uQ29e5QhZBuZGwqIxecfweAmr,fD2R,QWh5W5reU ... Mounting the volume manually and looking inside seems to show that the filenames in the filesystem are looking correct: andromeda# ls /xfstest.scratch/scratch/ref_dir/ a abcdefghijklmnopqrstuvwxyz empty subdir/ symlink@ symlink2@ symlink3@ yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy but I can't tell if that's because it's correctly decrypting them there - or whether they're actually stored unencrypted. So it's entirely possible that due to the filename wrangling being screwed, I'm not hitting the problem. David
I added some tracing to fs/ceph/addr.c and this highlights the bug causing the hang that I'm seeing. So what I see is ceph_writepages_start() being entered and getting a collection of folios from filemap_get_folios_tag(): netfs_ceph_writepages: i=10000004f52 ix=0 netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6 Then we get out the first dirty folio from the batch and attempt to lock it: netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock which succeeds. We then pass through a number of lines: netfs_ceph_wp_track: i=10000004f52 line=1218 which is the "/* shift unused page to beginning of fbatch */" comment, then: netfs_ceph_wp_track: i=10000004f52 line=1238 which is followed by "offset = ceph_fscrypt_page_offset(pages[0]);", then: netfs_ceph_wp_track: i=10000004f52 line=1264 which is the error handling path of: if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { rc = -EIO; goto release_folios; } and then: netfs_ceph_wp_track: i=10000004f52 line=1389 which is "release_folios:". We then reenter ceph_writepages_start(), get the same batch of dirty folios and try to lock them again: netfs_ceph_writepages: i=10000004f52 ix=0 netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6 netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock and that's where we hang. I think the problem is that the error handling here: if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { rc = -EIO; goto release_folios; } is insufficient. The folios are locked and can't just be released. Why ceph_inc_osd_stopping_blocker() fails is also something that needs looking at. David
On Tue, 2025-01-28 at 20:01 +0000, David Howells wrote: > I added some tracing to fs/ceph/addr.c and this highlights the bug causing the > hang that I'm seeing. > > So what I see is ceph_writepages_start() being entered and getting a > collection of folios from filemap_get_folios_tag(): > > netfs_ceph_writepages: i=10000004f52 ix=0 > netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6 > > Then we get out the first dirty folio from the batch and attempt to lock it: > > netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock > > which succeeds. We then pass through a number of lines: > > netfs_ceph_wp_track: i=10000004f52 line=1218 > > which is the "/* shift unused page to beginning of fbatch */" comment, then: > > netfs_ceph_wp_track: i=10000004f52 line=1238 > > which is followed by "offset = ceph_fscrypt_page_offset(pages[0]);", then: > > netfs_ceph_wp_track: i=10000004f52 line=1264 > > which is the error handling path of: > > if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { > rc = -EIO; > goto release_folios; > } > > and then: > > netfs_ceph_wp_track: i=10000004f52 line=1389 > > which is "release_folios:". > > We then reenter ceph_writepages_start(), get the same batch of dirty folios > and try to lock them again: > > netfs_ceph_writepages: i=10000004f52 ix=0 > netfs_ceph_wp_get_folios: i=10000004f52 oix=0 ix=8000000000000 nr=6 > netfs_folio: i=10000004f52 ix=00003-00003 ceph-wb-lock > > and that's where we hang. > > I think the problem is that the error handling here: > > if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { > rc = -EIO; > goto release_folios; > } > > is insufficient. The folios are locked and can't just be released. > > Why ceph_inc_osd_stopping_blocker() fails is also something that needs looking > at. > Yeah, I am trying to solve this issue now. :) I am reproducing the issue for generic/421. It's only the first issue. Also this code [1] doesn't work because page is already locked and it will be unlocked only in writepages_finish(): if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } It looks like we need to check it before the lock here [2]. And even after solving these two issues, I can see dirty memory pages after unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am trying to figure out what I am missing here yet. [1] https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L1101 [2] https://elixir.bootlin.com/linux/v6.13-rc3/source/fs/ceph/addr.c#L1059
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > And even after solving these two issues, I can see dirty memory pages after > unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am > trying to figure out what I am missing here yet. Do you want me to push a branch with my tracepoints that I'm using somewhere that you can grab it? David
On Tue, 2025-01-28 at 22:34 +0000, David Howells wrote: > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > And even after solving these two issues, I can see dirty memory pages after > > unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am > > trying to figure out what I am missing here yet. > > Do you want me to push a branch with my tracepoints that I'm using somewhere > that you can grab it? > Sounds good! Maybe it can help me. :) Thanks, Slava.
FYI, This the set of fscrypt of tests that keep failing, w/o this patch. Many of these revoke keys mid I/O. generic/397 generic/421 #Test revoking an encryption key during concurrent I/O. generic/429. #Test revoking an encryption key during concurrent I/O. And additional fscrypt races generic/440. #Test revoking an encryption key during concurrent I/O. generic/580 #Testing the different keyring policies - also revokes keys on open files generic/593 #Test adding a key to a filesystem's fscrypt keyring via an "fscrypt-provisioning" keyring key. generic/595 #Test revoking an encryption key during concurrent I/O. p.s generic/650 is yanking CPUs mid run so may also sporadically fail. unrelated to fscrypt. On Wed, Jan 29, 2025 at 12:37 AM Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > On Tue, 2025-01-28 at 22:34 +0000, David Howells wrote: > > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > > > And even after solving these two issues, I can see dirty memory pages after > > > unmount finish. Something wrong yet in ceph_writepages_start() logic. So, I am > > > trying to figure out what I am missing here yet. > > > > Do you want me to push a branch with my tracepoints that I'm using somewhere > > that you can grab it? > > > > Sounds good! Maybe it can help me. :) > > Thanks, > Slava. > >
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > Do you want me to push a branch with my tracepoints that I'm using somewhere > > that you can grab it? > > Sounds good! Maybe it can help me. :) Take a look at: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/ The "ceph-folio" branch has Willy's folio conversion patches plus a tracing patch plus a patch that's an unsuccessful attempt by me to fix the hang I was seeing. The tracepoint I'm using (netfs_folio) takes a folio pointer, so it was easier to do it on top of Willy's patches. The "netfs-crypto" branch are my patches to implement content crypto in netfslib. I've tested them to some extent with AFS, but the test code I have in AFS only supports crypto of files where the file is an exact multiple of page size as AFS doesn't support any sort of xattr and so I can't store the real EOF pointer so simply. The "ceph-iter" branch are my patches on top of a merge of those two (excluding the debugging patches) to try and convert ceph to fully using netfslib and to pass an iterator all the way down to the socket, aiming to reduce the number of data types to basically two. David
Alex Markuze <amarkuze@redhat.com> wrote: > FYI, This the set of fscrypt of tests that keep failing, w/o this patch. > Many of these revoke keys mid I/O. > generic/397 > generic/421 #Test revoking an encryption key during concurrent I/O. > generic/429. #Test revoking an encryption key during concurrent I/O. > And additional fscrypt races > generic/440. #Test revoking an encryption key during concurrent I/O. > generic/580 #Testing the different keyring policies - also revokes > keys on open files > generic/593 #Test adding a key to a filesystem's fscrypt keyring via > an "fscrypt-provisioning" keyring key. > generic/595 #Test revoking an encryption key during concurrent I/O. I presume I don't need to add a key and that these do it for themselves? David
Yeah, all tests are fully independent. Just make sure you see them being executed or you can just run them stand alone. e.g., sudo ./check generic/429 On Wed, Jan 29, 2025 at 3:42 PM David Howells <dhowells@redhat.com> wrote: > > Alex Markuze <amarkuze@redhat.com> wrote: > > > FYI, This the set of fscrypt of tests that keep failing, w/o this patch. > > Many of these revoke keys mid I/O. > > generic/397 > > generic/421 #Test revoking an encryption key during concurrent I/O. > > generic/429. #Test revoking an encryption key during concurrent I/O. > > And additional fscrypt races > > generic/440. #Test revoking an encryption key during concurrent I/O. > > generic/580 #Testing the different keyring policies - also revokes > > keys on open files > > generic/593 #Test adding a key to a filesystem's fscrypt keyring via > > an "fscrypt-provisioning" keyring key. > > generic/595 #Test revoking an encryption key during concurrent I/O. > > I presume I don't need to add a key and that these do it for themselves? > > David >
On Wed, 2025-01-29 at 13:41 +0000, David Howells wrote: > Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote: > > > > Do you want me to push a branch with my tracepoints that I'm using somewhere > > > that you can grab it? > > > > Sounds good! Maybe it can help me. :) > > Take a look at: > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/ > > The "ceph-folio" branch has Willy's folio conversion patches plus a tracing > patch plus a patch that's an unsuccessful attempt by me to fix the hang I was > seeing. > > The tracepoint I'm using (netfs_folio) takes a folio pointer, so it was easier > to do it on top of Willy's patches. > > The "netfs-crypto" branch are my patches to implement content crypto in > netfslib. I've tested them to some extent with AFS, but the test code I have > in AFS only supports crypto of files where the file is an exact multiple of > page size as AFS doesn't support any sort of xattr and so I can't store the > real EOF pointer so simply. > > The "ceph-iter" branch are my patches on top of a merge of those two > (excluding the debugging patches) to try and convert ceph to fully using > netfslib and to pass an iterator all the way down to the socket, aiming to > reduce the number of data types to basically two. > Great! Thanks a lot. I believe I have been found all current issues in ceph_writepages_start(). So, I need to clean up the current messy state of the fix and the method itself. Let me make this clean up, test the fix (probably, I could have some issues with the fix yet), and share the patch finally. As far as I can see, there are several issues in ceph_writepages_start(): (1) We have double lock issue (reason of the hang); (2) We have issue with not correct place for folio_wait_writeback(); (3) The ceph_inc_osd_stopping_blocker() could not provide guarantee of waiting finishing all dirty memory pages flush. It's racy now, as far as I can see. But I need to check it more accurately by testing. (4) The folio_batch with found dirty pages by filemap_get_folios_tag() is not processed properly. And this is why some number of dirty pages simply never processed and we still have dirty pages after unmount. (5) The whole method of ceph_writepages_start() is huge and messy for my taste and this is the reason of all of these issues (it's hard to follow the logic of the method in this unreasonable complexity). Thanks, Slava.
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 85936f6d2bf7..5e6ba92219f3 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -396,6 +396,15 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) struct page **pages; size_t page_off; + /* + * The io_iter.count needs to be corrected to aligned length. + * Otherwise, iov_iter_get_pages_alloc2() operates with + * the initial unaligned length value. As a result, + * ceph_msg_data_cursor_init() triggers BUG_ON() in the case + * if msg->sparse_read_total > msg->data_length. + */ + subreq->io_iter.count = len; + err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off); if (err < 0) { doutc(cl, "%llx.%llx failed to allocate pages, %d\n", @@ -405,6 +414,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq) /* should always give us a page-aligned read */ WARN_ON_ONCE(page_off); + len = err; err = 0;