diff mbox series

[5.10.x] btrfs: fix crash after non-aligned direct IO write with O_DSYNC

Message ID 94663c8a2172dc96b760d356a538d45c36f46040.1613062764.git.fdmanana@suse.com
State New
Headers show
Series [5.10.x] btrfs: fix crash after non-aligned direct IO write with O_DSYNC | expand

Commit Message

fdmanana@kernel.org Feb. 16, 2021, 2:40 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
end up triggering an assertion and crashing. Example reproducer:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj

  mkfs.btrfs -f $DEV > /dev/null
  mount $DEV $MNT

  # Do a direct IO write with O_DSYNC into a non-aligned range...
  xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar

  umount $MNT

When running the reproducer an assertion fails and produces the following
trace:

  [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
  [ 2418.403745] ------------[ cut here ]------------
  [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
  [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
  [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
  [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
  [ 2418.407835] Code: e6 48 c7 (...)
  [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
  [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
  [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
  [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
  [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
  [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
  [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
  [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
  [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 2418.415669] Call Trace:
  [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
  [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
  [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
  [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
  [ 2418.418920]  new_sync_write+0x11f/0x1c0
  [ 2418.419430]  vfs_write+0x2bb/0x3b0
  [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
  [ 2418.420486]  do_syscall_64+0x33/0x80
  [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 2418.421486] RIP: 0033:0x7f54fda0b986
  [ 2418.421981] Code: 48 c7 c0 (...)
  [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
  [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
  [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
  [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
  [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
  [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
  [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
  [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---

1) At btrfs_file_write_iter() we set current->journal_info to
   BTRFS_DIO_SYNC_STUB;

2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();

3) We can't do the direct IO write because it starts at a non-aligned
   offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
   check_direct_IO() which does the alignment check), but we leave
   current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
   at btrfs_dio_iomap_begin(), because we assume we always get there;

4) Then at __btrfs_direct_write() we see that the attempt to do the
   direct IO write was not successful, 0 bytes written, so we fallback
   to a buffered write by calling btrfs_buffered_write();

5) There we call btrfs_check_data_free_space() which in turn calls
   btrfs_alloc_data_chunk_ondemand() and that calls
   btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;

6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
   BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
   BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:

  int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
                               enum btrfs_reserve_flush_enum flush)
  {
      struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
      int ret;

      ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
             flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
      ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
  (...)

So fix that by setting the journal to NULL whenever check_direct_IO()
returns a failure.

This bug only affects 5.10 kernels, and the regression was introduced in
5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
patchset that went into the merge window for 5.11. So this is a fix only
for 5.10.x stable kernels, as there are people hitting this bug.

Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
CC: stable@vger.kernel.org # 5.10 (and only 5.10)
CC: David Sterba <dsterba@suse.cz>
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/inode.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Filipe Manana Feb. 16, 2021, 2:52 p.m. UTC | #1
On 16/02/21 14:50, Greg KH wrote:
> On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
>> From: Filipe Manana <fdmanana@suse.com>
>>
>> Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
>> end up triggering an assertion and crashing. Example reproducer:
>>
>>   $ cat test.sh
>>   #!/bin/bash
>>
>>   DEV=/dev/sdj
>>   MNT=/mnt/sdj
>>
>>   mkfs.btrfs -f $DEV > /dev/null
>>   mount $DEV $MNT
>>
>>   # Do a direct IO write with O_DSYNC into a non-aligned range...
>>   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
>>
>>   umount $MNT
>>
>> When running the reproducer an assertion fails and produces the following
>> trace:
>>
>>   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
>>   [ 2418.403745] ------------[ cut here ]------------
>>   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
>>   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
>>   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
>>   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>>   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
>>   [ 2418.407835] Code: e6 48 c7 (...)
>>   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
>>   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
>>   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
>>   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
>>   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
>>   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
>>   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
>>   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
>>   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [ 2418.415669] Call Trace:
>>   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
>>   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
>>   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
>>   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
>>   [ 2418.418920]  new_sync_write+0x11f/0x1c0
>>   [ 2418.419430]  vfs_write+0x2bb/0x3b0
>>   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
>>   [ 2418.420486]  do_syscall_64+0x33/0x80
>>   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>   [ 2418.421486] RIP: 0033:0x7f54fda0b986
>>   [ 2418.421981] Code: 48 c7 c0 (...)
>>   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
>>   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
>>   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
>>   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
>>   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
>>   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
>>   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
>>   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
>>
>> 1) At btrfs_file_write_iter() we set current->journal_info to
>>    BTRFS_DIO_SYNC_STUB;
>>
>> 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
>>
>> 3) We can't do the direct IO write because it starts at a non-aligned
>>    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
>>    check_direct_IO() which does the alignment check), but we leave
>>    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
>>    at btrfs_dio_iomap_begin(), because we assume we always get there;
>>
>> 4) Then at __btrfs_direct_write() we see that the attempt to do the
>>    direct IO write was not successful, 0 bytes written, so we fallback
>>    to a buffered write by calling btrfs_buffered_write();
>>
>> 5) There we call btrfs_check_data_free_space() which in turn calls
>>    btrfs_alloc_data_chunk_ondemand() and that calls
>>    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
>>
>> 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
>>    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
>>    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
>>
>>   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
>>                                enum btrfs_reserve_flush_enum flush)
>>   {
>>       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
>>       int ret;
>>
>>       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
>>              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
>>       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
>>   (...)
>>
>> So fix that by setting the journal to NULL whenever check_direct_IO()
>> returns a failure.
>>
>> This bug only affects 5.10 kernels, and the regression was introduced in
>> 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
>> The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
>> ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
>> patchset that went into the merge window for 5.11. So this is a fix only
>> for 5.10.x stable kernels, as there are people hitting this bug.
>>
>> Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
>> CC: stable@vger.kernel.org # 5.10 (and only 5.10)
>> CC: David Sterba <dsterba@suse.cz>
>> Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>> ---
>>  fs/btrfs/inode.c | 6 +++++-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> As this is a one-off patch, I need the btrfs maintainers to ack this and
> really justify why we can't take the larger patch or patch series here
> instead, as that is almost always the correct thing to do instead.

David will likely reply soon, but from another thread:

https://lore.kernel.org/linux-btrfs/20210216142324.GO1993@twin.jikos.cz/

Thanks.
> 
> thanks,
> 
> greg k-h
>
Greg Kroah-Hartman Feb. 16, 2021, 3:34 p.m. UTC | #2
On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
> On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > > 
> > > Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we
> > > end up triggering an assertion and crashing. Example reproducer:
> > > 
> > >   $ cat test.sh
> > >   #!/bin/bash
> > > 
> > >   DEV=/dev/sdj
> > >   MNT=/mnt/sdj
> > > 
> > >   mkfs.btrfs -f $DEV > /dev/null
> > >   mount $DEV $MNT
> > > 
> > >   # Do a direct IO write with O_DSYNC into a non-aligned range...
> > >   xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
> > > 
> > >   umount $MNT
> > > 
> > > When running the reproducer an assertion fails and produces the following
> > > trace:
> > > 
> > >   [ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467
> > >   [ 2418.403745] ------------[ cut here ]------------
> > >   [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286!
> > >   [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI
> > >   [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G      D           5.10.15-btrfs-next-87 #1
> > >   [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > >   [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs]
> > >   [ 2418.407835] Code: e6 48 c7 (...)
> > >   [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246
> > >   [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000
> > >   [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> > >   [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000
> > >   [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000
> > >   [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000
> > >   [ 2418.412713] FS:  00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000
> > >   [ 2418.413326] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >   [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0
> > >   [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > >   [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > >   [ 2418.415669] Call Trace:
> > >   [ 2418.416254]  btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs]
> > >   [ 2418.416812]  btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
> > >   [ 2418.417380]  btrfs_buffered_write+0x1b0/0x7f0 [btrfs]
> > >   [ 2418.418315]  btrfs_file_write_iter+0x2a9/0x770 [btrfs]
> > >   [ 2418.418920]  new_sync_write+0x11f/0x1c0
> > >   [ 2418.419430]  vfs_write+0x2bb/0x3b0
> > >   [ 2418.419972]  __x64_sys_pwrite64+0x90/0xc0
> > >   [ 2418.420486]  do_syscall_64+0x33/0x80
> > >   [ 2418.420979]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >   [ 2418.421486] RIP: 0033:0x7f54fda0b986
> > >   [ 2418.421981] Code: 48 c7 c0 (...)
> > >   [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> > >   [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986
> > >   [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003
> > >   [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400
> > >   [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff
> > >   [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000
> > >   [ 2418.426148] Modules linked in: btrfs blake2b_generic (...)
> > >   [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
> > > 
> > > 1) At btrfs_file_write_iter() we set current->journal_info to
> > >    BTRFS_DIO_SYNC_STUB;
> > > 
> > > 2) We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
> > > 
> > > 3) We can't do the direct IO write because it starts at a non-aligned
> > >    offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from
> > >    check_direct_IO() which does the alignment check), but we leave
> > >    current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it
> > >    at btrfs_dio_iomap_begin(), because we assume we always get there;
> > > 
> > > 4) Then at __btrfs_direct_write() we see that the attempt to do the
> > >    direct IO write was not successful, 0 bytes written, so we fallback
> > >    to a buffered write by calling btrfs_buffered_write();
> > > 
> > > 5) There we call btrfs_check_data_free_space() which in turn calls
> > >    btrfs_alloc_data_chunk_ondemand() and that calls
> > >    btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
> > > 
> > > 6) Then at btrfs_reserve_data_bytes() we have current->journal_info set to
> > >    BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value
> > >    BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
> > > 
> > >   int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes,
> > >                                enum btrfs_reserve_flush_enum flush)
> > >   {
> > >       struct btrfs_space_info *data_sinfo = fs_info->data_sinfo;
> > >       int ret;
> > > 
> > >       ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA ||
> > >              flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE);
> > >       ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
> > >   (...)
> > > 
> > > So fix that by setting the journal to NULL whenever check_direct_IO()
> > > returns a failure.
> > > 
> > > This bug only affects 5.10 kernels, and the regression was introduced in
> > > 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround").
> > > The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d
> > > ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large
> > > patchset that went into the merge window for 5.11. So this is a fix only
> > > for 5.10.x stable kernels, as there are people hitting this bug.
> > > 
> > > Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")
> > > CC: stable@vger.kernel.org # 5.10 (and only 5.10)
> > > CC: David Sterba <dsterba@suse.cz>
> > > Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  fs/btrfs/inode.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > As this is a one-off patch, I need the btrfs maintainers to ack this and
> > really justify why we can't take the larger patch or patch series here
> > instead, as that is almost always the correct thing to do instead.
> 
> Acked-by: David Sterba <dsterba@suse.com>
> 
> The full backport would be patches
> 
> ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
> a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
> 502756b38093 btrfs: remove btrfs_inode::dio_sem
> e9adabb9712e btrfs: use shared lock for direct writes within EOF
> c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
> a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
> b8d8e1fd570a btrfs: introduce btrfs_write_check()
> 
> and maybe more.
> 
> $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
>  btrfs_inode.h |   10 -
>  ctree.h       |    8 +
>  file.c        |  338 +++++++++++++++++++++++++++-------------------------------
>  inode.c       |   96 +++++++---------
>  transaction.h |    1 
>  5 files changed, 213 insertions(+), 240 deletions(-)
> 
> That seems too much for a backport, the fix Filipe implemented is
> simpler and IMO qualifies as the exceptional stable-only patch.

Why is that too much?  For 7 patches that's a small overall diffstat.
And you match identically what is upstream in Linus's tree.  That means
over time, backporting fixing is much easier, and understanding the code
for everyone is simpler.

It's almost always better to track what is in Linus's tree than to do
one-off patches as 95% of the time we do one-off patches they are buggy
and cause problems as no one else is running them.

So how about sending the above backported series instead please.

thanks,

greg k-h
David Sterba Feb. 16, 2021, 5:52 p.m. UTC | #3
On Tue, Feb 16, 2021 at 04:34:27PM +0100, Greg KH wrote:
> On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
> > On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
> > > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
> > > As this is a one-off patch, I need the btrfs maintainers to ack this and
> > > really justify why we can't take the larger patch or patch series here
> > > instead, as that is almost always the correct thing to do instead.
> > 
> > Acked-by: David Sterba <dsterba@suse.com>
> > 
> > The full backport would be patches
> > 
> > ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround
> > a42fa643169d btrfs: call iomap_dio_complete() without inode_lock
> > 502756b38093 btrfs: remove btrfs_inode::dio_sem
> > e9adabb9712e btrfs: use shared lock for direct writes within EOF
> > c35237063340 btrfs: push inode locking and unlocking into buffered/direct write
> > a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()
> > b8d8e1fd570a btrfs: introduce btrfs_write_check()
> > 
> > and maybe more.
> > 
> > $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat
> >  btrfs_inode.h |   10 -
> >  ctree.h       |    8 +
> >  file.c        |  338 +++++++++++++++++++++++++++-------------------------------
> >  inode.c       |   96 +++++++---------
> >  transaction.h |    1 
> >  5 files changed, 213 insertions(+), 240 deletions(-)
> > 
> > That seems too much for a backport, the fix Filipe implemented is
> > simpler and IMO qualifies as the exceptional stable-only patch.
> 
> Why is that too much?  For 7 patches that's a small overall diffstat.
> And you match identically what is upstream in Linus's tree.  That means
> over time, backporting fixing is much easier, and understanding the code
> for everyone is simpler.

The changes are not trivial and touch eg. inode locking and other
subsystems (iomap), so they're not self contained inside btrfs. And the
list of possibly related patches is not entirely known at this moment,
the above is an example that was obvious, but Filipe has expressed
doubts that it's complete and I agree.

Backporting them to 5.10.x would need same amount of testing and
validation that the 5.11 version got during the whole development cycle.

> It's almost always better to track what is in Linus's tree than to do
> one-off patches as 95% of the time we do one-off patches they are buggy
> and cause problems as no one else is running them.

While I understand that concern in general, in this case it's trading
changes by lots of code with a targeted fix with a reproducer, basically
fixing the buggy error handling path.

> So how about sending the above backported series instead please.

Considering the risk I don't want to do that.
Greg Kroah-Hartman Feb. 22, 2021, 11:07 a.m. UTC | #4
On Tue, Feb 16, 2021 at 06:52:21PM +0100, David Sterba wrote:
> On Tue, Feb 16, 2021 at 04:34:27PM +0100, Greg KH wrote:

> > On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:

> > > On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:

> > > > On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:

> > > > As this is a one-off patch, I need the btrfs maintainers to ack this and

> > > > really justify why we can't take the larger patch or patch series here

> > > > instead, as that is almost always the correct thing to do instead.

> > > 

> > > Acked-by: David Sterba <dsterba@suse.com>

> > > 

> > > The full backport would be patches

> > > 

> > > ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround

> > > a42fa643169d btrfs: call iomap_dio_complete() without inode_lock

> > > 502756b38093 btrfs: remove btrfs_inode::dio_sem

> > > e9adabb9712e btrfs: use shared lock for direct writes within EOF

> > > c35237063340 btrfs: push inode locking and unlocking into buffered/direct write

> > > a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock()

> > > b8d8e1fd570a btrfs: introduce btrfs_write_check()

> > > 

> > > and maybe more.

> > > 

> > > $ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat

> > >  btrfs_inode.h |   10 -

> > >  ctree.h       |    8 +

> > >  file.c        |  338 +++++++++++++++++++++++++++-------------------------------

> > >  inode.c       |   96 +++++++---------

> > >  transaction.h |    1 

> > >  5 files changed, 213 insertions(+), 240 deletions(-)

> > > 

> > > That seems too much for a backport, the fix Filipe implemented is

> > > simpler and IMO qualifies as the exceptional stable-only patch.

> > 

> > Why is that too much?  For 7 patches that's a small overall diffstat.

> > And you match identically what is upstream in Linus's tree.  That means

> > over time, backporting fixing is much easier, and understanding the code

> > for everyone is simpler.

> 

> The changes are not trivial and touch eg. inode locking and other

> subsystems (iomap), so they're not self contained inside btrfs. And the

> list of possibly related patches is not entirely known at this moment,

> the above is an example that was obvious, but Filipe has expressed

> doubts that it's complete and I agree.

> 

> Backporting them to 5.10.x would need same amount of testing and

> validation that the 5.11 version got during the whole development cycle.

> 

> > It's almost always better to track what is in Linus's tree than to do

> > one-off patches as 95% of the time we do one-off patches they are buggy

> > and cause problems as no one else is running them.

> 

> While I understand that concern in general, in this case it's trading

> changes by lots of code with a targeted fix with a reproducer, basically

> fixing the buggy error handling path.

> 

> > So how about sending the above backported series instead please.

> 

> Considering the risk I don't want to do that.


Ok, thanks, now queued up.

greg k-h
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index acc47e2ffb46..b536d21541a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8026,8 +8026,12 @@  ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	bool relock = false;
 	ssize_t ret;
 
-	if (check_direct_IO(fs_info, iter, offset))
+	if (check_direct_IO(fs_info, iter, offset)) {
+		ASSERT(current->journal_info == NULL ||
+		       current->journal_info == BTRFS_DIO_SYNC_STUB);
+		current->journal_info = NULL;
 		return 0;
+	}
 
 	count = iov_iter_count(iter);
 	if (iov_iter_rw(iter) == WRITE) {