[5.15,31/33] btrfs: fallback to blocking mode when doing async dio over multiple extents

From: Filipe Manana <fdmanana@suse.com>

From: Filipe Manana <fdmanana@suse.com>

commit ca93e44bfb5fd7996b76f0f544999171f647f93b upstream

Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests.

For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:

1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();

2) iomap creates a struct iomap_dio object, its reference count is
   initialized to 1 and its ->size field is initialized to 0;

3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
   the first 4K extent, and setups an iomap for this extent consisting
   of a single page;

4) At iomap_dio_bio_iter(), we are able to access the first page of the
   buffer (struct iov_iter) with bio_iov_iter_get_pages() without
   triggering a page fault;

5) iomap submits a bio for this 4K extent
   (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
   the refcount on the struct iomap_dio object to 2; The ->size field
   of the struct iomap_dio object is incremented to 4K;

6) iomap calls btrfs_iomap_begin() again, this time with a file
   offset of X + 4K. There we setup an iomap for the next extent
   that also has a size of 4K;

7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
   which tries to access the next page (2nd page) of the buffer.
   This triggers a page fault and returns -EFAULT;

8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
   to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
   the struct iomap_dio object has a ->size value of 4K (we submitted
   a bio for an extent already). The 'wait_for_completion' variable
   is not set to true, because our iocb has IOCB_NOWAIT set;

9) At the bottom of __iomap_dio_rw(), we decrement the reference count
   of the struct iomap_dio object from 2 to 1. Because we were not
   the only ones holding a reference on it and 'wait_for_completion' is
   set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
   just returns it up the callchain, up to io_uring;

10) The bio submitted for the first extent (step 5) completes and its
    bio endio function, iomap_dio_bio_end_io(), decrements the last
    reference on the struct iomap_dio object, resulting in calling
    iomap_dio_complete_work() -> iomap_dio_complete().

11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
    and return 4K (the amount of io done) to iomap_dio_complete_work();

12) iomap_dio_complete_work() calls the iocb completion callback,
    iocb->ki_complete() with a second argument value of 4K (total io
    done) and the iocb with the adjust ki_pos of X + 4K. This results
    in completing the read request for io_uring, leaving it with a
    result of 4K bytes read, and only the first page of the buffer
    filled in, while the remaining 3 pages, corresponding to the other
    3 extents, were not filled;

13) For the application, the result is unexpected because if we ask
    to read N bytes, it expects to get N bytes read as long as those
    N bytes don't cross the EOF (i_size).

MariaDB reports this as an error, as it's not expecting a short read,
since it knows it's asking for read operations fully within the i_size
boundary. This is typical in many applications, but it may also be
questionable if they should react to such short reads by issuing more
read calls to get the remaining data. Nevertheless, the short read
happened due to a change in btrfs regarding how it deals with page
faults while in the middle of a read operation, and there's no reason
why btrfs can't have the previous behaviour of returning the whole data
that was requested by the application.

The problem can also be triggered with the following simple program:

  /* Get O_DIRECT */
  #ifndef _GNU_SOURCE
  #define _GNU_SOURCE
  #endif

  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <fcntl.h>
  #include <errno.h>
  #include <string.h>
  #include <liburing.h>

  int main(int argc, char *argv[])
  {
      char *foo_path;
      struct io_uring ring;
      struct io_uring_sqe *sqe;
      struct io_uring_cqe *cqe;
      struct iovec iovec;
      int fd;
      long pagesize;
      void *write_buf;
      void *read_buf;
      ssize_t ret;
      int i;

      if (argc != 2) {
          fprintf(stderr, "Use: %s <directory>\n", argv[0]);
          return 1;
      }

      foo_path = malloc(strlen(argv[1]) + 5);
      if (!foo_path) {
          fprintf(stderr, "Failed to allocate memory for file path\n");
          return 1;
      }
      strcpy(foo_path, argv[1]);
      strcat(foo_path, "/foo");

      /*
       * Create file foo with 2 extents, each with a size matching
       * the page size. Then allocate a buffer to read both extents
       * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
       * the read with io_uring, access the first page of the buffer
       * to fault it in, so that during the read we only trigger a
       * page fault when accessing the second page of the buffer.
       */
       fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
                O_DIRECT, 0666);
       if (fd == -1) {
           fprintf(stderr,
                   "Failed to create file 'foo': %s (errno %d)",
                   strerror(errno), errno);
           return 1;
       }

       pagesize = sysconf(_SC_PAGE_SIZE);
       ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
       if (ret) {
           fprintf(stderr, "Failed to allocate write buffer\n");
           return 1;
       }

       memset(write_buf, 0xab, pagesize);
       memset(write_buf + pagesize, 0xcd, pagesize);

       /* Create 2 extents, each with a size matching page size. */
       for (i = 0; i < 2; i++) {
           ret = pwrite(fd, write_buf + i * pagesize, pagesize,
                        i * pagesize);
           if (ret != pagesize) {
               fprintf(stderr,
                     "Failed to write to file, ret = %ld errno %d (%s)\n",
                      ret, errno, strerror(errno));
               return 1;
           }
           ret = fsync(fd);
           if (ret != 0) {
               fprintf(stderr, "Failed to fsync file\n");
               return 1;
           }
       }

       close(fd);
       fd = open(foo_path, O_RDONLY | O_DIRECT);
       if (fd == -1) {
           fprintf(stderr,
                   "Failed to open file 'foo': %s (errno %d)",
                   strerror(errno), errno);
           return 1;
       }

       ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
       if (ret) {
           fprintf(stderr, "Failed to allocate read buffer\n");
           return 1;
       }

       /*
        * Fault in only the first page of the read buffer.
        * We want to trigger a page fault for the 2nd page of the
        * read buffer during the read operation with io_uring
        * (O_DIRECT and IOCB_NOWAIT).
        */
       memset(read_buf, 0, 1);

       ret = io_uring_queue_init(1, &ring, 0);
       if (ret != 0) {
           fprintf(stderr, "Failed to create io_uring queue\n");
           return 1;
       }

       sqe = io_uring_get_sqe(&ring);
       if (!sqe) {
           fprintf(stderr, "Failed to get io_uring sqe\n");
           return 1;
       }

       iovec.iov_base = read_buf;
       iovec.iov_len = 2 * pagesize;
       io_uring_prep_readv(sqe, fd, &iovec, 1, 0);

       ret = io_uring_submit_and_wait(&ring, 1);
       if (ret != 1) {
           fprintf(stderr,
                   "Failed at io_uring_submit_and_wait()\n");
           return 1;
       }

       ret = io_uring_wait_cqe(&ring, &cqe);
       if (ret < 0) {
           fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
           return 1;
       }

       printf("io_uring read result for file foo:\n\n");
       printf("  cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
       printf("  memcmp(read_buf, write_buf) == %d (expected 0)\n",
              memcmp(read_buf, write_buf, 2 * pagesize));

       io_uring_cqe_seen(&ring, cqe);
       io_uring_queue_exit(&ring);

       return 0;
  }

When running it on an unpatched kernel:

  $ gcc io_uring_test.c -luring
  $ mkfs.btrfs -f /dev/sda
  $ mount /dev/sda /mnt/sda
  $ ./a.out /mnt/sda
  io_uring read result for file foo:

    cqe->res == 4096 (expected 8192)
    memcmp(read_buf, write_buf) == -205 (expected 0)

After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().

Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
set) over a range that spans multiple extents (or a mix of extents and
holes). This avoids returning success to the caller when we only did
partial IO, which is not optimal for writes and for reads it's actually
incorrect, as the caller doesn't expect to get less bytes read than it has
requested (unless EOF is crossed), as previously mentioned. This is also
the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
even though it doesn't use IOMAP_DIO_PARTIAL.

A test case for fstests will follow soon.

Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/btrfs/inode.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Message ID	20220429104053.239796659@linuxfoundation.org
State	New
Headers	show Return-Path: <stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37FEEC433F5 for <stable@archiver.kernel.org>; Fri, 29 Apr 2022 10:45:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233440AbiD2KtC (ORCPT <rfc822;stable@archiver.kernel.org>); Fri, 29 Apr 2022 06:49:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60000 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1358088AbiD2KsS (ORCPT <rfc822;stable@vger.kernel.org>); Fri, 29 Apr 2022 06:48:18 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E3C51C8A83; Fri, 29 Apr 2022 03:43:41 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 01DD660DE9; Fri, 29 Apr 2022 10:43:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E8436C385A4; Fri, 29 Apr 2022 10:43:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1651229020; bh=Cc4bwEM5ZvakBvbRo2P3NpCCXdt4Zuwj1eVdDCGqlDE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kVzpLfeogDTz795soTtwXCNS+5OQgen93DhK+TIvabSMclW2qDnA6NSJm+Y4Ga61M 4s3vEEwx382KB3KLzM5Aq4JtRZvYa190Glo/IcHPSpAqCx6rQ5cv82NAN1y8p8wKbr XVIfm2p53sQsDiHWGVJr08K9NIRkjV+8NtPHt5w8= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Josef Bacik <josef@toxicpanda.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Anand Jain <anand.jain@oracle.com> Subject: [PATCH 5.15 31/33] btrfs: fallback to blocking mode when doing async dio over multiple extents Date: Fri, 29 Apr 2022 12:42:18 +0200 Message-Id: <20220429104053.239796659@linuxfoundation.org> X-Mailer: git-send-email 2.36.0 In-Reply-To: <20220429104052.345760505@linuxfoundation.org> References: <20220429104052.345760505@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.15,02/33] bpf: Introduce composable reg, ret and arg types. [5.15,03/33] bpf: Replace ARG_XXX_OR_NULL with ARG_XXX \| PTR_MAYBE_NULL [5.15,04/33] bpf: Replace RET_XXX_OR_NULL with RET_XXX \| PTR_MAYBE_NULL [5.15,05/33] bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX \| PTR_MAYBE_NULL [5.15,06/33] bpf: Introduce MEM_RDONLY flag [5.15,07/33] bpf: Convert PTR_TO_MEM_OR_NULL to composable types. [5.15,08/33] bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM. [5.15,09/33] bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem. [5.15,10/33] bpf/selftests: Test PTR_TO_RDONLY_MEM [5.15,11/33] bpf: Fix crash due to out of bounds access into reg2btf_ids. [5.15,12/33] spi: cadence-quadspi: fix write completion support [5.15,13/33] ARM: dts: socfpga: change qspi to "intel,socfpga-qspi" [5.15,14/33] mm: kfence: fix objcgs vector allocation [5.15,15/33] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} [5.15,16/33] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable [5.15,17/33] iov_iter: Introduce fault_in_iov_iter_writeable [5.15,18/33] gfs2: Add wrapper for iomap_file_buffered_write [5.15,19/33] gfs2: Clean up function may_grant [5.15,20/33] gfs2: Introduce flag for glock holder auto-demotion [5.15,21/33] gfs2: Move the inode glock locking to gfs2_file_buffered_write [5.15,22/33] gfs2: Eliminate ip->i_gh [5.15,23/33] gfs2: Fix mmap + page fault deadlocks for buffered I/O [5.15,24/33] iomap: Fix iomap_dio_rw return value for user copies [5.15,25/33] iomap: Support partial direct I/O on user copy failures [5.15,26/33] iomap: Add done_before argument to iomap_dio_rw [5.15,27/33] gup: Introduce FOLL_NOFAULT flag to disable page faults [5.15,28/33] iov_iter: Introduce nofault flag to disable page faults [5.15,29/33] gfs2: Fix mmap + page fault deadlocks for direct I/O [5.15,30/33] btrfs: fix deadlock due to page faults during direct IO reads and writes [5.15,31/33] btrfs: fallback to blocking mode when doing async dio over multiple extents [5.15,32/33] mm: gup: make fault_in_safe_writeable() use fixup_user_fault() [5.15,33/33] selftests/bpf: Add test for reg2btf_ids out of bounds access

[5.15,31/33] btrfs: fallback to blocking mode when doing async dio over multiple extents

Commit Message

Patch