From patchwork Wed Feb 5 00:02:47 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Viacheslav Dubeyko X-Patchwork-Id: 862171 Received: from mail-oa1-f68.google.com (mail-oa1-f68.google.com [209.85.160.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 336E98C11 for ; Wed, 5 Feb 2025 00:03:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.68 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713808; cv=none; b=eeAIpToiUtyYV5iCTZ05GszCcnJddEAHqaSe79M0UjtRR5DReFAnWBhlKC+Nu0QMHSjBtEn32h5B0Q7nj45Y4VvsNqXp36oqALtPPwZB9OtBKK+MzN1cPAjsR1/JFr/7rc888E0u56HkRLkvYXweSirXNxlf4tkE0g95LjuzhBo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713808; c=relaxed/simple; bh=mNZpD0Sy80vpPKABY4V1YsAxyRRZ+n0QFl/hS3KozyY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KgEzdkvzSyoXKXnrk5mSMfWskcjbHJAy2a0D3hgbXaPN9JQIqj17XgElSO9UXzvjoOIna5CkIMk6TIuk1MsHrsrMPMuXiopsHSt3qvoZipAxm9LQg2HWk5sx9LZvATPiZwYz5mV1ahu1Tmv+FuqHfepvKodh3W2f9m6CwMsD23c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com; spf=pass smtp.mailfrom=dubeyko.com; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b=pGVfajER; arc=none smtp.client-ip=209.85.160.68 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b="pGVfajER" Received: by mail-oa1-f68.google.com with SMTP id 586e51a60fabf-29fb5257e05so1892661fac.0 for ; Tue, 04 Feb 2025 16:03:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dubeyko-com.20230601.gappssmtp.com; s=20230601; t=1738713804; x=1739318604; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=F0G0H6IZpkWX9jNdakJUVZSphjoxgVBMdKBeRIa9iUI=; b=pGVfajERQp5Ka/sklTpZgDzF1+18SI1t16iLJsYwHPEuzRulKVjKzyllXBNMNEhUqa FUbsfS23OiS/bnYq7WXwgjIu/YwASEusbQPMbiMhR/JBy+lMGjmAIfx7kBis7Oc+G1sk 2/M6M/RgtwKfNf6tek0kUxsKwcFXR6B3zSbq6KIXcv5RFW1UA5jOvuyB8eE/chSu68K6 EPDcYO6ao+Yk3hq4HhHE2GZclGkbNqcFBRkjQa88k+v+ZbKGZ2oG/GaNom493/Rpp+rE oFqnMk+/FQOl0V0O9U/kEndGccvjm/HLPPH/khZnAXVIuG2DU3BK9vbZz/xmnQILEcrJ PVVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738713804; x=1739318604; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=F0G0H6IZpkWX9jNdakJUVZSphjoxgVBMdKBeRIa9iUI=; b=UdbbUToZew/o6bcKzCGirvVAmDOtYpltSqcE06x5pshO89t1KZpOO+VXN2cnsiZpvb 37CfQ2h3czsPaqBOSAp3nBihQWYarBDMCYuf/uGzBDjFaE9AETQC3C0dtGaRb6RXGIx6 gZN65PopDs/mTUMNuZz1YEbphVtKGJJ0UhL4YYyuzwCxPITNAraKdakontrNkedA0cIh WgMgmDa5onJjTG0y0tPqRzbxkkeBI9BYIBAg1+KDTrfSdL/gTw6Q6YumnKhcENZ1z3Fr aoy2XAsaqJeZBKzPkCVbPLN2Jzmv1EUra29ELknC1xIGj0hDXnnThEsPdq1mYK3V2rK9 8daQ== X-Gm-Message-State: AOJu0YwJTiD0Pyrr6vvrQ9brbv3xKLHa4A1CmgTWJ/uJ5UCg6lwRhDA5 yoOZC7PF7VErEHssa7LemU528I55o/gjPZVQB+Tpk7qGFfg7Hwp1umlCRLVcMF5EwposqRVGS/U MGp9fQ9bXqZU= X-Gm-Gg: ASbGncu70A4bgjf0s7sPFDmsOq0W5DYHRnOCdbm23F5tpUijZmj6ghwR9eDbBaJrPxy 7s0Nyhb36k2SDh9qfkWtTRt8wj0D8lp8Y1qG8LlMWmW1vWnyjyI4XVF7llhnFavZHC2p9WjpM35 kudN6SjkWOfGmJByBOH4vf9vaecn22ooD4c1Pwky2AHksEMCdF5TcD4/xX20j2J9ByLpB6mGavJ +cVl51uuGJp/hBNc/n9uyMZjT22FylPB5A9LChfVpdcywqWIFDMwur1+LNgkN1rsz4NHzcyGZNB Qb7G3HFdg3PB1g6qltbIn4264lyb6Yla X-Google-Smtp-Source: AGHT+IFoqz6fi3d08M9C5BqpNvy0d4ArXveUMny1fHKKo+YqrdumKcyX6zcBTue2RpOE5rKWkOczbg== X-Received: by 2002:a05:6870:213:b0:29e:1f4b:a55c with SMTP id 586e51a60fabf-2b804ed3760mr616528fac.7.1738713804411; Tue, 04 Feb 2025 16:03:24 -0800 (PST) Received: from system76-pc.attlocal.net ([2600:1700:6476:1430:d53:ebfc:fe83:43f5]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-726617eb64csm3666413a34.37.2025.02.04.16.03.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Feb 2025 16:03:22 -0800 (PST) From: Viacheslav Dubeyko To: ceph-devel@vger.kernel.org Cc: idryomov@gmail.com, dhowells@redhat.com, linux-fsdevel@vger.kernel.org, pdonnell@redhat.com, amarkuze@redhat.com, Slava.Dubeyko@ibm.com, slava@dubeyko.com Subject: [RFC PATCH 2/4] ceph: introduce ceph_process_folio_batch() method Date: Tue, 4 Feb 2025 16:02:47 -0800 Message-ID: <20250205000249.123054-3-slava@dubeyko.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250205000249.123054-1-slava@dubeyko.com> References: <20250205000249.123054-1-slava@dubeyko.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Viacheslav Dubeyko First step of ceph_writepages_start() logic is of finding the dirty memory folios and processing it. This patch introduces ceph_process_folio_batch() method that moves this logic into dedicated method. The ceph_writepages_start() has this logic: if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } The problem here that folio/page is locked here at first and it is by set_page_writeback(page) later before submitting the write request. The folio/page is unlocked by writepages_finish() after finishing the write request. It means that logic of checking folio_test_writeback() and folio_wait_writeback() never works because page is locked and it cannot be locked again until write request completion. However, for majority of folios/pages the trylock_page() is used. As a result, multiple threads can try to lock the same folios/pages multiple times even if they are under writeback already. It makes this logic more compute intensive than it is necessary. This patch changes this logic: if (folio_test_writeback(folio) || folio_test_private_2(folio) /* [DEPRECATED] */) { if (wbc->sync_mode == WB_SYNC_NONE) { doutc(cl, "%p under writeback\n", folio); folio_unlock(folio); continue; } doutc(cl, "waiting on writeback %p\n", folio); folio_wait_writeback(folio); folio_wait_private_2(folio); /* [DEPRECATED] */ } if (ceph_wbc.locked_pages == 0) lock_page(page); /* first page */ else if (!trylock_page(page)) break; This logic should exclude the ignoring of writeback state of folios/pages. Signed-off-by: Viacheslav Dubeyko --- fs/ceph/addr.c | 568 +++++++++++++++++++++++++++++++------------------ 1 file changed, 365 insertions(+), 203 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index d002ff62d867..739329846a07 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -978,6 +978,27 @@ static void writepages_finish(struct ceph_osd_request *req) ceph_dec_osd_stopping_blocker(fsc->mdsc); } +static inline +bool is_forced_umount(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + + if (ceph_inode_is_shutdown(inode)) { + if (ci->i_wrbuffer_ref > 0) { + pr_warn_ratelimited_client(cl, + "%llx.%llx %lld forced umount\n", + ceph_vinop(inode), ceph_ino(inode)); + } + mapping_set_error(mapping, -EIO); + return true; + } + + return false; +} + static inline unsigned int ceph_define_write_size(struct address_space *mapping) { @@ -1046,6 +1067,334 @@ void ceph_init_writeback_ctl(struct address_space *mapping, ceph_wbc->data_pages = NULL; } +static inline +int ceph_define_writeback_range(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc) +{ + struct inode *inode = mapping->host; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + + /* find oldest snap context with dirty data */ + ceph_wbc->snapc = get_oldest_context(inode, ceph_wbc, NULL); + if (!ceph_wbc->snapc) { + /* hmm, why does writepages get called when there + is no dirty data? */ + doutc(cl, " no snap context with dirty data?\n"); + return -ENODATA; + } + + doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n", + ceph_wbc->snapc, ceph_wbc->snapc->seq, + ceph_wbc->snapc->num_snaps); + + ceph_wbc->should_loop = false; + + if (ceph_wbc->head_snapc && ceph_wbc->snapc != ceph_wbc->last_snapc) { + /* where to start/end? */ + if (wbc->range_cyclic) { + ceph_wbc->index = ceph_wbc->start_index; + ceph_wbc->end = -1; + if (ceph_wbc->index > 0) + ceph_wbc->should_loop = true; + doutc(cl, " cyclic, start at %lu\n", ceph_wbc->index); + } else { + ceph_wbc->index = wbc->range_start >> PAGE_SHIFT; + ceph_wbc->end = wbc->range_end >> PAGE_SHIFT; + if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) + ceph_wbc->range_whole = true; + doutc(cl, " not cyclic, %lu to %lu\n", + ceph_wbc->index, ceph_wbc->end); + } + } else if (!ceph_wbc->head_snapc) { + /* Do not respect wbc->range_{start,end}. Dirty pages + * in that range can be associated with newer snapc. + * They are not writeable until we write all dirty pages + * associated with 'snapc' get written */ + if (ceph_wbc->index > 0) + ceph_wbc->should_loop = true; + doutc(cl, " non-head snapc, range whole\n"); + } + + ceph_put_snap_context(ceph_wbc->last_snapc); + ceph_wbc->last_snapc = ceph_wbc->snapc; + + return 0; +} + +static inline +bool has_writeback_done(struct ceph_writeback_ctl *ceph_wbc) +{ + return ceph_wbc->done && ceph_wbc->index > ceph_wbc->end; +} + +static inline +bool can_next_page_be_processed(struct ceph_writeback_ctl *ceph_wbc, + unsigned index) +{ + return index < ceph_wbc->nr_folios && + ceph_wbc->locked_pages < ceph_wbc->max_pages; +} + +static +int ceph_check_page_before_write(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc, + struct folio *folio) +{ + struct inode *inode = mapping->host; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct ceph_snap_context *pgsnapc; + struct page *page = &folio->page; + + /* only dirty pages, or our accounting breaks */ + if (unlikely(!PageDirty(page)) || unlikely(page->mapping != mapping)) { + doutc(cl, "!dirty or !mapping %p\n", page); + return -ENODATA; + } + + /* only if matching snap context */ + pgsnapc = page_snap_context(page); + if (pgsnapc != ceph_wbc->snapc) { + doutc(cl, "page snapc %p %lld != oldest %p %lld\n", + pgsnapc, pgsnapc->seq, + ceph_wbc->snapc, ceph_wbc->snapc->seq); + + if (!ceph_wbc->should_loop && !ceph_wbc->head_snapc && + wbc->sync_mode != WB_SYNC_NONE) + ceph_wbc->should_loop = true; + + return -ENODATA; + } + + if (page_offset(page) >= ceph_wbc->i_size) { + doutc(cl, "folio at %lu beyond eof %llu\n", + folio->index, ceph_wbc->i_size); + + if ((ceph_wbc->size_stable || + folio_pos(folio) >= i_size_read(inode)) && + folio_clear_dirty_for_io(folio)) + folio_invalidate(folio, 0, folio_size(folio)); + + return -ENODATA; + } + + if (ceph_wbc->strip_unit_end && + (page->index > ceph_wbc->strip_unit_end)) { + doutc(cl, "end of strip unit %p\n", page); + return -E2BIG; + } + + return 0; +} + +static inline +void __ceph_allocate_page_array(struct ceph_writeback_ctl *ceph_wbc, + unsigned int max_pages) +{ + ceph_wbc->pages = kmalloc_array(max_pages, + sizeof(*ceph_wbc->pages), + GFP_NOFS); + if (!ceph_wbc->pages) { + ceph_wbc->from_pool = true; + ceph_wbc->pages = mempool_alloc(ceph_wb_pagevec_pool, GFP_NOFS); + BUG_ON(!ceph_wbc->pages); + } +} + +static inline +void ceph_allocate_page_array(struct address_space *mapping, + struct ceph_writeback_ctl *ceph_wbc, + struct page *page) +{ + struct inode *inode = mapping->host; + struct ceph_inode_info *ci = ceph_inode(inode); + u64 objnum; + u64 objoff; + u32 xlen; + + /* prepare async write request */ + ceph_wbc->offset = (u64)page_offset(page); + ceph_calc_file_object_mapping(&ci->i_layout, + ceph_wbc->offset, ceph_wbc->wsize, + &objnum, &objoff, &xlen); + + ceph_wbc->num_ops = 1; + ceph_wbc->strip_unit_end = page->index + ((xlen - 1) >> PAGE_SHIFT); + + BUG_ON(ceph_wbc->pages); + ceph_wbc->max_pages = calc_pages_for(0, (u64)xlen); + __ceph_allocate_page_array(ceph_wbc, ceph_wbc->max_pages); + + ceph_wbc->len = 0; +} + +static inline +bool is_page_index_contiguous(struct ceph_writeback_ctl *ceph_wbc, + struct page *page) +{ + return page->index == (ceph_wbc->offset + ceph_wbc->len) >> PAGE_SHIFT; +} + +static inline +bool is_num_ops_too_big(struct ceph_writeback_ctl *ceph_wbc) +{ + return ceph_wbc->num_ops >= + (ceph_wbc->from_pool ? CEPH_OSD_SLAB_OPS : CEPH_OSD_MAX_OPS); +} + +static inline +bool is_write_congestion_happened(struct ceph_fs_client *fsc) +{ + return atomic_long_inc_return(&fsc->writeback_count) > + CONGESTION_ON_THRESH(fsc->mount_options->congestion_kb); +} + +static inline +int ceph_move_dirty_page_in_page_array(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc, + struct page *page) +{ + struct inode *inode = mapping->host; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct page **pages = ceph_wbc->pages; + unsigned int index = ceph_wbc->locked_pages; + gfp_t gfp_flags = ceph_wbc->locked_pages ? GFP_NOWAIT : GFP_NOFS; + + if (IS_ENCRYPTED(inode)) { + pages[index] = fscrypt_encrypt_pagecache_blocks(page, + PAGE_SIZE, + 0, + gfp_flags); + if (IS_ERR(pages[index])) { + if (PTR_ERR(pages[index]) == -EINVAL) { + pr_err_client(cl, "inode->i_blkbits=%hhu\n", + inode->i_blkbits); + } + + /* better not fail on first page! */ + BUG_ON(ceph_wbc->locked_pages == 0); + + pages[index] = NULL; + return PTR_ERR(pages[index]); + } + } else { + pages[index] = page; + } + + ceph_wbc->locked_pages++; + + return 0; +} + +static +int ceph_process_folio_batch(struct address_space *mapping, + struct writeback_control *wbc, + struct ceph_writeback_ctl *ceph_wbc) +{ + struct inode *inode = mapping->host; + struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_client *cl = fsc->client; + struct folio *folio = NULL; + struct page *page = NULL; + unsigned i; + int rc = 0; + + for (i = 0; can_next_page_be_processed(ceph_wbc, i); i++) { + folio = ceph_wbc->fbatch.folios[i]; + + if (!folio) + continue; + + page = &folio->page; + + doutc(cl, "? %p idx %lu, folio_test_writeback %#x, " + "folio_test_dirty %#x, folio_test_locked %#x\n", + page, page->index, folio_test_writeback(folio), + folio_test_dirty(folio), + folio_test_locked(folio)); + + if (folio_test_writeback(folio) || + folio_test_private_2(folio) /* [DEPRECATED] */) { + doutc(cl, "waiting on writeback %p\n", folio); + folio_wait_writeback(folio); + folio_wait_private_2(folio); /* [DEPRECATED] */ + continue; + } + + if (ceph_wbc->locked_pages == 0) + lock_page(page); /* first page */ + else if (!trylock_page(page)) + break; + + rc = ceph_check_page_before_write(mapping, wbc, + ceph_wbc, folio); + if (rc == -ENODATA) { + rc = 0; + unlock_page(page); + ceph_wbc->fbatch.folios[i] = NULL; + continue; + } else if (rc == -E2BIG) { + rc = 0; + unlock_page(page); + ceph_wbc->fbatch.folios[i] = NULL; + break; + } + + if (!clear_page_dirty_for_io(page)) { + doutc(cl, "%p !clear_page_dirty_for_io\n", page); + unlock_page(page); + ceph_wbc->fbatch.folios[i] = NULL; + continue; + } + + /* + * We have something to write. If this is + * the first locked page this time through, + * calculate max possible write size and + * allocate a page array + */ + if (ceph_wbc->locked_pages == 0) { + ceph_allocate_page_array(mapping, ceph_wbc, page); + } else if (!is_page_index_contiguous(ceph_wbc, page)) { + if (is_num_ops_too_big(ceph_wbc)) { + redirty_page_for_writepage(wbc, page); + unlock_page(page); + break; + } + + ceph_wbc->num_ops++; + ceph_wbc->offset = (u64)page_offset(page); + ceph_wbc->len = 0; + } + + /* note position of first page in fbatch */ + doutc(cl, "%llx.%llx will write page %p idx %lu\n", + ceph_vinop(inode), page, page->index); + + fsc->write_congested = is_write_congestion_happened(fsc); + + rc = ceph_move_dirty_page_in_page_array(mapping, wbc, + ceph_wbc, page); + if (rc) { + redirty_page_for_writepage(wbc, page); + unlock_page(page); + break; + } + + ceph_wbc->fbatch.folios[i] = NULL; + ceph_wbc->len += thp_size(page); + } + + ceph_wbc->processed_in_fbatch = i; + + return rc; +} + /* * initiate async writeback */ @@ -1057,7 +1406,6 @@ static int ceph_writepages_start(struct address_space *mapping, struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); struct ceph_client *cl = fsc->client; struct ceph_vino vino = ceph_vino(inode); - struct ceph_snap_context *pgsnapc; struct ceph_writeback_ctl ceph_wbc; struct ceph_osd_request *req = NULL; int rc = 0; @@ -1071,235 +1419,49 @@ static int ceph_writepages_start(struct address_space *mapping, wbc->sync_mode == WB_SYNC_NONE ? "NONE" : (wbc->sync_mode == WB_SYNC_ALL ? "ALL" : "HOLD")); - if (ceph_inode_is_shutdown(inode)) { - if (ci->i_wrbuffer_ref > 0) { - pr_warn_ratelimited_client(cl, - "%llx.%llx %lld forced umount\n", - ceph_vinop(inode), ceph_ino(inode)); - } - mapping_set_error(mapping, -EIO); - return -EIO; /* we're in a forced umount, don't write! */ + if (is_forced_umount(mapping)) { + /* we're in a forced umount, don't write! */ + return -EIO; } ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc); retry: - /* find oldest snap context with dirty data */ - ceph_wbc.snapc = get_oldest_context(inode, &ceph_wbc, NULL); - if (!ceph_wbc.snapc) { + rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc); + if (rc == -ENODATA) { /* hmm, why does writepages get called when there is no dirty data? */ - doutc(cl, " no snap context with dirty data?\n"); + rc = 0; goto out; } - doutc(cl, " oldest snapc is %p seq %lld (%d snaps)\n", - ceph_wbc.snapc, ceph_wbc.snapc->seq, - ceph_wbc.snapc->num_snaps); - - ceph_wbc.should_loop = false; - if (ceph_wbc.head_snapc && ceph_wbc.snapc != ceph_wbc.last_snapc) { - /* where to start/end? */ - if (wbc->range_cyclic) { - ceph_wbc.index = ceph_wbc.start_index; - ceph_wbc.end = -1; - if (ceph_wbc.index > 0) - ceph_wbc.should_loop = true; - doutc(cl, " cyclic, start at %lu\n", ceph_wbc.index); - } else { - ceph_wbc.index = wbc->range_start >> PAGE_SHIFT; - ceph_wbc.end = wbc->range_end >> PAGE_SHIFT; - if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) - ceph_wbc.range_whole = true; - doutc(cl, " not cyclic, %lu to %lu\n", - ceph_wbc.index, ceph_wbc.end); - } - } else if (!ceph_wbc.head_snapc) { - /* Do not respect wbc->range_{start,end}. Dirty pages - * in that range can be associated with newer snapc. - * They are not writeable until we write all dirty pages - * associated with 'snapc' get written */ - if (ceph_wbc.index > 0) - ceph_wbc.should_loop = true; - doutc(cl, " non-head snapc, range whole\n"); - } if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end); - ceph_put_snap_context(ceph_wbc.last_snapc); - ceph_wbc.last_snapc = ceph_wbc.snapc; - - while (!ceph_wbc.done && ceph_wbc.index <= ceph_wbc.end) { + while (!has_writeback_done(&ceph_wbc)) { unsigned i; struct page *page; + ceph_wbc.locked_pages = 0; ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT; get_more_pages: + ceph_folio_batch_reinit(&ceph_wbc); + ceph_wbc.nr_folios = filemap_get_folios_tag(mapping, &ceph_wbc.index, ceph_wbc.end, ceph_wbc.tag, &ceph_wbc.fbatch); - doutc(cl, "pagevec_lookup_range_tag got %d\n", - ceph_wbc.nr_folios); + doutc(cl, "pagevec_lookup_range_tag for tag %#x got %d\n", + ceph_wbc.tag, ceph_wbc.nr_folios); + if (!ceph_wbc.nr_folios && !ceph_wbc.locked_pages) break; - for (i = 0; i < ceph_wbc.nr_folios && - ceph_wbc.locked_pages < ceph_wbc.max_pages; i++) { - struct folio *folio = ceph_wbc.fbatch.folios[i]; - - page = &folio->page; - doutc(cl, "? %p idx %lu\n", page, page->index); - if (ceph_wbc.locked_pages == 0) - lock_page(page); /* first page */ - else if (!trylock_page(page)) - break; - - /* only dirty pages, or our accounting breaks */ - if (unlikely(!PageDirty(page)) || - unlikely(page->mapping != mapping)) { - doutc(cl, "!dirty or !mapping %p\n", page); - unlock_page(page); - continue; - } - /* only if matching snap context */ - pgsnapc = page_snap_context(page); - if (pgsnapc != ceph_wbc.snapc) { - doutc(cl, "page snapc %p %lld != oldest %p %lld\n", - pgsnapc, pgsnapc->seq, - ceph_wbc.snapc, ceph_wbc.snapc->seq); - if (!ceph_wbc.should_loop && - !ceph_wbc.head_snapc && - wbc->sync_mode != WB_SYNC_NONE) - ceph_wbc.should_loop = true; - unlock_page(page); - continue; - } - if (page_offset(page) >= ceph_wbc.i_size) { - doutc(cl, "folio at %lu beyond eof %llu\n", - folio->index, ceph_wbc.i_size); - if ((ceph_wbc.size_stable || - folio_pos(folio) >= i_size_read(inode)) && - folio_clear_dirty_for_io(folio)) - folio_invalidate(folio, 0, - folio_size(folio)); - folio_unlock(folio); - continue; - } - if (ceph_wbc.strip_unit_end && - (page->index > ceph_wbc.strip_unit_end)) { - doutc(cl, "end of strip unit %p\n", page); - unlock_page(page); - break; - } - if (folio_test_writeback(folio) || - folio_test_private_2(folio) /* [DEPRECATED] */) { - if (wbc->sync_mode == WB_SYNC_NONE) { - doutc(cl, "%p under writeback\n", folio); - folio_unlock(folio); - continue; - } - doutc(cl, "waiting on writeback %p\n", folio); - folio_wait_writeback(folio); - folio_wait_private_2(folio); /* [DEPRECATED] */ - } - - if (!clear_page_dirty_for_io(page)) { - doutc(cl, "%p !clear_page_dirty_for_io\n", page); - unlock_page(page); - continue; - } - - /* - * We have something to write. If this is - * the first locked page this time through, - * calculate max possinle write size and - * allocate a page array - */ - if (ceph_wbc.locked_pages == 0) { - u64 objnum; - u64 objoff; - u32 xlen; - - /* prepare async write request */ - ceph_wbc.offset = (u64)page_offset(page); - ceph_calc_file_object_mapping(&ci->i_layout, - ceph_wbc.offset, - ceph_wbc.wsize, - &objnum, &objoff, - &xlen); - ceph_wbc.len = xlen; - - ceph_wbc.num_ops = 1; - ceph_wbc.strip_unit_end = page->index + - ((ceph_wbc.len - 1) >> PAGE_SHIFT); - - BUG_ON(ceph_wbc.pages); - ceph_wbc.max_pages = - calc_pages_for(0, (u64)ceph_wbc.len); - ceph_wbc.pages = kmalloc_array(ceph_wbc.max_pages, - sizeof(*ceph_wbc.pages), - GFP_NOFS); - if (!ceph_wbc.pages) { - ceph_wbc.from_pool = true; - ceph_wbc.pages = - mempool_alloc(ceph_wb_pagevec_pool, - GFP_NOFS); - BUG_ON(!ceph_wbc.pages); - } - ceph_wbc.len = 0; - } else if (page->index != - (ceph_wbc.offset + ceph_wbc.len) >> PAGE_SHIFT) { - if (ceph_wbc.num_ops >= - (ceph_wbc.from_pool ? CEPH_OSD_SLAB_OPS : - CEPH_OSD_MAX_OPS)) { - redirty_page_for_writepage(wbc, page); - unlock_page(page); - break; - } - - ceph_wbc.num_ops++; - ceph_wbc.offset = (u64)page_offset(page); - ceph_wbc.len = 0; - } - - /* note position of first page in fbatch */ - doutc(cl, "%llx.%llx will write page %p idx %lu\n", - ceph_vinop(inode), page, page->index); - - if (atomic_long_inc_return(&fsc->writeback_count) > - CONGESTION_ON_THRESH( - fsc->mount_options->congestion_kb)) - fsc->write_congested = true; - - if (IS_ENCRYPTED(inode)) { - ceph_wbc.pages[ceph_wbc.locked_pages] = - fscrypt_encrypt_pagecache_blocks(page, - PAGE_SIZE, 0, - ceph_wbc.locked_pages ? - GFP_NOWAIT : GFP_NOFS); - if (IS_ERR(ceph_wbc.pages[ceph_wbc.locked_pages])) { - if (PTR_ERR(ceph_wbc.pages[ceph_wbc.locked_pages]) == -EINVAL) - pr_err_client(cl, - "inode->i_blkbits=%hhu\n", - inode->i_blkbits); - /* better not fail on first page! */ - BUG_ON(ceph_wbc.locked_pages == 0); - ceph_wbc.pages[ceph_wbc.locked_pages] = NULL; - redirty_page_for_writepage(wbc, page); - unlock_page(page); - break; - } - ++ceph_wbc.locked_pages; - } else { - ceph_wbc.pages[ceph_wbc.locked_pages++] = page; - } - - ceph_wbc.fbatch.folios[i] = NULL; - ceph_wbc.len += thp_size(page); - } + rc = ceph_process_folio_batch(mapping, wbc, &ceph_wbc); + if (rc) + goto release_folios; /* did we get anything? */ if (!ceph_wbc.locked_pages) From patchwork Wed Feb 5 00:02:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Viacheslav Dubeyko X-Patchwork-Id: 862170 Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BCD64134D4 for ; Wed, 5 Feb 2025 00:03:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713811; cv=none; b=mq0HQYBskSUYDZAFYENrpeEtZm2fWz8lnp1KHZaG7IFbJpZetwi4fINkN41NmBY0o7H86jL5NSftvZkpjCEvztiWJOPnJXHfaVKeGlRFM1bKDfR6RHH/B16p/tpX8daL1O6j58/+ERRkE20ceTurw4z4RHF5TwoWNk+lr2PJb3E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738713811; c=relaxed/simple; bh=W48Xmw8pBWO8C8BKhGVAmSx5PQbZVc9PLaufu+t+O+s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RKHYl7I2NyWBhmf6lZZQ0o5g9X7g3164U2ejBQWI9MLvrntuqAIJ3rB1/o1sVhNFDa4pKJWI242+uD3xjTpeaouOy/AOj4VxsxlaVWhDpsx3eEmnK+xwoipC2rqAB3gNyCeLECR+ChP4UgX4U1h5ULJ8y0KvRdlWDmWG9O67knI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com; spf=pass smtp.mailfrom=dubeyko.com; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b=tF/DPe51; arc=none smtp.client-ip=209.85.210.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=dubeyko.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=dubeyko-com.20230601.gappssmtp.com header.i=@dubeyko-com.20230601.gappssmtp.com header.b="tF/DPe51" Received: by mail-ot1-f41.google.com with SMTP id 46e09a7af769-71e2764aa46so3719498a34.2 for ; Tue, 04 Feb 2025 16:03:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dubeyko-com.20230601.gappssmtp.com; s=20230601; t=1738713808; x=1739318608; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Mwe/Wt31zfGjltQoyH9VPBmv6gjaT8PZg5nxto4F+xs=; b=tF/DPe51lFNid7i7BT65onMJa+gjqU8HxQvM3DXrbh2vwE1AM4mj6oApvKyu3S0GNG 6VDgIP+5+v23avsUi7PfuFO3Mx1rxKlC1gOpSo1iB0z5z6vt8wR1htmOk8QC/Q1nPNsH RDHVd0p7T5gRgxpKgKfN4DwDD1j4qSeyrq7yOci5Q4rQgx2TYOsSUNBF/QFaaWsLGYuL AxDmaWx1LBTXETqcf3Hp87SZ0pnYmXDL4tiqmvjEaiPiYwWANJTxT5acaBFOAhOwqK9Q s+0RT5B1gvVm6L9V1oxzaP0kBaclcZuAouR76qVuHs3cem4Gfn+6Lx0O0K2s51b6m7so IVoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738713808; x=1739318608; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Mwe/Wt31zfGjltQoyH9VPBmv6gjaT8PZg5nxto4F+xs=; b=JpPD1YRXrqNhAu2Il8viFTXrGRkLqbtB7Qh90YupuTJYqdW48tsMylk9ww6AxaBW4s lAiEyJkT/fsuL89KU+hrNnj7jDY294pI4wmXEoMp12x8/OnK62EB7Jk1DTOur8h0JuWh 4Fuwt5Ymww+k88lFK8jJgmTLtOB8zobTu4n6pGKdcYmZtybMxzBX/bLbRBXfi2rLt0Cn ex+ZfdFEWE1rV3tt5YVBmaEFRNYqU0ITQrq+OEMPE1/SnzN4Wx9MnHPyEEiFoaVNKS5p jHGpgwH4DUgPq1U5Fa9aJNac288d6mxfNWy/jhYvsKpXICGGUGGwa3d/+Wzmv6zrcylI sD2A== X-Gm-Message-State: AOJu0Yw3eLaB047Zk+/vwomcWTLAHjU+KEiT0BiikkoMIQLq0TR8cvUF cGMXwd+ntwl3q1mUT7pHSHMVelv4JAZrX88xEMiMMoA39wO8j2iscmAEj/DcubeZ50lAm1+iZVC E7a1Th7zF X-Gm-Gg: ASbGnctM4yWWwIs22bm1Ylj12ov+1owCHdWmwo7m9GNPEe2wKR3yaPROOTE3GtuQpSJ 18Ac7TAF5EtLCiHplzjGVi/gyUmEP/R8zpuCPu9lfripId7W9Wcgw6GHjiyA7yPOnS8nAW75lL7 iKvC/89SieXWMwfLTqvEwVW7wuE8JIa/7E6zXqs0566vJnidTrLTJgIOAYZnfjCD7p8FXGOXaCS nwRsBEp6gY/yOhbI/5YSTVqdO4kjO3ITE50lS/ZRr8hM8RQB1Rj8n4WGXbwaP4JOznq01sbePrg ox7CQHXfU+QgGn235pI69uFTuCcAkycy X-Google-Smtp-Source: AGHT+IG98kDtic+2XHpEl/g96Nv8uptV0EUsUG1kiyVydWgRNd330QjlALIk+EE7jcMLs9A/WPdTMA== X-Received: by 2002:a05:6830:dc7:b0:71d:fa78:a5ee with SMTP id 46e09a7af769-726a416bf5dmr668735a34.11.1738713808481; Tue, 04 Feb 2025 16:03:28 -0800 (PST) Received: from system76-pc.attlocal.net ([2600:1700:6476:1430:d53:ebfc:fe83:43f5]) by smtp.gmail.com with ESMTPSA id 46e09a7af769-726617eb64csm3666413a34.37.2025.02.04.16.03.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Feb 2025 16:03:27 -0800 (PST) From: Viacheslav Dubeyko To: ceph-devel@vger.kernel.org Cc: idryomov@gmail.com, dhowells@redhat.com, linux-fsdevel@vger.kernel.org, pdonnell@redhat.com, amarkuze@redhat.com, Slava.Dubeyko@ibm.com, slava@dubeyko.com Subject: [RFC PATCH 4/4] ceph: fix generic/421 test failure Date: Tue, 4 Feb 2025 16:02:49 -0800 Message-ID: <20250205000249.123054-5-slava@dubeyko.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250205000249.123054-1-slava@dubeyko.com> References: <20250205000249.123054-1-slava@dubeyko.com> Precedence: bulk X-Mailing-List: ceph-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Viacheslav Dubeyko The generic/421 fails to finish because of the issue: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24) Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace: Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30 Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] There are several issues here: (1) ceph_kill_sb() doesn't wait ending of flushing all dirty folios/pages because of racy nature of mdsc->stopping_blockers. As a result, mdsc->stopping becomes CEPH_MDSC_STOPPING_FLUSHED too early. (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails to increment mdsc->stopping_blockers. Finally, already locked folios/pages are never been unlocked and the logic tries to lock the same page second time. (3) The folio_batch with found dirty pages by filemap_get_folios_tag() is not processed properly. And this is why some number of dirty pages simply never processed and we have dirty folios/pages after unmount anyway. This patch fixes the issues by means of: (1) introducing dirty_folios counter and flush_end_wq waiting queue in struct ceph_mds_client; (2) ceph_dirty_folio() increments the dirty_folios counter; (3) writepages_finish() decrements the dirty_folios counter and wake up all waiters on the queue if dirty_folios counter is equal or lesser than zero; (4) adding in ceph_kill_sb() method the logic of checking the value of dirty_folios counter and waiting if it is bigger than zero; (5) adding ceph_inc_osd_stopping_blocker() call in the beginning of the ceph_writepages_start() and ceph_dec_osd_stopping_blocker() at the end of the ceph_writepages_start() with the goal to resolve the racy nature of mdsc->stopping_blockers. sudo ./check generic/421 FSTYP -- ceph PLATFORM -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb 3 20:30:08 UTC 2025 MKFS_OPTIONS -- 127.0.0.1:40551:/scratch MOUNT_OPTIONS -- -o name=fs,secret=,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch generic/421 7s ... 4s Ran: generic/421 Passed all 1 tests Signed-off-by: Viacheslav Dubeyko --- fs/ceph/addr.c | 20 +++++++++++++++++++- fs/ceph/mds_client.c | 2 ++ fs/ceph/mds_client.h | 3 +++ fs/ceph/super.c | 11 +++++++++++ 4 files changed, 35 insertions(+), 1 deletion(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 02d20c000dc5..d82ce4867fca 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -82,6 +82,7 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) { struct inode *inode = mapping->host; struct ceph_client *cl = ceph_inode_to_client(inode); + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); struct ceph_inode_info *ci; struct ceph_snap_context *snapc; @@ -92,6 +93,8 @@ static bool ceph_dirty_folio(struct address_space *mapping, struct folio *folio) return false; } + atomic64_inc(&mdsc->dirty_folios); + ci = ceph_inode(inode); /* dirty the head */ @@ -894,6 +897,7 @@ static void writepages_finish(struct ceph_osd_request *req) struct ceph_snap_context *snapc = req->r_snapc; struct address_space *mapping = inode->i_mapping; struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode); + struct ceph_mds_client *mdsc = ceph_sb_to_mdsc(inode->i_sb); unsigned int len = 0; bool remove_page; @@ -949,6 +953,12 @@ static void writepages_finish(struct ceph_osd_request *req) ceph_put_snap_context(detach_page_private(page)); end_page_writeback(page); + + if (atomic64_dec_return(&mdsc->dirty_folios) <= 0) { + wake_up_all(&mdsc->flush_end_wq); + WARN_ON(atomic64_read(&mdsc->dirty_folios) < 0); + } + doutc(cl, "unlocking %p\n", page); if (remove_page) @@ -1660,13 +1670,18 @@ static int ceph_writepages_start(struct address_space *mapping, ceph_init_writeback_ctl(mapping, wbc, &ceph_wbc); + if (!ceph_inc_osd_stopping_blocker(fsc->mdsc)) { + rc = -EIO; + goto out; + } + retry: rc = ceph_define_writeback_range(mapping, wbc, &ceph_wbc); if (rc == -ENODATA) { /* hmm, why does writepages get called when there is no dirty data? */ rc = 0; - goto out; + goto dec_osd_stopping_blocker; } if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages) @@ -1756,6 +1771,9 @@ static int ceph_writepages_start(struct address_space *mapping, if (wbc->range_cyclic || (ceph_wbc.range_whole && wbc->nr_to_write > 0)) mapping->writeback_index = ceph_wbc.index; +dec_osd_stopping_blocker: + ceph_dec_osd_stopping_blocker(fsc->mdsc); + out: ceph_put_snap_context(ceph_wbc.last_snapc); doutc(cl, "%llx.%llx dend - startone, rc = %d\n", ceph_vinop(inode), diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 54b3421501e9..230e0c3f341f 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -5489,6 +5489,8 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) spin_lock_init(&mdsc->stopping_lock); atomic_set(&mdsc->stopping_blockers, 0); init_completion(&mdsc->stopping_waiter); + atomic64_set(&mdsc->dirty_folios, 0); + init_waitqueue_head(&mdsc->flush_end_wq); init_waitqueue_head(&mdsc->session_close_wq); INIT_LIST_HEAD(&mdsc->waiting_for_map); mdsc->quotarealms_inodes = RB_ROOT; diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 7c9fee9e80d4..3e2a6fa7c19a 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -458,6 +458,9 @@ struct ceph_mds_client { atomic_t stopping_blockers; struct completion stopping_waiter; + atomic64_t dirty_folios; + wait_queue_head_t flush_end_wq; + atomic64_t quotarealms_count; /* # realms with quota */ /* * We keep a list of inodes we don't see in the mountpoint but that we diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 4344e1f11806..f3951253e393 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -1563,6 +1563,17 @@ static void ceph_kill_sb(struct super_block *s) */ sync_filesystem(s); + if (atomic64_read(&mdsc->dirty_folios) > 0) { + wait_queue_head_t *wq = &mdsc->flush_end_wq; + long timeleft = wait_event_killable_timeout(*wq, + atomic64_read(&mdsc->dirty_folios) <= 0, + fsc->client->options->mount_timeout); + if (!timeleft) /* timed out */ + pr_warn_client(cl, "umount timed out, %ld\n", timeleft); + else if (timeleft < 0) /* killed */ + pr_warn_client(cl, "umount was killed, %ld\n", timeleft); + } + spin_lock(&mdsc->stopping_lock); mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHING; wait = !!atomic_read(&mdsc->stopping_blockers);