From patchwork Tue May 25 13:50:38 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447395 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7904AC47087 for ; Tue, 25 May 2021 13:51:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5D3F56141B for ; Tue, 25 May 2021 13:51:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233464AbhEYNwi (ORCPT ); Tue, 25 May 2021 09:52:38 -0400 Received: from mx2.suse.de ([195.135.220.15]:42592 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233281AbhEYNwe (ORCPT ); Tue, 25 May 2021 09:52:34 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6LMR/Ngf6lvtbuuF+m2yUFjG+uBCM2FsKQmwAkdHZe8=; b=uwPabqO1I4EDvDUjlntJHYfMbjaaTPOoyAuBTXQPKyzqe+fS+Dj6/JBTEjB2J7DJ3leAKN oZTqKNcbzHMIRIZfB0oC77snqI1QiAXhSF75Nwif94Q6EjLried6NH0r7cP/MxpIVtxmus h0oMlw0bz/UJyKT6HCsJQa3Q61z69B0= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6LMR/Ngf6lvtbuuF+m2yUFjG+uBCM2FsKQmwAkdHZe8=; b=NQDHltGxstV+jhuCvBrHFwMNOTXUqX+YVK7vAHROsjcQPVoo45ln0AjwcLptGKOL/jn/hB gXfjK6yLg8Z7CzBA== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id B1561AEC2; Tue, 25 May 2021 13:51:01 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 118FF1E0AEB; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara , Christoph Hellwig , Hugh Dickins Subject: [PATCH 01/13] mm: Fix comments mentioning i_mutex Date: Tue, 25 May 2021 15:50:38 +0200 Message-Id: <20210525135100.11221-1-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix comments still mentioning i_mutex. Reviewed-by: Christoph Hellwig Acked-by: Hugh Dickins Signed-off-by: Jan Kara --- mm/filemap.c | 10 +++++----- mm/madvise.c | 2 +- mm/memory-failure.c | 2 +- mm/rmap.c | 6 +++--- mm/shmem.c | 20 ++++++++++---------- mm/truncate.c | 8 ++++---- 6 files changed, 24 insertions(+), 24 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 66f7e9fdfbc4..ba1068a1837f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -76,7 +76,7 @@ * ->swap_lock (exclusive_swap_page, others) * ->i_pages lock * - * ->i_mutex + * ->i_rwsem * ->i_mmap_rwsem (truncate->unmap_mapping_range) * * ->mmap_lock @@ -87,7 +87,7 @@ * ->mmap_lock * ->lock_page (access_process_vm) * - * ->i_mutex (generic_perform_write) + * ->i_rwsem (generic_perform_write) * ->mmap_lock (fault_in_pages_readable->do_page_fault) * * bdi->wb.list_lock @@ -3710,12 +3710,12 @@ EXPORT_SYMBOL(generic_perform_write); * modification times and calls proper subroutines depending on whether we * do direct IO or a standard buffered write. * - * It expects i_mutex to be grabbed unless we work on a block device or similar + * It expects i_rwsem to be grabbed unless we work on a block device or similar * object which does not need locking at all. * * This function does *not* take care of syncing data in case of O_SYNC write. * A caller has to handle it. This is mainly due to the fact that we want to - * avoid syncing under i_mutex. + * avoid syncing under i_rwsem. * * Return: * * number of bytes written, even for truncated writes @@ -3803,7 +3803,7 @@ EXPORT_SYMBOL(__generic_file_write_iter); * * This is a wrapper around __generic_file_write_iter() to be used by most * filesystems. It takes care of syncing the file in case of O_SYNC file - * and acquires i_mutex as needed. + * and acquires i_rwsem as needed. * Return: * * negative error code if no data has been written at all of * vfs_fsync_range() failed for a synchronous write diff --git a/mm/madvise.c b/mm/madvise.c index 63e489e5bfdb..a0137706b92a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -853,7 +853,7 @@ static long madvise_remove(struct vm_area_struct *vma, + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* - * Filesystem's fallocate may need to take i_mutex. We need to + * Filesystem's fallocate may need to take i_rwsem. We need to * explicitly grab a reference because the vma (and hence the * vma's reference to the file) can go away as soon as we drop * mmap_lock. diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 85ad98c00fd9..9dcc9bcea731 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -704,7 +704,7 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn) /* * Truncation is a bit tricky. Enable it per file system for now. * - * Open: to take i_mutex or not for this? Right now we don't. + * Open: to take i_rwsem or not for this? Right now we don't. */ return truncate_error_page(p, pfn, mapping); } diff --git a/mm/rmap.c b/mm/rmap.c index 693a610e181d..a35cbbbded0d 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -20,9 +20,9 @@ /* * Lock ordering in mm: * - * inode->i_mutex (while writing or truncating, not reading or faulting) + * inode->i_rwsem (while writing or truncating, not reading or faulting) * mm->mmap_lock - * page->flags PG_locked (lock_page) * (see huegtlbfs below) + * page->flags PG_locked (lock_page) * (see hugetlbfs below) * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) * mapping->i_mmap_rwsem * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) @@ -41,7 +41,7 @@ * in arch-dependent flush_dcache_mmap_lock, * within bdi.wb->list_lock in __sync_single_inode) * - * anon_vma->rwsem,mapping->i_mutex (memory_failure, collect_procs_anon) + * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon) * ->tasklist_lock * pte map lock * diff --git a/mm/shmem.c b/mm/shmem.c index a08cedefbfaa..056204b1f76a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -96,7 +96,7 @@ static struct vfsmount *shm_mnt; /* * shmem_fallocate communicates with shmem_fault or shmem_writepage via - * inode->i_private (with i_mutex making sure that it has only one user at + * inode->i_private (with i_rwsem making sure that it has only one user at * a time): we would prefer not to enlarge the shmem inode just for that. */ struct shmem_falloc { @@ -774,7 +774,7 @@ static int shmem_free_swap(struct address_space *mapping, * Determine (in bytes) how many of the shmem object's pages mapped by the * given offsets are swapped out. * - * This is safe to call without i_mutex or the i_pages lock thanks to RCU, + * This is safe to call without i_rwsem or the i_pages lock thanks to RCU, * as long as the inode doesn't go away and racy results are not a problem. */ unsigned long shmem_partial_swap_usage(struct address_space *mapping, @@ -806,7 +806,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping, * Determine (in bytes) how many of the shmem object's pages mapped by the * given vma is swapped out. * - * This is safe to call without i_mutex or the i_pages lock thanks to RCU, + * This is safe to call without i_rwsem or the i_pages lock thanks to RCU, * as long as the inode doesn't go away and racy results are not a problem. */ unsigned long shmem_swap_usage(struct vm_area_struct *vma) @@ -1069,7 +1069,7 @@ static int shmem_setattr(struct user_namespace *mnt_userns, loff_t oldsize = inode->i_size; loff_t newsize = attr->ia_size; - /* protected by i_mutex */ + /* protected by i_rwsem */ if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) || (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; @@ -2049,7 +2049,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) /* * Trinity finds that probing a hole which tmpfs is punching can * prevent the hole-punch from ever completing: which in turn - * locks writers out with its hold on i_mutex. So refrain from + * locks writers out with its hold on i_rwsem. So refrain from * faulting pages into the hole while it's being punched. Although * shmem_undo_range() does remove the additions, it may be unable to * keep up, as each new page needs its own unmap_mapping_range() call, @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) * we just need to make racing faults a rare case. * * The implementation below would be much simpler if we just used a - * standard mutex or completion: but we cannot take i_mutex in fault, + * standard mutex or completion: but we cannot take i_rwsem in fault, * and bloating every shmem inode for this unlikely case would be sad. */ if (unlikely(inode->i_private)) { @@ -2518,7 +2518,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; - /* i_mutex is held by caller */ + /* i_rwsem is held by caller */ if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) @@ -2618,7 +2618,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) /* * We must evaluate after, since reads (unlike writes) - * are called without i_mutex protection against truncate + * are called without i_rwsem protection against truncate */ nr = PAGE_SIZE; i_size = i_size_read(inode); @@ -2688,7 +2688,7 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return -ENXIO; inode_lock(inode); - /* We're holding i_mutex so we can access i_size directly */ + /* We're holding i_rwsem so we can access i_size directly */ offset = mapping_seek_hole_data(mapping, offset, inode->i_size, whence); if (offset >= 0) offset = vfs_setpos(file, offset, MAX_LFS_FILESIZE); @@ -2717,7 +2717,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1; DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); - /* protected by i_mutex */ + /* protected by i_rwsem */ if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) { error = -EPERM; goto out; diff --git a/mm/truncate.c b/mm/truncate.c index 95af244b112a..57a618c4a0d6 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -415,7 +415,7 @@ EXPORT_SYMBOL(truncate_inode_pages_range); * @mapping: mapping to truncate * @lstart: offset from which to truncate * - * Called under (and serialised by) inode->i_mutex. + * Called under (and serialised by) inode->i_rwsem. * * Note: When this function returns, there can be a page in the process of * deletion (inside __delete_from_page_cache()) in the specified range. Thus @@ -432,7 +432,7 @@ EXPORT_SYMBOL(truncate_inode_pages); * truncate_inode_pages_final - truncate *all* pages before inode dies * @mapping: mapping to truncate * - * Called under (and serialized by) inode->i_mutex. + * Called under (and serialized by) inode->i_rwsem. * * Filesystems have to use this in the .evict_inode path to inform the * VM that this is the final truncate and the inode is going away. @@ -753,7 +753,7 @@ EXPORT_SYMBOL(truncate_pagecache); * setattr function when ATTR_SIZE is passed in. * * Must be called with a lock serializing truncates and writes (generally - * i_mutex but e.g. xfs uses a different lock) and before all filesystem + * i_rwsem but e.g. xfs uses a different lock) and before all filesystem * specific block truncation has been performed. */ void truncate_setsize(struct inode *inode, loff_t newsize) @@ -782,7 +782,7 @@ EXPORT_SYMBOL(truncate_setsize); * * The function must be called after i_size is updated so that page fault * coming after we unlock the page will already see the new i_size. - * The function must be called while we still hold i_mutex - this not only + * The function must be called while we still hold i_rwsem - this not only * makes sure i_size is stable but also that userspace cannot observe new * i_size value before we are prepared to store mmap writes at new inode size. */ From patchwork Tue May 25 13:50:39 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447396 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B15CC47087 for ; Tue, 25 May 2021 13:51:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4AB3361432 for ; Tue, 25 May 2021 13:51:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233393AbhEYNwd (ORCPT ); Tue, 25 May 2021 09:52:33 -0400 Received: from mx2.suse.de ([195.135.220.15]:42390 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233155AbhEYNwd (ORCPT ); Tue, 25 May 2021 09:52:33 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950661; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lk7KBT5ZiYvhccuNilA5BAF8HpqJCTgi83U7AEOzoXI=; b=q5jkwTxGdqeaGaDaqiqLIMLCpp5ptFXuGCcr1hJS/3Ipux9vO7RWjhItyRVSvv9CvGzf1V xLOTzWVgELW+6kBZRNOrt0um7pb3yGoABY1Je9rN0EfiEAsNiovwR/0bdvjxIp1OD3KuwB RdwlPMFCD29Qro9mh1HywNZCokn9tpw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950661; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lk7KBT5ZiYvhccuNilA5BAF8HpqJCTgi83U7AEOzoXI=; b=45NGqsBKUkiz2pmlZQ15hMWjdOMZh/W+62lRh1LP7ZIk+ibFNNxLmmMKvXMknkPqn2A7bv OKqIdEAkUaTkJWAA== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id AD081AE99; Tue, 25 May 2021 13:51:01 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 16FE91F2CAB; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara Subject: [PATCH 02/13] documentation: Sync file_operations members with reality Date: Tue, 25 May 2021 15:50:39 +0200 Message-Id: <20210525135100.11221-2-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Sync listing of struct file_operations members with the real one in fs.h. Signed-off-by: Jan Kara --- Documentation/filesystems/locking.rst | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 1e894480115b..4ed2b22bd0a8 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -506,6 +506,7 @@ prototypes:: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iopoll) (struct kiocb *kiocb, bool spin); int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); @@ -518,12 +519,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, - loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, - void __user *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, @@ -536,6 +531,14 @@ prototypes:: size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **, void **); long (*fallocate)(struct file *, int, loff_t, loff_t); + void (*show_fdinfo)(struct seq_file *m, struct file *f); + unsigned (*mmap_capabilities)(struct file *); + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, + loff_t, size_t, unsigned int); + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, + struct file *file_out, loff_t pos_out, + loff_t len, unsigned int remap_flags); + int (*fadvise)(struct file *, loff_t, loff_t, int); locking rules: All may block. From patchwork Tue May 25 13:50:40 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447394 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0673AC4708D for ; Tue, 25 May 2021 13:51:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DD11161436 for ; Tue, 25 May 2021 13:51:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233193AbhEYNwn (ORCPT ); Tue, 25 May 2021 09:52:43 -0400 Received: from mx2.suse.de ([195.135.220.15]:42616 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233370AbhEYNwe (ORCPT ); Tue, 25 May 2021 09:52:34 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DAtyr1HbqsY8i6dz4CNYfKVV2QI/4xne4YQpWKeuH+4=; b=YmpuUMfgprYdJOgnGYaPCHbzN0/crZ/97/iLAcOkXGjR2TBbla/OncdG2VhXfd+Xdt8x+4 Qmxey1X2LwhRr3dn169B/sxxjoSQcqznxYEZlxLuY5j/tWUSSYHiD8AouSNr5RmofA703m 0aZYbwdigoGFP6KamxvWUDsYpEKxBCI= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=DAtyr1HbqsY8i6dz4CNYfKVV2QI/4xne4YQpWKeuH+4=; b=c2tDIS4wnXqMxhg3dY/+OaGYEOdRfspROVqM4PbzGhsqBILPk0r5c2XCdEymPr0Y2WBKgk FzCMQeV5zIICLFDQ== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id B7892AEC8; Tue, 25 May 2021 13:51:01 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 1C26E1F2CAE; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara Subject: [PATCH 03/13] mm: Protect operations adding pages to page cache with invalidate_lock Date: Tue, 25 May 2021 15:50:40 +0200 Message-Id: <20210525135100.11221-3-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Currently, serializing operations such as page fault, read, or readahead against hole punching is rather difficult. The basic race scheme is like: fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. truncate_inode_pages_range() Now the problem is in this way read / page fault / readahead can instantiate pages in page cache with potentially stale data (if blocks get quickly reused). Avoiding this race is not simple - page locks do not work because we want to make sure there are *no* pages in given range. inode->i_rwsem does not work because page fault happens under mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes the performance for mixed read-write workloads suffer. So create a new rw_semaphore in the address_space - invalidate_lock - that protects adding of pages to page cache for page faults / reads / readahead. Signed-off-by: Jan Kara --- Documentation/filesystems/locking.rst | 64 ++++++++++++++++++-------- fs/inode.c | 2 + include/linux/fs.h | 6 +++ mm/filemap.c | 65 ++++++++++++++++++++++----- mm/readahead.c | 2 + mm/rmap.c | 37 +++++++-------- mm/truncate.c | 3 +- 7 files changed, 129 insertions(+), 50 deletions(-) diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index 4ed2b22bd0a8..af425bef55d3 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -271,19 +271,19 @@ prototypes:: locking rules: All except set_page_dirty and freepage may block -====================== ======================== ========= -ops PageLocked(page) i_rwsem -====================== ======================== ========= +====================== ======================== ========= =============== +ops PageLocked(page) i_rwsem invalidate_lock +====================== ======================== ========= =============== writepage: yes, unlocks (see below) -readpage: yes, unlocks +readpage: yes, unlocks shared writepages: set_page_dirty no -readahead: yes, unlocks -readpages: no +readahead: yes, unlocks shared +readpages: no shared write_begin: locks the page exclusive write_end: yes, unlocks exclusive bmap: -invalidatepage: yes +invalidatepage: yes exclusive releasepage: yes freepage: yes direct_IO: @@ -378,7 +378,10 @@ keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop some or all of the buffers from the page when it is being truncated. It returns zero on success. If ->invalidatepage is zero, the kernel uses -block_invalidatepage() instead. +block_invalidatepage() instead. The filesystem should exclusively acquire +invalidate_lock before invalidating page cache in truncate / hole punch path +(and thus calling into ->invalidatepage) to block races between page cache +invalidation and page cache filling functions (fault, read, ...). ->releasepage() is called when the kernel is about to try to drop the buffers from the page in preparation for freeing it. It returns zero to @@ -573,6 +576,27 @@ in sys_read() and friends. the lease within the individual filesystem to record the result of the operation +->fallocate implementation must be really careful to maintain page cache +consistency when punching holes or performing other operations that invalidate +page cache contents. Usually the filesystem needs to call +truncate_inode_pages_range() to invalidate relevant range of the page cache. +However the filesystem usually also needs to update its internal (and on disk) +view of file offset -> disk block mapping. Until this update is finished, the +filesystem needs to block page faults and reads from reloading now-stale page +cache contents from the disk. VFS provides mapping->invalidate_lock for this +and acquires it in shared mode in paths loading pages from disk +(filemap_fault(), filemap_read(), readahead paths). The filesystem is +responsible for taking this lock in its fallocate implementation and generally +whenever the page cache contents needs to be invalidated because a block is +moving from under a page. + +->copy_file_range and ->remap_file_range implementations need to serialize +against modifications of file data while the operation is running. For +blocking changes through write(2) and similar operations inode->i_rwsem can be +used. For blocking changes through memory mapping, the filesystem can use +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite +implementation. + dquot_operations ================ @@ -630,11 +654,11 @@ pfn_mkwrite: yes access: yes ============= ========= =========================== -->fault() is called when a previously not present pte is about -to be faulted in. The filesystem must find and return the page associated -with the passed in "pgoff" in the vm_fault structure. If it is possible that -the page may be truncated and/or invalidated, then the filesystem must lock -the page, then ensure it is not already truncated (the page lock will block +->fault() is called when a previously not present pte is about to be faulted +in. The filesystem must find and return the page associated with the passed in +"pgoff" in the vm_fault structure. If it is possible that the page may be +truncated and/or invalidated, then the filesystem must lock invalidate_lock, +then ensure the page is not already truncated (invalidate_lock will block subsequent truncate), and then return with VM_FAULT_LOCKED, and the page locked. The VM will unlock the page. @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with the page is passed in "pte" field in vm_fault structure. Pointers to entries for other offsets should be calculated relative to "pte". -->page_mkwrite() is called when a previously read-only pte is -about to become writeable. The filesystem again must ensure that there are -no truncate/invalidate races, and then return with the page locked. If -the page has been truncated, the filesystem should not look up a new page -like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which -will cause the VM to retry the fault. +->page_mkwrite() is called when a previously read-only pte is about to become +writeable. The filesystem again must ensure that there are no +truncate/invalidate races or races with operations such as ->remap_file_range +or ->copy_file_range, and then return with the page locked. Usually +mapping->invalidate_lock is suitable for proper serialization. If the page has +been truncated, the filesystem should not look up a new page like the ->fault() +handler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to +retry the fault. ->pfn_mkwrite() is the same as page_mkwrite but when the pte is VM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is diff --git a/fs/inode.c b/fs/inode.c index c93500d84264..84c528cd1955 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -190,6 +190,8 @@ int inode_init_always(struct super_block *sb, struct inode *inode) mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); mapping->private_data = NULL; mapping->writeback_index = 0; + __init_rwsem(&mapping->invalidate_lock, "mapping.invalidate_lock", + &sb->s_type->invalidate_lock_key); inode->i_private = NULL; inode->i_mapping = mapping; INIT_HLIST_HEAD(&inode->i_dentry); /* buggered by rcu freeing */ diff --git a/include/linux/fs.h b/include/linux/fs.h index c3c88fdb9b2a..897238d9f1e0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -436,6 +436,10 @@ int pagecache_write_end(struct file *, struct address_space *mapping, * struct address_space - Contents of a cacheable, mappable object. * @host: Owner, either the inode or the block_device. * @i_pages: Cached pages. + * @invalidate_lock: Guards coherency between page cache contents and + * file offset->disk block mappings in the filesystem during invalidates. + * It is also used to block modification of page cache contents through + * memory mappings. * @gfp_mask: Memory allocation flags to use for allocating pages. * @i_mmap_writable: Number of VM_SHARED mappings. * @nr_thps: Number of THPs in the pagecache (non-shmem only). @@ -453,6 +457,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping, struct address_space { struct inode *host; struct xarray i_pages; + struct rw_semaphore invalidate_lock; gfp_t gfp_mask; atomic_t i_mmap_writable; #ifdef CONFIG_READ_ONLY_THP_FOR_FS @@ -2488,6 +2493,7 @@ struct file_system_type { struct lock_class_key i_lock_key; struct lock_class_key i_mutex_key; + struct lock_class_key invalidate_lock_key; struct lock_class_key i_mutex_dir_key; }; diff --git a/mm/filemap.c b/mm/filemap.c index ba1068a1837f..4d9ec4c6cc34 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -77,7 +77,8 @@ * ->i_pages lock * * ->i_rwsem - * ->i_mmap_rwsem (truncate->unmap_mapping_range) + * ->invalidate_lock (acquired by fs in truncate path) + * ->i_mmap_rwsem (truncate->unmap_mapping_range) * * ->mmap_lock * ->i_mmap_rwsem @@ -85,7 +86,8 @@ * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) * * ->mmap_lock - * ->lock_page (access_process_vm) + * ->invalidate_lock (filemap_fault) + * ->lock_page (filemap_fault, access_process_vm) * * ->i_rwsem (generic_perform_write) * ->mmap_lock (fault_in_pages_readable->do_page_fault) @@ -2368,20 +2370,30 @@ static int filemap_update_page(struct kiocb *iocb, { int error; + if (iocb->ki_flags & IOCB_NOWAIT) { + if (!down_read_trylock(&mapping->invalidate_lock)) + return -EAGAIN; + } else { + down_read(&mapping->invalidate_lock); + } + if (!trylock_page(page)) { + error = -EAGAIN; if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) - return -EAGAIN; + goto unlock_mapping; if (!(iocb->ki_flags & IOCB_WAITQ)) { + up_read(&mapping->invalidate_lock); put_and_wait_on_page_locked(page, TASK_KILLABLE); return AOP_TRUNCATED_PAGE; } error = __lock_page_async(page, iocb->ki_waitq); if (error) - return error; + goto unlock_mapping; } + error = AOP_TRUNCATED_PAGE; if (!page->mapping) - goto truncated; + goto unlock; error = 0; if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, page)) @@ -2392,15 +2404,13 @@ static int filemap_update_page(struct kiocb *iocb, goto unlock; error = filemap_read_page(iocb->ki_filp, mapping, page); - if (error == AOP_TRUNCATED_PAGE) - put_page(page); - return error; -truncated: - unlock_page(page); - put_page(page); - return AOP_TRUNCATED_PAGE; + goto unlock_mapping; unlock: unlock_page(page); +unlock_mapping: + up_read(&mapping->invalidate_lock); + if (error == AOP_TRUNCATED_PAGE) + put_page(page); return error; } @@ -2415,6 +2425,19 @@ static int filemap_create_page(struct file *file, if (!page) return -ENOMEM; + /* + * Protect against truncate / hole punch. Grabbing invalidate_lock here + * assures we cannot instantiate and bring uptodate new pagecache pages + * after evicting page cache during truncate and before actually + * freeing blocks. Note that we could release invalidate_lock after + * inserting the page into page cache as the locked page would then be + * enough to synchronize with hole punching. But there are code paths + * such as filemap_update_page() filling in partially uptodate pages or + * ->readpages() that need to hold invalidate_lock while mapping blocks + * for IO so let's hold the lock here as well to keep locking rules + * simple. + */ + down_read(&mapping->invalidate_lock); error = add_to_page_cache_lru(page, mapping, index, mapping_gfp_constraint(mapping, GFP_KERNEL)); if (error == -EEXIST) @@ -2426,9 +2449,11 @@ static int filemap_create_page(struct file *file, if (error) goto error; + up_read(&mapping->invalidate_lock); pagevec_add(pvec, page); return 0; error: + up_read(&mapping->invalidate_lock); put_page(page); return error; } @@ -2988,6 +3013,13 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); ret = VM_FAULT_MAJOR; fpin = do_sync_mmap_readahead(vmf); + } + + /* + * See comment in filemap_create_page() why we need invalidate_lock + */ + down_read(&mapping->invalidate_lock); + if (!page) { retry_find: page = pagecache_get_page(mapping, offset, FGP_CREAT|FGP_FOR_MMAP, @@ -2995,6 +3027,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (!page) { if (fpin) goto out_retry; + up_read(&mapping->invalidate_lock); return VM_FAULT_OOM; } } @@ -3035,9 +3068,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (unlikely(offset >= max_off)) { unlock_page(page); put_page(page); + up_read(&mapping->invalidate_lock); return VM_FAULT_SIGBUS; } + up_read(&mapping->invalidate_lock); vmf->page = page; return ret | VM_FAULT_LOCKED; @@ -3056,6 +3091,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) if (!error || error == AOP_TRUNCATED_PAGE) goto retry_find; + up_read(&mapping->invalidate_lock); return VM_FAULT_SIGBUS; @@ -3067,6 +3103,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) */ if (page) put_page(page); + up_read(&mapping->invalidate_lock); if (fpin) fput(fpin); return ret | VM_FAULT_RETRY; @@ -3437,6 +3474,8 @@ static struct page *do_read_cache_page(struct address_space *mapping, * * If the page does not get brought uptodate, return -EIO. * + * The function expects mapping->invalidate_lock to be already held. + * * Return: up to date page on success, ERR_PTR() on failure. */ struct page *read_cache_page(struct address_space *mapping, @@ -3460,6 +3499,8 @@ EXPORT_SYMBOL(read_cache_page); * * If the page does not get brought uptodate, return -EIO. * + * The function expects mapping->invalidate_lock to be already held. + * * Return: up to date page on success, ERR_PTR() on failure. */ struct page *read_cache_page_gfp(struct address_space *mapping, diff --git a/mm/readahead.c b/mm/readahead.c index d589f147f4c2..9785c54107bb 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, */ unsigned int nofs = memalloc_nofs_save(); + down_read(&mapping->invalidate_lock); /* * Preallocate as many pages as we will need. */ @@ -236,6 +237,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, * will then handle the error. */ read_pages(ractl, &page_pool, false); + up_read(&mapping->invalidate_lock); memalloc_nofs_restore(nofs); } EXPORT_SYMBOL_GPL(page_cache_ra_unbounded); diff --git a/mm/rmap.c b/mm/rmap.c index a35cbbbded0d..76d33c3b8ae6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -22,24 +22,25 @@ * * inode->i_rwsem (while writing or truncating, not reading or faulting) * mm->mmap_lock - * page->flags PG_locked (lock_page) * (see hugetlbfs below) - * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) - * mapping->i_mmap_rwsem - * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) - * anon_vma->rwsem - * mm->page_table_lock or pte_lock - * swap_lock (in swap_duplicate, swap_info_get) - * mmlist_lock (in mmput, drain_mmlist and others) - * mapping->private_lock (in __set_page_dirty_buffers) - * lock_page_memcg move_lock (in __set_page_dirty_buffers) - * i_pages lock (widely used) - * lruvec->lru_lock (in lock_page_lruvec_irq) - * inode->i_lock (in set_page_dirty's __mark_inode_dirty) - * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) - * sb_lock (within inode_lock in fs/fs-writeback.c) - * i_pages lock (widely used, in set_page_dirty, - * in arch-dependent flush_dcache_mmap_lock, - * within bdi.wb->list_lock in __sync_single_inode) + * mapping->invalidate_lock (in filemap_fault) + * page->flags PG_locked (lock_page) * (see hugetlbfs below) + * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share) + * mapping->i_mmap_rwsem + * hugetlb_fault_mutex (hugetlbfs specific page fault mutex) + * anon_vma->rwsem + * mm->page_table_lock or pte_lock + * swap_lock (in swap_duplicate, swap_info_get) + * mmlist_lock (in mmput, drain_mmlist and others) + * mapping->private_lock (in __set_page_dirty_buffers) + * lock_page_memcg move_lock (in __set_page_dirty_buffers) + * i_pages lock (widely used) + * lruvec->lru_lock (in lock_page_lruvec_irq) + * inode->i_lock (in set_page_dirty's __mark_inode_dirty) + * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) + * sb_lock (within inode_lock in fs/fs-writeback.c) + * i_pages lock (widely used, in set_page_dirty, + * in arch-dependent flush_dcache_mmap_lock, + * within bdi.wb->list_lock in __sync_single_inode) * * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon) * ->tasklist_lock diff --git a/mm/truncate.c b/mm/truncate.c index 57a618c4a0d6..d0cc6588aba2 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -415,7 +415,8 @@ EXPORT_SYMBOL(truncate_inode_pages_range); * @mapping: mapping to truncate * @lstart: offset from which to truncate * - * Called under (and serialised by) inode->i_rwsem. + * Called under (and serialised by) inode->i_rwsem and + * mapping->invalidate_lock. * * Note: When this function returns, there can be a page in the process of * deletion (inside __delete_from_page_cache()) in the specified range. Thus From patchwork Tue May 25 13:50:44 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447393 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFE5AC47242 for ; Tue, 25 May 2021 13:51:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 90CE661420 for ; Tue, 25 May 2021 13:51:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233555AbhEYNws (ORCPT ); Tue, 25 May 2021 09:52:48 -0400 Received: from mx2.suse.de ([195.135.220.15]:42684 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233396AbhEYNwf (ORCPT ); Tue, 25 May 2021 09:52:35 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1Zm3GVDAk5aDXe2oCWbogZ1CdxctGqDI9RHacxZ37OA=; b=UvfzAbk2FV6liQCbICuNLqokQTuq76a239lxr+CsPOrA0tPk4fALfbzAsdkBTPZittPII1 Nudohpq6Q6ZiexTpz3SJEoH2hrxmk6cmi/mwCWDajRc4gAHNgP7QZVIb599/jSRHH6gc0k pE+smzd6pi3Rrq3PFtg1K1y09N+Xdqw= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1Zm3GVDAk5aDXe2oCWbogZ1CdxctGqDI9RHacxZ37OA=; b=/EJJwL07hO3T+1CGeSpg67JNe590u7yBl7pKd8Vg7NYw6XQCNpUforMSO/lJ0Eb5GKgeHR w1wrarlb74vlGFBw== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id EC3E6AED5; Tue, 25 May 2021 13:51:01 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 2C3321F2CB3; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara , Christoph Hellwig Subject: [PATCH 07/13] xfs: Convert to use invalidate_lock Date: Tue, 25 May 2021 15:50:44 +0200 Message-Id: <20210525135100.11221-7-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Use invalidate_lock instead of XFS internal i_mmap_lock. The intended purpose of invalidate_lock is exactly the same. Note that the locking in __xfs_filemap_fault() slightly changes as filemap_fault() already takes invalidate_lock. Reviewed-by: Christoph Hellwig CC: CC: "Darrick J. Wong" Signed-off-by: Jan Kara --- fs/xfs/xfs_file.c | 12 ++++++----- fs/xfs/xfs_inode.c | 52 ++++++++++++++++++++++++++-------------------- fs/xfs/xfs_inode.h | 1 - fs/xfs/xfs_super.c | 2 -- 4 files changed, 36 insertions(+), 31 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 396ef36dcd0a..dc9cb5c20549 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1282,7 +1282,7 @@ xfs_file_llseek( * * mmap_lock (MM) * sb_start_pagefault(vfs, freeze) - * i_mmaplock (XFS - truncate serialisation) + * invalidate_lock (vfs/XFS_MMAPLOCK - truncate serialisation) * page_lock (MM) * i_lock (XFS - extent map serialisation) */ @@ -1303,24 +1303,26 @@ __xfs_filemap_fault( file_update_time(vmf->vma->vm_file); } - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { pfn_t pfn; + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL, (write_fault && !vmf->cow_page) ? &xfs_direct_write_iomap_ops : &xfs_read_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); } else { - if (write_fault) + if (write_fault) { + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); ret = iomap_page_mkwrite(vmf, &xfs_buffered_write_iomap_ops); - else + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + } else ret = filemap_fault(vmf); } - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (write_fault) sb_end_pagefault(inode->i_sb); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 0369eb22c1bb..53bb5fc33621 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -131,7 +131,7 @@ xfs_ilock_attr_map_shared( /* * In addition to i_rwsem in the VFS inode, the xfs inode contains 2 - * multi-reader locks: i_mmap_lock and the i_lock. This routine allows + * multi-reader locks: invalidate_lock and the i_lock. This routine allows * various combinations of the locks to be obtained. * * The 3 locks should always be ordered so that the IO lock is obtained first, @@ -139,23 +139,23 @@ xfs_ilock_attr_map_shared( * * Basic locking order: * - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock + * i_rwsem -> invalidate_lock -> page_lock -> i_ilock * * mmap_lock locking order: * * i_rwsem -> page lock -> mmap_lock - * mmap_lock -> i_mmap_lock -> page_lock + * mmap_lock -> invalidate_lock -> page_lock * * The difference in mmap_lock locking order mean that we cannot hold the - * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can - * fault in pages during copy in/out (for buffered IO) or require the mmap_lock - * in get_user_pages() to map the user pages into the kernel address space for - * direct IO. Similarly the i_rwsem cannot be taken inside a page fault because - * page faults already hold the mmap_lock. + * invalidate_lock over syscall based read(2)/write(2) based IO. These IO paths + * can fault in pages during copy in/out (for buffered IO) or require the + * mmap_lock in get_user_pages() to map the user pages into the kernel address + * space for direct IO. Similarly the i_rwsem cannot be taken inside a page + * fault because page faults already hold the mmap_lock. * * Hence to serialise fully against both syscall and mmap based IO, we need to - * take both the i_rwsem and the i_mmap_lock. These locks should *only* be both - * taken in places where we need to invalidate the page cache in a race + * take both the i_rwsem and the invalidate_lock. These locks should *only* be + * both taken in places where we need to invalidate the page cache in a race * free manner (e.g. truncate, hole punch and other extent manipulation * functions). */ @@ -187,10 +187,13 @@ xfs_ilock( XFS_IOLOCK_DEP(lock_flags)); } - if (lock_flags & XFS_MMAPLOCK_EXCL) - mrupdate_nested(&ip->i_mmaplock, XFS_MMAPLOCK_DEP(lock_flags)); - else if (lock_flags & XFS_MMAPLOCK_SHARED) - mraccess_nested(&ip->i_mmaplock, XFS_MMAPLOCK_DEP(lock_flags)); + if (lock_flags & XFS_MMAPLOCK_EXCL) { + down_write_nested(&VFS_I(ip)->i_mapping->invalidate_lock, + XFS_MMAPLOCK_DEP(lock_flags)); + } else if (lock_flags & XFS_MMAPLOCK_SHARED) { + down_read_nested(&VFS_I(ip)->i_mapping->invalidate_lock, + XFS_MMAPLOCK_DEP(lock_flags)); + } if (lock_flags & XFS_ILOCK_EXCL) mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags)); @@ -239,10 +242,10 @@ xfs_ilock_nowait( } if (lock_flags & XFS_MMAPLOCK_EXCL) { - if (!mrtryupdate(&ip->i_mmaplock)) + if (!down_write_trylock(&VFS_I(ip)->i_mapping->invalidate_lock)) goto out_undo_iolock; } else if (lock_flags & XFS_MMAPLOCK_SHARED) { - if (!mrtryaccess(&ip->i_mmaplock)) + if (!down_read_trylock(&VFS_I(ip)->i_mapping->invalidate_lock)) goto out_undo_iolock; } @@ -257,9 +260,9 @@ xfs_ilock_nowait( out_undo_mmaplock: if (lock_flags & XFS_MMAPLOCK_EXCL) - mrunlock_excl(&ip->i_mmaplock); + up_write(&VFS_I(ip)->i_mapping->invalidate_lock); else if (lock_flags & XFS_MMAPLOCK_SHARED) - mrunlock_shared(&ip->i_mmaplock); + up_read(&VFS_I(ip)->i_mapping->invalidate_lock); out_undo_iolock: if (lock_flags & XFS_IOLOCK_EXCL) up_write(&VFS_I(ip)->i_rwsem); @@ -306,9 +309,9 @@ xfs_iunlock( up_read(&VFS_I(ip)->i_rwsem); if (lock_flags & XFS_MMAPLOCK_EXCL) - mrunlock_excl(&ip->i_mmaplock); + up_write(&VFS_I(ip)->i_mapping->invalidate_lock); else if (lock_flags & XFS_MMAPLOCK_SHARED) - mrunlock_shared(&ip->i_mmaplock); + up_read(&VFS_I(ip)->i_mapping->invalidate_lock); if (lock_flags & XFS_ILOCK_EXCL) mrunlock_excl(&ip->i_lock); @@ -334,7 +337,7 @@ xfs_ilock_demote( if (lock_flags & XFS_ILOCK_EXCL) mrdemote(&ip->i_lock); if (lock_flags & XFS_MMAPLOCK_EXCL) - mrdemote(&ip->i_mmaplock); + downgrade_write(&VFS_I(ip)->i_mapping->invalidate_lock); if (lock_flags & XFS_IOLOCK_EXCL) downgrade_write(&VFS_I(ip)->i_rwsem); @@ -355,8 +358,11 @@ xfs_isilocked( if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) { if (!(lock_flags & XFS_MMAPLOCK_SHARED)) - return !!ip->i_mmaplock.mr_writer; - return rwsem_is_locked(&ip->i_mmaplock.mr_lock); + return !debug_locks || + lockdep_is_held_type( + &VFS_I(ip)->i_mapping->invalidate_lock, + 0); + return rwsem_is_locked(&VFS_I(ip)->i_mapping->invalidate_lock); } if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) { diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index ca826cfba91c..a0e4153efbbe 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -40,7 +40,6 @@ typedef struct xfs_inode { /* Transaction and locking information. */ struct xfs_inode_log_item *i_itemp; /* logging information */ mrlock_t i_lock; /* inode lock */ - mrlock_t i_mmaplock; /* inode mmap IO lock */ atomic_t i_pincount; /* inode pin count */ /* diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index a2dab05332ac..eeaf44910b5f 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -715,8 +715,6 @@ xfs_fs_inode_init_once( atomic_set(&ip->i_pincount, 0); spin_lock_init(&ip->i_flags_lock); - mrlock_init(&ip->i_mmaplock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER, - "xfsino", ip->i_ino); mrlock_init(&ip->i_lock, MRLOCK_ALLOW_EQUAL_PRI|MRLOCK_BARRIER, "xfsino", ip->i_ino); } From patchwork Tue May 25 13:50:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447392 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC24AC47244 for ; Tue, 25 May 2021 13:51:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BCDA261420 for ; Tue, 25 May 2021 13:51:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233609AbhEYNwy (ORCPT ); Tue, 25 May 2021 09:52:54 -0400 Received: from mx2.suse.de ([195.135.220.15]:42768 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233429AbhEYNwj (ORCPT ); Tue, 25 May 2021 09:52:39 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VG4deE2P8cYdaMa6Zotd3HxtIalbbn5ZHvE2Qp4EUmE=; b=iIv9F3TWePTWfY7Vf8W8XsK00ixNePK8YfWOUvmPhZuiOoWXUKbfzCrGrgPwGKp2s8EkOG SkpyX6pTl6dq8WZDKnCobFDqG9x8qcLnAjk9LVpPCzfZKTmg8aL6HMEXYj0d6i6fQhYPs9 6PF+4i2ag3a3JzJEV5RCdD9bOMHYeEM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VG4deE2P8cYdaMa6Zotd3HxtIalbbn5ZHvE2Qp4EUmE=; b=kqdKzYnKczjHbq4HcSGWHgKO4bpHccvc9gS8UeQPgoGM146zoBaS8Hhdu/c4r3RI8GRZRy e+3lf04b5B20hWBQ== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 15E2EAEEE; Tue, 25 May 2021 13:51:02 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 30C781F2CB6; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara Subject: [PATCH 08/13] xfs: Convert double locking of MMAPLOCK to use VFS helpers Date: Tue, 25 May 2021 15:50:45 +0200 Message-Id: <20210525135100.11221-8-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Convert places in XFS that take MMAPLOCK for two inodes to use helper VFS provides for it (filemap_invalidate_down_write_two()). Note that this changes lock ordering for MMAPLOCK from inode number based ordering to pointer based ordering VFS generally uses. Signed-off-by: Jan Kara Reviewed-by: Darrick J. Wong --- fs/xfs/xfs_bmap_util.c | 15 ++++++++------- fs/xfs/xfs_inode.c | 37 +++++++++++-------------------------- 2 files changed, 19 insertions(+), 33 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index a5e9d7d34023..8a5cede59f3f 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -1582,7 +1582,6 @@ xfs_swap_extents( struct xfs_bstat *sbp = &sxp->sx_stat; int src_log_flags, target_log_flags; int error = 0; - int lock_flags; uint64_t f; int resblks = 0; unsigned int flags = 0; @@ -1594,8 +1593,8 @@ xfs_swap_extents( * do the rest of the checks. */ lock_two_nondirectories(VFS_I(ip), VFS_I(tip)); - lock_flags = XFS_MMAPLOCK_EXCL; - xfs_lock_two_inodes(ip, XFS_MMAPLOCK_EXCL, tip, XFS_MMAPLOCK_EXCL); + filemap_invalidate_down_write_two(VFS_I(ip)->i_mapping, + VFS_I(tip)->i_mapping); /* Verify that both files have the same format */ if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) { @@ -1667,7 +1666,6 @@ xfs_swap_extents( * or cancel will unlock the inodes from this point onwards. */ xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL); - lock_flags |= XFS_ILOCK_EXCL; xfs_trans_ijoin(tp, ip, 0); xfs_trans_ijoin(tp, tip, 0); @@ -1786,13 +1784,16 @@ xfs_swap_extents( trace_xfs_swap_extent_after(ip, 0); trace_xfs_swap_extent_after(tip, 1); +out_unlock_ilock: + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_iunlock(tip, XFS_ILOCK_EXCL); out_unlock: - xfs_iunlock(ip, lock_flags); - xfs_iunlock(tip, lock_flags); + filemap_invalidate_up_write_two(VFS_I(ip)->i_mapping, + VFS_I(tip)->i_mapping); unlock_two_nondirectories(VFS_I(ip), VFS_I(tip)); return error; out_trans_cancel: xfs_trans_cancel(tp); - goto out_unlock; + goto out_unlock_ilock; } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 53bb5fc33621..11616c9b37f8 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -537,12 +537,10 @@ xfs_lock_inodes( } /* - * xfs_lock_two_inodes() can only be used to lock one type of lock at a time - - * the mmaplock or the ilock, but not more than one type at a time. If we lock - * more than one at a time, lockdep will report false positives saying we have - * violated locking orders. The iolock must be double-locked separately since - * we use i_rwsem for that. We now support taking one lock EXCL and the other - * SHARED. + * xfs_lock_two_inodes() can only be used to lock ilock. The iolock and + * mmaplock must be double-locked separately since we use i_rwsem and + * invalidate_lock for that. We now support taking one lock EXCL and the + * other SHARED. */ void xfs_lock_two_inodes( @@ -560,15 +558,8 @@ xfs_lock_two_inodes( ASSERT(hweight32(ip1_mode) == 1); ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))); ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL))); - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || - !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || - !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || - !(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) || - !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL))); - + ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))); + ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL))); ASSERT(ip0->i_ino != ip1->i_ino); if (ip0->i_ino > ip1->i_ino) { @@ -3731,11 +3722,8 @@ xfs_ilock2_io_mmap( ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2)); if (ret) return ret; - if (ip1 == ip2) - xfs_ilock(ip1, XFS_MMAPLOCK_EXCL); - else - xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL, - ip2, XFS_MMAPLOCK_EXCL); + filemap_invalidate_down_write_two(VFS_I(ip1)->i_mapping, + VFS_I(ip2)->i_mapping); return 0; } @@ -3745,12 +3733,9 @@ xfs_iunlock2_io_mmap( struct xfs_inode *ip1, struct xfs_inode *ip2) { - bool same_inode = (ip1 == ip2); - - xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL); - if (!same_inode) - xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL); + filemap_invalidate_up_write_two(VFS_I(ip1)->i_mapping, + VFS_I(ip2)->i_mapping); inode_unlock(VFS_I(ip2)); - if (!same_inode) + if (ip1 != ip2) inode_unlock(VFS_I(ip1)); } From patchwork Tue May 25 13:50:48 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447391 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F914C2B9F8 for ; Tue, 25 May 2021 13:51:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4B0136140F for ; Tue, 25 May 2021 13:51:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233465AbhEYNw4 (ORCPT ); Tue, 25 May 2021 09:52:56 -0400 Received: from mx2.suse.de ([195.135.220.15]:42782 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233434AbhEYNwj (ORCPT ); Tue, 25 May 2021 09:52:39 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bBZEAWXxK0cQSvtCjbZ2jRXsNeD7aGQ4wH0EwxRk+ck=; b=BVpginvTs6xxAq85PzoU16jz4hz30U3rEh/Jpo1AEYWjhK29bb3VcmKaLDg9A4+g8dRLY5 pUELkZMmRftBdClrllfYJyeFgSFJV7gyvhvLqWuKt6itx0nBqF+jSmFql1yurSJoSStjyh YshNQEsIeAwZxYoPNX4IqXNMlpE4bYs= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bBZEAWXxK0cQSvtCjbZ2jRXsNeD7aGQ4wH0EwxRk+ck=; b=0vEo9pDYWRXGcHd5KxbaL50sFGz/6N2a0KacwAlZjpFJ4qlgt5PhlQHLJdFd+ycBWyddzY 4CyUhYacfoDOvHBQ== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 2EEB2AEFF; Tue, 25 May 2021 13:51:02 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 3CDA41F2CBA; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara Subject: [PATCH 11/13] fuse: Convert to using invalidate_lock Date: Tue, 25 May 2021 15:50:48 +0200 Message-Id: <20210525135100.11221-11-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Use invalidate_lock instead of fuse's private i_mmap_sem. The intended purpose is exactly the same. By this conversion we fix a long standing race between hole punching and read(2) / readahead(2) paths that can lead to stale page cache contents. CC: Miklos Szeredi Signed-off-by: Jan Kara --- fs/fuse/dax.c | 50 +++++++++++++++++++++++------------------------- fs/fuse/dir.c | 11 ++++++----- fs/fuse/file.c | 10 +++++----- fs/fuse/fuse_i.h | 7 ------- fs/fuse/inode.c | 1 - 5 files changed, 35 insertions(+), 44 deletions(-) diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index ff99ab2a3c43..03e5477ee913 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -443,12 +443,12 @@ static int fuse_setup_new_dax_mapping(struct inode *inode, loff_t pos, /* * Can't do inline reclaim in fault path. We call * dax_layout_busy_page() before we free a range. And - * fuse_wait_dax_page() drops fi->i_mmap_sem lock and requires it. - * In fault path we enter with fi->i_mmap_sem held and can't drop - * it. Also in fault path we hold fi->i_mmap_sem shared and not - * exclusive, so that creates further issues with fuse_wait_dax_page(). - * Hence return -EAGAIN and fuse_dax_fault() will wait for a memory - * range to become free and retry. + * fuse_wait_dax_page() drops mapping->invalidate_lock and requires it. + * In fault path we enter with mapping->invalidate_lock held and can't + * drop it. Also in fault path we hold mapping->invalidate_lock shared + * and not exclusive, so that creates further issues with + * fuse_wait_dax_page(). Hence return -EAGAIN and fuse_dax_fault() + * will wait for a memory range to become free and retry. */ if (flags & IOMAP_FAULT) { alloc_dmap = alloc_dax_mapping(fcd); @@ -512,7 +512,7 @@ static int fuse_upgrade_dax_mapping(struct inode *inode, loff_t pos, down_write(&fi->dax->sem); node = interval_tree_iter_first(&fi->dax->tree, idx, idx); - /* We are holding either inode lock or i_mmap_sem, and that should + /* We are holding either inode lock or invalidate_lock, and that should * ensure that dmap can't be truncated. We are holding a reference * on dmap and that should make sure it can't be reclaimed. So dmap * should still be there in tree despite the fact we dropped and @@ -659,14 +659,12 @@ static const struct iomap_ops fuse_iomap_ops = { static void fuse_wait_dax_page(struct inode *inode) { - struct fuse_inode *fi = get_fuse_inode(inode); - - up_write(&fi->i_mmap_sem); + up_write(&inode->i_mapping->invalidate_lock); schedule(); - down_write(&fi->i_mmap_sem); + down_write(&inode->i_mapping->invalidate_lock); } -/* Should be called with fi->i_mmap_sem lock held exclusively */ +/* Should be called with mapping->invalidate_lock held exclusively */ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry, loff_t start, loff_t end) { @@ -812,18 +810,18 @@ static vm_fault_t __fuse_dax_fault(struct vm_fault *vmf, * we do not want any read/write/mmap to make progress and try * to populate page cache or access memory we are trying to free. */ - down_read(&get_fuse_inode(inode)->i_mmap_sem); + down_read(&inode->i_mapping->invalidate_lock); ret = dax_iomap_fault(vmf, pe_size, &pfn, &error, &fuse_iomap_ops); if ((ret & VM_FAULT_ERROR) && error == -EAGAIN) { error = 0; retry = true; - up_read(&get_fuse_inode(inode)->i_mmap_sem); + up_read(&inode->i_mapping->invalidate_lock); goto retry; } if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, pe_size, pfn); - up_read(&get_fuse_inode(inode)->i_mmap_sem); + up_read(&inode->i_mapping->invalidate_lock); if (write) sb_end_pagefault(sb); @@ -959,7 +957,7 @@ inode_inline_reclaim_one_dmap(struct fuse_conn_dax *fcd, struct inode *inode, int ret; struct interval_tree_node *node; - down_write(&fi->i_mmap_sem); + down_write(&inode->i_mapping->invalidate_lock); /* Lookup a dmap and corresponding file offset to reclaim. */ down_read(&fi->dax->sem); @@ -1020,7 +1018,7 @@ inode_inline_reclaim_one_dmap(struct fuse_conn_dax *fcd, struct inode *inode, out_write_dmap_sem: up_write(&fi->dax->sem); out_mmap_sem: - up_write(&fi->i_mmap_sem); + up_write(&inode->i_mapping->invalidate_lock); return dmap; } @@ -1049,10 +1047,10 @@ alloc_dax_mapping_reclaim(struct fuse_conn_dax *fcd, struct inode *inode) * had a reference or some other temporary failure, * Try again. We want to give up inline reclaim only * if there is no range assigned to this node. Otherwise - * if a deadlock is possible if we sleep with fi->i_mmap_sem - * held and worker to free memory can't make progress due - * to unavailability of fi->i_mmap_sem lock. So sleep - * only if fi->dax->nr=0 + * if a deadlock is possible if we sleep with + * mapping->invalidate_lock held and worker to free memory + * can't make progress due to unavailability of + * mapping->invalidate_lock. So sleep only if fi->dax->nr=0 */ if (retry) continue; @@ -1060,8 +1058,8 @@ alloc_dax_mapping_reclaim(struct fuse_conn_dax *fcd, struct inode *inode) * There are no mappings which can be reclaimed. Wait for one. * We are not holding fi->dax->sem. So it is possible * that range gets added now. But as we are not holding - * fi->i_mmap_sem, worker should still be able to free up - * a range and wake us up. + * mapping->invalidate_lock, worker should still be able to + * free up a range and wake us up. */ if (!fi->dax->nr && !(fcd->nr_free_ranges > 0)) { if (wait_event_killable_exclusive(fcd->range_waitq, @@ -1107,7 +1105,7 @@ static int lookup_and_reclaim_dmap_locked(struct fuse_conn_dax *fcd, /* * Free a range of memory. * Locking: - * 1. Take fi->i_mmap_sem to block dax faults. + * 1. Take mapping->invalidate_lock to block dax faults. * 2. Take fi->dax->sem to protect interval tree and also to make sure * read/write can not reuse a dmap which we might be freeing. */ @@ -1121,7 +1119,7 @@ static int lookup_and_reclaim_dmap(struct fuse_conn_dax *fcd, loff_t dmap_start = start_idx << FUSE_DAX_SHIFT; loff_t dmap_end = (dmap_start + FUSE_DAX_SZ) - 1; - down_write(&fi->i_mmap_sem); + down_write(&inode->i_mapping->invalidate_lock); ret = fuse_dax_break_layouts(inode, dmap_start, dmap_end); if (ret) { pr_debug("virtio_fs: fuse_dax_break_layouts() failed. err=%d\n", @@ -1133,7 +1131,7 @@ static int lookup_and_reclaim_dmap(struct fuse_conn_dax *fcd, ret = lookup_and_reclaim_dmap_locked(fcd, inode, start_idx); up_write(&fi->dax->sem); out_mmap_sem: - up_write(&fi->i_mmap_sem); + up_write(&inode->i_mapping->invalidate_lock); return ret; } diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 1b6c001a7dd1..0ab27170da11 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -1601,6 +1601,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, struct fuse_mount *fm = get_fuse_mount(inode); struct fuse_conn *fc = fm->fc; struct fuse_inode *fi = get_fuse_inode(inode); + struct address_space *mapping = inode->i_mapping; FUSE_ARGS(args); struct fuse_setattr_in inarg; struct fuse_attr_out outarg; @@ -1625,11 +1626,11 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, } if (FUSE_IS_DAX(inode) && is_truncate) { - down_write(&fi->i_mmap_sem); + down_write(&mapping->invalidate_lock); fault_blocked = true; err = fuse_dax_break_layouts(inode, 0, 0); if (err) { - up_write(&fi->i_mmap_sem); + up_write(&mapping->invalidate_lock); return err; } } @@ -1739,13 +1740,13 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, if ((is_truncate || !is_wb) && S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) { truncate_pagecache(inode, outarg.attr.size); - invalidate_inode_pages2(inode->i_mapping); + invalidate_inode_pages2(mapping); } clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); out: if (fault_blocked) - up_write(&fi->i_mmap_sem); + up_write(&mapping->invalidate_lock); return 0; @@ -1756,7 +1757,7 @@ int fuse_do_setattr(struct dentry *dentry, struct iattr *attr, clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (fault_blocked) - up_write(&fi->i_mmap_sem); + up_write(&mapping->invalidate_lock); return err; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 09ef2a4d25ed..22be837169a7 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -243,7 +243,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir) } if (dax_truncate) { - down_write(&get_fuse_inode(inode)->i_mmap_sem); + down_write(&inode->i_mapping->invalidate_lock); err = fuse_dax_break_layouts(inode, 0, 0); if (err) goto out; @@ -255,7 +255,7 @@ int fuse_open_common(struct inode *inode, struct file *file, bool isdir) out: if (dax_truncate) - up_write(&get_fuse_inode(inode)->i_mmap_sem); + up_write(&inode->i_mapping->invalidate_lock); if (is_wb_truncate | dax_truncate) { fuse_release_nowrite(inode); @@ -2920,7 +2920,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, if (lock_inode) { inode_lock(inode); if (block_faults) { - down_write(&fi->i_mmap_sem); + down_write(&inode->i_mapping->invalidate_lock); err = fuse_dax_break_layouts(inode, 0, 0); if (err) goto out; @@ -2976,7 +2976,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset, clear_bit(FUSE_I_SIZE_UNSTABLE, &fi->state); if (block_faults) - up_write(&fi->i_mmap_sem); + up_write(&inode->i_mapping->invalidate_lock); if (lock_inode) inode_unlock(inode); @@ -3045,7 +3045,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in, * modifications. Yet this does give less guarantees than if the * copying was performed with write(2). * - * To fix this a i_mmap_sem style lock could be used to prevent new + * To fix this a mapping->invalidate_lock could be used to prevent new * faults while the copy is ongoing. */ err = fuse_writeback_range(inode_out, pos_out, pos_out + len - 1); diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index 7e463e220053..5130e88f811e 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -149,13 +149,6 @@ struct fuse_inode { /** Lock to protect write related fields */ spinlock_t lock; - /** - * Can't take inode lock in fault path (leads to circular dependency). - * Introduce another semaphore which can be taken in fault path and - * then other filesystem paths can take this to block faults. - */ - struct rw_semaphore i_mmap_sem; - #ifdef CONFIG_FUSE_DAX /* * Dax specific inode data diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 393e36b74dc4..f73bab71a0a0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -85,7 +85,6 @@ static struct inode *fuse_alloc_inode(struct super_block *sb) fi->orig_ino = 0; fi->state = 0; mutex_init(&fi->mutex); - init_rwsem(&fi->i_mmap_sem); spin_lock_init(&fi->lock); fi->forget = fuse_alloc_forget(); if (!fi->forget) From patchwork Tue May 25 13:50:49 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Kara X-Patchwork-Id: 447390 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43B87C47251 for ; Tue, 25 May 2021 13:51:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 26C4A6140F for ; Tue, 25 May 2021 13:51:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233657AbhEYNw7 (ORCPT ); Tue, 25 May 2021 09:52:59 -0400 Received: from mx2.suse.de ([195.135.220.15]:42780 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233435AbhEYNwj (ORCPT ); Tue, 25 May 2021 09:52:39 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J9i26LHL2rlVmaJ49pwJamYptfR6NjvqlSUaLcXXdxM=; b=fVtuBhGBODJ/q4WtenoZlmsUFVoiJdvJk+GxugxWsyTz4LAtQiQS/Iu1ZvwKRd2g3kvS60 MV9ntfut1APeWOS+c7ZH57UOJzZ+5bXN+RW7JyZGFpt7D+YgH/LqFuYPEPP1t0cLyktyzO lgKL7Jn4+a+bkcNdXw+HL5Y2PWcL3po= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1621950662; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=J9i26LHL2rlVmaJ49pwJamYptfR6NjvqlSUaLcXXdxM=; b=6QDtzpbab4vaBOv4VCkDNUMcoimrVwb3/6BsjgWKLsuFNnENOviDrGEbqFZnFhIdR38G1M /U3ruX0XChlnCiAQ== Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 35461AF1F; Tue, 25 May 2021 13:51:02 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 41D691F2CBD; Tue, 25 May 2021 15:51:00 +0200 (CEST) From: Jan Kara To: Cc: Christoph Hellwig , Dave Chinner , ceph-devel@vger.kernel.org, Chao Yu , Damien Le Moal , "Darrick J. Wong" , Jaegeuk Kim , Jeff Layton , Johannes Thumshirn , linux-cifs@vger.kernel.org, , linux-f2fs-devel@lists.sourceforge.net, , , Miklos Szeredi , Steve French , Ted Tso , Matthew Wilcox , Jan Kara Subject: [PATCH 12/13] ceph: Fix race between hole punch and page fault Date: Tue, 25 May 2021 15:50:49 +0200 Message-Id: <20210525135100.11221-12-jack@suse.cz> X-Mailer: git-send-email 2.26.2 In-Reply-To: <20210525125652.20457-1-jack@suse.cz> References: <20210525125652.20457-1-jack@suse.cz> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org Ceph has a following race between hole punching and page fault: CPU1 CPU2 ceph_fallocate() ... ceph_zero_pagecache_range() ceph_filemap_fault() faults in page in the range being punched ceph_zero_objects() And now we have a page in punched range with invalid data. Fix the problem by using mapping->invalidate_lock similarly to other filesystems. Note that using invalidate_lock also fixes a similar race wrt ->readpage(). CC: Jeff Layton CC: ceph-devel@vger.kernel.org Reviewed-by: Jeff Layton Signed-off-by: Jan Kara --- fs/ceph/addr.c | 9 ++++++--- fs/ceph/file.c | 2 ++ 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index c1570fada3d8..6d868faf97b5 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1401,9 +1401,11 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf) ret = VM_FAULT_SIGBUS; } else { struct address_space *mapping = inode->i_mapping; - struct page *page = find_or_create_page(mapping, 0, - mapping_gfp_constraint(mapping, - ~__GFP_FS)); + struct page *page; + + down_read(&mapping->invalidate_lock); + page = find_or_create_page(mapping, 0, + mapping_gfp_constraint(mapping, ~__GFP_FS)); if (!page) { ret = VM_FAULT_OOM; goto out_inline; @@ -1424,6 +1426,7 @@ static vm_fault_t ceph_filemap_fault(struct vm_fault *vmf) vmf->page = page; ret = VM_FAULT_MAJOR | VM_FAULT_LOCKED; out_inline: + up_read(&mapping->invalidate_lock); dout("filemap_fault %p %llu read inline data ret %x\n", inode, off, ret); } diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 77fc037d5beb..91693d8b458e 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -2083,6 +2083,7 @@ static long ceph_fallocate(struct file *file, int mode, if (ret < 0) goto unlock; + down_write(&inode->i_mapping->invalidate_lock); ceph_zero_pagecache_range(inode, offset, length); ret = ceph_zero_objects(inode, offset, length); @@ -2095,6 +2096,7 @@ static long ceph_fallocate(struct file *file, int mode, if (dirty) __mark_inode_dirty(inode, dirty); } + up_write(&inode->i_mapping->invalidate_lock); ceph_put_cap_refs(ci, got); unlock: