[v20,49/71] ceph: add truncate size handling support for fscrypt

Message ID	20230613052424.254540-50-xiubli@redhat.com
State	New
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> From: xiubli@redhat.com To: idryomov@gmail.com, ceph-devel@vger.kernel.org Cc: jlayton@kernel.org, vshankar@redhat.com, mchangir@redhat.com, lhenriques@suse.de, Xiubo Li <xiubli@redhat.com> Subject: [PATCH v20 49/71] ceph: add truncate size handling support for fscrypt Date: Tue, 13 Jun 2023 13:24:02 +0800 Message-Id: <20230613052424.254540-50-xiubli@redhat.com> In-Reply-To: <20230613052424.254540-1-xiubli@redhat.com> References: <20230613052424.254540-1-xiubli@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ceph+fscrypt: full support \| expand [v20,00/71] ceph+fscrypt: full support [v20,01/71] libceph: add spinlock around osd->o_requests [v20,02/71] libceph: define struct ceph_sparse_extent and add some helpers [v20,03/71] libceph: add sparse read support to msgr2 crc state machine [v20,04/71] libceph: add sparse read support to OSD client [v20,05/71] libceph: support sparse reads on msgr2 secure codepath [v20,06/71] libceph: add sparse read support to msgr1 [v20,07/71] ceph: add new mount option to enable sparse reads [v20,08/71] ceph: preallocate inode for ops that may create one [v20,09/71] ceph: make ceph_msdc_build_path use ref-walk [v20,10/71] libceph: add new iov_iter-based ceph_msg_data_type and ceph_osd_data_type [v20,11/71] ceph: use osd_req_op_extent_osd_iter for netfs reads [v20,12/71] ceph: fscrypt_auth handling for ceph [v20,13/71] ceph: ensure that we accept a new context from MDS for new inodes [v20,14/71] ceph: add support for fscrypt_auth/fscrypt_file to cap messages [v20,15/71] ceph: implement -o test_dummy_encryption mount option [v20,16/71] ceph: decode alternate_name in lease info [v20,17/71] ceph: add fscrypt ioctls [v20,18/71] ceph: make the ioctl cmd more readable in debug log [v20,19/71] ceph: add base64 endcoding routines for encrypted names [v20,20/71] ceph: add encrypted fname handling to ceph_mdsc_build_path [v20,21/71] ceph: send altname in MClientRequest [v20,22/71] ceph: encode encrypted name in dentry release [v20,23/71] ceph: properly set DCACHE_NOKEY_NAME flag in lookup [v20,24/71] ceph: set DCACHE_NOKEY_NAME in atomic open [v20,25/71] ceph: make d_revalidate call fscrypt revalidator for encrypted dentries [v20,26/71] ceph: add helpers for converting names for userland presentation [v20,27/71] ceph: fix base64 encoded name's length check in ceph_fname_to_usr() [v20,28/71] ceph: add fscrypt support to ceph_fill_trace [v20,29/71] ceph: pass the request to parse_reply_info_readdir() [v20,30/71] ceph: add ceph_encode_encrypted_dname() helper [v20,31/71] ceph: add support to readdir for encrypted filenames [v20,32/71] ceph: create symlinks with encrypted and base64-encoded targets [v20,33/71] ceph: make ceph_get_name decrypt filenames [v20,34/71] ceph: add a new ceph.fscrypt.auth vxattr [v20,35/71] ceph: add some fscrypt guardrails [v20,36/71] ceph: allow encrypting a directory while not having Ax caps [v20,37/71] ceph: mark directory as non-complete after loading key [v20,38/71] ceph: don't allow changing layout on encrypted files/directories [v20,39/71] libceph: add CEPH_OSD_OP_ASSERT_VER support [v20,40/71] ceph: size handling for encrypted inodes in cap updates [v20,41/71] ceph: fscrypt_file field handling in MClientRequest messages [v20,42/71] ceph: get file size from fscrypt_file when present in inode traces [v20,43/71] ceph: handle fscrypt fields in cap messages from MDS [v20,44/71] ceph: update WARN_ON message to pr_warn [v20,45/71] ceph: add __ceph_get_caps helper support [v20,46/71] ceph: add __ceph_sync_read helper support [v20,47/71] ceph: add object version support for sync read [v20,48/71] ceph: add infrastructure for file encryption and decryption [v20,49/71] ceph: add truncate size handling support for fscrypt [v20,50/71] libceph: allow ceph_osdc_new_request to accept a multi-op read [v20,51/71] ceph: disable fallocate for encrypted inodes [v20,52/71] ceph: disable copy offload on encrypted inodes [v20,53/71] ceph: don't use special DIO path for encrypted inodes [v20,54/71] ceph: align data in pages in ceph_sync_write [v20,55/71] ceph: add read/modify/write to ceph_sync_write [v20,56/71] ceph: plumb in decryption during sync reads [v20,57/71] ceph: add fscrypt decryption support to ceph_netfs_issue_op [v20,58/71] ceph: set i_blkbits to crypto block size for encrypted inodes [v20,59/71] ceph: add encryption support to writepage [v20,60/71] ceph: fscrypt support for writepages [v20,61/71] ceph: invalidate pages when doing direct/sync writes [v20,62/71] ceph: add support for encrypted snapshot names [v20,63/71] ceph: add support for handling encrypted snapshot names [v20,64/71] ceph: update documentation regarding snapshot naming limitations [v20,65/71] ceph: prevent snapshots to be created in encrypted locked directories [v20,66/71] ceph: report STATX_ATTR_ENCRYPTED on encrypted inodes [v20,67/71] ceph: drop the messages from MDS when unmounting [v20,68/71] ceph: just wait the osd requests' callbacks to finish when unmounting [v20,69/71] ceph: fix updating the i_truncate_pagecache_size for fscrypt [v20,70/71] ceph: switch ceph_lookup() to use new fscrypt helper [v20,71/71] ceph: switch ceph_open_atomic() to use the new fscrypt helper

diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h index 887f191cc423..db6b399645ba 100644 --- a/fs/ceph/crypto.h +++ b/fs/ceph/crypto.h @@ -26,6 +26,27 @@ struct ceph_fname { bool no_copy; }; +/* + * Header for the crypted file when truncating the size, this + * will be sent to MDS, and the MDS will update the encrypted + * last block and then truncate the size. + */ +struct ceph_fscrypt_truncate_size_header { + __u8 ver; + __u8 compat; + + /* + * It will be sizeof(assert_ver + file_offset + block_size) + * if the last block is empty when it's located in a file + * hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE. + */ + __le32 data_len; + + __le64 change_attr; + __le64 file_offset; + __le32 block_size; +} __packed; + struct ceph_fscrypt_auth { __le32 cfa_version; __le32 cfa_blob_len; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index db54cc44a82f..50664f7b18e3 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -595,6 +595,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) ci->i_truncate_seq = 0; ci->i_truncate_size = 0; ci->i_truncate_pending = 0; + ci->i_truncate_pagecache_size = 0; ci->i_max_size = 0; ci->i_reported_size = 0; @@ -766,6 +767,10 @@ int ceph_fill_file_size(struct inode *inode, int issued, dout("truncate_size %lld -> %llu\n", ci->i_truncate_size, truncate_size); ci->i_truncate_size = truncate_size; + if (IS_ENCRYPTED(inode)) + ci->i_truncate_pagecache_size = size; + else + ci->i_truncate_pagecache_size = truncate_size; } return queue_trunc; } @@ -2140,7 +2145,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode) /* there should be no reader or writer */ WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref); - to = ci->i_truncate_size; + to = ci->i_truncate_pagecache_size; wrbuffer_refs = ci->i_wrbuffer_ref; dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode, ci->i_truncate_pending, to); @@ -2150,7 +2155,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode) truncate_pagecache(inode, to); spin_lock(&ci->i_ceph_lock); - if (to == ci->i_truncate_size) { + if (to == ci->i_truncate_pagecache_size) { ci->i_truncate_pending = 0; finish = 1; } @@ -2231,6 +2236,142 @@ static const struct inode_operations ceph_encrypted_symlink_iops = { .listxattr = ceph_listxattr, }; +/* + * Transfer the encrypted last block to the MDS and the MDS + * will help update it when truncating a smaller size. + * + * We don't support a PAGE_SIZE that is smaller than the + * CEPH_FSCRYPT_BLOCK_SIZE. + */ +static int fill_fscrypt_truncate(struct inode *inode, + struct ceph_mds_request *req, + struct iattr *attr) +{ + struct ceph_inode_info *ci = ceph_inode(inode); + int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE; + loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE); + u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT; + struct ceph_pagelist *pagelist = NULL; + struct kvec iov = {0}; + struct iov_iter iter; + struct page *page = NULL; + struct ceph_fscrypt_truncate_size_header header; + int retry_op = 0; + int len = CEPH_FSCRYPT_BLOCK_SIZE; + loff_t i_size = i_size_read(inode); + int got, ret, issued; + u64 objver; + + ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got); + if (ret < 0) + return ret; + + issued = __ceph_caps_issued(ci, NULL); + + dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__, + i_size, attr->ia_size, ceph_cap_string(got), + ceph_cap_string(issued)); + + /* Try to writeback the dirty pagecaches */ + if (issued & (CEPH_CAP_FILE_BUFFER)) { + loff_t lend = orig_pos + CEPH_FSCRYPT_BLOCK_SHIFT - 1; + ret = filemap_write_and_wait_range(inode->i_mapping, + orig_pos, lend); + if (ret < 0) + goto out; + } + + page = __page_cache_alloc(GFP_KERNEL); + if (page == NULL) { + ret = -ENOMEM; + goto out; + } + + pagelist = ceph_pagelist_alloc(GFP_KERNEL); + if (!pagelist) { + ret = -ENOMEM; + goto out; + } + + iov.iov_base = kmap_local_page(page); + iov.iov_len = len; + iov_iter_kvec(&iter, READ, &iov, 1, len); + + pos = orig_pos; + ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver); + if (ret < 0) + goto out; + + /* Insert the header first */ + header.ver = 1; + header.compat = 1; + header.change_attr = cpu_to_le64(inode_peek_iversion_raw(inode)); + + /* + * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE, + * because in MDS it may need this to do the truncate. + */ + header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE); + + /* + * If we hit a hole here, we should just skip filling + * the fscrypt for the request, because once the fscrypt + * is enabled, the file will be split into many blocks + * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there + * has a hole, the hole size should be multiple of block + * size. + * + * If the Rados object doesn't exist, it will be set to 0. + */ + if (!objver) { + dout("%s hit hole, ppos %lld < size %lld\n", __func__, + pos, i_size); + + header.data_len = cpu_to_le32(8 + 8 + 4); + header.file_offset = 0; + ret = 0; + } else { + header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE); + header.file_offset = cpu_to_le64(orig_pos); + + /* truncate and zero out the extra contents for the last block */ + memset(iov.iov_base + boff, 0, PAGE_SIZE - boff); + + /* encrypt the last block */ + ret = ceph_fscrypt_encrypt_block_inplace(inode, page, + CEPH_FSCRYPT_BLOCK_SIZE, + 0, block, + GFP_KERNEL); + if (ret) + goto out; + } + + /* Insert the header */ + ret = ceph_pagelist_append(pagelist, &header, sizeof(header)); + if (ret) + goto out; + + if (header.block_size) { + /* Append the last block contents to pagelist */ + ret = ceph_pagelist_append(pagelist, iov.iov_base, + CEPH_FSCRYPT_BLOCK_SIZE); + if (ret) + goto out; + } + req->r_pagelist = pagelist; +out: + dout("%s %p size dropping cap refs on %s\n", __func__, + inode, ceph_cap_string(got)); + ceph_put_cap_refs(ci, got); + if (iov.iov_base) + kunmap_local(iov.iov_base); + if (page) + __free_pages(page, 0); + if (ret && pagelist) + ceph_pagelist_release(pagelist); + return ret; +} + int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia) { struct ceph_inode_info *ci = ceph_inode(inode); @@ -2238,13 +2379,17 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c struct ceph_mds_request *req; struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc; struct ceph_cap_flush *prealloc_cf; + loff_t isize = i_size_read(inode); int issued; int release = 0, dirtied = 0; int mask = 0; int err = 0; int inode_dirty_flags = 0; bool lock_snap_rwsem = false; + bool fill_fscrypt; + int truncate_retry = 20; /* The RMW will take around 50ms */ +retry: prealloc_cf = ceph_alloc_cap_flush(); if (!prealloc_cf) return -ENOMEM; @@ -2256,6 +2401,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c return PTR_ERR(req); } + fill_fscrypt = false; spin_lock(&ci->i_ceph_lock); issued = __ceph_caps_issued(ci, NULL); @@ -2377,10 +2523,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c } } if (ia_valid & ATTR_SIZE) { - loff_t isize = i_size_read(inode); - dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size); - if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) { + /* + * Only when the new size is smaller and not aligned to + * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed. + */ + if (IS_ENCRYPTED(inode) && attr->ia_size < isize && + (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) { + mask |= CEPH_SETATTR_SIZE; + release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL | + CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR; + set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags); + mask |= CEPH_SETATTR_FSCRYPT_FILE; + req->r_args.setattr.size = + cpu_to_le64(round_up(attr->ia_size, + CEPH_FSCRYPT_BLOCK_SIZE)); + req->r_args.setattr.old_size = + cpu_to_le64(round_up(isize, + CEPH_FSCRYPT_BLOCK_SIZE)); + req->r_fscrypt_file = attr->ia_size; + fill_fscrypt = true; + } else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) { if (attr->ia_size > isize) { i_size_write(inode, attr->ia_size); inode->i_blocks = calc_inode_blocks(attr->ia_size); @@ -2403,7 +2566,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c cpu_to_le64(round_up(isize, CEPH_FSCRYPT_BLOCK_SIZE)); req->r_fscrypt_file = attr->ia_size; - /* FIXME: client must zero out any partial blocks! */ } else { req->r_args.setattr.size = cpu_to_le64(attr->ia_size); req->r_args.setattr.old_size = cpu_to_le64(isize); @@ -2470,8 +2632,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c release &= issued; spin_unlock(&ci->i_ceph_lock); - if (lock_snap_rwsem) + if (lock_snap_rwsem) { up_read(&mdsc->snap_rwsem); + lock_snap_rwsem = false; + } if (inode_dirty_flags) __mark_inode_dirty(inode, inode_dirty_flags); @@ -2483,7 +2647,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c req->r_args.setattr.mask = cpu_to_le32(mask); req->r_num_caps = 1; req->r_stamp = attr->ia_ctime; + if (fill_fscrypt) { + err = fill_fscrypt_truncate(inode, req, attr); + if (err) + goto out; + } + + /* + * The truncate request will return -EAGAIN when the + * last block has been updated just before the MDS + * successfully gets the xlock for the FILE lock. To + * avoid corrupting the file contents we need to retry + * it. + */ err = ceph_mdsc_do_request(mdsc, NULL, req); + if (err == -EAGAIN && truncate_retry--) { + dout("setattr %p result=%d (%s locally, %d remote), retry it!\n", + inode, err, ceph_cap_string(dirtied), mask); + ceph_mdsc_put_request(req); + ceph_free_cap_flush(prealloc_cf); + goto retry; + } } out: dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err, diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 3addefa9575b..47d86068b92a 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -426,6 +426,11 @@ struct ceph_inode_info { u32 i_truncate_seq; /* last truncate to smaller size */ u64 i_truncate_size; /* and the size we last truncated down to */ int i_truncate_pending; /* still need to call vmtruncate */ + /* + * For none fscrypt case it equals to i_truncate_size or it will + * equals to fscrypt_file_size + */ + u64 i_truncate_pagecache_size; u64 i_max_size; /* max file size authorized by mds */ u64 i_reported_size; /* (max_)size reported to or requested of mds */

[v20,49/71] ceph: add truncate size handling support for fscrypt

Commit Message

Patch