mbox series

[RFC,v10,00/48] ceph+fscrypt: full support

Message ID 20220111191608.88762-1-jlayton@kernel.org
Headers show
Series ceph+fscrypt: full support | expand

Message

Jeff Layton Jan. 11, 2022, 7:15 p.m. UTC
This patchset represents a (mostly) complete rough draft of fscrypt
support for cephfs. The context, filename and symlink support is more or
less the same as the versions posted before, and comprise the first half
of the patches.

The new bits here are the size handling changes and support for content
encryption, in buffered, direct and synchronous codepaths. Much of this
code is still very rough and needs a lot of cleanup work.

fscrypt support relies on some MDS changes that are being tracked here:

    https://github.com/ceph/ceph/pull/43588

In particular, this PR adds some new opaque fields in the inode that we
use to store fscrypt-specific information, like the context and the real
size of a file. That is slated to be merged for the upcoming Quincy
release (which is sometime this northern spring).

There are still some notable bugs:

1/ we've identified a few more potential races in truncate handling
which will probably necessitate a protocol change, as well as changes to
the MDS and kclient patchsets. The good news is that we think we have
an approach that will resolve this.

2/ the kclient doesn't handle reading sparse regions in OSD objects
properly yet. The client can end up writing to a non-zero offset in a
non-existent object. Then, if the client tries to read the written
region back later, it'll get back zeroes and give you garbage when you
try to decrypt them.

It turns out that the OSD already supports a SPARSE_READ operation, so
I'm working on implementing that in the kclient to make it not try to
decrypt the sparse regions.

Still, I was able to run xfstests on this set yesterday. Bug #2 above
prevented all of the tests from passing, but it didn't oops! I call that
progress! Given that, I figured this is a good time to post what I have
so far.

Note that the buffered I/O changes in this set are not suitable for
merge and will likely end up being discarded. We need to plumb the
encryption in at the netfs layer, so that we can store encrypted data
in fscache.

The non-buffered codepaths will likely also need substantial changes
before merging. It may be simpler to just move that into the netfs layer
too as cifs will need something similar anyway.

My goal is to get most of this into v5.18, but v5.19 might be more
realistiv. Hopefully I'll have a non-RFC patchset to send in a few
weeks.

Special thanks to Xiubo who came through with the MDS patches. Also,
thanks to everyone (especially Eric Biggers) for all of the previous
reviews. It's much appreciated!

Jeff Layton (43):
  vfs: export new_inode_pseudo
  fscrypt: export fscrypt_base64url_encode and fscrypt_base64url_decode
  fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
  fscrypt: add fscrypt_context_for_new_inode
  ceph: preallocate inode for ops that may create one
  ceph: crypto context handling for ceph
  ceph: parse new fscrypt_auth and fscrypt_file fields in inode traces
  ceph: add fscrypt_* handling to caps.c
  ceph: add ability to set fscrypt_auth via setattr
  ceph: implement -o test_dummy_encryption mount option
  ceph: decode alternate_name in lease info
  ceph: add fscrypt ioctls
  ceph: make ceph_msdc_build_path use ref-walk
  ceph: add encrypted fname handling to ceph_mdsc_build_path
  ceph: send altname in MClientRequest
  ceph: encode encrypted name in dentry release
  ceph: properly set DCACHE_NOKEY_NAME flag in lookup
  ceph: make d_revalidate call fscrypt revalidator for encrypted
    dentries
  ceph: add helpers for converting names for userland presentation
  ceph: add fscrypt support to ceph_fill_trace
  ceph: add support to readdir for encrypted filenames
  ceph: create symlinks with encrypted and base64-encoded targets
  ceph: make ceph_get_name decrypt filenames
  ceph: add a new ceph.fscrypt.auth vxattr
  ceph: add some fscrypt guardrails
  libceph: add CEPH_OSD_OP_ASSERT_VER support
  ceph: size handling for encrypted inodes in cap updates
  ceph: fscrypt_file field handling in MClientRequest messages
  ceph: get file size from fscrypt_file when present in inode traces
  ceph: handle fscrypt fields in cap messages from MDS
  ceph: add infrastructure for file encryption and decryption
  libceph: allow ceph_osdc_new_request to accept a multi-op read
  ceph: disable fallocate for encrypted inodes
  ceph: disable copy offload on encrypted inodes
  ceph: don't use special DIO path for encrypted inodes
  ceph: set encryption context on open
  ceph: align data in pages in ceph_sync_write
  ceph: add read/modify/write to ceph_sync_write
  ceph: plumb in decryption during sync reads
  ceph: set i_blkbits to crypto block size for encrypted inodes
  ceph: add fscrypt decryption support to ceph_netfs_issue_op
  ceph: add encryption support to writepage
  ceph: fscrypt support for writepages

Luis Henriques (1):
  ceph: don't allow changing layout on encrypted files/directories

Xiubo Li (4):
  ceph: add __ceph_get_caps helper support
  ceph: add __ceph_sync_read helper support
  ceph: add object version support for sync read
  ceph: add truncate size handling support for fscrypt

 fs/ceph/Makefile                |   1 +
 fs/ceph/acl.c                   |   4 +-
 fs/ceph/addr.c                  | 128 +++++--
 fs/ceph/caps.c                  | 211 ++++++++++--
 fs/ceph/crypto.c                | 374 +++++++++++++++++++++
 fs/ceph/crypto.h                | 237 +++++++++++++
 fs/ceph/dir.c                   | 209 +++++++++---
 fs/ceph/export.c                |  44 ++-
 fs/ceph/file.c                  | 476 +++++++++++++++++++++-----
 fs/ceph/inode.c                 | 576 +++++++++++++++++++++++++++++---
 fs/ceph/ioctl.c                 |  87 +++++
 fs/ceph/mds_client.c            | 349 ++++++++++++++++---
 fs/ceph/mds_client.h            |  24 +-
 fs/ceph/super.c                 |  90 ++++-
 fs/ceph/super.h                 |  43 ++-
 fs/ceph/xattr.c                 |  29 ++
 fs/crypto/fname.c               |  44 ++-
 fs/crypto/fscrypt_private.h     |   9 +-
 fs/crypto/hooks.c               |   6 +-
 fs/crypto/policy.c              |  35 +-
 fs/inode.c                      |   1 +
 include/linux/ceph/ceph_fs.h    |  21 +-
 include/linux/ceph/osd_client.h |   6 +-
 include/linux/ceph/rados.h      |   4 +
 include/linux/fscrypt.h         |  10 +
 net/ceph/osd_client.c           |  32 +-
 26 files changed, 2700 insertions(+), 350 deletions(-)
 create mode 100644 fs/ceph/crypto.c
 create mode 100644 fs/ceph/crypto.h

Comments

Jeff Layton Jan. 11, 2022, 7:26 p.m. UTC | #1
On Tue, 2022-01-11 at 14:15 -0500, Jeff Layton wrote:
> This patchset represents a (mostly) complete rough draft of fscrypt
> support for cephfs. The context, filename and symlink support is more or
> less the same as the versions posted before, and comprise the first half
> of the patches.
> 
> The new bits here are the size handling changes and support for content
> encryption, in buffered, direct and synchronous codepaths. Much of this
> code is still very rough and needs a lot of cleanup work.
> 
> fscrypt support relies on some MDS changes that are being tracked here:
> 
>     https://github.com/ceph/ceph/pull/43588
> 
> In particular, this PR adds some new opaque fields in the inode that we
> use to store fscrypt-specific information, like the context and the real
> size of a file. That is slated to be merged for the upcoming Quincy
> release (which is sometime this northern spring).
> 
> There are still some notable bugs:
> 
> 1/ we've identified a few more potential races in truncate handling
> which will probably necessitate a protocol change, as well as changes to
> the MDS and kclient patchsets. The good news is that we think we have
> an approach that will resolve this.
> 
> 2/ the kclient doesn't handle reading sparse regions in OSD objects
> properly yet. The client can end up writing to a non-zero offset in a
> non-existent object. Then, if the client tries to read the written
> region back later, it'll get back zeroes and give you garbage when you
> try to decrypt them.
> 
> It turns out that the OSD already supports a SPARSE_READ operation, so
> I'm working on implementing that in the kclient to make it not try to
> decrypt the sparse regions.
> 
> Still, I was able to run xfstests on this set yesterday. Bug #2 above
> prevented all of the tests from passing, but it didn't oops! I call that
> progress! Given that, I figured this is a good time to post what I have
> so far.
> 
> Note that the buffered I/O changes in this set are not suitable for
> merge and will likely end up being discarded. We need to plumb the
> encryption in at the netfs layer, so that we can store encrypted data
> in fscache.
> 
> The non-buffered codepaths will likely also need substantial changes
> before merging. It may be simpler to just move that into the netfs layer
> too as cifs will need something similar anyway.
> 
> My goal is to get most of this into v5.18, but v5.19 might be more
> realistiv. Hopefully I'll have a non-RFC patchset to send in a few
> weeks.
> 
> Special thanks to Xiubo who came through with the MDS patches. Also,
> thanks to everyone (especially Eric Biggers) for all of the previous
> reviews. It's much appreciated!
> 
> Jeff Layton (43):
>   vfs: export new_inode_pseudo
>   fscrypt: export fscrypt_base64url_encode and fscrypt_base64url_decode
>   fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
>   fscrypt: add fscrypt_context_for_new_inode
>   ceph: preallocate inode for ops that may create one
>   ceph: crypto context handling for ceph
>   ceph: parse new fscrypt_auth and fscrypt_file fields in inode traces
>   ceph: add fscrypt_* handling to caps.c
>   ceph: add ability to set fscrypt_auth via setattr
>   ceph: implement -o test_dummy_encryption mount option
>   ceph: decode alternate_name in lease info
>   ceph: add fscrypt ioctls
>   ceph: make ceph_msdc_build_path use ref-walk
>   ceph: add encrypted fname handling to ceph_mdsc_build_path
>   ceph: send altname in MClientRequest
>   ceph: encode encrypted name in dentry release
>   ceph: properly set DCACHE_NOKEY_NAME flag in lookup
>   ceph: make d_revalidate call fscrypt revalidator for encrypted
>     dentries
>   ceph: add helpers for converting names for userland presentation
>   ceph: add fscrypt support to ceph_fill_trace
>   ceph: add support to readdir for encrypted filenames
>   ceph: create symlinks with encrypted and base64-encoded targets
>   ceph: make ceph_get_name decrypt filenames
>   ceph: add a new ceph.fscrypt.auth vxattr
>   ceph: add some fscrypt guardrails
>   libceph: add CEPH_OSD_OP_ASSERT_VER support
>   ceph: size handling for encrypted inodes in cap updates
>   ceph: fscrypt_file field handling in MClientRequest messages
>   ceph: get file size from fscrypt_file when present in inode traces
>   ceph: handle fscrypt fields in cap messages from MDS
>   ceph: add infrastructure for file encryption and decryption
>   libceph: allow ceph_osdc_new_request to accept a multi-op read
>   ceph: disable fallocate for encrypted inodes
>   ceph: disable copy offload on encrypted inodes
>   ceph: don't use special DIO path for encrypted inodes
>   ceph: set encryption context on open
>   ceph: align data in pages in ceph_sync_write
>   ceph: add read/modify/write to ceph_sync_write
>   ceph: plumb in decryption during sync reads
>   ceph: set i_blkbits to crypto block size for encrypted inodes
>   ceph: add fscrypt decryption support to ceph_netfs_issue_op
>   ceph: add encryption support to writepage
>   ceph: fscrypt support for writepages
> 
> Luis Henriques (1):
>   ceph: don't allow changing layout on encrypted files/directories
> 
> Xiubo Li (4):
>   ceph: add __ceph_get_caps helper support
>   ceph: add __ceph_sync_read helper support
>   ceph: add object version support for sync read
>   ceph: add truncate size handling support for fscrypt
> 
>  fs/ceph/Makefile                |   1 +
>  fs/ceph/acl.c                   |   4 +-
>  fs/ceph/addr.c                  | 128 +++++--
>  fs/ceph/caps.c                  | 211 ++++++++++--
>  fs/ceph/crypto.c                | 374 +++++++++++++++++++++
>  fs/ceph/crypto.h                | 237 +++++++++++++
>  fs/ceph/dir.c                   | 209 +++++++++---
>  fs/ceph/export.c                |  44 ++-
>  fs/ceph/file.c                  | 476 +++++++++++++++++++++-----
>  fs/ceph/inode.c                 | 576 +++++++++++++++++++++++++++++---
>  fs/ceph/ioctl.c                 |  87 +++++
>  fs/ceph/mds_client.c            | 349 ++++++++++++++++---
>  fs/ceph/mds_client.h            |  24 +-
>  fs/ceph/super.c                 |  90 ++++-
>  fs/ceph/super.h                 |  43 ++-
>  fs/ceph/xattr.c                 |  29 ++
>  fs/crypto/fname.c               |  44 ++-
>  fs/crypto/fscrypt_private.h     |   9 +-
>  fs/crypto/hooks.c               |   6 +-
>  fs/crypto/policy.c              |  35 +-
>  fs/inode.c                      |   1 +
>  include/linux/ceph/ceph_fs.h    |  21 +-
>  include/linux/ceph/osd_client.h |   6 +-
>  include/linux/ceph/rados.h      |   4 +
>  include/linux/fscrypt.h         |  10 +
>  net/ceph/osd_client.c           |  32 +-
>  26 files changed, 2700 insertions(+), 350 deletions(-)
>  create mode 100644 fs/ceph/crypto.c
>  create mode 100644 fs/ceph/crypto.h
> 

I should also mention that I've pushed this series into a new
wip-fscrypt branch in the ceph-client tree for anyone that wants to
check it out.

    https://github.com/ceph/ceph-client/commits/wip-fscrypt

I can't recommend this for general use yet until the data corruption
bugs are fixed, of course.
Xiubo Li Jan. 12, 2022, 8:41 a.m. UTC | #2
Hi Jeff,

I have post the V8 for this patch by switching the 'header.objver' to 
'header.change_attr' to gate the truncate operation to fix the first 
notable bug you mentioned in the cover-letter.

Regards

-- Xiubo


On 1/12/22 3:15 AM, Jeff Layton wrote:
> From: Xiubo Li <xiubli@redhat.com>
>
> This will transfer the encrypted last block contents to the MDS
> along with the truncate request only when the new size is smaller
> and not aligned to the fscrypt BLOCK size. When the last block is
> located in the file hole, the truncate request will only contain
> the header.
>
> The MDS could fail to do the truncate if there has another client
> or process has already updated the RADOS object which contains
> the last block, and will return -EAGAIN, then the kclient needs
> to retry it. The RMW will take around 50ms, and will let it retry
> 20 times for now.
>
> Signed-off-by: Xiubo Li <xiubli@redhat.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>   fs/ceph/crypto.h |  21 +++++
>   fs/ceph/inode.c  | 217 ++++++++++++++++++++++++++++++++++++++++++++---
>   fs/ceph/super.h  |   5 ++
>   3 files changed, 229 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/crypto.h b/fs/ceph/crypto.h
> index b5d360085fe8..3b7efffecbeb 100644
> --- a/fs/ceph/crypto.h
> +++ b/fs/ceph/crypto.h
> @@ -25,6 +25,27 @@ struct ceph_fname {
>   	u32		ctext_len;	// length of crypttext
>   };
>   
> +/*
> + * Header for the crypted file when truncating the size, this
> + * will be sent to MDS, and the MDS will update the encrypted
> + * last block and then truncate the size.
> + */
> +struct ceph_fscrypt_truncate_size_header {
> +       __u8  ver;
> +       __u8  compat;
> +
> +       /*
> +	* It will be sizeof(assert_ver + file_offset + block_size)
> +	* if the last block is empty when it's located in a file
> +	* hole. Or the data_len will plus CEPH_FSCRYPT_BLOCK_SIZE.
> +	*/
> +       __le32 data_len;
> +
> +       __le64 assert_ver;
> +       __le64 file_offset;
> +       __le32 block_size;
> +} __packed;
> +
>   struct ceph_fscrypt_auth {
>   	__le32	cfa_version;
>   	__le32	cfa_blob_len;
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 2497306eef58..eecda0a73908 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -586,6 +586,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
>   	ci->i_truncate_seq = 0;
>   	ci->i_truncate_size = 0;
>   	ci->i_truncate_pending = 0;
> +	ci->i_truncate_pagecache_size = 0;
>   
>   	ci->i_max_size = 0;
>   	ci->i_reported_size = 0;
> @@ -759,6 +760,10 @@ int ceph_fill_file_size(struct inode *inode, int issued,
>   		dout("truncate_size %lld -> %llu\n", ci->i_truncate_size,
>   		     truncate_size);
>   		ci->i_truncate_size = truncate_size;
> +		if (IS_ENCRYPTED(inode))
> +			ci->i_truncate_pagecache_size = size;
> +		else
> +			ci->i_truncate_pagecache_size = truncate_size;
>   	}
>   	return queue_trunc;
>   }
> @@ -1015,7 +1020,7 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   
>   	if (new_version ||
>   	    (new_issued & (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR))) {
> -		u64 size = info->size;
> +		u64 size = le64_to_cpu(info->size);
>   		s64 old_pool = ci->i_layout.pool_id;
>   		struct ceph_string *old_ns;
>   
> @@ -1030,16 +1035,20 @@ int ceph_fill_inode(struct inode *inode, struct page *locked_page,
>   		pool_ns = old_ns;
>   
>   		if (IS_ENCRYPTED(inode) && size &&
> -		    (iinfo->fscrypt_file_len == sizeof(__le64))) {
> -			size = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> -			if (info->size != round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> -				pr_warn("size=%llu fscrypt_file=%llu\n", info->size, size);
> +		    (iinfo->fscrypt_file_len >= sizeof(__le64))) {
> +			u64 fsize = __le64_to_cpu(*(__le64 *)iinfo->fscrypt_file);
> +			if (fsize) {
> +				size = fsize;
> +				if (le64_to_cpu(info->size) !=
> +				    round_up(size, CEPH_FSCRYPT_BLOCK_SIZE))
> +					pr_warn("size=%llu fscrypt_file=%llu\n",
> +						info->size, size);
> +			}
>   		}
>   
>   		queue_trunc = ceph_fill_file_size(inode, issued,
>   					le32_to_cpu(info->truncate_seq),
> -					le64_to_cpu(info->truncate_size),
> -					le64_to_cpu(size));
> +					le64_to_cpu(info->truncate_size), size);
>   		/* only update max_size on auth cap */
>   		if ((info->cap.flags & CEPH_CAP_FLAG_AUTH) &&
>   		    ci->i_max_size != le64_to_cpu(info->max_size)) {
> @@ -2153,7 +2162,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	/* there should be no reader or writer */
>   	WARN_ON_ONCE(ci->i_rd_ref || ci->i_wr_ref);
>   
> -	to = ci->i_truncate_size;
> +	to = ci->i_truncate_pagecache_size;
>   	wrbuffer_refs = ci->i_wrbuffer_ref;
>   	dout("__do_pending_vmtruncate %p (%d) to %lld\n", inode,
>   	     ci->i_truncate_pending, to);
> @@ -2163,7 +2172,7 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
>   	truncate_pagecache(inode, to);
>   
>   	spin_lock(&ci->i_ceph_lock);
> -	if (to == ci->i_truncate_size) {
> +	if (to == ci->i_truncate_pagecache_size) {
>   		ci->i_truncate_pending = 0;
>   		finish = 1;
>   	}
> @@ -2244,6 +2253,143 @@ static const struct inode_operations ceph_encrypted_symlink_iops = {
>   	.listxattr = ceph_listxattr,
>   };
>   
> +/*
> + * Transfer the encrypted last block to the MDS and the MDS
> + * will help update it when truncating a smaller size.
> + *
> + * We don't support a PAGE_SIZE that is smaller than the
> + * CEPH_FSCRYPT_BLOCK_SIZE.
> + */
> +static int fill_fscrypt_truncate(struct inode *inode,
> +				 struct ceph_mds_request *req,
> +				 struct iattr *attr)
> +{
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	int boff = attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t pos, orig_pos = round_down(attr->ia_size, CEPH_FSCRYPT_BLOCK_SIZE);
> +	u64 block = orig_pos >> CEPH_FSCRYPT_BLOCK_SHIFT;
> +	struct ceph_pagelist *pagelist = NULL;
> +	struct kvec iov;
> +	struct iov_iter iter;
> +	struct page *page = NULL;
> +	struct ceph_fscrypt_truncate_size_header header;
> +	int retry_op = 0;
> +	int len = CEPH_FSCRYPT_BLOCK_SIZE;
> +	loff_t i_size = i_size_read(inode);
> +	int got, ret, issued;
> +	u64 objver;
> +
> +	ret = __ceph_get_caps(inode, NULL, CEPH_CAP_FILE_RD, 0, -1, &got);
> +	if (ret < 0)
> +		return ret;
> +
> +	issued = __ceph_caps_issued(ci, NULL);
> +
> +	dout("%s size %lld -> %lld got cap refs on %s, issued %s\n", __func__,
> +	     i_size, attr->ia_size, ceph_cap_string(got),
> +	     ceph_cap_string(issued));
> +
> +	/* Try to writeback the dirty pagecaches */
> +	if (issued & (CEPH_CAP_FILE_BUFFER))
> +		filemap_write_and_wait(inode->i_mapping);
> +
> +	page = __page_cache_alloc(GFP_KERNEL);
> +	if (page == NULL) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> +	if (!pagelist) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	iov.iov_base = kmap_local_page(page);
> +	iov.iov_len = len;
> +	iov_iter_kvec(&iter, READ, &iov, 1, len);
> +
> +	pos = orig_pos;
> +	ret = __ceph_sync_read(inode, &pos, &iter, &retry_op, &objver);
> +	ceph_put_cap_refs(ci, got);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Insert the header first */
> +	header.ver = 1;
> +	header.compat = 1;
> +
> +	/*
> +	 * Always set the block_size to CEPH_FSCRYPT_BLOCK_SIZE,
> +	 * because in MDS it may need this to do the truncate.
> +	 */
> +	header.block_size = cpu_to_le32(CEPH_FSCRYPT_BLOCK_SIZE);
> +
> +	/*
> +	 * If we hit a hole here, we should just skip filling
> +	 * the fscrypt for the request, because once the fscrypt
> +	 * is enabled, the file will be split into many blocks
> +	 * with the size of CEPH_FSCRYPT_BLOCK_SIZE, if there
> +	 * has a hole, the hole size should be multiple of block
> +	 * size.
> +	 *
> +	 * If the Rados object doesn't exist, it will be set 0.
> +	 */
> +	if (!objver) {
> +		dout("%s hit hole, ppos %lld < size %lld\n", __func__,
> +		     pos, i_size);
> +
> +		header.data_len = cpu_to_le32(8 + 8 + 4);
> +
> +		/*
> +		 * If the "assert_ver" is 0 means hitting a hole, and
> +		 * the MDS will use the it to check whether hitting a
> +		 * hole or not.
> +		 */
> +		header.assert_ver = 0;
> +		header.file_offset = 0;
> +		ret = 0;
> +	} else {
> +		header.data_len = cpu_to_le32(8 + 8 + 4 + CEPH_FSCRYPT_BLOCK_SIZE);
> +		header.assert_ver = cpu_to_le64(objver);
> +		header.file_offset = cpu_to_le64(orig_pos);
> +
> +		/* truncate and zero out the extra contents for the last block */
> +		memset(iov.iov_base + boff, 0, PAGE_SIZE - boff);
> +
> +		/* encrypt the last block */
> +		ret = ceph_fscrypt_encrypt_block_inplace(inode, page,
> +						    CEPH_FSCRYPT_BLOCK_SIZE,
> +						    0, block,
> +						    GFP_KERNEL);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	/* Insert the header */
> +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> +	if (ret)
> +		goto out;
> +
> +	if (header.block_size) {
> +		/* Append the last block contents to pagelist */
> +		ret = ceph_pagelist_append(pagelist, iov.iov_base,
> +					   CEPH_FSCRYPT_BLOCK_SIZE);
> +		if (ret)
> +			goto out;
> +	}
> +	req->r_pagelist = pagelist;
> +out:
> +	dout("%s %p size dropping cap refs on %s\n", __func__,
> +	     inode, ceph_cap_string(got));
> +	kunmap_local(iov.iov_base);
> +	if (page)
> +		__free_pages(page, 0);
> +	if (ret && pagelist)
> +		ceph_pagelist_release(pagelist);
> +	return ret;
> +}
> +
>   int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *cia)
>   {
>   	struct ceph_inode_info *ci = ceph_inode(inode);
> @@ -2251,13 +2397,17 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   	struct ceph_mds_request *req;
>   	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>   	struct ceph_cap_flush *prealloc_cf;
> +	loff_t isize = i_size_read(inode);
>   	int issued;
>   	int release = 0, dirtied = 0;
>   	int mask = 0;
>   	int err = 0;
>   	int inode_dirty_flags = 0;
>   	bool lock_snap_rwsem = false;
> +	bool fill_fscrypt;
> +	int truncate_retry = 20; /* The RMW will take around 50ms */
>   
> +retry:
>   	prealloc_cf = ceph_alloc_cap_flush();
>   	if (!prealloc_cf)
>   		return -ENOMEM;
> @@ -2269,6 +2419,7 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		return PTR_ERR(req);
>   	}
>   
> +	fill_fscrypt = false;
>   	spin_lock(&ci->i_ceph_lock);
>   	issued = __ceph_caps_issued(ci, NULL);
>   
> @@ -2390,10 +2541,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		}
>   	}
>   	if (ia_valid & ATTR_SIZE) {
> -		loff_t isize = i_size_read(inode);
> -
>   		dout("setattr %p size %lld -> %lld\n", inode, isize, attr->ia_size);
> -		if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
> +		/*
> +		 * Only when the new size is smaller and not aligned to
> +		 * CEPH_FSCRYPT_BLOCK_SIZE will the RMW is needed.
> +		 */
> +		if (IS_ENCRYPTED(inode) && attr->ia_size < isize &&
> +		    (attr->ia_size % CEPH_FSCRYPT_BLOCK_SIZE)) {
> +			mask |= CEPH_SETATTR_SIZE;
> +			release |= CEPH_CAP_FILE_SHARED | CEPH_CAP_FILE_EXCL |
> +				   CEPH_CAP_FILE_RD | CEPH_CAP_FILE_WR;
> +			set_bit(CEPH_MDS_R_FSCRYPT_FILE, &req->r_req_flags);
> +			mask |= CEPH_SETATTR_FSCRYPT_FILE;
> +			req->r_args.setattr.size =
> +				cpu_to_le64(round_up(attr->ia_size,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_args.setattr.old_size =
> +				cpu_to_le64(round_up(isize,
> +						     CEPH_FSCRYPT_BLOCK_SIZE));
> +			req->r_fscrypt_file = attr->ia_size;
> +			fill_fscrypt = true;
> +		} else if ((issued & CEPH_CAP_FILE_EXCL) && attr->ia_size >= isize) {
>   			if (attr->ia_size > isize) {
>   				i_size_write(inode, attr->ia_size);
>   				inode->i_blocks = calc_inode_blocks(attr->ia_size);
> @@ -2416,7 +2584,6 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   					cpu_to_le64(round_up(isize,
>   							     CEPH_FSCRYPT_BLOCK_SIZE));
>   				req->r_fscrypt_file = attr->ia_size;
> -				/* FIXME: client must zero out any partial blocks! */
>   			} else {
>   				req->r_args.setattr.size = cpu_to_le64(attr->ia_size);
>   				req->r_args.setattr.old_size = cpu_to_le64(isize);
> @@ -2482,8 +2649,10 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   
>   	release &= issued;
>   	spin_unlock(&ci->i_ceph_lock);
> -	if (lock_snap_rwsem)
> +	if (lock_snap_rwsem) {
>   		up_read(&mdsc->snap_rwsem);
> +		lock_snap_rwsem = false;
> +	}
>   
>   	if (inode_dirty_flags)
>   		__mark_inode_dirty(inode, inode_dirty_flags);
> @@ -2495,7 +2664,27 @@ int __ceph_setattr(struct inode *inode, struct iattr *attr, struct ceph_iattr *c
>   		req->r_args.setattr.mask = cpu_to_le32(mask);
>   		req->r_num_caps = 1;
>   		req->r_stamp = attr->ia_ctime;
> +		if (fill_fscrypt) {
> +			err = fill_fscrypt_truncate(inode, req, attr);
> +			if (err)
> +				goto out;
> +		}
> +
> +		/*
> +		 * The truncate request will return -EAGAIN when the
> +		 * last block has been updated just before the MDS
> +		 * successfully gets the xlock for the FILE lock. To
> +		 * avoid corrupting the file contents we need to retry
> +		 * it.
> +		 */
>   		err = ceph_mdsc_do_request(mdsc, NULL, req);
> +		if (err == -EAGAIN && truncate_retry--) {
> +			dout("setattr %p result=%d (%s locally, %d remote), retry it!\n",
> +			     inode, err, ceph_cap_string(dirtied), mask);
> +			ceph_mdsc_put_request(req);
> +			ceph_free_cap_flush(prealloc_cf);
> +			goto retry;
> +		}
>   	}
>   out:
>   	dout("setattr %p result=%d (%s locally, %d remote)\n", inode, err,
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 4d2ccb51fe61..cd4a83fcbc0f 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -410,6 +410,11 @@ struct ceph_inode_info {
>   	u32 i_truncate_seq;        /* last truncate to smaller size */
>   	u64 i_truncate_size;       /*  and the size we last truncated down to */
>   	int i_truncate_pending;    /*  still need to call vmtruncate */
> +	/*
> +	 * For none fscrypt case it equals to i_truncate_size or it will
> +	 * equals to fscrypt_file_size
> +	 */
> +	u64 i_truncate_pagecache_size;
>   
>   	u64 i_max_size;            /* max file size authorized by mds */
>   	u64 i_reported_size; /* (max_)size reported to or requested of mds */
Eric Biggers Jan. 27, 2022, 2:14 a.m. UTC | #3
On Tue, Jan 11, 2022 at 02:15:20PM -0500, Jeff Layton wrote:
> Still, I was able to run xfstests on this set yesterday. Bug #2 above
> prevented all of the tests from passing, but it didn't oops! I call that
> progress! Given that, I figured this is a good time to post what I have
> so far.

One question: what sort of testing are you doing to show that the file contents
and filenames being stored (i.e., sent by the client to the server in this case)
have been encrypted correctly?  xfstests has tests that verify this for block
device based filesystems; are you doing any equivalent testing?

- Eric
Jeff Layton Jan. 27, 2022, 11:08 a.m. UTC | #4
On Wed, 2022-01-26 at 18:14 -0800, Eric Biggers wrote:
> On Tue, Jan 11, 2022 at 02:15:20PM -0500, Jeff Layton wrote:
> > Still, I was able to run xfstests on this set yesterday. Bug #2 above
> > prevented all of the tests from passing, but it didn't oops! I call that
> > progress! Given that, I figured this is a good time to post what I have
> > so far.
> 
> One question: what sort of testing are you doing to show that the file contents
> and filenames being stored (i.e., sent by the client to the server in this case)
> have been encrypted correctly?  xfstests has tests that verify this for block
> device based filesystems; are you doing any equivalent testing?
> 

I've been testing this pretty regularly with xfstests, and the filenames
portion all seems to be working correctly. Parts of the content
encryption also seem to work ok. I'm still working that piece, so I
haven't been able to validate that part yet.

At the moment I'm working on switching the ceph client over to doing
sparse reads, which is necessary in order to be able to handle sparse
writes without filling in unwritten holes.
Eric Biggers Jan. 28, 2022, 8:39 p.m. UTC | #5
On Thu, Jan 27, 2022 at 06:08:40AM -0500, Jeff Layton wrote:
> On Wed, 2022-01-26 at 18:14 -0800, Eric Biggers wrote:
> > On Tue, Jan 11, 2022 at 02:15:20PM -0500, Jeff Layton wrote:
> > > Still, I was able to run xfstests on this set yesterday. Bug #2 above
> > > prevented all of the tests from passing, but it didn't oops! I call that
> > > progress! Given that, I figured this is a good time to post what I have
> > > so far.
> > 
> > One question: what sort of testing are you doing to show that the file contents
> > and filenames being stored (i.e., sent by the client to the server in this case)
> > have been encrypted correctly?  xfstests has tests that verify this for block
> > device based filesystems; are you doing any equivalent testing?
> > 
> 
> I've been testing this pretty regularly with xfstests, and the filenames
> portion all seems to be working correctly. Parts of the content
> encryption also seem to work ok. I'm still working that piece, so I
> haven't been able to validate that part yet.
> 
> At the moment I'm working on switching the ceph client over to doing
> sparse reads, which is necessary in order to be able to handle sparse
> writes without filling in unwritten holes.

To clarify, I'm asking about the correctness of the ciphertext written to
"disk", not about the user-visible filesystem behavior which is something
different (but also super important as well, of course).  xfstests includes both
types of tests.

Grepping for _verify_ciphertext_for_encryption_policy in xfstests will show the
tests that verify the ciphertext written to disk.  I doubt that you're running
those, as they rely on a block device.  So you'll need to write some equivalent
tests.  In a pinch, you could simply check that the ciphertext is random rather
than correct (that would at least show that it's not plaintext) like what
generic/399 does.  But actually verifying its correctness would be ideal to
ensure that nothing went wrong along the way.

- Eric
Jeff Layton Jan. 28, 2022, 8:47 p.m. UTC | #6
On Fri, 2022-01-28 at 12:39 -0800, Eric Biggers wrote:
> On Thu, Jan 27, 2022 at 06:08:40AM -0500, Jeff Layton wrote:
> > On Wed, 2022-01-26 at 18:14 -0800, Eric Biggers wrote:
> > > On Tue, Jan 11, 2022 at 02:15:20PM -0500, Jeff Layton wrote:
> > > > Still, I was able to run xfstests on this set yesterday. Bug #2 above
> > > > prevented all of the tests from passing, but it didn't oops! I call that
> > > > progress! Given that, I figured this is a good time to post what I have
> > > > so far.
> > > 
> > > One question: what sort of testing are you doing to show that the file contents
> > > and filenames being stored (i.e., sent by the client to the server in this case)
> > > have been encrypted correctly?  xfstests has tests that verify this for block
> > > device based filesystems; are you doing any equivalent testing?
> > > 
> > 
> > I've been testing this pretty regularly with xfstests, and the filenames
> > portion all seems to be working correctly. Parts of the content
> > encryption also seem to work ok. I'm still working that piece, so I
> > haven't been able to validate that part yet.
> > 
> > At the moment I'm working on switching the ceph client over to doing
> > sparse reads, which is necessary in order to be able to handle sparse
> > writes without filling in unwritten holes.
> 
> To clarify, I'm asking about the correctness of the ciphertext written to
> "disk", not about the user-visible filesystem behavior which is something
> different (but also super important as well, of course).  xfstests includes both
> types of tests.
> 
> Grepping for _verify_ciphertext_for_encryption_policy in xfstests will show the
> tests that verify the ciphertext written to disk.  I doubt that you're running
> those, as they rely on a block device.  So you'll need to write some equivalent
> tests.  In a pinch, you could simply check that the ciphertext is random rather
> than correct (that would at least show that it's not plaintext) like what
> generic/399 does.  But actually verifying its correctness would be ideal to
> ensure that nothing went wrong along the way.
> 

Got it. Yes, that would be a good thing. I'll have to see what I can do
once I get to the point of a fully-functioning prototype.
Luis Henriques Feb. 16, 2022, 4:13 p.m. UTC | #7
Jeff Layton <jlayton@kernel.org> writes:

> On Mon, 2022-02-14 at 17:57 +0000, Luís Henriques wrote:
>> Jeff Layton <jlayton@kernel.org> writes:
>> 
>> > This patchset represents a (mostly) complete rough draft of fscrypt
>> > support for cephfs. The context, filename and symlink support is more or
>> > less the same as the versions posted before, and comprise the first half
>> > of the patches.
>> > 
>> > The new bits here are the size handling changes and support for content
>> > encryption, in buffered, direct and synchronous codepaths. Much of this
>> > code is still very rough and needs a lot of cleanup work.
>> > 
>> > fscrypt support relies on some MDS changes that are being tracked here:
>> > 
>> >     https://github.com/ceph/ceph/pull/43588
>> > 
>> 
>> Please correct me if I'm wrong (and I've a feeling that I *will* be
>> wrong): we're still missing some mechanism that prevents clients that do
>> not support fscrypt from creating new files in an encryption directory,
>> right?  I'm pretty sure I've discussed this "somewhere" with "someone",
>> but I can't remember anything else.
>> 
>> At this point, I can create an encrypted directory and, from a different
>> client (that doesn't support fscrypt), create a new non-encrypted file in
>> that directory.  The result isn't good, of course.
>> 
>> I guess that a new feature bit can be used so that the MDS won't allow any
>> sort of operations (or, at least, write/create operations) on encrypted
>> dirs from clients that don't have this bit set.
>> 
>> So, am I missing something or is this still on the TODO list?
>> 
>> (I can try to have a look at it if this is still missing.)
>> 
>> Cheers,
>
> It's still on the TODO list.
>
> Basically, I think we'll want to allow non-fscrypt-enabled clients to
> stat and readdir in an fscrypt-enabled directory tree, and unlink files
> and directories in it.
>
> They should have no need to do anything else. You can't run backups from
> such clients since you wouldn't have the real size or crypto context.
> -- 
> Jeff Layton <jlayton@kernel.org>

OK, I've looked at the code and I've a patch that works (sort of).  Here's
what I've done:

I'm blocking all the dangerous Ops (CEPH_MDS_OP_{CREATE,MKDIR,...}) early
in the client requests handling code.  I.e., returning -EROFS if the
client session doesn't have the feature *and* the inode has fscrypt_auth
set.

It sort of works (I still need to find if I need any locks, that's black
magic for me!), but it won't prevent a client from doing things like
appending garbage to an encrypted file.  Doing this will obviously make
that file useless, but it's not that much different from non-encrypted
files (sure, in this case it might be possible to recover some data).  But
I'm not seeing an easy way to caps into this mix.

Cheers,