[v2,03/20] fuse: Implement standard FUSE operations

Message ID	20200922104932.46384-4-mreitz@redhat.com
State	New
Headers	show Return-Path: <SRS0=0j61=C7=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 826D42396D From: Max Reitz <mreitz@redhat.com> To: qemu-block@nongnu.org Subject: [PATCH v2 03/20] fuse: Implement standard FUSE operations Date: Tue, 22 Sep 2020 12:49:15 +0200 Message-Id: <20200922104932.46384-4-mreitz@redhat.com> In-Reply-To: <20200922104932.46384-1-mreitz@redhat.com> References: <20200922104932.46384-1-mreitz@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=63.128.21.124; envelope-from=mreitz@redhat.com; helo=us-smtp-delivery-124.mimecast.com Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, Max Reitz <mreitz@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	block/export: Allow exporting BDSs via FUSE \| expand [v2,00/20] block/export: Allow exporting BDSs via FUSE [v2,01/20] configure: Detect libfuse [v2,02/20] fuse: Allow exporting BDSs via FUSE [v2,03/20] fuse: Implement standard FUSE operations [v2,04/20] fuse: Allow growable exports [v2,05/20] fuse: (Partially) implement fallocate() [v2,06/20] fuse: Implement hole detection through lseek [v2,07/20] iotests: Do not needlessly filter _make_test_img [v2,08/20] iotests: Do not pipe _make_test_img [v2,09/20] iotests: Use convert -n in some cases [v2,10/20] iotests/046: Avoid renaming images [v2,11/20] iotests: Derive image names from $TEST_IMG [v2,12/20] iotests/091: Use _cleanup_qemu instad of "wait" [v2,13/20] iotests: Restrict some Python tests to file [v2,14/20] iotests: Let _make_test_img guess $TEST_IMG_FILE [v2,15/20] iotests/287: Clean up subshell test image [v2,16/20] storage-daemon: Call bdrv_close_all() on exit [v2,17/20] iotests: Give access to the qemu-storage-daemon [v2,18/20] iotests: Allow testing FUSE exports [v2,19/20] iotests: Enable fuse for many tests [v2,20/20] iotests/308: Add test for FUSE exports

Message ID

20200922104932.46384-4-mreitz@redhat.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 826D42396D
From: Max Reitz <mreitz@redhat.com>
To: qemu-block@nongnu.org
Subject: [PATCH v2 03/20] fuse: Implement standard FUSE operations
Date: Tue, 22 Sep 2020 12:49:15 +0200
Message-Id: <20200922104932.46384-4-mreitz@redhat.com>
In-Reply-To: <20200922104932.46384-1-mreitz@redhat.com>
References: <20200922104932.46384-1-mreitz@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=63.128.21.124; envelope-from=mreitz@redhat.com; 
	helo=us-smtp-delivery-124.mimecast.com
X-Spam_score_int: -35
X-Spam_score: -3.6
X-Spam_bar: ---
X-Spam_report: (-3.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-1.455,
	DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, 
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001,
	RCVD_IN_MSPIKE_WL=0.001, 
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org,
	Stefan Hajnoczi <stefanha@redhat.com>, Max Reitz <mreitz@redhat.com>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>

Series

block/export: Allow exporting BDSs via FUSE | expand

Commit Message

Max Reitz Sept. 22, 2020, 10:49 a.m. UTC

This makes the export actually useful instead of only producing errors
whenever it is accessed.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/export/fuse.c | 226 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 226 insertions(+)

Comments

Kevin Wolf Oct. 15, 2020, 9:46 a.m. UTC | #1

Am 22.09.2020 um 12:49 hat Max Reitz geschrieben:
> This makes the export actually useful instead of only producing errors
> whenever it is accessed.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/export/fuse.c | 226 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 226 insertions(+)
> 
> diff --git a/block/export/fuse.c b/block/export/fuse.c
> index 75f11d2514..8fc667231d 100644
> --- a/block/export/fuse.c
> +++ b/block/export/fuse.c
> @@ -32,6 +32,10 @@
>  #include <fuse_lowlevel.h>
>  
>  
> +/* Prevent overly long bounce buffer allocations */
> +#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
> +
> +
>  typedef struct FuseExport {
>      BlockExport common;
>  
> @@ -241,7 +245,229 @@ static bool is_regular_file(const char *path, Error **errp)
>      return true;
>  }
>  
> +
> +/**
> + * Let clients look up files.  Always return ENOENT because we only
> + * care about the mountpoint itself.
> + */
> +static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
> +{
> +    fuse_reply_err(req, ENOENT);
> +}
> +
> +/**
> + * Let clients get file attributes (i.e., stat() the file).
> + */
> +static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
> +                         struct fuse_file_info *fi)
> +{
> +    struct stat statbuf;
> +    int64_t length, allocated_blocks;
> +    time_t now = time(NULL);
> +    ImageInfo *info = NULL;
> +    FuseExport *exp = fuse_req_userdata(req);
> +    mode_t mode;
> +    Error *local_error = NULL;
> +
> +    length = blk_getlength(exp->common.blk);
> +    if (length < 0) {
> +        fuse_reply_err(req, -length);
> +        return;
> +    }
> +
> +    bdrv_query_image_info(blk_bs(exp->common.blk), &info, &local_error);
> +    if (local_error) {
> +        allocated_blocks = DIV_ROUND_UP(length, 512);
> +    } else {
> +        allocated_blocks = DIV_ROUND_UP(info->actual_size, 512);
> +    }
> +
> +    qapi_free_ImageInfo(info);
> +    error_free(local_error);
> +    local_error = NULL;

If you only use info->actual_size, why not directly call
bdrv_get_allocated_file_size()?

> +
> +    mode = S_IFREG | S_IRUSR;
> +    if (exp->writable) {
> +        mode |= S_IWUSR;
> +    }
> +
> +    statbuf = (struct stat) {
> +        .st_ino     = inode,
> +        .st_mode    = mode,
> +        .st_nlink   = 1,
> +        .st_uid     = getuid(),
> +        .st_gid     = getgid(),
> +        .st_size    = length,
> +        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
> +        .st_blocks  = allocated_blocks,
> +        .st_atime   = now,
> +        .st_mtime   = now,
> +        .st_ctime   = now,
> +    };
> +
> +    fuse_reply_attr(req, &statbuf, 1.);
> +}
> +
> +static int fuse_do_truncate(const FuseExport *exp, int64_t size,
> +                            PreallocMode prealloc)
> +{
> +    uint64_t blk_perm, blk_shared_perm;
> +    int ret;
> +
> +    blk_get_perm(exp->common.blk, &blk_perm, &blk_shared_perm);
> +
> +    ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
> +                       blk_shared_perm, NULL);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = blk_truncate(exp->common.blk, size, true, prealloc, 0, NULL);
> +
> +    /* Must succeed, because we are only giving up the RESIZE permission */
> +    blk_set_perm(exp->common.blk, blk_perm, blk_shared_perm, &error_abort);
> +
> +    return ret;
> +}
> +
> +/**
> + * Let clients set file attributes.  Only resizing is supported.
> + */
> +static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
> +                         int to_set, struct fuse_file_info *fi)
> +{
> +    FuseExport *exp = fuse_req_userdata(req);
> +    int ret;
> +
> +    if (!exp->writable) {
> +        fuse_reply_err(req, EACCES);
> +        return;
> +    }
> +
> +    if (to_set & ~FUSE_SET_ATTR_SIZE) {
> +        fuse_reply_err(req, ENOTSUP);
> +        return;
> +    }
> +
> +    ret = fuse_do_truncate(exp, statbuf->st_size, PREALLOC_MODE_OFF);
> +    if (ret < 0) {
> +        fuse_reply_err(req, -ret);
> +        return;
> +    }
> +
> +    fuse_getattr(req, inode, fi);
> +}
> +
> +/**
> + * Let clients open a file (i.e., the exported image).
> + */
> +static void fuse_open(fuse_req_t req, fuse_ino_t inode,
> +                      struct fuse_file_info *fi)
> +{
> +    fuse_reply_open(req, fi);
> +}
> +
> +/**
> + * Handle client reads from the exported image.
> + */
> +static void fuse_read(fuse_req_t req, fuse_ino_t inode,
> +                      size_t size, off_t offset, struct fuse_file_info *fi)
> +{
> +    FuseExport *exp = fuse_req_userdata(req);
> +    int64_t length;
> +    void *buf;
> +    int ret;
> +
> +    /**
> +     * Clients will expect short reads at EOF, so we have to limit
> +     * offset+size to the image length.
> +     */
> +    length = blk_getlength(exp->common.blk);
> +    if (length < 0) {
> +        fuse_reply_err(req, -length);
> +        return;
> +    }
> +
> +    size = MIN(size, FUSE_MAX_BOUNCE_BYTES);

"Read should send exactly the number of bytes requested except on EOF or
error, otherwise the rest of the data will be substituted with zeroes."

Do we corrupt huge reads with this, so that read() succeeds, but the
buffer is zeroed above 64M instead of containing the correct data? Maybe
we should return an error instead?

(It's kind of sad that we need a bounce buffer from which data is later
copied instead of being provided the right memory by the kernel.)

> +    if (offset + size > length) {
> +        size = length - offset;
> +    }
> +
> +    buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
> +    if (!buf) {
> +        fuse_reply_err(req, ENOMEM);
> +        return;
> +    }
> +
> +    ret = blk_pread(exp->common.blk, offset, buf, size);
> +    if (ret >= 0) {
> +        fuse_reply_buf(req, buf, size);
> +    } else {
> +        fuse_reply_err(req, -ret);
> +    }
> +
> +    qemu_vfree(buf);
> +}
> +
> +/**
> + * Handle client writes to the exported image.
> + */
> +static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
> +                       size_t size, off_t offset, struct fuse_file_info *fi)
> +{
> +    FuseExport *exp = fuse_req_userdata(req);
> +    int64_t length;
> +    int ret;
> +
> +    if (!exp->writable) {
> +        fuse_reply_err(req, EACCES);
> +        return;
> +    }
> +
> +    /**
> +     * Clients will expect short writes at EOF, so we have to limit
> +     * offset+size to the image length.
> +     */
> +    length = blk_getlength(exp->common.blk);
> +    if (length < 0) {
> +        fuse_reply_err(req, -length);
> +        return;
> +    }
> +
> +    size = MIN(size, BDRV_REQUEST_MAX_BYTES);

We're only supposed to do short writes on EOF, so this has a similar
problem as in fuse_read, except that BDRV_REQUEST_MAX_BYTES is much
higher and it's not specified what the resulting misbehaviour could be
(possibly the kernel not retrying write for the rest of the buffer?)

> +    if (offset + size > length) {
> +        size = length - offset;
> +    }
> +
> +    ret = blk_pwrite(exp->common.blk, offset, buf, size, 0);
> +    if (ret >= 0) {
> +        fuse_reply_write(req, size);
> +    } else {
> +        fuse_reply_err(req, -ret);
> +    }
> +}
> +
> +/**
> + * Let clients flush the exported image.
> + */
> +static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
> +                       struct fuse_file_info *fi)
> +{
> +    FuseExport *exp = fuse_req_userdata(req);
> +    int ret;
> +
> +    ret = blk_flush(exp->common.blk);
> +    fuse_reply_err(req, ret < 0 ? -ret : 0);
> +}

This seems to be an implementation for .fsync rather than for .flush.

Hmm, or maybe actually for both? We usually do bdrv_flush() during
close, so it would be consistent to do the same here. It's our last
chance to report an error to the user before the file is closed.

>  static const struct fuse_lowlevel_ops fuse_ops = {
> +    .lookup     = fuse_lookup,
> +    .getattr    = fuse_getattr,
> +    .setattr    = fuse_setattr,
> +    .open       = fuse_open,
> +    .read       = fuse_read,
> +    .write      = fuse_write,
> +    .flush      = fuse_flush,
>  };

Kevin

Max Reitz Oct. 15, 2020, 3:18 p.m. UTC | #2

On 15.10.20 11:46, Kevin Wolf wrote:
> Am 22.09.2020 um 12:49 hat Max Reitz geschrieben:
>> This makes the export actually useful instead of only producing errors
>> whenever it is accessed.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>  block/export/fuse.c | 226 ++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 226 insertions(+)
>>
>> diff --git a/block/export/fuse.c b/block/export/fuse.c
>> index 75f11d2514..8fc667231d 100644
>> --- a/block/export/fuse.c
>> +++ b/block/export/fuse.c
>> @@ -32,6 +32,10 @@
>>  #include <fuse_lowlevel.h>
>>  
>>  
>> +/* Prevent overly long bounce buffer allocations */
>> +#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
>> +
>> +
>>  typedef struct FuseExport {
>>      BlockExport common;
>>  
>> @@ -241,7 +245,229 @@ static bool is_regular_file(const char *path, Error **errp)
>>      return true;
>>  }
>>  
>> +
>> +/**
>> + * Let clients look up files.  Always return ENOENT because we only
>> + * care about the mountpoint itself.
>> + */
>> +static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
>> +{
>> +    fuse_reply_err(req, ENOENT);
>> +}
>> +
>> +/**
>> + * Let clients get file attributes (i.e., stat() the file).
>> + */
>> +static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
>> +                         struct fuse_file_info *fi)
>> +{
>> +    struct stat statbuf;
>> +    int64_t length, allocated_blocks;
>> +    time_t now = time(NULL);
>> +    ImageInfo *info = NULL;
>> +    FuseExport *exp = fuse_req_userdata(req);
>> +    mode_t mode;
>> +    Error *local_error = NULL;
>> +
>> +    length = blk_getlength(exp->common.blk);
>> +    if (length < 0) {
>> +        fuse_reply_err(req, -length);
>> +        return;
>> +    }
>> +
>> +    bdrv_query_image_info(blk_bs(exp->common.blk), &info, &local_error);
>> +    if (local_error) {
>> +        allocated_blocks = DIV_ROUND_UP(length, 512);
>> +    } else {
>> +        allocated_blocks = DIV_ROUND_UP(info->actual_size, 512);
>> +    }
>> +
>> +    qapi_free_ImageInfo(info);
>> +    error_free(local_error);
>> +    local_error = NULL;
> 
> If you only use info->actual_size, why not directly call
> bdrv_get_allocated_file_size()?

🤔

Sometimes one doesn’t think of the simplest things.

>> +
>> +    mode = S_IFREG | S_IRUSR;
>> +    if (exp->writable) {
>> +        mode |= S_IWUSR;
>> +    }
>> +
>> +    statbuf = (struct stat) {
>> +        .st_ino     = inode,
>> +        .st_mode    = mode,
>> +        .st_nlink   = 1,
>> +        .st_uid     = getuid(),
>> +        .st_gid     = getgid(),
>> +        .st_size    = length,
>> +        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
>> +        .st_blocks  = allocated_blocks,
>> +        .st_atime   = now,
>> +        .st_mtime   = now,
>> +        .st_ctime   = now,
>> +    };
>> +
>> +    fuse_reply_attr(req, &statbuf, 1.);
>> +}
>> +
>> +static int fuse_do_truncate(const FuseExport *exp, int64_t size,
>> +                            PreallocMode prealloc)
>> +{
>> +    uint64_t blk_perm, blk_shared_perm;
>> +    int ret;
>> +
>> +    blk_get_perm(exp->common.blk, &blk_perm, &blk_shared_perm);
>> +
>> +    ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
>> +                       blk_shared_perm, NULL);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = blk_truncate(exp->common.blk, size, true, prealloc, 0, NULL);
>> +
>> +    /* Must succeed, because we are only giving up the RESIZE permission */
>> +    blk_set_perm(exp->common.blk, blk_perm, blk_shared_perm, &error_abort);
>> +
>> +    return ret;
>> +}
>> +
>> +/**
>> + * Let clients set file attributes.  Only resizing is supported.
>> + */
>> +static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
>> +                         int to_set, struct fuse_file_info *fi)
>> +{
>> +    FuseExport *exp = fuse_req_userdata(req);
>> +    int ret;
>> +
>> +    if (!exp->writable) {
>> +        fuse_reply_err(req, EACCES);
>> +        return;
>> +    }
>> +
>> +    if (to_set & ~FUSE_SET_ATTR_SIZE) {
>> +        fuse_reply_err(req, ENOTSUP);
>> +        return;
>> +    }
>> +
>> +    ret = fuse_do_truncate(exp, statbuf->st_size, PREALLOC_MODE_OFF);
>> +    if (ret < 0) {
>> +        fuse_reply_err(req, -ret);
>> +        return;
>> +    }
>> +
>> +    fuse_getattr(req, inode, fi);
>> +}
>> +
>> +/**
>> + * Let clients open a file (i.e., the exported image).
>> + */
>> +static void fuse_open(fuse_req_t req, fuse_ino_t inode,
>> +                      struct fuse_file_info *fi)
>> +{
>> +    fuse_reply_open(req, fi);
>> +}
>> +
>> +/**
>> + * Handle client reads from the exported image.
>> + */
>> +static void fuse_read(fuse_req_t req, fuse_ino_t inode,
>> +                      size_t size, off_t offset, struct fuse_file_info *fi)
>> +{
>> +    FuseExport *exp = fuse_req_userdata(req);
>> +    int64_t length;
>> +    void *buf;
>> +    int ret;
>> +
>> +    /**
>> +     * Clients will expect short reads at EOF, so we have to limit
>> +     * offset+size to the image length.
>> +     */
>> +    length = blk_getlength(exp->common.blk);
>> +    if (length < 0) {
>> +        fuse_reply_err(req, -length);
>> +        return;
>> +    }
>> +
>> +    size = MIN(size, FUSE_MAX_BOUNCE_BYTES);
> 
> "Read should send exactly the number of bytes requested except on EOF or
> error, otherwise the rest of the data will be substituted with zeroes."

:(

> Do we corrupt huge reads with this, so that read() succeeds, but the
> buffer is zeroed above 64M instead of containing the correct data? Maybe
> we should return an error instead?

Hm.  It looks like there is a max_read option obeyed by the kernel
driver, and it appears it’s set by implementing .init() and setting
fuse_conn_info.max_read (also, “for the time being” it has to also set
in the mount options passed to fuse_session_new()).

I’m not sure whether that does anything, though.  It appears that
whenever I do a cached read, it caps out at 128k (which is what
cuse_prep_data() in libfuse sets; though increasing that number there
doesn’t change anything, so I think that’s just a coincidence), and with
O_DIRECT, it caps out at 1M.

But at least that would be grounds to return an error for >64M requests.
 (Limiting max_read to 64k does limit all read requests to 64k.)

Further investigation is needed.

> (It's kind of sad that we need a bounce buffer from which data is later
> copied instead of being provided the right memory by the kernel.)

Yes, it is.

>> +    if (offset + size > length) {
>> +        size = length - offset;
>> +    }
>> +
>> +    buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
>> +    if (!buf) {
>> +        fuse_reply_err(req, ENOMEM);
>> +        return;
>> +    }
>> +
>> +    ret = blk_pread(exp->common.blk, offset, buf, size);
>> +    if (ret >= 0) {
>> +        fuse_reply_buf(req, buf, size);
>> +    } else {
>> +        fuse_reply_err(req, -ret);
>> +    }
>> +
>> +    qemu_vfree(buf);
>> +}
>> +
>> +/**
>> + * Handle client writes to the exported image.
>> + */
>> +static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
>> +                       size_t size, off_t offset, struct fuse_file_info *fi)
>> +{
>> +    FuseExport *exp = fuse_req_userdata(req);
>> +    int64_t length;
>> +    int ret;
>> +
>> +    if (!exp->writable) {
>> +        fuse_reply_err(req, EACCES);
>> +        return;
>> +    }
>> +
>> +    /**
>> +     * Clients will expect short writes at EOF, so we have to limit
>> +     * offset+size to the image length.
>> +     */
>> +    length = blk_getlength(exp->common.blk);
>> +    if (length < 0) {
>> +        fuse_reply_err(req, -length);
>> +        return;
>> +    }
>> +
>> +    size = MIN(size, BDRV_REQUEST_MAX_BYTES);
> 
> We're only supposed to do short writes on EOF, so this has a similar
> problem as in fuse_read, except that BDRV_REQUEST_MAX_BYTES is much
> higher and it's not specified what the resulting misbehaviour could be
> (possibly the kernel not retrying write for the rest of the buffer?)

As for reading above, we can (and should) probably set max_write.

>> +    if (offset + size > length) {
>> +        size = length - offset;
>> +    }
>> +
>> +    ret = blk_pwrite(exp->common.blk, offset, buf, size, 0);
>> +    if (ret >= 0) {
>> +        fuse_reply_write(req, size);
>> +    } else {
>> +        fuse_reply_err(req, -ret);
>> +    }
>> +}
>> +
>> +/**
>> + * Let clients flush the exported image.
>> + */
>> +static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
>> +                       struct fuse_file_info *fi)
>> +{
>> +    FuseExport *exp = fuse_req_userdata(req);
>> +    int ret;
>> +
>> +    ret = blk_flush(exp->common.blk);
>> +    fuse_reply_err(req, ret < 0 ? -ret : 0);
>> +}
> 
> This seems to be an implementation for .fsync rather than for .flush.

Wouldn’t fsync entail a drain?

Or is it the opposite, flush should just drain and not invoke
blk_flush()?  (Sorry, this all gets me confused every time.)

(Though I do think .fsync should both flush and drain.)

> Hmm, or maybe actually for both? We usually do bdrv_flush() during
> close, so it would be consistent to do the same here. It's our last
> chance to report an error to the user before the file is closed.

I don’t understand what you mean.  What is “the same”?  Closing the
image?  Or indeed having .flush() be implemented with blk_flush()?

Do you mean that other parties may do the same as qemu does, i.e. flush
files before they are closed, which is why we should anticipate the same
and give users a chance to see errors?

Max

Kevin Wolf Oct. 15, 2020, 3:58 p.m. UTC | #3

Am 15.10.2020 um 17:18 hat Max Reitz geschrieben:
> On 15.10.20 11:46, Kevin Wolf wrote:

> > Am 22.09.2020 um 12:49 hat Max Reitz geschrieben:

> >> This makes the export actually useful instead of only producing errors

> >> whenever it is accessed.

> >>

> >> Signed-off-by: Max Reitz <mreitz@redhat.com>


> >> +/**

> >> + * Handle client reads from the exported image.

> >> + */

> >> +static void fuse_read(fuse_req_t req, fuse_ino_t inode,

> >> +                      size_t size, off_t offset, struct fuse_file_info *fi)

> >> +{

> >> +    FuseExport *exp = fuse_req_userdata(req);

> >> +    int64_t length;

> >> +    void *buf;

> >> +    int ret;

> >> +

> >> +    /**

> >> +     * Clients will expect short reads at EOF, so we have to limit

> >> +     * offset+size to the image length.

> >> +     */

> >> +    length = blk_getlength(exp->common.blk);

> >> +    if (length < 0) {

> >> +        fuse_reply_err(req, -length);

> >> +        return;

> >> +    }

> >> +

> >> +    size = MIN(size, FUSE_MAX_BOUNCE_BYTES);

> > 

> > "Read should send exactly the number of bytes requested except on EOF or

> > error, otherwise the rest of the data will be substituted with zeroes."

> 

> :(

> 

> > Do we corrupt huge reads with this, so that read() succeeds, but the

> > buffer is zeroed above 64M instead of containing the correct data? Maybe

> > we should return an error instead?

> 

> Hm.  It looks like there is a max_read option obeyed by the kernel

> driver, and it appears it’s set by implementing .init() and setting

> fuse_conn_info.max_read (also, “for the time being” it has to also set

> in the mount options passed to fuse_session_new()).

> 

> I’m not sure whether that does anything, though.  It appears that

> whenever I do a cached read, it caps out at 128k (which is what

> cuse_prep_data() in libfuse sets; though increasing that number there

> doesn’t change anything, so I think that’s just a coincidence), and with

> O_DIRECT, it caps out at 1M.

> 

> But at least that would be grounds to return an error for >64M requests.

>  (Limiting max_read to 64k does limit all read requests to 64k.)


Yes, setting max_read and making larger requests an error seems like a
good solution.

> Further investigation is needed.


If you want :-)

> > (It's kind of sad that we need a bounce buffer from which data is later

> > copied instead of being provided the right memory by the kernel.)

> 

> Yes, it is.

> 

> >> +    if (offset + size > length) {

> >> +        size = length - offset;

> >> +    }

> >> +

> >> +    buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);

> >> +    if (!buf) {

> >> +        fuse_reply_err(req, ENOMEM);

> >> +        return;

> >> +    }

> >> +

> >> +    ret = blk_pread(exp->common.blk, offset, buf, size);

> >> +    if (ret >= 0) {

> >> +        fuse_reply_buf(req, buf, size);

> >> +    } else {

> >> +        fuse_reply_err(req, -ret);

> >> +    }

> >> +

> >> +    qemu_vfree(buf);

> >> +}

> >> +

> >> +/**

> >> + * Handle client writes to the exported image.

> >> + */

> >> +static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,

> >> +                       size_t size, off_t offset, struct fuse_file_info *fi)

> >> +{

> >> +    FuseExport *exp = fuse_req_userdata(req);

> >> +    int64_t length;

> >> +    int ret;

> >> +

> >> +    if (!exp->writable) {

> >> +        fuse_reply_err(req, EACCES);

> >> +        return;

> >> +    }

> >> +

> >> +    /**

> >> +     * Clients will expect short writes at EOF, so we have to limit

> >> +     * offset+size to the image length.

> >> +     */

> >> +    length = blk_getlength(exp->common.blk);

> >> +    if (length < 0) {

> >> +        fuse_reply_err(req, -length);

> >> +        return;

> >> +    }

> >> +

> >> +    size = MIN(size, BDRV_REQUEST_MAX_BYTES);

> > 

> > We're only supposed to do short writes on EOF, so this has a similar

> > problem as in fuse_read, except that BDRV_REQUEST_MAX_BYTES is much

> > higher and it's not specified what the resulting misbehaviour could be

> > (possibly the kernel not retrying write for the rest of the buffer?)

> 

> As for reading above, we can (and should) probably set max_write.


Yes.

> >> +    if (offset + size > length) {

> >> +        size = length - offset;

> >> +    }

> >> +

> >> +    ret = blk_pwrite(exp->common.blk, offset, buf, size, 0);

> >> +    if (ret >= 0) {

> >> +        fuse_reply_write(req, size);

> >> +    } else {

> >> +        fuse_reply_err(req, -ret);

> >> +    }

> >> +}

> >> +

> >> +/**

> >> + * Let clients flush the exported image.

> >> + */

> >> +static void fuse_flush(fuse_req_t req, fuse_ino_t inode,

> >> +                       struct fuse_file_info *fi)

> >> +{

> >> +    FuseExport *exp = fuse_req_userdata(req);

> >> +    int ret;

> >> +

> >> +    ret = blk_flush(exp->common.blk);

> >> +    fuse_reply_err(req, ret < 0 ? -ret : 0);

> >> +}

> > 

> > This seems to be an implementation for .fsync rather than for .flush.

> 

> Wouldn’t fsync entail a drain?


Hm, the libfuse documentation doesn't say anything about draining. I
suppose this is because if requests need to be drained, it will do so in
the kernel. But I haven't checked the code.

So I expect that calling fsync() on the lower layer does everything that
is needed. Which is bdrv_flush() in QEMU.

> Or is it the opposite, flush should just drain and not invoke

> blk_flush()?  (Sorry, this all gets me confused every time.)


I'm just relying on the libfuse documentation there for flush:

"This is called on each close() of the opened file."

and

"NOTE: the name of the method is misleading, since (unlike fsync) the
filesystem is not forced to flush pending writes. One reason to flush
data is if the filesystem wants to return write errors during close.
However, such use is non-portable because POSIX does not require close
to wait for delayed I/O to complete."

> (Though I do think .fsync should both flush and drain.)

> 

> > Hmm, or maybe actually for both? We usually do bdrv_flush() during

> > close, so it would be consistent to do the same here. It's our last

> > chance to report an error to the user before the file is closed.

> 

> I don’t understand what you mean.  What is “the same”?  Closing the

> image?  Or indeed having .flush() be implemented with blk_flush()?


Implementing .flush(), which will be called when the image is closed,
with blk_flush().

And still doing blk_flush() for .fsync, of course.

> Do you mean that other parties may do the same as qemu does, i.e.

> flush files before they are closed, which is why we should anticipate

> the same and give users a chance to see errors?


I'm not exactly sure what failure of .flush() would look like for users.
Probably close() failing, so they don't have to do anything special,
just check their return values.

Kevin

Max Reitz Oct. 15, 2020, 4:04 p.m. UTC | #4

On 15.10.20 17:58, Kevin Wolf wrote:
> Am 15.10.2020 um 17:18 hat Max Reitz geschrieben:
>> On 15.10.20 11:46, Kevin Wolf wrote:
>>> Am 22.09.2020 um 12:49 hat Max Reitz geschrieben:
>>>> This makes the export actually useful instead of only producing errors
>>>> whenever it is accessed.
>>>>
>>>> Signed-off-by: Max Reitz <mreitz@redhat.com>
> 
>>>> +/**
>>>> + * Handle client reads from the exported image.
>>>> + */
>>>> +static void fuse_read(fuse_req_t req, fuse_ino_t inode,
>>>> +                      size_t size, off_t offset, struct fuse_file_info *fi)
>>>> +{
>>>> +    FuseExport *exp = fuse_req_userdata(req);
>>>> +    int64_t length;
>>>> +    void *buf;
>>>> +    int ret;
>>>> +
>>>> +    /**
>>>> +     * Clients will expect short reads at EOF, so we have to limit
>>>> +     * offset+size to the image length.
>>>> +     */
>>>> +    length = blk_getlength(exp->common.blk);
>>>> +    if (length < 0) {
>>>> +        fuse_reply_err(req, -length);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    size = MIN(size, FUSE_MAX_BOUNCE_BYTES);
>>>
>>> "Read should send exactly the number of bytes requested except on EOF or
>>> error, otherwise the rest of the data will be substituted with zeroes."
>>
>> :(
>>
>>> Do we corrupt huge reads with this, so that read() succeeds, but the
>>> buffer is zeroed above 64M instead of containing the correct data? Maybe
>>> we should return an error instead?
>>
>> Hm.  It looks like there is a max_read option obeyed by the kernel
>> driver, and it appears it’s set by implementing .init() and setting
>> fuse_conn_info.max_read (also, “for the time being” it has to also set
>> in the mount options passed to fuse_session_new()).
>>
>> I’m not sure whether that does anything, though.  It appears that
>> whenever I do a cached read, it caps out at 128k (which is what
>> cuse_prep_data() in libfuse sets; though increasing that number there
>> doesn’t change anything, so I think that’s just a coincidence), and with
>> O_DIRECT, it caps out at 1M.
>>
>> But at least that would be grounds to return an error for >64M requests.
>>  (Limiting max_read to 64k does limit all read requests to 64k.)
> 
> Yes, setting max_read and making larger requests an error seems like a
> good solution.
> 
>> Further investigation is needed.
> 
> If you want :-)

Not really, but 128k per request is a bit sad.

[...]

>>>> +/**
>>>> + * Let clients flush the exported image.
>>>> + */
>>>> +static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
>>>> +                       struct fuse_file_info *fi)
>>>> +{
>>>> +    FuseExport *exp = fuse_req_userdata(req);
>>>> +    int ret;
>>>> +
>>>> +    ret = blk_flush(exp->common.blk);
>>>> +    fuse_reply_err(req, ret < 0 ? -ret : 0);
>>>> +}
>>>
>>> This seems to be an implementation for .fsync rather than for .flush.
>>
>> Wouldn’t fsync entail a drain?
> 
> Hm, the libfuse documentation doesn't say anything about draining. I
> suppose this is because if requests need to be drained, it will do so in
> the kernel. But I haven't checked the code.

Hmm, well, yeah...  A sync doesn’t necessarily mean settling all
in-flight requests.

> So I expect that calling fsync() on the lower layer does everything that
> is needed. Which is bdrv_flush() in QEMU.
> 
>> Or is it the opposite, flush should just drain and not invoke
>> blk_flush()?  (Sorry, this all gets me confused every time.)
> 
> I'm just relying on the libfuse documentation there for flush:
> 
> "This is called on each close() of the opened file."

Ah, OK.

> and
> 
> "NOTE: the name of the method is misleading, since (unlike fsync) the
> filesystem is not forced to flush pending writes. One reason to flush
> data is if the filesystem wants to return write errors during close.
> However, such use is non-portable because POSIX does not require close
> to wait for delayed I/O to complete."
> 
>> (Though I do think .fsync should both flush and drain.)
>>
>>> Hmm, or maybe actually for both? We usually do bdrv_flush() during
>>> close, so it would be consistent to do the same here. It's our last
>>> chance to report an error to the user before the file is closed.
>>
>> I don’t understand what you mean.  What is “the same”?  Closing the
>> image?  Or indeed having .flush() be implemented with blk_flush()?
> 
> Implementing .flush(), which will be called when the image is closed,
> with blk_flush().
> 
> And still doing blk_flush() for .fsync, of course.

OK.

>> Do you mean that other parties may do the same as qemu does, i.e.
>> flush files before they are closed, which is why we should anticipate
>> the same and give users a chance to see errors?
> 
> I'm not exactly sure what failure of .flush() would look like for users.
> Probably close() failing, so they don't have to do anything special,
> just check their return values.

Checking return values on close()?  Sounds special to me. O:)

Max

diff --git a/block/export/fuse.c b/block/export/fuse.c
index 75f11d2514..8fc667231d 100644
--- a/block/export/fuse.c
+++ b/block/export/fuse.c
@@ -32,6 +32,10 @@ 
 #include <fuse_lowlevel.h>
 
 
+/* Prevent overly long bounce buffer allocations */
+#define FUSE_MAX_BOUNCE_BYTES (MIN(BDRV_REQUEST_MAX_BYTES, 64 * 1024 * 1024))
+
+
 typedef struct FuseExport {
     BlockExport common;
 
@@ -241,7 +245,229 @@  static bool is_regular_file(const char *path, Error **errp)
     return true;
 }
 
+
+/**
+ * Let clients look up files.  Always return ENOENT because we only
+ * care about the mountpoint itself.
+ */
+static void fuse_lookup(fuse_req_t req, fuse_ino_t parent, const char *name)
+{
+    fuse_reply_err(req, ENOENT);
+}
+
+/**
+ * Let clients get file attributes (i.e., stat() the file).
+ */
+static void fuse_getattr(fuse_req_t req, fuse_ino_t inode,
+                         struct fuse_file_info *fi)
+{
+    struct stat statbuf;
+    int64_t length, allocated_blocks;
+    time_t now = time(NULL);
+    ImageInfo *info = NULL;
+    FuseExport *exp = fuse_req_userdata(req);
+    mode_t mode;
+    Error *local_error = NULL;
+
+    length = blk_getlength(exp->common.blk);
+    if (length < 0) {
+        fuse_reply_err(req, -length);
+        return;
+    }
+
+    bdrv_query_image_info(blk_bs(exp->common.blk), &info, &local_error);
+    if (local_error) {
+        allocated_blocks = DIV_ROUND_UP(length, 512);
+    } else {
+        allocated_blocks = DIV_ROUND_UP(info->actual_size, 512);
+    }
+
+    qapi_free_ImageInfo(info);
+    error_free(local_error);
+    local_error = NULL;
+
+    mode = S_IFREG | S_IRUSR;
+    if (exp->writable) {
+        mode |= S_IWUSR;
+    }
+
+    statbuf = (struct stat) {
+        .st_ino     = inode,
+        .st_mode    = mode,
+        .st_nlink   = 1,
+        .st_uid     = getuid(),
+        .st_gid     = getgid(),
+        .st_size    = length,
+        .st_blksize = blk_bs(exp->common.blk)->bl.request_alignment,
+        .st_blocks  = allocated_blocks,
+        .st_atime   = now,
+        .st_mtime   = now,
+        .st_ctime   = now,
+    };
+
+    fuse_reply_attr(req, &statbuf, 1.);
+}
+
+static int fuse_do_truncate(const FuseExport *exp, int64_t size,
+                            PreallocMode prealloc)
+{
+    uint64_t blk_perm, blk_shared_perm;
+    int ret;
+
+    blk_get_perm(exp->common.blk, &blk_perm, &blk_shared_perm);
+
+    ret = blk_set_perm(exp->common.blk, blk_perm | BLK_PERM_RESIZE,
+                       blk_shared_perm, NULL);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = blk_truncate(exp->common.blk, size, true, prealloc, 0, NULL);
+
+    /* Must succeed, because we are only giving up the RESIZE permission */
+    blk_set_perm(exp->common.blk, blk_perm, blk_shared_perm, &error_abort);
+
+    return ret;
+}
+
+/**
+ * Let clients set file attributes.  Only resizing is supported.
+ */
+static void fuse_setattr(fuse_req_t req, fuse_ino_t inode, struct stat *statbuf,
+                         int to_set, struct fuse_file_info *fi)
+{
+    FuseExport *exp = fuse_req_userdata(req);
+    int ret;
+
+    if (!exp->writable) {
+        fuse_reply_err(req, EACCES);
+        return;
+    }
+
+    if (to_set & ~FUSE_SET_ATTR_SIZE) {
+        fuse_reply_err(req, ENOTSUP);
+        return;
+    }
+
+    ret = fuse_do_truncate(exp, statbuf->st_size, PREALLOC_MODE_OFF);
+    if (ret < 0) {
+        fuse_reply_err(req, -ret);
+        return;
+    }
+
+    fuse_getattr(req, inode, fi);
+}
+
+/**
+ * Let clients open a file (i.e., the exported image).
+ */
+static void fuse_open(fuse_req_t req, fuse_ino_t inode,
+                      struct fuse_file_info *fi)
+{
+    fuse_reply_open(req, fi);
+}
+
+/**
+ * Handle client reads from the exported image.
+ */
+static void fuse_read(fuse_req_t req, fuse_ino_t inode,
+                      size_t size, off_t offset, struct fuse_file_info *fi)
+{
+    FuseExport *exp = fuse_req_userdata(req);
+    int64_t length;
+    void *buf;
+    int ret;
+
+    /**
+     * Clients will expect short reads at EOF, so we have to limit
+     * offset+size to the image length.
+     */
+    length = blk_getlength(exp->common.blk);
+    if (length < 0) {
+        fuse_reply_err(req, -length);
+        return;
+    }
+
+    size = MIN(size, FUSE_MAX_BOUNCE_BYTES);
+    if (offset + size > length) {
+        size = length - offset;
+    }
+
+    buf = qemu_try_blockalign(blk_bs(exp->common.blk), size);
+    if (!buf) {
+        fuse_reply_err(req, ENOMEM);
+        return;
+    }
+
+    ret = blk_pread(exp->common.blk, offset, buf, size);
+    if (ret >= 0) {
+        fuse_reply_buf(req, buf, size);
+    } else {
+        fuse_reply_err(req, -ret);
+    }
+
+    qemu_vfree(buf);
+}
+
+/**
+ * Handle client writes to the exported image.
+ */
+static void fuse_write(fuse_req_t req, fuse_ino_t inode, const char *buf,
+                       size_t size, off_t offset, struct fuse_file_info *fi)
+{
+    FuseExport *exp = fuse_req_userdata(req);
+    int64_t length;
+    int ret;
+
+    if (!exp->writable) {
+        fuse_reply_err(req, EACCES);
+        return;
+    }
+
+    /**
+     * Clients will expect short writes at EOF, so we have to limit
+     * offset+size to the image length.
+     */
+    length = blk_getlength(exp->common.blk);
+    if (length < 0) {
+        fuse_reply_err(req, -length);
+        return;
+    }
+
+    size = MIN(size, BDRV_REQUEST_MAX_BYTES);
+    if (offset + size > length) {
+        size = length - offset;
+    }
+
+    ret = blk_pwrite(exp->common.blk, offset, buf, size, 0);
+    if (ret >= 0) {
+        fuse_reply_write(req, size);
+    } else {
+        fuse_reply_err(req, -ret);
+    }
+}
+
+/**
+ * Let clients flush the exported image.
+ */
+static void fuse_flush(fuse_req_t req, fuse_ino_t inode,
+                       struct fuse_file_info *fi)
+{
+    FuseExport *exp = fuse_req_userdata(req);
+    int ret;
+
+    ret = blk_flush(exp->common.blk);
+    fuse_reply_err(req, ret < 0 ? -ret : 0);
+}
+
 static const struct fuse_lowlevel_ops fuse_ops = {
+    .lookup     = fuse_lookup,
+    .getattr    = fuse_getattr,
+    .setattr    = fuse_setattr,
+    .open       = fuse_open,
+    .read       = fuse_read,
+    .write      = fuse_write,
+    .flush      = fuse_flush,
 };
 
 const BlockExportDriver blk_exp_fuse = {

[v2,03/20] fuse: Implement standard FUSE operations

Commit Message

Comments

Patch