mbox series

[RFC,0/5] fix races in CDC-WDM

Message ID 20200812132034.14363-1-oneukum@suse.com
Headers show
Series fix races in CDC-WDM | expand

Message

Oliver Neukum Aug. 12, 2020, 1:20 p.m. UTC
CDC-WDM was not written with multithreaded users or
uncooperative devices in mind.
This leads to race conditions and hangs in flush().

Comments

Tetsuo Handa Sept. 10, 2020, 10:01 a.m. UTC | #1
On 2020/09/10 18:09, Oliver Neukum wrote:
>> Does it make sense to wait for response of someone else's usb_submit_urb() when
>> someone is calling close(), for there is no guarantee that failure notice received
>> via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from
>> wdm_write() call from that file descriptor?
> 
> Well, user space may do multithreading. Whether it makes sense is
> another question. We just need to return results confirming to the
> standards. You noticed bugs. I think the next version will fix them.

My question is how do you handle if App1 and App2 (not multithreading but
multiprocessing) shares the "desc" ? Unless

>> In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR
>>  from close(). But I think that returning -EINTR from close() is not recommended because
>> it can confuse multithreaded application (retrying close() upon -EINTR is not safe).
> 
> Well, but what is the alternative? Should we ignore signals?
> 

we return the error from write() request (i.e. give up trying to report errors from
close() event), we can't pass results to the intended recipients.
Oliver Neukum Sept. 15, 2020, 9:14 a.m. UTC | #2
Am Donnerstag, den 10.09.2020, 19:01 +0900 schrieb Tetsuo Handa:
> On 2020/09/10 18:09, Oliver Neukum wrote:
> > > Does it make sense to wait for response of someone else's usb_submit_urb() when
> > > someone is calling close(), for there is no guarantee that failure notice received
> > > via wdm_flush() via some file descriptor corresponds to usb_submit_urb() request from
> > > wdm_write() call from that file descriptor?
> > 
> > Well, user space may do multithreading. Whether it makes sense is
> > another question. We just need to return results confirming to the
> > standards. You noticed bugs. I think the next version will fix them.
> 
> My question is how do you handle if App1 and App2 (not multithreading but
> multiprocessing) shares the "desc" ? Unless

Well, device locking is a function of user space. This is Unix. If you
have two uncoordinated writers to a device, you cannot expect sensible
results. You can expect that the driver does not hang, of course.

> > > In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR
> > >  from close(). But I think that returning -EINTR from close() is not recommended because
> > > it can confuse multithreaded application (retrying close() upon -EINTR is not safe).
> > 
> > Well, but what is the alternative? Should we ignore signals?
> > 
> 
> we return the error from write() request (i.e. give up trying to report errors from
> close() event), we can't pass results to the intended recipients.

That means

* harming the single threaded for the sake of the few multithreaded
* it would not work for O_NONBLOCK
* if you use a device from multiple threads or tasks, locking is your
problem

Is there something we can do in flush()?

	Regards
		Oliver
Tetsuo Handa Sept. 15, 2020, 10:30 a.m. UTC | #3
On 2020/09/15 18:14, Oliver Neukum wrote:
>>>> In patch "[RFC 3/5] CDC-WDM: making flush() interruptible", it is legal to return -EINTR

>>>>  from close(). But I think that returning -EINTR from close() is not recommended because

>>>> it can confuse multithreaded application (retrying close() upon -EINTR is not safe).

>>>

>>> Well, but what is the alternative? Should we ignore signals?

>>>

>>

>> we return the error from write() request (i.e. give up trying to report errors from

>> close() event), we can't pass results to the intended recipients.

> 

> That means

> 

> * harming the single threaded for the sake of the few multithreaded


What is wrong with the single threaded user? As I describe below, there is no guarantee
that wdm_write() can complete immediately (even if O_NONBLOCK flag is set).

> * it would not work for O_NONBLOCK


It does work for O_NONBLOCK. Please see my proposal at
https://lkml.kernel.org/r/b347b882-a986-24c6-2b37-0b1a092931b9@i-love.sakura.ne.jp .
Since my proposal calls mutex_unlock() before start waiting for response, my
proposal does not block mutex_lock_interruptible() from O_NONBLOCK write().

O_NONBLOCK cannot guarantee that "we do not wait for memory allocation request by
memdup_user()/copy_from_user()/usb_submit_urb(GFP_KERNEL)". It is possible that
O_NONBLOCK write() is blocked for many _minutes_ at (at least) these three locations.

O_NONBLOCK only guarantees that "it does not wait when we can't start immediately".

> * if you use a device from multiple threads or tasks, locking is your

> problem


You mean "a device" as "struct wdm_device *desc" ? Then, it does not matter because
the mutex, buffer, status etc. are per "struct wdm_device *desc" objects. Nobody will
be disturbed by somebody else using different "struct wdm_device *desc".

> 

> Is there something we can do in flush()?


I consider that wdm_flush() is a wrong place to return an error. It is possible that
a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than
via calling close() syscall). Then, that userspace process will never receive the error
fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer,
being able to terminate and release resources as soon as possible is more preferable
than try to wait for response.
Oliver Neukum Sept. 16, 2020, 10:18 a.m. UTC | #4
Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa:
> On 2020/09/15 18:14, Oliver Neukum wrote

> > Is there something we can do in flush()?
> 
> I consider that wdm_flush() is a wrong place to return an error. It is possible that

I am afraid that is a basic problem we need to resolve. As I understand
 it, flush() as a method exists precisely to report errors. Otherwise
you could implement it in release(). But this is not called for every
close().

Hence a driver is supposed to start IO upon write() and report the
result to the next call, which can be either write() or close(), the
latter corresponding to flush().

> a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than
> via calling close() syscall). Then, that userspace process will never receive the error

If you are killed by a signal you are in a race condition
anyway. It cannot be handled.

> fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer,
> being able to terminate and release resources as soon as possible is more preferable
> than try to wait for response.

Right, so should we just proceed in case of a dieing task? How do we
do that?

	Regards
		Oliver
Tetsuo Handa Sept. 16, 2020, 11:14 a.m. UTC | #5
On 2020/09/16 19:18, Oliver Neukum wrote:
> Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa:

>> On 2020/09/15 18:14, Oliver Neukum wrote

> 

>>> Is there something we can do in flush()?

>>

>> I consider that wdm_flush() is a wrong place to return an error. It is possible that

> 

> I am afraid that is a basic problem we need to resolve. As I understand

>  it, flush() as a method exists precisely to report errors. Otherwise

> you could implement it in release(). But this is not called for every

> close().


I think fsync() or ioctl() is a better method for reporting errors.

But anyway I consider that any error from N'th write() request should be
returned by N'th write() request.

> 

> Hence a driver is supposed to start IO upon write() and report the

> result to the next call, which can be either write() or close(), the

> latter corresponding to flush().


Whether N'th write() request succeeded remains unknown until (N+1)'th
write() request or close() request is issued? That sounds a strange design.

If there is nothing more to write(), how userspace process knows whether
N'th write() request succeeded? Wait for writability using poll() ?

How do we guarantee that N'th write() request already set desc->werr before
(N+1)'th next write() request is issued? If (N+1)'th write() request reached
memdup_user() before desc->werr is set by callback of N'th write() request,
(N+1)'th write() request will fail to report the error from N'th write() request.
Yes, that error would be reported by (N+2)'th write() request, but the userspace
process might have already discarded data needed for taking some actions (e.g.
print error messages, retry the write() request with same argument).

> 

>> a userspace process reaches wdm_flush() due to being killed by SIGKILL (rather than

>> via calling close() syscall). Then, that userspace process will never receive the error

> 

> If you are killed by a signal you are in a race condition

> anyway. It cannot be handled.


What is the purpose of sending the error to the userspace process via write() or close()?
Isn't the purpose to allow userspace process to do something (e.g. print error messages,
retry the write() request with same argument) ? If close() returned an error, it might be
too late to retry the write() request with same argument.

> 

>> fetched from wdm_flush(). Also, if that userspace process is killed by the OOM killer,

>> being able to terminate and release resources as soon as possible is more preferable

>> than try to wait for response.

> 

> Right, so should we just proceed in case of a dieing task? How do we

> do that?


We need to avoid waiting at wdm_flush() if killed.

And I think that wdm_flush() is a strange interface for reporting the error.
Oliver Neukum Sept. 17, 2020, 9:50 a.m. UTC | #6
Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa:
> On 2020/09/16 19:18, Oliver Neukum wrote:
> > Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa:
> > > On 2020/09/15 18:14, Oliver Neukum wrote
> > > > Is there something we can do in flush()?
> > > 
> > > I consider that wdm_flush() is a wrong place to return an error. It is possible that
> > 
> > I am afraid that is a basic problem we need to resolve. As I understand
> >  it, flush() as a method exists precisely to report errors. Otherwise
> > you could implement it in release(). But this is not called for every
> > close().
> 
> I think fsync() or ioctl() is a better method for reporting errors.

Very well, I am implementing fsync(). However, I must say that the very
existance of fsync() tells us that write() is not expected to push
data out to devices and hence report results.

> Whether N'th write() request succeeded remains unknown until (N+1)'th
> write() request or close() request is issued? That sounds a strange design.

Welcome to the world of Unix. This is necessary if you want to block
as rarely as possible.

> If there is nothing more to write(), how userspace process knows whether
> N'th write() request succeeded? Wait for writability using poll() ?

Technically fsync(). poll() will tell you that you can write again.
That is a related but different concept.

> What is the purpose of sending the error to the userspace process via write() or close()?

Yes. However to do so, user space must be running. That the death
of a process interferes with error handling is independent from that.

> And I think that wdm_flush() is a strange interface for reporting the error.

Well, POSIX is what it is.The close() syscall is supposed to return
errors. Hence flush() must report them.

	Regards
		Oliver
Tetsuo Handa Sept. 17, 2020, 11:24 a.m. UTC | #7
On 2020/09/17 18:50, Oliver Neukum wrote:
> Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa:

>> On 2020/09/16 19:18, Oliver Neukum wrote:

>>> Am Dienstag, den 15.09.2020, 19:30 +0900 schrieb Tetsuo Handa:

>>>> On 2020/09/15 18:14, Oliver Neukum wrote

>>>>> Is there something we can do in flush()?

>>>>

>>>> I consider that wdm_flush() is a wrong place to return an error. It is possible that

>>>

>>> I am afraid that is a basic problem we need to resolve. As I understand

>>>  it, flush() as a method exists precisely to report errors. Otherwise

>>> you could implement it in release(). But this is not called for every

>>> close().

>>

>> I think fsync() or ioctl() is a better method for reporting errors.

> 

> Very well, I am implementing fsync(). However, I must say that the very

> existance of fsync() tells us that write() is not expected to push

> data out to devices and hence report results.


If you ask userspace programs to be updated to call fsync(), we can ask
userspace programs be updated to call ioctl().

I expected you to implement wdm_ioctl() for fetching last error code. Then,

> 

>> Whether N'th write() request succeeded remains unknown until (N+1)'th

>> write() request or close() request is issued? That sounds a strange design.

> 

> Welcome to the world of Unix. This is necessary if you want to block

> as rarely as possible.


we can apply my proposal (wait for response at wdm_write() by default) as a baseline
for not to break existing userspace programs (except latency), followed by a patch
which allows userspace programs to use that wdm_ioctl() in order not to wait for
response at wdm_write(), which is enabled by calling wdm_ioctl() (in order to
recover latency caused by waiting for response at wdm_write()).



>> What is the purpose of sending the error to the userspace process via write() or close()?

> 

> Yes. However to do so, user space must be running. That the death

> of a process interferes with error handling is independent from that.


Why need to send the error to the userspace process when that process was killed?
My question

  Isn't the purpose to allow userspace process to do something (e.g. print error messages,
  retry the write() request with same argument) ? If close() returned an error, it might be
  too late to retry the write() request with same argument.

which is unrelated to the userspace process being killed. My suggestion is that we can send
the error from wdm_write() (for synchronous mode) or wdm_ioctl() (for asynchronous mode).

> 

>> And I think that wdm_flush() is a strange interface for reporting the error.

> 

> Well, POSIX is what it is.The close() syscall is supposed to return

> errors. Hence flush() must report them.


If we check the error at wdm_write() or wdm_ioctl(), there is no error to report at
wdm_flush(). Therefore, we can remove wdm_flush() completely.



I can't read this series without squashing into single patch. Making changes which
will be after all removed in [RFC 5/7] is sad. Please do [RFC 5/7] before [RFC 4/7].
Then, you won't need to make unneeded modifications. I'd like to see one cleanup
patch, one possible unsafe dereference fix patch, and one deadlock avoidance patch.



You did not answer to

  How do we guarantee that N'th write() request already set desc->werr before
  (N+1)'th next write() request is issued? If (N+1)'th write() request reached
  memdup_user() before desc->werr is set by callback of N'th write() request,
  (N+1)'th write() request will fail to report the error from N'th write() request.
  Yes, that error would be reported by (N+2)'th write() request, but the userspace
  process might have already discarded data needed for taking some actions (e.g.
  print error messages, retry the write() request with same argument).

. At least I think that

        spin_lock_irq(&desc->iuspin);
        we = desc->werr;
        desc->werr = 0;
        spin_unlock_irq(&desc->iuspin);
        if (we < 0)
                return usb_translate_errors(we);

in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).



In [RFC 2/7], I think that

                /* in case flush() had timed out */
                usb_kill_urb(desc->command);

which is called only when desc->count == 0 in wdm_open() is pointless, for since
desc->count is incremented/decremented with wdm_mutex held, kill_urbs(desc) which
is called when desc->count == 0 in wdm_release() must have already called
usb_kill_urb(desc->command) before allowing wdm_open() to reach there.

In addition, is

        /* using write lock to protect desc->count */
        mutex_lock(&desc->wlock);

required? Isn't wdm_mutex that is actually protecting desc->count from modification?
If it is desc->wlock that is actually protecting desc->count, the !desc->count check
in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with
desc->wlock held.



In [RFC 3/7], patch description says

  There is no need for flush() to be uninterruptible. close(2)
  is allowed to return -EAGAIN. Change it.

but the code does

	if (rv < 0)
		return -EINTR;

. Which error code do you want to use? (I still prefer removing wdm_flush() completely...)
Oliver Neukum Sept. 17, 2020, 2:17 p.m. UTC | #8
Am Donnerstag, den 17.09.2020, 20:24 +0900 schrieb Tetsuo Handa:
> On 2020/09/17 18:50, Oliver Neukum wrote:

> > Am Mittwoch, den 16.09.2020, 20:14 +0900 schrieb Tetsuo Handa:


> If you ask userspace programs to be updated to call fsync(), we can ask

> userspace programs be updated to call ioctl().

> 

> I expected you to implement wdm_ioctl() for fetching last error code. Then,


Again, we are not redefining APIs. The APIs for character devices are
well defined by POSIX. Please see the man pages. Introducing a whole
new ioctl() is out of question.

The API and its semantics are clear. Write schedules a write:

       A  successful  return  from  write() does not make any guarantee that data has been committed to disk.  On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data.  In this case, some errors might be
       delayed until a future write(2), fsync(2), or even close(2).  The only way to be sure is to call fsync(2) after you are done writing all your data.

Fsync flushes data:

       fsync()  transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if
       the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.

If user space does not call fsync(), the error is supposed to be reported
by the next write() and if there is no next write(), close() shall report it.

> we can apply my proposal (wait for response at wdm_write() by default) as a baseline

> for not to break existing userspace programs (except latency), followed by a patch

> which allows userspace programs to use that wdm_ioctl() in order not to wait for

> response at wdm_write(), which is enabled by calling wdm_ioctl() (in order to

> recover latency caused by waiting for response at wdm_write()).


I am sorry, but the API is defined by POSIX.

> > > What is the purpose of sending the error to the userspace process via write() or close()?

> > 

> > Yes. However to do so, user space must be running. That the death

> > of a process interferes with error handling is independent from that.

> 

> Why need to send the error to the userspace process when that process was killed?

> My question

> 

>   Isn't the purpose to allow userspace process to do something (e.g. print error messages,

>   retry the write() request with same argument) ? If close() returned an error, it might be

>   too late to retry the write() request with same argument.


Yes. Technically you need to use fsync(). Hence I implemented it.

> If we check the error at wdm_write() or wdm_ioctl(), there is no error to report at

> wdm_flush(). Therefore, we can remove wdm_flush() completely.


Again, the API is defined by POSIX. If user space calls write() and
then close(), close() must report the error.

> I can't read this series without squashing into single patch. Making changes which

> will be after all removed in [RFC 5/7] is sad. Please do [RFC 5/7] before [RFC 4/7].


Done.

> Then, you won't need to make unneeded modifications. I'd like to see one cleanup

> patch, one possible unsafe dereference fix patch, and one deadlock avoidance patch.


This needs to partially go into stable. Hence fixes must come first.

> You did not answer to

> 

>   How do we guarantee that N'th write() request already set desc->werr before

>   (N+1)'th next write() request is issued? If (N+1)'th write() request reached

>   memdup_user() before desc->werr is set by callback of N'th write() request,

>   (N+1)'th write() request will fail to report the error from N'th write() request.

>   Yes, that error would be reported by (N+2)'th write() request, but the userspace

>   process might have already discarded data needed for taking some actions (e.g.

>   print error messages, retry the write() request with same argument).


We don't, nor do we have to. You are right, error reporting can be
improved. I implemented fsync() to do so.

> . At least I think that

> 

>         spin_lock_irq(&desc->iuspin);

>         we = desc->werr;

>         desc->werr = 0;

>         spin_unlock_irq(&desc->iuspin);

>         if (we < 0)

>                 return usb_translate_errors(we);

> 

> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).


Why?

> In [RFC 2/7], I think that

> 

>                 /* in case flush() had timed out */

>                 usb_kill_urb(desc->command);

> 

> which is called only when desc->count == 0 in wdm_open() is pointless, for since

> desc->count is incremented/decremented with wdm_mutex held, kill_urbs(desc) which

> is called when desc->count == 0 in wdm_release() must have already called

> usb_kill_urb(desc->command) before allowing wdm_open() to reach there.


You are right. I am going to remove it.

> In addition, is

> 

>         /* using write lock to protect desc->count */

>         mutex_lock(&desc->wlock);

> 

> required? Isn't wdm_mutex that is actually protecting desc->count from modification?

> If it is desc->wlock that is actually protecting desc->count, the !desc->count check

> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with

> desc->wlock held.


Correct. So should wdm_mutex be dropped earlier?

> In [RFC 3/7], patch description says

> 

>   There is no need for flush() to be uninterruptible. close(2)

>   is allowed to return -EAGAIN. Change it.

> 

> but the code does

> 

> 	if (rv < 0)

> 		return -EINTR;

> 

> . Which error code do you want to use? (I still prefer removing wdm_flush() completely...)


-EINTR. Sorry. I shall change the description.

	Regards
		Oliver
Tetsuo Handa Sept. 17, 2020, 4:17 p.m. UTC | #9
On 2020/09/17 23:17, Oliver Neukum wrote:
> The API and its semantics are clear. Write schedules a write:
> 
>        A  successful  return  from  write() does not make any guarantee that data has been committed to disk.  On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data.  In this case, some errors might be
>        delayed until a future write(2), fsync(2), or even close(2).  The only way to be sure is to call fsync(2) after you are done writing all your data.

But I think that this leaves a room for allowing write() to imply fflush()
(i.e. write() is allowed to wait for data to be committed to disk).

> 
> Fsync flushes data:
> 
>        fsync()  transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even if
>        the system crashes or is rebooted.  This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
> 
> If user space does not call fsync(), the error is supposed to be reported
> by the next write() and if there is no next write(), close() shall report it.

Where does "the next" (and not "the next after the next") write() come from?

>> You did not answer to
>>
>>   How do we guarantee that N'th write() request already set desc->werr before
>>   (N+1)'th next write() request is issued? If (N+1)'th write() request reached
>>   memdup_user() before desc->werr is set by callback of N'th write() request,
>>   (N+1)'th write() request will fail to report the error from N'th write() request.
>>   Yes, that error would be reported by (N+2)'th write() request, but the userspace
>>   process might have already discarded data needed for taking some actions (e.g.
>>   print error messages, retry the write() request with same argument).
> 
> We don't, nor do we have to. You are right, error reporting can be
> improved. I implemented fsync() to do so.

You are saying that if user space does not call fsync(), the error is allowed to be
reported by the next after the next (in other words, (N+2)'th) write() ?

> 
>> . At least I think that
>>
>>         spin_lock_irq(&desc->iuspin);
>>         we = desc->werr;
>>         desc->werr = 0;
>>         spin_unlock_irq(&desc->iuspin);
>>         if (we < 0)
>>                 return usb_translate_errors(we);
>>
>> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).
> 
> Why?

Otherwise, we can't make sure (N+1)'th write() will report error from N'th write().

Since I don't know the characteristics of data passed via wdm_write() (I guess that
the data is some stateful controlling commands rather than meaningless byte stream),
I guess that (N+1)'th wdm_write() attempt should be made only after confirming that
N'th wdm_write() attempt received wdm_callback() response. To preserve state / data
used by N'th wdm_write() attempt, reporting the error from too late write() attempt
would be useless.



>> In addition, is
>>
>>         /* using write lock to protect desc->count */
>>         mutex_lock(&desc->wlock);
>>
>> required? Isn't wdm_mutex that is actually protecting desc->count from modification?
>> If it is desc->wlock that is actually protecting desc->count, the !desc->count check
>> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with
>> desc->wlock held.
> 
> Correct. So should wdm_mutex be dropped earlier?

If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already
protects desc->count. I don't know how this module works. I don't know whether
wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from
wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when
recover_from_urb_loss() is called from wdm_post_reset().



By the way, after the fixes, we could replace

  spin_lock_irq(&desc->iuspin);
  rv = desc->werr;
  desc->werr = 0;
  spin_unlock_irq(&desc->iuspin);

with

  rv = xchg(&desc->werr, 0);

and avoid spin_lock_irq()/spin_unlock_irq() because there are many
locations which needs to check and clear the error...
Oliver Neukum Sept. 21, 2020, 10:52 a.m. UTC | #10
Am Freitag, den 18.09.2020, 01:17 +0900 schrieb Tetsuo Handa:
> On 2020/09/17 23:17, Oliver Neukum wrote:

> > The API and its semantics are clear. Write schedules a write:

> > 

> >        A  successful  return  from  write() does not make any guarantee that data has been committed to disk.  On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data.  In this case, some errors might be

> >        delayed until a future write(2), fsync(2), or even close(2).  The only way to be sure is to call fsync(2) after you are done writing all your data.

> 

> But I think that this leaves a room for allowing write() to imply fflush()

> (i.e. write() is allowed to wait for data to be committed to disk).


That would be inferior and very bad for the non-blocking case.

> > If user space does not call fsync(), the error is supposed to be reported

> > by the next write() and if there is no next write(), close() shall report it.

> 

> Where does "the next" (and not "the next after the next") write() come from?


We would indeed by on spec. However, we perform best if we return an
error as soon as possible.

> You are saying that if user space does not call fsync(), the error is allowed to be

> reported by the next after the next (in other words, (N+2)'th) write() ?


Yes. The man page is clear on that.

> > > . At least I think that

> > > 

> > >         spin_lock_irq(&desc->iuspin);

> > >         we = desc->werr;

> > >         desc->werr = 0;

> > >         spin_unlock_irq(&desc->iuspin);

> > >         if (we < 0)

> > >                 return usb_translate_errors(we);

> > > 

> > > in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).

> > 

> > Why?

> 

> Otherwise, we can't make sure (N+1)'th write() will report error from N'th write().


We should move the test for reporting errors later, so that it is sure
to be carried out? I am afraid I cannot follow that logic.

> Since I don't know the characteristics of data passed via wdm_write() (I guess that

> the data is some stateful controlling commands rather than meaningless byte stream),

> I guess that (N+1)'th wdm_write() attempt should be made only after confirming that

> N'th wdm_write() attempt received wdm_callback() response. To preserve state / data

> used by N'th wdm_write() attempt, reporting the error from too late write() attempt

> would be useless.


We cannot make assumptions on how user space uses the driver. Somebody
manually connecting and typing in commands letter by letter must also
work.

We can optimize for the common case, but we must operate according to
the specs.
> 

> 

> 

> > > In addition, is

> > > 

> > >         /* using write lock to protect desc->count */

> > >         mutex_lock(&desc->wlock);

> > > 

> > > required? Isn't wdm_mutex that is actually protecting desc->count from modification?

> > > If it is desc->wlock that is actually protecting desc->count, the !desc->count check

> > > in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with

> > > desc->wlock held.

> > 

> > Correct. So should wdm_mutex be dropped earlier?

> 

> If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already


It cannot.

> protects desc->count. I don't know how this module works. I don't know whether

> wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from

> wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when

> recover_from_urb_loss() is called from wdm_post_reset().


Indeed.
> 

> 

> 

> By the way, after the fixes, we could replace

> 

>   spin_lock_irq(&desc->iuspin);

>   rv = desc->werr;

>   desc->werr = 0;

>   spin_unlock_irq(&desc->iuspin);

> 

> with

> 

>   rv = xchg(&desc->werr, 0);

> 

> and avoid spin_lock_irq()/spin_unlock_irq() because there are many

> locations which needs to check and clear the error...


Have you checked whether this has implications on memory ordering?

	Regards
		Oliver
Tetsuo Handa Sept. 22, 2020, 1:56 a.m. UTC | #11
On 2020/09/21 19:52, Oliver Neukum wrote:
>>> If user space does not call fsync(), the error is supposed to be reported
>>> by the next write() and if there is no next write(), close() shall report it.
>>
>> Where does "the next" (and not "the next after the next") write() come from?
> 
> We would indeed by on spec. However, we perform best if we return an
> error as soon as possible.
> 
>> You are saying that if user space does not call fsync(), the error is allowed to be
>> reported by the next after the next (in other words, (N+2)'th) write() ?
> 
> Yes. The man page is clear on that.

To understand it, I must understand why it is safe to defer error reporting.

> 
>>>> . At least I think that
>>>>
>>>>         spin_lock_irq(&desc->iuspin);
>>>>         we = desc->werr;
>>>>         desc->werr = 0;
>>>>         spin_unlock_irq(&desc->iuspin);
>>>>         if (we < 0)
>>>>                 return usb_translate_errors(we);
>>>>
>>>> in wdm_write() should be moved to after !test_bit(WDM_IN_USE, &desc->flags).
>>>
>>> Why?
>>
>> Otherwise, we can't make sure (N+1)'th write() will report error from N'th write().
> 
> We should move the test for reporting errors later, so that it is sure
> to be carried out? I am afraid I cannot follow that logic.
> 
>> Since I don't know the characteristics of data passed via wdm_write() (I guess that
>> the data is some stateful controlling commands rather than meaningless byte stream),
>> I guess that (N+1)'th wdm_write() attempt should be made only after confirming that
>> N'th wdm_write() attempt received wdm_callback() response. To preserve state / data
>> used by N'th wdm_write() attempt, reporting the error from too late write() attempt
>> would be useless.
> 
> We cannot make assumptions on how user space uses the driver. Somebody
> manually connecting and typing in commands letter by letter must also
> work.

I'm not making any assumptions on how user space uses the driver. Up to now,
your explanation makes me think that you are making assumptions on how user space
uses cdc-wdm driver.

I'm querying you about characteristics of data passed to wdm_write().
Without knowing the difference between writing to cdc-wdm driver and normal file on
some filesystem, I can't judge whether it is acceptable to defer reporting errors.
When an error occurred when writing to normal file on some filesystem, the userspace
would simply treat that file as broken (e.g. delete such file). The userspace of this
cdc-wdm driver might want that any error is immediately reported, for I'm thinking that
each data passed to wdm_write() is stateful, for

  /* CDC-WMC r1.1 requires wMaxCommand to be "at least 256 decimal (0x100)" */
  #define WDM_DEFAULT_BUFSIZE     256

and

        if (count > desc->wMaxCommand)
                count = desc->wMaxCommand;

makes me think that wdm_write() has to be able to handle chunk of data in atomic
manner.

> 
> We can optimize for the common case, but we must operate according to
> the specs.
>>
>>
>>
>>>> In addition, is
>>>>
>>>>         /* using write lock to protect desc->count */
>>>>         mutex_lock(&desc->wlock);
>>>>
>>>> required? Isn't wdm_mutex that is actually protecting desc->count from modification?
>>>> If it is desc->wlock that is actually protecting desc->count, the !desc->count check
>>>> in wdm_release() and the desc->count == 1 check in wdm_open() have to be done with
>>>> desc->wlock held.
>>>
>>> Correct. So should wdm_mutex be dropped earlier?
>>
>> If recover_from_urb_loss() can tolerate stale desc->count value, wdm_mutex already
> 
> It cannot.
> 
>> protects desc->count. I don't know how this module works. I don't know whether
>> wdm_mutex and/or desc->wlock is held when recover_from_urb_loss() is called from
>> wdm_resume(). It seems that desc->wlock is held but wdm_mutex is not held when
>> recover_from_urb_loss() is called from wdm_post_reset().
> 
> Indeed.

I don't like [RFC 8/8]. Please drop [RFC 8/8] because not only it is unrelated to
hang up problem syzbot is reporting but also lock dependency is still unclear.
The /* using write lock to protect desc->count */ comment is hardly helpful
because wdm_disconnect() accesses desc->count with only wdm_mutex held.
Detailed explanation about why releasing wdm_mutex early in wdm_open() is safe
with regards to e.g. wdm_disconnect()/recover_from_urb_loss()/wdm_release() will
be required for [RFC 8/8].

Also, pointless

+               /* in case flush() had timed out */
+               usb_kill_urb(desc->command);

is still in [RFC 2/8].

Please do squash into one patch, and add detailed/polite patch description like
https://lkml.kernel.org/r/1597188375-4787-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp .
Current series is too fragmented to understand, and nobody can review such series.
Oliver Neukum Sept. 22, 2020, 7:33 a.m. UTC | #12
Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa:
> On 2020/09/21 19:52, Oliver Neukum wrote:


> 

> To understand it, I must understand why it is safe to defer error reporting.


It is not. There is nothing to understand here. If user space needs
a guarantee that data has been pushed out without an error, it will
have to call fsync()

Let me explain.
POSIX, as described in the man page, sets some rules about what
a driver must do and must not do. In other areas it makes
recommendations a good driver should follow. One of them is that
we block as little as possible, at the risk of delaying error
reporting. A good driver will report errors as soon as possible
while being compatible with doing IO asynchronously.

> I'm querying you about characteristics of data passed to wdm_write().

> Without knowing the difference between writing to cdc-wdm driver and normal file on

> some filesystem, I can't judge whether it is acceptable to defer reporting errors.


That is simply not a decision you or I make. The man page clearly
says that it is acceptable. If user space does not like that, it must
call fsync() after write().

> When an error occurred when writing to normal file on some filesystem, the userspace

> would simply treat that file as broken (e.g. delete such file). The userspace of this

> cdc-wdm driver might want that any error is immediately reported, for I'm thinking that

> each data passed to wdm_write() is stateful, for

> 

>   /* CDC-WMC r1.1 requires wMaxCommand to be "at least 256 decimal (0x100)" */

>   #define WDM_DEFAULT_BUFSIZE     256

> 

> and

> 

>         if (count > desc->wMaxCommand)

>                 count = desc->wMaxCommand;

> 

> makes me think that wdm_write() has to be able to handle chunk of data in atomic

> manner.


No. the WDM spec says the clear opposite and the man page of write says
so, too.

> I don't like [RFC 8/8]. Please drop [RFC 8/8] because not only it is unrelated to

> hang up problem syzbot is reporting but also lock dependency is still unclear.


Very well. I will make a patch set that clearly fixes bugs and a
secondary one with optimizations.

	Regards
		Oliver
Tetsuo Handa Sept. 22, 2020, 8:34 a.m. UTC | #13
On 2020/09/22 16:33, Oliver Neukum wrote:
> Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa:
>> On 2020/09/21 19:52, Oliver Neukum wrote:
>> To understand it, I must understand why it is safe to defer error reporting.
> 
> It is not. There is nothing to understand here. If user space needs
> a guarantee that data has been pushed out without an error, it will
> have to call fsync()

>> I'm querying you about characteristics of data passed to wdm_write().
>> Without knowing the difference between writing to cdc-wdm driver and normal file on
>> some filesystem, I can't judge whether it is acceptable to defer reporting errors.
> 
> That is simply not a decision you or I make. The man page clearly
> says that it is acceptable. If user space does not like that, it must
> call fsync() after write().

Then, cdc-wdm driver did not implement fsync() was a big problem. Userspace
needs to be careful not to give up upon -EINVAL when running on older kernels
which did not implement wdm_fsync().

The remaining concern would be how to handle unresponding hardware, for blocking
wdm_write()/wdm_read() and wdm_fsync() are using wait_event_interruptible(). If
the caller do not have a mean to send a signal, the caller might hung up forever
when the hardware stopped responding. Please add a comment that userspace needs to
be careful when calling these functions.
Oliver Neukum Sept. 22, 2020, 9:45 a.m. UTC | #14
Am Dienstag, den 22.09.2020, 17:34 +0900 schrieb Tetsuo Handa:
> On 2020/09/22 16:33, Oliver Neukum wrote:
> > Am Dienstag, den 22.09.2020, 10:56 +0900 schrieb Tetsuo Handa:
> > > On 2020/09/21 19:52, Oliver Neukum wrote:
> > > To understand it, I must understand why it is safe to defer error reporting.
> > 
> > It is not. There is nothing to understand here. If user space needs
> > a guarantee that data has been pushed out without an error, it will
> > have to call fsync()
> > > I'm querying you about characteristics of data passed to wdm_write().
> > > Without knowing the difference between writing to cdc-wdm driver and normal file on
> > > some filesystem, I can't judge whether it is acceptable to defer reporting errors.
> > 
> > That is simply not a decision you or I make. The man page clearly
> > says that it is acceptable. If user space does not like that, it must
> > call fsync() after write().
> 
> Then, cdc-wdm driver did not implement fsync() was a big problem. Userspace
> needs to be careful not to give up upon -EINVAL when running on older kernels
> which did not implement wdm_fsync().

Very well. So I'll call the lack of fsync() a bug, which should be
fixed in stable.

> The remaining concern would be how to handle unresponding hardware, for blocking
> wdm_write()/wdm_read() and wdm_fsync() are using wait_event_interruptible(). If
> the caller do not have a mean to send a signal, the caller might hung up forever
> when the hardware stopped responding. Please add a comment that userspace needs to
> be careful when calling these functions.

wdm_flush() has such a comment. Yet no driver can make a guarantee that
a device will make progress in IO. The driver must, however, provide
a means of dealing with such cases. Usually that means handling
signals. That is the normal semantics of a write() syscall.

I believe we are covered on that.

	Regards
		Oliver