diff mbox series

[1/1] fs: pipe: wakeup readers everytime new data written is to pipe

Message ID 20210729222635.2937453-2-sspatil@android.com
State New
Headers show
Series Revert change in pipe reader wakeup behavior | expand

Commit Message

Sandeep Patil July 29, 2021, 10:26 p.m. UTC
commit '1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")'
changed pipe_write() to wakeup readers only if the pipe was empty.
Prior to this change, threads waiting in epoll_wait(EPOLLET | EPOLLIN)
on non-empty pipes would get woken up on new data.

It meant an applications that,
   1. used pipe + epoll for notifications between threads / processes
   2. Didn't drain the pipe on each epoll wakeup unless the pipe was full
started to experience hang / timeouts in threads stuck in epoll_wait()

So restore the old behavior to wakeup all readers if any new data is
written to the pipe.

Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")
Signed-off-by: Sandeep Patil <sspatil@android.com>
---
 fs/pipe.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

Comments

Linus Torvalds July 29, 2021, 11:01 p.m. UTC | #1
On Thu, Jul 29, 2021 at 3:27 PM Sandeep Patil <sspatil@android.com> wrote:
>
> So restore the old behavior to wakeup all readers if any new data is
> written to the pipe.

Ah-hahh.

I've had this slightly smaller patch waiting for the better part of a year:

  https://lore.kernel.org/lkml/CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1Ww8eg@mail.gmail.com/

waiting to see if some broken program actually depends on the bogus
epollet semantics.

Can you verify that that patch fixes the realm-core brokenness too?

             Linus
Sandeep Patil July 30, 2021, 7:11 p.m. UTC | #2
On 7/29/21 11:01 PM, Linus Torvalds wrote:
> On Thu, Jul 29, 2021 at 3:27 PM Sandeep Patil <sspatil@android.com> wrote:

>>

>> So restore the old behavior to wakeup all readers if any new data is

>> written to the pipe.

> 

> Ah-hahh.

> 

> I've had this slightly smaller patch waiting for the better part of a year:

> 

>    https://lore.kernel.org/lkml/CAHk-=wgjR7Nd4CyDoi3SH9kPJp_Td9S-hhFJZMqvp6GS1Ww8eg@mail.gmail.com/

> 

> waiting to see if some broken program actually depends on the bogus

> epollet semantics.

> 

> Can you verify that that patch fixes the realm-core brokenness too?


Yes, your patch fixes all apps on Android I can test that include this 
library.

fwiw, the library seems to have been fixed. However, I am not sure
how long it will be for all apps to take that update :(.

- ssp
Linus Torvalds July 30, 2021, 7:23 p.m. UTC | #3
On Fri, Jul 30, 2021 at 12:11 PM Sandeep Patil <sspatil@android.com> wrote:
>

> Yes, your patch fixes all apps on Android I can test that include this

> library.


Ok, thanks for checking.

> fwiw, the library seems to have been fixed. However, I am not sure

> how long it will be for all apps to take that update :(.


I wonder if I could make the wakeup logic do this only for the epollet case.

I'll have to think about it, but maybe I'll just apply that simple
patch. I dislike the pointless wakeups, and as long as the only case I
knew of was only a test of broken behavior, it was fine. But now that
you've reported actual application breakage, this is in the "real
regression" category, and so I'll fix it one way or the other.

And on the other hand I also have a slight preference towards your
patch simply because you did the work of finding this out, so you
should get the credit.

I'll mull it over a bit more, but whatever I'll do I'll do before rc4
and mark it for stable.

Thanks for testing,

                 Linus
Sandeep Patil July 30, 2021, 7:47 p.m. UTC | #4
On 7/30/21 7:23 PM, Linus Torvalds wrote:
> On Fri, Jul 30, 2021 at 12:11 PM Sandeep Patil <sspatil@android.com> wrote:

>>

>> Yes, your patch fixes all apps on Android I can test that include this

>> library.

> 

> Ok, thanks for checking.

> 

>> fwiw, the library seems to have been fixed. However, I am not sure

>> how long it will be for all apps to take that update :(.

> 

> I wonder if I could make the wakeup logic do this only for the epollet case.


aren't we supposed to wakeup on each write in level-triggered (default) 
case though?

> 

> I'll have to think about it, but maybe I'll just apply that simple

> patch. I dislike the pointless wakeups, and as long as the only case I

> knew of was only a test of broken behavior, it was fine. But now that

> you've reported actual application breakage, this is in the "real

> regression" category, and so I'll fix it one way or the other.

> 

> And on the other hand I also have a slight preference towards your

> patch simply because you did the work of finding this out, so you

> should get the credit.


Ha, I can't really claim credit here. This was also reported to us
in Android that triggered the search. Plus, now that I see your thread 
with Michael Kerrisk, he was way ahead of us in finding this out.

> 

> I'll mull it over a bit more, but whatever I'll do I'll do before rc4

> and mark it for stable.


Thanks, I was actually going to suggest taking your patch cause it also 
  makes changes in pipe_read(). I am not sure if there are apps that do 
EPOLLET | EPOLLOUT (can't think of a reason why).

- ssp

> 

> Thanks for testing,

> 

>                   Linus

>
Linus Torvalds July 30, 2021, 10:06 p.m. UTC | #5
On Fri, Jul 30, 2021 at 12:47 PM Sandeep Patil <sspatil@android.com> wrote:
>

> aren't we supposed to wakeup on each write in level-triggered (default)

> case though?


No.

The thing about level triggered is that if the condition was already
true, it would not need a wakeup in the first place.

Put another way: select() and poll() are both fundamentally
level-triggered. If the condition was already true, they will return
success immediately, and don't need any extraneous wakeups.

This is literally an epoll() confusion about what an "edge" is.

An edge is not "somebody wrote more data". An edge is "there was no
data, now there is data".

And a level triggered event is *also* not "somebody wrote more data".
A level-triggered signal is simply "there is data".

Notice how neither edge nor level are about "more data". One is about
the edge of "no data" -> "some data", and the other is just a "data is
available".

Sadly, it seems that our old "we'll wake things up whether needed or
not" implementation ended up being something that people thought was
edge-triggered semantics.

But we have the policy that regressions aren't about documentation or
even sane behavior.

Regressions are about whether a user application broke in a noticeable way.

                     Linus
Linus Torvalds July 30, 2021, 10:53 p.m. UTC | #6
On Fri, Jul 30, 2021 at 12:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> I'll mull it over a bit more, but whatever I'll do I'll do before rc4

> and mark it for stable.


Ok, I ended up committing the minimal possible change (and fixing up
the comment above it).

It's very much *not* the original behavior either, but that original
behavior was truly insane ("wake up for each hunk written"), and I'm
trying to at least keep the kernel code from doing actively stupid
things.

Since that old patch of mine worked for your test-case, then clearly
that realm-core library didn't rely on _that_ kind of insane internal
kernel implementation details exposed as semantics. So The minimal
patch basically says "each write() system call wil do at least one
wake-up, whether really necessary or not".

I also intentionally kept the read side untouched, in that there
apparently still isn't a case that would need the confused semantics
for read events.

End result: the commit message is a lot bigger than the patch, with
most of it being trying to explain the background.

I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes
always wake up readers"). Holler if you notice anything odd remaining.

              Linus
Linus Torvalds July 30, 2021, 10:55 p.m. UTC | #7
On Fri, Jul 30, 2021 at 3:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes

> always wake up readers"). Holler if you notice anything odd remaining.


.. and as I wrote that, I realized that I had forgotten to mark it for
stable even though that was the intent.

So stable team, please consider this the "that commit should probably
be added to your queue" notification,

             Linus
Greg Kroah-Hartman July 31, 2021, 5:32 a.m. UTC | #8
On Fri, Jul 30, 2021 at 03:55:17PM -0700, Linus Torvalds wrote:
> On Fri, Jul 30, 2021 at 3:53 PM Linus Torvalds

> <torvalds@linux-foundation.org> wrote:

> >

> > I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes

> > always wake up readers"). Holler if you notice anything odd remaining.

> 

> .. and as I wrote that, I realized that I had forgotten to mark it for

> stable even though that was the intent.

> 

> So stable team, please consider this the "that commit should probably

> be added to your queue" notification,


Will do so, thanks!

greg k-h
Sandeep Patil Aug. 2, 2021, 6:59 p.m. UTC | #9
On 7/30/21 10:53 PM, Linus Torvalds wrote:
> On Fri, Jul 30, 2021 at 12:23 PM Linus Torvalds

> <torvalds@linux-foundation.org> wrote:

>>

>> I'll mull it over a bit more, but whatever I'll do I'll do before rc4

>> and mark it for stable.

> 

> Ok, I ended up committing the minimal possible change (and fixing up

> the comment above it).

> 

> It's very much *not* the original behavior either, but that original

> behavior was truly insane ("wake up for each hunk written"), and I'm

> trying to at least keep the kernel code from doing actively stupid

> things.

> 

> Since that old patch of mine worked for your test-case, then clearly

> that realm-core library didn't rely on _that_ kind of insane internal

> kernel implementation details exposed as semantics. So The minimal

> patch basically says "each write() system call wil do at least one

> wake-up, whether really necessary or not".

> 

> I also intentionally kept the read side untouched, in that there

> apparently still isn't a case that would need the confused semantics

> for read events.

> 

> End result: the commit message is a lot bigger than the patch, with

> most of it being trying to explain the background.

> 

> I've pushed it out as commit 3a34b13a88ca ("pipe: make pipe writes

> always wake up readers"). Holler if you notice anything odd remaining.


Since what you merged isn't different than what I tested, I don't
expect any surprises but I will test it regardless. I will come back
if I see anything unexpected.

Thanks for the explanation about the default behavior earlier
in the thread.

- ssp
diff mbox series

Patch

diff --git a/fs/pipe.c b/fs/pipe.c
index bfd946a9ad01..dda22a316bb3 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -406,7 +406,7 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret = 0;
 	size_t total_len = iov_iter_count(from);
 	ssize_t chars;
-	bool was_empty = false;
+	bool do_wakeup = false;
 	bool wake_next_writer = false;
 
 	/* Null write succeeds. */
@@ -429,10 +429,11 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Wake up readers if the pipe was written to. Regardless
+	 * of whether it was empty or not. Otherwise, threads
+	 * waiting with EPOLLET will hang until the pipe is emptied.
 	 *
-	 * If it wasn't empty we try to merge new data into
+	 * If pipe wasn't empty we try to merge new data into
 	 * the last buffer.
 	 *
 	 * That naturally merges small writes, but it also
@@ -440,9 +441,8 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
@@ -460,6 +460,7 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			}
 
 			buf->len += ret;
+			do_wakeup = true;
 			if (!iov_iter_count(from))
 				goto out;
 		}
@@ -526,6 +527,7 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 			ret += copied;
 			buf->offset = 0;
 			buf->len = copied;
+			do_wakeup = true;
 
 			if (!iov_iter_count(from))
 				break;
@@ -553,13 +555,12 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 		 * become empty while we dropped the lock.
 		 */
 		__pipe_unlock(pipe);
-		if (was_empty) {
+		if (do_wakeup) {
 			wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 			kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 		}
 		wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
 		__pipe_lock(pipe);
-		was_empty = pipe_empty(pipe->head, pipe->tail);
 		wake_next_writer = true;
 	}
 out:
@@ -576,7 +577,7 @@  pipe_write(struct kiocb *iocb, struct iov_iter *from)
 	 * how (for example) the GNU make jobserver uses small writes to
 	 * wake up pending jobs
 	 */
-	if (was_empty) {
+	if (do_wakeup) {
 		wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
 		kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
 	}