mbox series

[v7,0/7] hw/char/pl011: Implement TX (async) FIFO to avoid blocking the main loop

Message ID 20250310012825.79614-1-philmd@linaro.org
Headers show
Series hw/char/pl011: Implement TX (async) FIFO to avoid blocking the main loop | expand

Message

Philippe Mathieu-Daudé March 10, 2025, 1:28 a.m. UTC
Hi,

This series add support for (async) FIFO on the transmit path
of the PL011 UART.

Since v6:
- Incorporated Peter's fixes in corresponding patches

Since v5:
- Rebased (few patches already merged)
- Do not forbid disabled UART/receiver (Peter)
- Use fifo8_peek API for wrapped buffer (Mark)

Since v4:
- Rebased (loopback)
- Addressed Richard & Juan migration comments
- Split in smaller patches

Since v3:
- Document migration bits (Alex, Richard)
- Just check FIFO is not empty in pl011_xmit_fifo_state_needed (rth)
- In pl011_xmit check TX enabled first, and ignore < 8-bit TX (rth)

Since v2:
- Added R-b tags
- Addressed Richard comments on migration

Since v1:
- Restrict pl011_ops[] impl access_size,
- Do not check transmitter is enabled (Peter),
- Addressed Alex's review comments,
- Simplified migration trying to care about backward compat,
  but still unsure...

Philippe Mathieu-Daudé (7):
  hw/char/pl011: Add transmit FIFO to PL011State
  hw/char/pl011: Introduce pl011_xmit()
  hw/char/pl011: Factor pl011_xmit_cb() out as GSource
  hw/char/pl011: Trace FIFO enablement
  hw/char/pl011: Consider TX FIFO overrun error
  hw/char/pl011: Drain TX FIFO when no backend connected
  hw/char/pl011: Implement TX FIFO

 include/hw/char/pl011.h |   2 +
 hw/char/pl011.c         | 150 +++++++++++++++++++++++++++++++++++++---
 hw/char/trace-events    |   8 +++
 3 files changed, 150 insertions(+), 10 deletions(-)

Comments

Peter Maydell March 10, 2025, 2:42 p.m. UTC | #1
On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
>
> Hi,
>
> This series add support for (async) FIFO on the transmit path
> of the PL011 UART.

This hasn't made the last pre-softfreeze arm pullreq, but
I think we can reasonably call "don't do blocking I/O"
enough of a bugfix for it to be ok to go in early in the
freeze cycle for rc0.

I've applied it to target-arm.next.

thanks
-- PMM
Peter Maydell March 10, 2025, 5:28 p.m. UTC | #2
On Mon, 10 Mar 2025 at 14:42, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
> >
> > Hi,
> >
> > This series add support for (async) FIFO on the transmit path
> > of the PL011 UART.
>
> This hasn't made the last pre-softfreeze arm pullreq, but
> I think we can reasonably call "don't do blocking I/O"
> enough of a bugfix for it to be ok to go in early in the
> freeze cycle for rc0.
>
> I've applied it to target-arm.next.

...but it still fails 'make check-functional', though in a
less easy-to-reproduce way than it did. The problem turns out
to be that when the guest kernel is doing its earlycon
output (which is by polling, not interrupt driven) the output
can be corrupted, which makes the aarch64/test_arm_virt test
fail to find the "Kernel command line:" output it is looking for.

This seems to be because the pl011 code and the chardev
code disagree about how "couldn't write anything" is
reported. pl011 here is looking for "0 means wrote nothing",
but the chardev code reports it as "-1 and errno is EAGAIN".

I think the chardev code is actually what we need to fix here,
because it makes basically no effort to guarantee that the
errno from the underlying write is still in 'errno' by the
time qemu_chr_fe_write() returns. In particular it may
call qemu_chr_write_log() or replay_char_write_event_save(),
both of which will happily trash errno if something fails
during their execution.

So my long-term preference for fixing this is:
 * fix up any callsites that can't handle a 0 return for
   "wrote no bytes"
 * make (and document) qemu_chr_fe_write()'s return value be
    - 0 == wrote no bytes
    - >0 == wrote some bytes
    - <0 == a negative-errno indicating a definite error


I had planned in the meantime that we could deal with
this by squashing in this change to the last patch in
this series:

--- a/hw/char/pl011.c
+++ b/hw/char/pl011.c
@@ -275,6 +275,9 @@ static gboolean pl011_xmit_cb(void *do_not_use,
GIOCondition cond, void *opaque)
     /* Transmit as much data as we can. */
     bytes_consumed = qemu_chr_fe_write(&s->chr, buf, count);
     trace_pl011_fifo_tx_xmit_consumed(bytes_consumed);
+    if (bytes_consumed < 0 && errno == EAGAIN) {
+        bytes_consumed = 0;
+    }
     if (bytes_consumed < 0) {
         /* Error in back-end: drain the fifo. */
         printf("oops, bytes_consumed = %d errno = %d\n",
bytes_consumed, errno);


which makes the code handle both "returns 0" and "returns -1
with errno=EAGAIN" as "try again later".

But even with that I still see the check-functional
test failing on a clang sanitizer build, though without
any clear reason why. It's intermittent; running the
test like this:

(cd build/arm-clang/ ; PYTHONPATH=../../python:../../tests/functional
QEMU_TEST_QEMU_BINARY=./qemu-system-aarch64 ./pyvenv/bin/python3
../../tests/functional/test_arm_virt.py)

I got one pass once but mostly it hangs after printing
some of the early console output. A debug build seems
more reliable, oddly.

I'll try to continue investigating this this week, but
in the meantime I'm going to have to drop this series
from target-arm.next again, I'm afraid :-(

thanks
-- PMM
Philippe Mathieu-Daudé March 11, 2025, 10:33 a.m. UTC | #3
Hi Peter,

On 10/3/25 18:28, Peter Maydell wrote:
> On Mon, 10 Mar 2025 at 14:42, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
>>>
>>> Hi,
>>>
>>> This series add support for (async) FIFO on the transmit path
>>> of the PL011 UART.
>>
>> This hasn't made the last pre-softfreeze arm pullreq, but
>> I think we can reasonably call "don't do blocking I/O"
>> enough of a bugfix for it to be ok to go in early in the
>> freeze cycle for rc0.
>>
>> I've applied it to target-arm.next.
> 
> ...but it still fails 'make check-functional', though in a
> less easy-to-reproduce way than it did. The problem turns out
> to be that when the guest kernel is doing its earlycon
> output (which is by polling, not interrupt driven) the output
> can be corrupted, which makes the aarch64/test_arm_virt test
> fail to find the "Kernel command line:" output it is looking for.

Thanks for keeping investigating...

> This seems to be because the pl011 code and the chardev
> code disagree about how "couldn't write anything" is
> reported. pl011 here is looking for "0 means wrote nothing",
> but the chardev code reports it as "-1 and errno is EAGAIN".
> 
> I think the chardev code is actually what we need to fix here,
> because it makes basically no effort to guarantee that the
> errno from the underlying write is still in 'errno' by the
> time qemu_chr_fe_write() returns. In particular it may
> call qemu_chr_write_log() or replay_char_write_event_save(),
> both of which will happily trash errno if something fails
> during their execution.

IIUC when retrying qemu_chr_write_buffer(s, buf, len, ofs) could
write less than @len (but still writing few bytes, that information
is stored in @offset) and return -errno, discarding @offset partial
write len.

> So my long-term preference for fixing this is:
>   * fix up any callsites that can't handle a 0 return for
>     "wrote no bytes"
>   * make (and document) qemu_chr_fe_write()'s return value be
>      - 0 == wrote no bytes
>      - >0 == wrote some bytes
>      - <0 == a negative-errno indicating a definite error

This would be an improvement, but not fixing ignored partial
writes mentioned, is that right?

> I had planned in the meantime that we could deal with
> this by squashing in this change to the last patch in
> this series:
> 
> --- a/hw/char/pl011.c
> +++ b/hw/char/pl011.c
> @@ -275,6 +275,9 @@ static gboolean pl011_xmit_cb(void *do_not_use,
> GIOCondition cond, void *opaque)
>       /* Transmit as much data as we can. */
>       bytes_consumed = qemu_chr_fe_write(&s->chr, buf, count);
>       trace_pl011_fifo_tx_xmit_consumed(bytes_consumed);
> +    if (bytes_consumed < 0 && errno == EAGAIN) {
> +        bytes_consumed = 0;
> +    }
>       if (bytes_consumed < 0) {
>           /* Error in back-end: drain the fifo. */
>           printf("oops, bytes_consumed = %d errno = %d\n",
> bytes_consumed, errno);
> 
> 
> which makes the code handle both "returns 0" and "returns -1
> with errno=EAGAIN" as "try again later".
> 
> But even with that I still see the check-functional
> test failing on a clang sanitizer build, though without
> any clear reason why. It's intermittent; running the
> test like this:
> 
> (cd build/arm-clang/ ; PYTHONPATH=../../python:../../tests/functional
> QEMU_TEST_QEMU_BINARY=./qemu-system-aarch64 ./pyvenv/bin/python3
> ../../tests/functional/test_arm_virt.py)
> 
> I got one pass once but mostly it hangs after printing
> some of the early console output. A debug build seems
> more reliable, oddly.
> 
> I'll try to continue investigating this this week, but
> in the meantime I'm going to have to drop this series
> from target-arm.next again, I'm afraid :-(

No worry, I was prepared for another issue :)

Regards,

Phil.
Peter Maydell March 11, 2025, 6:36 p.m. UTC | #4
On Tue, 11 Mar 2025 at 10:33, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
> On 10/3/25 18:28, Peter Maydell wrote:
> > This seems to be because the pl011 code and the chardev
> > code disagree about how "couldn't write anything" is
> > reported. pl011 here is looking for "0 means wrote nothing",
> > but the chardev code reports it as "-1 and errno is EAGAIN".
> >
> > I think the chardev code is actually what we need to fix here,
> > because it makes basically no effort to guarantee that the
> > errno from the underlying write is still in 'errno' by the
> > time qemu_chr_fe_write() returns. In particular it may
> > call qemu_chr_write_log() or replay_char_write_event_save(),
> > both of which will happily trash errno if something fails
> > during their execution.
>
> IIUC when retrying qemu_chr_write_buffer(s, buf, len, ofs) could
> write less than @len (but still writing few bytes, that information
> is stored in @offset) and return -errno, discarding @offset partial
> write len.

I thought that too when I first read the code, because it's
been written in a way that didn't match what I was expecting
for a "write until you would block, then stop" loop, but on
second reading I decided it could not, at least in the case
where write_all is false. If we write any data at all on the
first cc->chr_write call, we will then break out of the loop
without trying to send any more, so there's no code path
that will then set res to a negative number. (If write_all
is true, then yeah I think we can return an errno rather
than a partial-write count if chr_write fails on the second
time through the loop, but all the callers of that version
of the function ignore errors of any kind anyway, and
certainly don't have any means to handle "only wrote half
the data", otherwise they'd be using the non-blocking
version.)

-- PMM