Message ID | 20250310012825.79614-1-philmd@linaro.org |
---|---|
Headers | show |
Series | hw/char/pl011: Implement TX (async) FIFO to avoid blocking the main loop | expand |
On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote: > > Hi, > > This series add support for (async) FIFO on the transmit path > of the PL011 UART. This hasn't made the last pre-softfreeze arm pullreq, but I think we can reasonably call "don't do blocking I/O" enough of a bugfix for it to be ok to go in early in the freeze cycle for rc0. I've applied it to target-arm.next. thanks -- PMM
On Mon, 10 Mar 2025 at 14:42, Peter Maydell <peter.maydell@linaro.org> wrote: > > On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote: > > > > Hi, > > > > This series add support for (async) FIFO on the transmit path > > of the PL011 UART. > > This hasn't made the last pre-softfreeze arm pullreq, but > I think we can reasonably call "don't do blocking I/O" > enough of a bugfix for it to be ok to go in early in the > freeze cycle for rc0. > > I've applied it to target-arm.next. ...but it still fails 'make check-functional', though in a less easy-to-reproduce way than it did. The problem turns out to be that when the guest kernel is doing its earlycon output (which is by polling, not interrupt driven) the output can be corrupted, which makes the aarch64/test_arm_virt test fail to find the "Kernel command line:" output it is looking for. This seems to be because the pl011 code and the chardev code disagree about how "couldn't write anything" is reported. pl011 here is looking for "0 means wrote nothing", but the chardev code reports it as "-1 and errno is EAGAIN". I think the chardev code is actually what we need to fix here, because it makes basically no effort to guarantee that the errno from the underlying write is still in 'errno' by the time qemu_chr_fe_write() returns. In particular it may call qemu_chr_write_log() or replay_char_write_event_save(), both of which will happily trash errno if something fails during their execution. So my long-term preference for fixing this is: * fix up any callsites that can't handle a 0 return for "wrote no bytes" * make (and document) qemu_chr_fe_write()'s return value be - 0 == wrote no bytes - >0 == wrote some bytes - <0 == a negative-errno indicating a definite error I had planned in the meantime that we could deal with this by squashing in this change to the last patch in this series: --- a/hw/char/pl011.c +++ b/hw/char/pl011.c @@ -275,6 +275,9 @@ static gboolean pl011_xmit_cb(void *do_not_use, GIOCondition cond, void *opaque) /* Transmit as much data as we can. */ bytes_consumed = qemu_chr_fe_write(&s->chr, buf, count); trace_pl011_fifo_tx_xmit_consumed(bytes_consumed); + if (bytes_consumed < 0 && errno == EAGAIN) { + bytes_consumed = 0; + } if (bytes_consumed < 0) { /* Error in back-end: drain the fifo. */ printf("oops, bytes_consumed = %d errno = %d\n", bytes_consumed, errno); which makes the code handle both "returns 0" and "returns -1 with errno=EAGAIN" as "try again later". But even with that I still see the check-functional test failing on a clang sanitizer build, though without any clear reason why. It's intermittent; running the test like this: (cd build/arm-clang/ ; PYTHONPATH=../../python:../../tests/functional QEMU_TEST_QEMU_BINARY=./qemu-system-aarch64 ./pyvenv/bin/python3 ../../tests/functional/test_arm_virt.py) I got one pass once but mostly it hangs after printing some of the early console output. A debug build seems more reliable, oddly. I'll try to continue investigating this this week, but in the meantime I'm going to have to drop this series from target-arm.next again, I'm afraid :-( thanks -- PMM
Hi Peter, On 10/3/25 18:28, Peter Maydell wrote: > On Mon, 10 Mar 2025 at 14:42, Peter Maydell <peter.maydell@linaro.org> wrote: >> >> On Mon, 10 Mar 2025 at 01:28, Philippe Mathieu-Daudé <philmd@linaro.org> wrote: >>> >>> Hi, >>> >>> This series add support for (async) FIFO on the transmit path >>> of the PL011 UART. >> >> This hasn't made the last pre-softfreeze arm pullreq, but >> I think we can reasonably call "don't do blocking I/O" >> enough of a bugfix for it to be ok to go in early in the >> freeze cycle for rc0. >> >> I've applied it to target-arm.next. > > ...but it still fails 'make check-functional', though in a > less easy-to-reproduce way than it did. The problem turns out > to be that when the guest kernel is doing its earlycon > output (which is by polling, not interrupt driven) the output > can be corrupted, which makes the aarch64/test_arm_virt test > fail to find the "Kernel command line:" output it is looking for. Thanks for keeping investigating... > This seems to be because the pl011 code and the chardev > code disagree about how "couldn't write anything" is > reported. pl011 here is looking for "0 means wrote nothing", > but the chardev code reports it as "-1 and errno is EAGAIN". > > I think the chardev code is actually what we need to fix here, > because it makes basically no effort to guarantee that the > errno from the underlying write is still in 'errno' by the > time qemu_chr_fe_write() returns. In particular it may > call qemu_chr_write_log() or replay_char_write_event_save(), > both of which will happily trash errno if something fails > during their execution. IIUC when retrying qemu_chr_write_buffer(s, buf, len, ofs) could write less than @len (but still writing few bytes, that information is stored in @offset) and return -errno, discarding @offset partial write len. > So my long-term preference for fixing this is: > * fix up any callsites that can't handle a 0 return for > "wrote no bytes" > * make (and document) qemu_chr_fe_write()'s return value be > - 0 == wrote no bytes > - >0 == wrote some bytes > - <0 == a negative-errno indicating a definite error This would be an improvement, but not fixing ignored partial writes mentioned, is that right? > I had planned in the meantime that we could deal with > this by squashing in this change to the last patch in > this series: > > --- a/hw/char/pl011.c > +++ b/hw/char/pl011.c > @@ -275,6 +275,9 @@ static gboolean pl011_xmit_cb(void *do_not_use, > GIOCondition cond, void *opaque) > /* Transmit as much data as we can. */ > bytes_consumed = qemu_chr_fe_write(&s->chr, buf, count); > trace_pl011_fifo_tx_xmit_consumed(bytes_consumed); > + if (bytes_consumed < 0 && errno == EAGAIN) { > + bytes_consumed = 0; > + } > if (bytes_consumed < 0) { > /* Error in back-end: drain the fifo. */ > printf("oops, bytes_consumed = %d errno = %d\n", > bytes_consumed, errno); > > > which makes the code handle both "returns 0" and "returns -1 > with errno=EAGAIN" as "try again later". > > But even with that I still see the check-functional > test failing on a clang sanitizer build, though without > any clear reason why. It's intermittent; running the > test like this: > > (cd build/arm-clang/ ; PYTHONPATH=../../python:../../tests/functional > QEMU_TEST_QEMU_BINARY=./qemu-system-aarch64 ./pyvenv/bin/python3 > ../../tests/functional/test_arm_virt.py) > > I got one pass once but mostly it hangs after printing > some of the early console output. A debug build seems > more reliable, oddly. > > I'll try to continue investigating this this week, but > in the meantime I'm going to have to drop this series > from target-arm.next again, I'm afraid :-( No worry, I was prepared for another issue :) Regards, Phil.
On Tue, 11 Mar 2025 at 10:33, Philippe Mathieu-Daudé <philmd@linaro.org> wrote: > On 10/3/25 18:28, Peter Maydell wrote: > > This seems to be because the pl011 code and the chardev > > code disagree about how "couldn't write anything" is > > reported. pl011 here is looking for "0 means wrote nothing", > > but the chardev code reports it as "-1 and errno is EAGAIN". > > > > I think the chardev code is actually what we need to fix here, > > because it makes basically no effort to guarantee that the > > errno from the underlying write is still in 'errno' by the > > time qemu_chr_fe_write() returns. In particular it may > > call qemu_chr_write_log() or replay_char_write_event_save(), > > both of which will happily trash errno if something fails > > during their execution. > > IIUC when retrying qemu_chr_write_buffer(s, buf, len, ofs) could > write less than @len (but still writing few bytes, that information > is stored in @offset) and return -errno, discarding @offset partial > write len. I thought that too when I first read the code, because it's been written in a way that didn't match what I was expecting for a "write until you would block, then stop" loop, but on second reading I decided it could not, at least in the case where write_all is false. If we write any data at all on the first cc->chr_write call, we will then break out of the loop without trying to send any more, so there's no code path that will then set res to a negative number. (If write_all is true, then yeah I think we can return an errno rather than a partial-write count if chr_write fails on the second time through the loop, but all the callers of that version of the function ignore errors of any kind anyway, and certainly don't have any means to handle "only wrote half the data", otherwise they'd be using the non-blocking version.) -- PMM