Message ID | 20231211091346.14616-1-alex.bennee@linaro.org |
---|---|
Headers | show |
Series | record/replay fixes: attempting to get avocado green | expand |
Alex Bennée <alex.bennee@linaro.org> writes: > As I'm a glutton for punishment I thought I'd have a go at fixing the > slowly growing number of record/replay bugs. The two fixes are: > > replay: stop us hanging in rr_wait_io_event > chardev: force write all when recording replay logs > > I think we are beyond 8.2 material but it would be nice to get this > functionality stable again. We have a growing number of bugs under the > icount label on gitlab: > > https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=icount > > Changes > ------- > > v2 > > Apart from addressing tidy ups and tags I've been investigating the > failures in replay_linux.py which are the more exhaustive tests which > boot the kernel and user-space. The "fix": > > replay: report sync error when no exception in log (!DEBUG INVESTIGATION) > > triggers around the time of the hang in the logs and despite the > rather hairy EXCP->INT transitions around cpu_exec_loop() I think > points to a genuine problem. I added the tracing to cputlb to verify > the page tables are the same and started detecting divergence between > record and replay a lot earlier on that when the replay_sync_error() > catches things. I see patterns like this: > > 1878 tlb_fill 0x4770c000/1 1 2 tlb_fill 0x4770c000/1 1 2 > 1879 tlb_fill 0x4770d000/1 1 2 tlb_fill 0x4770d000/1 1 2 > 1880 tlb_fill 0x59000/1 0 2 tlb_fill 0x59000/1 0 2 > 1881 > tlb_fill 0x476dd116/1 0 2 > 1882 tlb_fill 0x4770e000/1 1 2 tlb_fill 0x4770e000/1 1 2 > 1883 tlb_fill 0x476dd527/1 0 2 | tlb_fill 0x476dfb17/1 0 2 > 1884 > tlb_fill 0x476de0fd/1 0 2 > 1885 > tlb_fill 0x476dce2e/1 0 2 > 1886 tlb_fill 0x4770f000/1 1 2 tlb_fill 0x4770f000/1 1 2 > 1887 tlb_fill 0x476df939/1 0 2 < > 1888 tlb_fill 0x47710000/1 1 2 tlb_fill 0x47710000/1 1 2 > 1889 tlb_fill 0x47711000/1 1 2 tlb_fill 0x47711000/1 1 2 > > These don't seem to affect the overall program flow but are concerning > because the memory access patterns should be the same. My > investigations with rr seem to indicate the difference is due to > behaviour of the victim_tlb_cache which again AFAICT should be > deterministic. > > Anyway I can't spend any time debugging it this week so I thought I'd > post the current state in case anyone is curious enough to want to go > diving into record/replay. Just for the actual failure report, on replay: [ 144.918551] audit: type=1130 audit(1702415469.960:45): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-setup comm=" systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success' [ 145.579135] audit: type=1305 audit(1702415470.620:46): op=set audit_enabled=1 old=1 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 qemu-system-aarch64: Exception not in log (insn total 30592471098/1463 left, event 2848747 is EVENT_INSTRUCTION) fish: Job 1, './qemu-system-aarch64 -display …' terminated by signal SIGABRT (Abort) I've made some more fixes to the replay_dump script and the next event expected is a CLOCK_WARP followed by some more instructions before an exception. However there are a lot of exceptions/interrupts firing off in the stream: 2848736:EVENT_EXCEPTION(2) no data 2848737:EVENT_INSTRUCTION(0) + 835 -> 30592463170 2848738:EVENT_EXCEPTION(2) no data 2848739:EVENT_INSTRUCTION(0) + 1133 -> 30592464303 2848740:EVENT_EXCEPTION(2) no data 2848741:EVENT_INSTRUCTION(0) + 4739 -> 30592469042 2848742:EVENT_EXCEPTION(2) no data 2848743:EVENT_INSTRUCTION(0) + 865 -> 30592469907 2848744:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data 2848745:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data 2848746:EVENT_INTERRUPT(1) 2848747:EVENT_INSTRUCTION(0) + 2654 -> 30592472561 we are here... 2848748:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data 2848749:EVENT_INSTRUCTION(0) + 184 -> 30592472745 2848750:EVENT_EXCEPTION(2) no data 2848751:EVENT_INSTRUCTION(0) + 832 -> 30592473577 2848752:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data 2848753:EVENT_INSTRUCTION(0) + 6 -> 30592473583 2848754:EVENT_CP_CLOCK_WARP_ACCOUNT(31) no additional data 2848755:EVENT_INSTRUCTION(0) + 2960 -> 30592476543 2848756:EVENT_EXCEPTION(2) no data 2848757:EVENT_INSTRUCTION(0) + 1565 -> 30592478108 2848758:EVENT_EXCEPTION(2) no data 2848759:EVENT_INSTRUCTION(0) + 2996 -> 30592481104 2848760:EVENT_EXCEPTION(2) no data 2848761:EVENT_INSTRUCTION(0) + 832 -> 30592481936 2848762:EVENT_ASYNC(9) REPLAY_ASYNC_EVENT_NET(6) id 0 flags 0 305 bytes That said given the complexity of the exception/interrupt processing back and forth I can see there is potential for things getting confused. Of course I still haven't explained why there is a divergence in the tlb processing so early on despite seemingly having no effect on the replay until this very late stage.