diff mbox series

[PULL,12/13] page_unprotect(): handle calls to pages that are PAGE_WRITE

Message ID 20180123144807.5618-13-laurent@vivier.eu
State Accepted
Commit 9c4bbee9e3b83544257e82566342c29e15a88637
Headers show
Series [PULL,01/13] linux-user: Fix locking order in fork_start() | expand

Commit Message

Laurent Vivier Jan. 23, 2018, 2:48 p.m. UTC
From: Peter Maydell <peter.maydell@linaro.org>


If multiple guest threads in user-mode emulation write to a
page which QEMU has marked read-only because of cached TCG
translations, the threads can race in page_unprotect:

 * threads A & B both try to do a write to a page with code in it at
   the same time (ie which we've made non-writeable, so SEGV)
 * they race into the signal handler with this faulting address
 * thread A happens to get to page_unprotect() first and takes the
   mmap lock, so thread B sits waiting for it to be done
 * A then finds the page, marks it PAGE_WRITE and mprotect()s it writable
 * A can then continue OK (returns from signal handler to retry the
   memory access)
 * ...but when B gets the mmap lock it finds that the page is already
   PAGE_WRITE, and so it exits page_unprotect() via the "not due to
   protected translation" code path, and wrongly delivers the signal
   to the guest rather than just retrying the access

In particular, this meant that trying to run 'javac' in user-mode
emulation would fail with a spurious guest SIGSEGV.

Handle this by making page_unprotect() assume that a call for a page
which is already PAGE_WRITE is due to a race of this sort and return
a "fault handled" indication.

Since this would cause an infinite loop if we ever called
page_unprotect() for some other kind of fault than "write failed due
to bad access permissions", tighten the condition in
handle_cpu_signal() to check the signal number and si_code, and add a
comment so that if somebody does ever find themselves debugging an
infinite loop of faults they have some clue about why.

(The trick for identifying the correct setting for
current_tb_invalidated for thread B (needed to handle the precise-SMC
case) is due to Richard Henderson.  Paolo Bonzini suggested just
relying on si_code rather than trying anything more complicated.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Message-Id: <1511879725-9576-3-git-send-email-peter.maydell@linaro.org>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>

---
 accel/tcg/translate-all.c | 50 +++++++++++++++++++++++++++++------------------
 accel/tcg/user-exec.c     | 13 +++++++++++-
 2 files changed, 43 insertions(+), 20 deletions(-)

-- 
2.14.3

Comments

Laurent Vivier March 22, 2018, 1:52 a.m. UTC | #1
Le 23/01/2018 à 15:48, Laurent Vivier a écrit :
> From: Peter Maydell <peter.maydell@linaro.org>

> 

> If multiple guest threads in user-mode emulation write to a

> page which QEMU has marked read-only because of cached TCG

> translations, the threads can race in page_unprotect:

> 

>  * threads A & B both try to do a write to a page with code in it at

>    the same time (ie which we've made non-writeable, so SEGV)

>  * they race into the signal handler with this faulting address

>  * thread A happens to get to page_unprotect() first and takes the

>    mmap lock, so thread B sits waiting for it to be done

>  * A then finds the page, marks it PAGE_WRITE and mprotect()s it writable

>  * A can then continue OK (returns from signal handler to retry the

>    memory access)

>  * ...but when B gets the mmap lock it finds that the page is already

>    PAGE_WRITE, and so it exits page_unprotect() via the "not due to

>    protected translation" code path, and wrongly delivers the signal

>    to the guest rather than just retrying the access

> 

> In particular, this meant that trying to run 'javac' in user-mode

> emulation would fail with a spurious guest SIGSEGV.

> 

> Handle this by making page_unprotect() assume that a call for a page

> which is already PAGE_WRITE is due to a race of this sort and return

> a "fault handled" indication.

> 

> Since this would cause an infinite loop if we ever called

> page_unprotect() for some other kind of fault than "write failed due

> to bad access permissions", tighten the condition in

> handle_cpu_signal() to check the signal number and si_code, and add a

> comment so that if somebody does ever find themselves debugging an

> infinite loop of faults they have some clue about why.

> 

> (The trick for identifying the correct setting for

> current_tb_invalidated for thread B (needed to handle the precise-SMC

> case) is due to Richard Henderson.  Paolo Bonzini suggested just

> relying on si_code rather than trying anything more complicated.)

> 

> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

> Message-Id: <1511879725-9576-3-git-send-email-peter.maydell@linaro.org>

> Signed-off-by: Laurent Vivier <laurent@vivier.eu>

> ---

>  accel/tcg/translate-all.c | 50 +++++++++++++++++++++++++++++------------------

>  accel/tcg/user-exec.c     | 13 +++++++++++-

>  2 files changed, 43 insertions(+), 20 deletions(-)

> 


It seems this patch breaks something in linux-user mode emulation for
m68k (32bit BE) on ppc (32bit BE).

What I have:

  ~/chroot$ sudo QEMU_CPU=m68040 chroot m68k/sid/
  I have no name!@localhost:/# ls
  bin   debootstrap  etc	 lib   qemu-m68k  run	sys  usr
  boot  dev	   home  proc  root	  sbin	tmp  var
  qemu: uncaught target signal 11 (Segmentation fault) - core dumped
  ~/chroot$

It seems "bash" crashes on "ls" exit.

My chroot has been installed with:

  ARCH=m68k
  TARGET=sid
  CHROOT=$HOME/chroot/m68k/sid/
  REPOT=http://cdn-fastly.deb.debian.org/debian-ports/
  debootstrap --arch=$ARCH --foreign --variant=minbase \
              --no-check-gpg $TARGET $CHROOT $REPO

I didn't investigate more.

Thanks,
Laurent
Laurent Vivier March 22, 2018, 10:36 a.m. UTC | #2
Le 22/03/2018 à 02:52, Laurent Vivier a écrit :
> Le 23/01/2018 à 15:48, Laurent Vivier a écrit :

>> From: Peter Maydell <peter.maydell@linaro.org>

>>

>> If multiple guest threads in user-mode emulation write to a

>> page which QEMU has marked read-only because of cached TCG

>> translations, the threads can race in page_unprotect:

>>

>>  * threads A & B both try to do a write to a page with code in it at

>>    the same time (ie which we've made non-writeable, so SEGV)

>>  * they race into the signal handler with this faulting address

>>  * thread A happens to get to page_unprotect() first and takes the

>>    mmap lock, so thread B sits waiting for it to be done

>>  * A then finds the page, marks it PAGE_WRITE and mprotect()s it writable

>>  * A can then continue OK (returns from signal handler to retry the

>>    memory access)

>>  * ...but when B gets the mmap lock it finds that the page is already

>>    PAGE_WRITE, and so it exits page_unprotect() via the "not due to

>>    protected translation" code path, and wrongly delivers the signal

>>    to the guest rather than just retrying the access

>>

>> In particular, this meant that trying to run 'javac' in user-mode

>> emulation would fail with a spurious guest SIGSEGV.

>>

>> Handle this by making page_unprotect() assume that a call for a page

>> which is already PAGE_WRITE is due to a race of this sort and return

>> a "fault handled" indication.

>>

>> Since this would cause an infinite loop if we ever called

>> page_unprotect() for some other kind of fault than "write failed due

>> to bad access permissions", tighten the condition in

>> handle_cpu_signal() to check the signal number and si_code, and add a

>> comment so that if somebody does ever find themselves debugging an

>> infinite loop of faults they have some clue about why.

>>

>> (The trick for identifying the correct setting for

>> current_tb_invalidated for thread B (needed to handle the precise-SMC

>> case) is due to Richard Henderson.  Paolo Bonzini suggested just

>> relying on si_code rather than trying anything more complicated.)

>>

>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

>> Message-Id: <1511879725-9576-3-git-send-email-peter.maydell@linaro.org>

>> Signed-off-by: Laurent Vivier <laurent@vivier.eu>

>> ---

>>  accel/tcg/translate-all.c | 50 +++++++++++++++++++++++++++++------------------

>>  accel/tcg/user-exec.c     | 13 +++++++++++-

>>  2 files changed, 43 insertions(+), 20 deletions(-)

>>

> 

> It seems this patch breaks something in linux-user mode emulation for

> m68k (32bit BE) on ppc (32bit BE).

> 

> What I have:

> 

>   ~/chroot$ sudo QEMU_CPU=m68040 chroot m68k/sid/

>   I have no name!@localhost:/# ls

>   bin   debootstrap  etc	 lib   qemu-m68k  run	sys  usr

>   boot  dev	   home  proc  root	  sbin	tmp  var

>   qemu: uncaught target signal 11 (Segmentation fault) - core dumped

>   ~/chroot$

> 

> It seems "bash" crashes on "ls" exit.

> 

> My chroot has been installed with:

> 

>   ARCH=m68k

>   TARGET=sid

>   CHROOT=$HOME/chroot/m68k/sid/

>   REPOT=http://cdn-fastly.deb.debian.org/debian-ports/

>   debootstrap --arch=$ARCH --foreign --variant=minbase \

>               --no-check-gpg $TARGET $CHROOT $REPO

> 

> I didn't investigate more.


It goes wrong in this part:

+     */
+    if (is_write && info->si_signo == SIGSEGV && info->si_code ==
SEGV_ACCERR &&
+        h2g_valid(address)) {

Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR
(on x86_64, si_code is SEGV_ACCERR as expected)

Any idea?

Thanks,
Laurent
Peter Maydell March 22, 2018, 11:05 a.m. UTC | #3
On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:
> Le 22/03/2018 à 02:52, Laurent Vivier a écrit :

>> It seems this patch breaks something in linux-user mode emulation for

>> m68k (32bit BE) on ppc (32bit BE).

>>

>> What I have:

>>

>>   ~/chroot$ sudo QEMU_CPU=m68040 chroot m68k/sid/

>>   I have no name!@localhost:/# ls

>>   bin   debootstrap  etc       lib   qemu-m68k  run   sys  usr

>>   boot  dev      home  proc  root       sbin  tmp  var

>>   qemu: uncaught target signal 11 (Segmentation fault) - core dumped

>>   ~/chroot$

>>

>> It seems "bash" crashes on "ls" exit.

>>

>> My chroot has been installed with:

>>

>>   ARCH=m68k

>>   TARGET=sid

>>   CHROOT=$HOME/chroot/m68k/sid/

>>   REPOT=http://cdn-fastly.deb.debian.org/debian-ports/

>>   debootstrap --arch=$ARCH --foreign --variant=minbase \

>>               --no-check-gpg $TARGET $CHROOT $REPO

>>

>> I didn't investigate more.

>

> It goes wrong in this part:

>

> +     */

> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

> SEGV_ACCERR &&

> +        h2g_valid(address)) {

>

> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

> (on x86_64, si_code is SEGV_ACCERR as expected)


So on PPC if you have a page mapped, and you access it with
the wrong permissions, you get SEGV_MAPERR? This seems like
a host kernel bug to me.

thanks
-- PMM
Peter Maydell March 22, 2018, 11:07 a.m. UTC | #4
On 22 March 2018 at 11:05, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:

>> It goes wrong in this part:

>>

>> +     */

>> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

>> SEGV_ACCERR &&

>> +        h2g_valid(address)) {

>>

>> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

>> (on x86_64, si_code is SEGV_ACCERR as expected)

>

> So on PPC if you have a page mapped, and you access it with

> the wrong permissions, you get SEGV_MAPERR? This seems like

> a host kernel bug to me.


...in particular, kernel commit ecb101aed86156e (dated Dec 2017)
fixes a regression introduced in commit c3350602e876 that broke
the ppc kernels so they started returning SEGV_MAPERR here
instead of SEGV_ACCERR. Presumably your host kernel is missing
this fix.

thanks
-- PMM
Laurent Vivier March 22, 2018, 11:07 a.m. UTC | #5
Le 22/03/2018 à 12:05, Peter Maydell a écrit :
> On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:

>> Le 22/03/2018 à 02:52, Laurent Vivier a écrit :

>>> It seems this patch breaks something in linux-user mode emulation for

>>> m68k (32bit BE) on ppc (32bit BE).

>>>

>>> What I have:

>>>

>>>   ~/chroot$ sudo QEMU_CPU=m68040 chroot m68k/sid/

>>>   I have no name!@localhost:/# ls

>>>   bin   debootstrap  etc       lib   qemu-m68k  run   sys  usr

>>>   boot  dev      home  proc  root       sbin  tmp  var

>>>   qemu: uncaught target signal 11 (Segmentation fault) - core dumped

>>>   ~/chroot$

>>>

>>> It seems "bash" crashes on "ls" exit.

>>>

>>> My chroot has been installed with:

>>>

>>>   ARCH=m68k

>>>   TARGET=sid

>>>   CHROOT=$HOME/chroot/m68k/sid/

>>>   REPOT=http://cdn-fastly.deb.debian.org/debian-ports/

>>>   debootstrap --arch=$ARCH --foreign --variant=minbase \

>>>               --no-check-gpg $TARGET $CHROOT $REPO

>>>

>>> I didn't investigate more.

>>

>> It goes wrong in this part:

>>

>> +     */

>> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

>> SEGV_ACCERR &&

>> +        h2g_valid(address)) {

>>

>> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

>> (on x86_64, si_code is SEGV_ACCERR as expected)

> 

> So on PPC if you have a page mapped, and you access it with

> the wrong permissions, you get SEGV_MAPERR? This seems like

> a host kernel bug to me.


Are we sure it is mapped? How to know?
otherwise yes, it sounds like a kernel bug.

Thanks,
Laurent
Peter Maydell March 22, 2018, 11:10 a.m. UTC | #6
On 22 March 2018 at 11:07, Laurent Vivier <laurent@vivier.eu> wrote:
> Le 22/03/2018 à 12:05, Peter Maydell a écrit :

>> On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:re.

>>> It goes wrong in this part:

>>>

>>> +     */

>>> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

>>> SEGV_ACCERR &&

>>> +        h2g_valid(address)) {

>>>

>>> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

>>> (on x86_64, si_code is SEGV_ACCERR as expected)

>>

>> So on PPC if you have a page mapped, and you access it with

>> the wrong permissions, you get SEGV_MAPERR? This seems like

>> a host kernel bug to me.

>

> Are we sure it is mapped? How to know?


We know it's mapped because the kernel doesn't give us the
SEGV_MAPERR code :-) Access to unmapped pages must be the
guest binary's problem -- the thing we're trying to detect
here is "is this a write access to a page that we mapped
read-only because we have a cache of code translated for it",
which is always going to be "mapped but not with the right
permissions".

thanks
-- PMM
Laurent Vivier March 22, 2018, 11:13 a.m. UTC | #7
Le 22/03/2018 à 12:07, Peter Maydell a écrit :
> On 22 March 2018 at 11:05, Peter Maydell <peter.maydell@linaro.org> wrote:

>> On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:

>>> It goes wrong in this part:

>>>

>>> +     */

>>> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

>>> SEGV_ACCERR &&

>>> +        h2g_valid(address)) {

>>>

>>> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

>>> (on x86_64, si_code is SEGV_ACCERR as expected)

>>

>> So on PPC if you have a page mapped, and you access it with

>> the wrong permissions, you get SEGV_MAPERR? This seems like

>> a host kernel bug to me.

> 

> ...in particular, kernel commit ecb101aed86156e (dated Dec 2017)

> fixes a regression introduced in commit c3350602e876 that broke

> the ppc kernels so they started returning SEGV_MAPERR here

> instead of SEGV_ACCERR. Presumably your host kernel is missing

> this fix.


Yes, you're right, my kernel is 4.14-rc1 (6e80ecd) with
c3350602e876 but without ecb101aed86156e.

I'm going to update it.

Thanks,
Laurent
Laurent Vivier March 22, 2018, 4:47 p.m. UTC | #8
Le 22/03/2018 à 12:13, Laurent Vivier a écrit :
> Le 22/03/2018 à 12:07, Peter Maydell a écrit :

>> On 22 March 2018 at 11:05, Peter Maydell <peter.maydell@linaro.org> wrote:

>>> On 22 March 2018 at 10:36, Laurent Vivier <laurent@vivier.eu> wrote:

>>>> It goes wrong in this part:

>>>>

>>>> +     */

>>>> +    if (is_write && info->si_signo == SIGSEGV && info->si_code ==

>>>> SEGV_ACCERR &&

>>>> +        h2g_valid(address)) {

>>>>

>>>> Because, on ppc, si_code is SEGV_MAPERR and not SEGV_ACCERR

>>>> (on x86_64, si_code is SEGV_ACCERR as expected)

>>>

>>> So on PPC if you have a page mapped, and you access it with

>>> the wrong permissions, you get SEGV_MAPERR? This seems like

>>> a host kernel bug to me.

>>

>> ...in particular, kernel commit ecb101aed86156e (dated Dec 2017)

>> fixes a regression introduced in commit c3350602e876 that broke

>> the ppc kernels so they started returning SEGV_MAPERR here

>> instead of SEGV_ACCERR. Presumably your host kernel is missing

>> this fix.

> 

> Yes, you're right, my kernel is 4.14-rc1 (6e80ecd) with

> c3350602e876 but without ecb101aed86156e.

> 

> I'm going to update it.


Re-tested with 4.16-rc6 on ppc32 and it works fine.

Thanks,
Laurent
diff mbox series

Patch

diff --git a/accel/tcg/translate-all.c b/accel/tcg/translate-all.c
index 7736257085..67795cd78c 100644
--- a/accel/tcg/translate-all.c
+++ b/accel/tcg/translate-all.c
@@ -2181,29 +2181,41 @@  int page_unprotect(target_ulong address, uintptr_t pc)
 
     /* if the page was really writable, then we change its
        protection back to writable */
-    if ((p->flags & PAGE_WRITE_ORG) && !(p->flags & PAGE_WRITE)) {
-        host_start = address & qemu_host_page_mask;
-        host_end = host_start + qemu_host_page_size;
-
-        prot = 0;
+    if (p->flags & PAGE_WRITE_ORG) {
         current_tb_invalidated = false;
-        for (addr = host_start ; addr < host_end ; addr += TARGET_PAGE_SIZE) {
-            p = page_find(addr >> TARGET_PAGE_BITS);
-            p->flags |= PAGE_WRITE;
-            prot |= p->flags;
-
-            /* and since the content will be modified, we must invalidate
-               the corresponding translated code. */
-            current_tb_invalidated |= tb_invalidate_phys_page(addr, pc);
-#ifdef CONFIG_USER_ONLY
-            if (DEBUG_TB_CHECK_GATE) {
-                tb_invalidate_check(addr);
+        if (p->flags & PAGE_WRITE) {
+            /* If the page is actually marked WRITE then assume this is because
+             * this thread raced with another one which got here first and
+             * set the page to PAGE_WRITE and did the TB invalidate for us.
+             */
+#ifdef TARGET_HAS_PRECISE_SMC
+            TranslationBlock *current_tb = tb_find_pc(pc);
+            if (current_tb) {
+                current_tb_invalidated = tb_cflags(current_tb) & CF_INVALID;
             }
 #endif
+        } else {
+            host_start = address & qemu_host_page_mask;
+            host_end = host_start + qemu_host_page_size;
+
+            prot = 0;
+            for (addr = host_start; addr < host_end; addr += TARGET_PAGE_SIZE) {
+                p = page_find(addr >> TARGET_PAGE_BITS);
+                p->flags |= PAGE_WRITE;
+                prot |= p->flags;
+
+                /* and since the content will be modified, we must invalidate
+                   the corresponding translated code. */
+                current_tb_invalidated |= tb_invalidate_phys_page(addr, pc);
+#ifdef CONFIG_USER_ONLY
+                if (DEBUG_TB_CHECK_GATE) {
+                    tb_invalidate_check(addr);
+                }
+#endif
+            }
+            mprotect((void *)g2h(host_start), qemu_host_page_size,
+                     prot & PAGE_BITS);
         }
-        mprotect((void *)g2h(host_start), qemu_host_page_size,
-                 prot & PAGE_BITS);
-
         mmap_unlock();
         /* If current TB was invalidated return to main loop */
         return current_tb_invalidated ? 2 : 1;
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index e8f26ff0cb..c973752562 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -104,7 +104,18 @@  static inline int handle_cpu_signal(uintptr_t pc, siginfo_t *info,
            pc, address, is_write, *(unsigned long *)old_set);
 #endif
     /* XXX: locking issue */
-    if (is_write && h2g_valid(address)) {
+    /* Note that it is important that we don't call page_unprotect() unless
+     * this is really a "write to nonwriteable page" fault, because
+     * page_unprotect() assumes that if it is called for an access to
+     * a page that's writeable this means we had two threads racing and
+     * another thread got there first and already made the page writeable;
+     * so we will retry the access. If we were to call page_unprotect()
+     * for some other kind of fault that should really be passed to the
+     * guest, we'd end up in an infinite loop of retrying the faulting
+     * access.
+     */
+    if (is_write && info->si_signo == SIGSEGV && info->si_code == SEGV_ACCERR &&
+        h2g_valid(address)) {
         switch (page_unprotect(h2g(address), pc)) {
         case 0:
             /* Fault not caused by a page marked unwritable to protect