Message ID | 20210115014638.15037-1-jarkko@kernel.org |
---|---|
State | New |
Headers | show |
Series | [v4] x86/sgx: Fix the call order of synchronize_srcu() in sgx_release() | expand |
On Fri, Jan 15, 2021 at 08:18:09AM +0100, Borislav Petkov wrote: > On Fri, Jan 15, 2021 at 03:46:38AM +0200, jarkko@kernel.org wrote: > > From: Jarkko Sakkinen <jarkko@kernel.org> > > > > The most trivial example of a race condition can be demonstrated with this > > example where mm_list contains just one entry: > > > > CPU A CPU B > > sgx_release() > > sgx_mmu_notifier_release() > > list_del_rcu() > > sgx_encl_release() > > synchronize_srcu() > > cleanup_srcu_struct() > > > > To fix this, call synchronize_srcu() before checking whether mm_list is > > empty in sgx_release(). > > To fix what? > > Lemme explain one more time: a commit message for a race condition > fix needs to explain in *detail* what the race condition is. And that > explanation needs to be understandable months and years from now. > > You have the function call order above, now you have to explain what can > happen. Just how you did here: > > https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org OK, I could recall the race that from but that must be partly because I've been proactively working on it, i.e. getting your point. So let's say I add this after the sequence: "The sequence demonstrates a scenario where CPU B starts a new grace period, which goes unnoticed by CPU A in sgx_release(), because it did not remove the final entry from the enclave's mm list." Would this be sufficient or not? > > Cc: stable@vger.kernel.org > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Suggested-by: Haitao Huang <haitao.huang@linux.intel.com> > > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > > --- > > v4: > > - Rewrite the commit message. > > - Just change the call order. *_expedited() is out of scope for this > > bug fix. > > v3: Fine-tuned tags, and added missing change log for v2. > > v2: Switch to synchronize_srcu_expedited(). > > arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c > > index f2eac41bb4ff..53056345f5f8 100644 > > --- a/arch/x86/kernel/cpu/sgx/driver.c > > +++ b/arch/x86/kernel/cpu/sgx/driver.c > > @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file) > > > > spin_unlock(&encl->mm_lock); > > > > + /* > > + * The call is need even if the list empty, because sgx_encl_mmu_notifier_release() > > "The call is need"? > > > $ git grep sgx_encl_mmu_notifier_release > $ > > WTF? > > Please be more careful. Note taken. > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette /Jarkko
On Sat, Jan 16, 2021 at 07:12:54AM +0200, Jarkko Sakkinen wrote: > > https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org > > OK, I could recall the race that from but that must be partly because I've > been proactively working on it, i.e. getting your point. > > So let's say I add this after the sequence: > > "The sequence demonstrates a scenario where CPU B starts a new > grace period, which goes unnoticed by CPU A in sgx_release(), > because it did not remove the final entry from the enclave's > mm list." > > Would this be sufficient or not? Not sure. That link above says: "Now, let's imagine that there is exactly one entry in the encl->mm_list. and sgx_release() execution gets scheduled right after returning from synchronize_srcu(). With some bad luck, some process comes and removes that last entry befoe sgx_release() acquires mm_lock." So, the last entry gets removed by some other process before sgx_release() acquires mm_lock. When it does acquire that lock, the test if (list_empty(&encl->mm_list)) will be true because "some other process" has removed that last entry. So why do you need the synchronize_srcu() call when this process sees an empty mm_list already? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Jan 18, 2021 at 07:57:12PM +0100, Borislav Petkov wrote: > On Sat, Jan 16, 2021 at 07:12:54AM +0200, Jarkko Sakkinen wrote: > > > https://lkml.kernel.org/r/X/zoarV7gd/LNo4A@kernel.org > > > > OK, I could recall the race that from but that must be partly because I've > > been proactively working on it, i.e. getting your point. > > > > So let's say I add this after the sequence: > > > > "The sequence demonstrates a scenario where CPU B starts a new > > grace period, which goes unnoticed by CPU A in sgx_release(), > > because it did not remove the final entry from the enclave's > > mm list." > > > > Would this be sufficient or not? > > Not sure. > > That link above says: > > "Now, let's imagine that there is exactly one entry in the encl->mm_list. > and sgx_release() execution gets scheduled right after returning from > synchronize_srcu(). > > With some bad luck, some process comes and removes that last entry befoe > sgx_release() acquires mm_lock." > > So, the last entry gets removed by some other process before > sgx_release() acquires mm_lock. When it does acquire that lock, the test > > if (list_empty(&encl->mm_list)) > > will be true because "some other process" has removed that last entry. > > So why do you need the synchronize_srcu() call when this process sees an > empty mm_list already? > > Thx. The other process aka some process using the enclave calls list_del_rcu() (and synchronize_srcu()), which starts a new grace period. If we don't do it, then the cleanup_srcu() will race with that grace period. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette > /Jarkko
On 1/20/21 6:43 AM, Jarkko Sakkinen wrote: >> So why do you need the synchronize_srcu() call when this process sees an >> empty mm_list already? >> >> Thx. > The other process aka some process using the enclave calls list_del_rcu() > (and synchronize_srcu()), which starts a new grace period. If we don't > do it, then the cleanup_srcu() will race with that grace period. To me, this is only a partial explanation. That goal of synchronize_srcu() is to wait for the completion of a *previous* grace period: one that might have observed the old state of the list. Could you explain the *actual* effects of the misplaced synchronize_srcu()? If the race _occurs_, what is the side-effect?
On Fri, Jan 15, 2021, jarkko@kernel.org wrote: > From: Jarkko Sakkinen <jarkko@kernel.org> > > The most trivial example of a race condition can be demonstrated with this > example where mm_list contains just one entry: > > CPU A CPU B > sgx_release() > sgx_mmu_notifier_release() > list_del_rcu() > sgx_encl_release() > synchronize_srcu() > cleanup_srcu_struct() > > To fix this, call synchronize_srcu() before checking whether mm_list is > empty in sgx_release(). Why haven't you included the splat that Haitao provided? That would go a long way to helping answer Boris' question about exactly what is broken... > Cc: stable@vger.kernel.org > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > Suggested-by: Sean Christopherson <seanjc@google.com> > Suggested-by: Haitao Huang <haitao.huang@linux.intel.com> > Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> > --- > v4: > - Rewrite the commit message. > - Just change the call order. *_expedited() is out of scope for this > bug fix. > v3: Fine-tuned tags, and added missing change log for v2. > v2: Switch to synchronize_srcu_expedited(). > arch/x86/kernel/cpu/sgx/driver.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c > index f2eac41bb4ff..53056345f5f8 100644 > --- a/arch/x86/kernel/cpu/sgx/driver.c > +++ b/arch/x86/kernel/cpu/sgx/driver.c > @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file) > > spin_unlock(&encl->mm_lock); > > + /* > + * The call is need even if the list empty, because sgx_encl_mmu_notifier_release() > + * could have initiated a new grace period. > + */ > + synchronize_srcu(&encl->srcu); I don't think this completely fixes the issue as sgx_release() isn't guaranteed to trigger cleanup_srcu_struct(), e.g. the reclaimer can also have a reference to the enclave. > + > /* The enclave is no longer mapped by any mm. */ > if (!encl_mm) > break; > > - synchronize_srcu(&encl->srcu); > mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm); > kfree(encl_mm); > } > -- > 2.29.2 >
On Wed, Jan 20, 2021 at 09:34:10AM -0800, Dave Hansen wrote: > On 1/20/21 6:43 AM, Jarkko Sakkinen wrote: > >> So why do you need the synchronize_srcu() call when this process sees an > >> empty mm_list already? > >> > >> Thx. > > The other process aka some process using the enclave calls list_del_rcu() > > (and synchronize_srcu()), which starts a new grace period. If we don't > > do it, then the cleanup_srcu() will race with that grace period. > > To me, this is only a partial explanation. > > That goal of synchronize_srcu() is to wait for the completion of a > *previous* grace period: one that might have observed the old state of > the list. > > Could you explain the *actual* effects of the misplaced > synchronize_srcu()? If the race _occurs_, what is the side-effect? As I haven't been able to reproduce this regression myself, I need to take steps back and try to reproduce the it with Graphene. WARN_ON()'s trigger inside cleanup_srcu_struct(), which causes a memory leak since free_percpu() gets never called. If I recall correctly, it was srcu_readers_active() but unfortunately I don't have a log available. Perhaps Haitao could provide us one. /Jarkko
On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote: > On Fri, Jan 15, 2021, jarkko@kernel.org wrote: > > From: Jarkko Sakkinen <jarkko@kernel.org> > > > > The most trivial example of a race condition can be demonstrated with this > > example where mm_list contains just one entry: > > > > CPU A CPU B > > sgx_release() > > sgx_mmu_notifier_release() > > list_del_rcu() > > sgx_encl_release() > > synchronize_srcu() > > cleanup_srcu_struct() > > > > To fix this, call synchronize_srcu() before checking whether mm_list is > > empty in sgx_release(). > > Why haven't you included the splat that Haitao provided? That would go a long > way to helping answer Boris' question about exactly what is broken... I've lost the klog. /Jarkko
On 1/20/21 4:29 PM, Jarkko Sakkinen wrote: > On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote: >> On Fri, Jan 15, 2021, jarkko@kernel.org wrote: >>> From: Jarkko Sakkinen <jarkko@kernel.org> >>> >>> The most trivial example of a race condition can be demonstrated with this >>> example where mm_list contains just one entry: >>> >>> CPU A CPU B >>> sgx_release() >>> sgx_mmu_notifier_release() >>> list_del_rcu() >>> sgx_encl_release() >>> synchronize_srcu() >>> cleanup_srcu_struct() >>> >>> To fix this, call synchronize_srcu() before checking whether mm_list is >>> empty in sgx_release(). >> Why haven't you included the splat that Haitao provided? That would go a long >> way to helping answer Boris' question about exactly what is broken... > I've lost the klog. Haitao said he thought it was this: > void cleanup_srcu_struct(struct srcu_struct *ssp) > { ... > if (WARN_ON(srcu_readers_active(ssp))) > return; /* Just leak it! */ Which would indicate that an 'encl' kref reached 0 while some other thread was inside a idx = srcu_read_lock(&encl->srcu); ... srcu_read_unlock(&encl->srcu, idx); critical section. A quick audit didn't turn up any obvious suspects, though. If that *is* it, it might be nice to try to catch the culprit at srcu_read_{un}lock() time. If there's ever a 0 refcount at those sites, it would be nice to spew a warning: idx = srcu_read_lock(&encl->srcu); WARN_ON(!atomic_read(&encl->refcount->refcount); ... WARN_ON(!atomic_read(&encl->refcount->refcount); srcu_read_unlock(&encl->srcu, idx); at each site.
On Wed, Jan 20, 2021 at 05:19:44PM -0800, Dave Hansen wrote: > On 1/20/21 4:29 PM, Jarkko Sakkinen wrote: > > On Wed, Jan 20, 2021 at 09:35:28AM -0800, Sean Christopherson wrote: > >> On Fri, Jan 15, 2021, jarkko@kernel.org wrote: > >>> From: Jarkko Sakkinen <jarkko@kernel.org> > >>> > >>> The most trivial example of a race condition can be demonstrated with this > >>> example where mm_list contains just one entry: > >>> > >>> CPU A CPU B > >>> sgx_release() > >>> sgx_mmu_notifier_release() > >>> list_del_rcu() > >>> sgx_encl_release() > >>> synchronize_srcu() > >>> cleanup_srcu_struct() > >>> > >>> To fix this, call synchronize_srcu() before checking whether mm_list is > >>> empty in sgx_release(). > >> Why haven't you included the splat that Haitao provided? That would go a long > >> way to helping answer Boris' question about exactly what is broken... > > I've lost the klog. > > Haitao said he thought it was this: > > > void cleanup_srcu_struct(struct srcu_struct *ssp) > > { > ... > > if (WARN_ON(srcu_readers_active(ssp))) > > return; /* Just leak it! */ > > Which would indicate that an 'encl' kref reached 0 while some other > thread was inside a > > idx = srcu_read_lock(&encl->srcu); > ... > srcu_read_unlock(&encl->srcu, idx); > > critical section. A quick audit didn't turn up any obvious suspects, > though. > > If that *is* it, it might be nice to try to catch the culprit at > srcu_read_{un}lock() time. If there's ever a 0 refcount at those sites, > it would be nice to spew a warning: > > idx = srcu_read_lock(&encl->srcu); > WARN_ON(!atomic_read(&encl->refcount->refcount); > ... > WARN_ON(!atomic_read(&encl->refcount->refcount); > srcu_read_unlock(&encl->srcu, idx); > > at each site. The root cause is fully known already and audited. An mm_list entry is kept up until the process exits *or* when VFS close happens, and sgx_release() executes and removes it. It's entirely possible that MMU notifier callback registered by a process happens while sgx_release() is executing, which causes list_del_rcu() to happen, unnoticed by sgx_release(). If that empties the list, cleanup_srcu_struct() is executed unsynchronized in the middle a unfinished grace period. /Jarkko
On 1/21/21 4:55 AM, Jarkko Sakkinen wrote: > The root cause is fully known already and audited. > > An mm_list entry is kept up until the process exits *or* when > VFS close happens, and sgx_release() executes and removes it. > > It's entirely possible that MMU notifier callback registered > by a process happens while sgx_release() is executing, which > causes list_del_rcu() to happen, unnoticed by sgx_release(). Just to be clear, are you talking about sgx_mmu_notifier_release()? According to comments, mmu_notifier->release() can be called "before all pages are freed." That means it can be called before VMAs are torn down, which is where the filp->release() is called. There is also nothing to *stop* a VMA from being torn down or an fd closed at the point that mmu_notifier->release() is called. fd = open("/dev/sgx") // filp count==1 mmap(fd) // filp count==2 close(fd) // filp count==1 munmap() // filp count==0, sgx_encl_release() cleanup_srcu_struct(&encl->srcu); kfree(encl); Meanwhile, another thread is doing: exit() ... exit_mmap() ... mmu_notifier_release() Which is touching 'encl'. Shouldn't a kref be held on 'encl' to ensure it isn't referenced after it is freed? > If that empties the list, cleanup_srcu_struct() is executed > unsynchronized in the middle a unfinished grace period. > + /* > + * Each sgx_mmu_notifier_release() starts a grace period. Therefore, an > + * additional sync is required here. > + */ Except that sgx_mmu_notifier_release() doesn't do any srcu locking. How does sgx_mmu_notifier_release() start a new grace period?
On 1/20/21 9:35 AM, Sean Christopherson wrote: > Why haven't you included the splat that Haitao provided? That would go a long > way to helping answer Boris' question about exactly what is broken... The bad news is that the original splat seems to be lost. The good news is that this is hard to reproduce and *might* not occur on what got merged in mainline. We're going to circle back around and make sure we have a clean reproduction before we try to fix this again.
On Wed, 20 Jan 2021 18:26:56 -0600, Jarkko Sakkinen <jarkko@kernel.org> wrote: > On Wed, Jan 20, 2021 at 09:34:10AM -0800, Dave Hansen wrote: >> On 1/20/21 6:43 AM, Jarkko Sakkinen wrote: >> >> So why do you need the synchronize_srcu() call when this process >> sees an >> >> empty mm_list already? >> >> >> >> Thx. >> > The other process aka some process using the enclave calls >> list_del_rcu() >> > (and synchronize_srcu()), which starts a new grace period. If we don't >> > do it, then the cleanup_srcu() will race with that grace period. >> >> To me, this is only a partial explanation. >> >> That goal of synchronize_srcu() is to wait for the completion of a >> *previous* grace period: one that might have observed the old state of >> the list. >> >> Could you explain the *actual* effects of the misplaced >> synchronize_srcu()? If the race _occurs_, what is the side-effect? > > As I haven't been able to reproduce this regression myself, I need > to take steps back and try to reproduce the it with Graphene. > > WARN_ON()'s trigger inside cleanup_srcu_struct(), which causes a memory > leak since free_percpu() gets never called. If I recall correctly, it > was srcu_readers_active() but unfortunately I don't have a log available. > > Perhaps Haitao could provide us one. > > /Jarkko Sorry I lost those logs too as our email server automatically clean up old emails. I have been re-running the tests but have not been able to reproduce the same issue. Haitao
On Fri, Jan 22, 2021 at 08:56:44AM -0800, Dave Hansen wrote: > On 1/20/21 9:35 AM, Sean Christopherson wrote: > > Why haven't you included the splat that Haitao provided? That would go a long > > way to helping answer Boris' question about exactly what is broken... > > The bad news is that the original splat seems to be lost. > > The good news is that this is hard to reproduce and *might* not occur on > what got merged in mainline. > > We're going to circle back around and make sure we have a clean > reproduction before we try to fix this again. Yeah, this a good opportunity to level up the QA. /Jarkko
Haitao managed to create another splat over the weekend. It was, indeed: > WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100 which is: > if (WARN_ON(!srcu_get_delay(ssp))) > return; /* Just leak it! */ That check means that there is an outstanding "expedited" grace period. The fact that it's expedited is not important. This: https://lwn.net/Articles/202847/ describes the reasoning behind the warning: If the struct srcu_struct is dynamically allocated, then cleanup_srcu_struct() must be called before it is freed ... the caller must take care to ensure that all SRCU read-side critical sections have completed (and that no more will commence) before calling cleanup_srcu_struct(). synchronize_srcu() will (obviously) wait for the grace period to complete. Calling it will shut up the warning for sure, most of the time. The required sequence of events is in here: > https://lore.kernel.org/lkml/1492472726-3841-4-git-send-email-paulmck@linux.vnet.ibm.com/ I suspect that the mmu notifier's synchronize_srcu() is run in parallel very close to when cleanup_srcu_struct() is called. This violates the "prevent any further calls to synchronize_srcu" rule. So, while I suspect that adding a synchronize_srcu() is *part* of the correct solution, I'm still not convinced that the sgx_mmu_notifier_release() code is correct.
On Mon, Jan 25, 2021 at 07:49:04AM -0800, Dave Hansen wrote: > Haitao managed to create another splat over the weekend. It was, indeed: > > > WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100 > > which is: > > > if (WARN_ON(!srcu_get_delay(ssp))) > > return; /* Just leak it! */ > > That check means that there is an outstanding "expedited" grace period. > The fact that it's expedited is not important. This: > > https://lwn.net/Articles/202847/ > > describes the reasoning behind the warning: > > If the struct srcu_struct is dynamically allocated, then > cleanup_srcu_struct() must be called before it is freed ... the > caller must take care to ensure that all SRCU read-side critical > sections have completed (and that no more will commence) before > calling cleanup_srcu_struct(). > > synchronize_srcu() will (obviously) wait for the grace period to > complete. Calling it will shut up the warning for sure, most of the time. > > The required sequence of events is in here: > > > https://lore.kernel.org/lkml/1492472726-3841-4-git-send-email-paulmck@linux.vnet.ibm.com/ I'll add "Link:" for this to the final fix. It's a good reference. > I suspect that the mmu notifier's synchronize_srcu() is run in parallel > very close to when cleanup_srcu_struct() is called. This violates the > "prevent any further calls to synchronize_srcu" rule. > > So, while I suspect that adding a synchronize_srcu() is *part* of the > correct solution, I'm still not convinced that the > sgx_mmu_notifier_release() code is correct. What Sean suggested in private discussion, i.e. using kref_get() in the MMU notifier AFAIK should be enough to do the necessary serialization. /Jarkko
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c index f2eac41bb4ff..53056345f5f8 100644 --- a/arch/x86/kernel/cpu/sgx/driver.c +++ b/arch/x86/kernel/cpu/sgx/driver.c @@ -65,11 +65,16 @@ static int sgx_release(struct inode *inode, struct file *file) spin_unlock(&encl->mm_lock); + /* + * The call is need even if the list empty, because sgx_encl_mmu_notifier_release() + * could have initiated a new grace period. + */ + synchronize_srcu(&encl->srcu); + /* The enclave is no longer mapped by any mm. */ if (!encl_mm) break; - synchronize_srcu(&encl->srcu); mmu_notifier_unregister(&encl_mm->mmu_notifier, encl_mm->mm); kfree(encl_mm); }