Message ID | 20250227083717.4307-1-liuhangbin@gmail.com |
---|---|
Headers | show |
Series | bond: fix xfrm offload issues | expand |
On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > >> One more thing - note I'm not an xfrm expert by far but it seems to me here you have > >> to also call xdo_dev_state_free() with the old active slave dev otherwise that will > >> never get called with the original real_dev after the switch to a new > >> active slave (or more accurately it might if the GC runs between the switching > >> but it is a race), care must be taken wrt sequence of events because the XFRM > > > > Can we just call xs->xso.real_dev->xfrmdev_ops->xdo_dev_state_free(xs) > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > xdo_dev_state_free() every where may make us lot more easily. > > > > You'd have to check all drivers that implement the callback to answer that and even then > I'd stick to the canonical way of how it's done in xfrm and make the bond just passthrough. > Any other games become dangerous and new code will have to be carefully reviewed every > time, calling another device's free_sa when it wasn't added before doesn't sound good. > > >> GC may be running in parallel which probably means that in bond_ipsec_free_sa() > >> you'll have to take the mutex before calling xdo_dev_state_free() and check > >> if the entry is still linked in the bond's ipsec list before calling the free_sa > >> callback, if it isn't then del_sa_all got to it before the GC and there's nothing > >> to do if it also called the dev's free_sa callback. The check for real_dev doesn't > >> seem enough to protect against this race. > > > > I agree that we need to take the mutex before calling xdo_dev_state_free() > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a bit lot here. > > > > Thanks > > Hangbin > > Well, the race is between the xfrm GC and del_sa_all, in bond's free_sa if you > walk the list under the mutex before calling real_dev's free callback and > don't find the current element that's being freed in free_sa then it was > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk that > list and clean the entries. I think it should be fine as long as free_sa > was called once with the proper device. OK, so the free will be called either in del_sa_all() or free_sa(). Something like this? static void bond_ipsec_del_sa_all(struct bonding *bond) @@ -620,6 +614,16 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (!ipsec->xs->xso.real_dev) continue; + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + /* already dead no need to delete again */ + if (real_dev->xfrmdev_ops->xdo_dev_state_free) + real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); + list_del(&ipsec->list); + kfree(ipsec); + continue; + } + if (!real_dev->xfrmdev_ops || !real_dev->xfrmdev_ops->xdo_dev_state_delete || netif_is_bond_master(real_dev)) { @@ -659,11 +664,22 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) if (!xs->xso.real_dev) goto out; - WARN_ON(xs->xso.real_dev != real_dev); + mutex_lock(&bond->ipsec_lock); + list_for_each_entry(ipsec, &bond->ipsec_list, list) { + if (ipsec->xs == xs) { + if (real_dev && xs->xso.real_dev == real_dev && ^^ looks we don't need this xs->xso.real_dev == real_dev checking if there is no race, do we? Or just keep the WARN_ON() in case of any race. + real_dev->xfrmdev_ops && + real_dev->xfrmdev_ops->xdo_dev_state_free) + real_dev->xfrmdev_ops->xdo_dev_state_free(xs); + list_del(&ipsec->list); + kfree(ipsec); + break; + } + } + mutex_unlock(&bond->ipsec_lock); - if (real_dev && real_dev->xfrmdev_ops && - real_dev->xfrmdev_ops->xdo_dev_state_free) - real_dev->xfrmdev_ops->xdo_dev_state_free(xs); out: netdev_put(real_dev, &tracker); }
On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: > On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > > > > One more thing - note I'm not an xfrm expert by far but it > > > > seems to me here you have > > > > to also callĀ xdo_dev_state_free() with the old active slave > > > > dev otherwise that will > > > > never get called with the original real_dev after the switch to > > > > a new > > > > active slave (or more accurately it might if the GC runs > > > > between the switching > > > > but it is a race), care must be taken wrt sequence of events > > > > because the XFRM > > > > > > Can we just call xs->xso.real_dev->xfrmdev_ops- > > > >xdo_dev_state_free(xs) > > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > > xdo_dev_state_free() every where may make us lot more easily. > > > > > > > You'd have to check all drivers that implement the callback to > > answer that and even then > > I'd stick to the canonical way of how it's done in xfrm and make > > the bond just passthrough. > > Any other games become dangerous and new code will have to be > > carefully reviewed every > > time, calling another device's free_sa when it wasn't added before > > doesn't sound good. > > > > > > GC may be running in parallel which probably means that in > > > > bond_ipsec_free_sa() > > > > you'll have to take the mutex before calling > > > > xdo_dev_state_free() and check > > > > if the entry is still linked in the bond's ipsec list before > > > > calling the free_sa > > > > callback, if it isn't then del_sa_all got to it before the GC > > > > and there's nothing > > > > to do if it also called the dev's free_sa callback. The check > > > > for real_dev doesn't > > > > seem enough to protect against this race. > > > > > > I agree that we need to take the mutex before calling > > > xdo_dev_state_free() > > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a > > > bit lot here. > > > > > > Thanks > > > Hangbin > > > > Well, the race is between the xfrm GC and del_sa_all, in bond's > > free_sa if you > > walk the list under the mutex before calling real_dev's free > > callback and > > don't find the current element that's being freed in free_sa then > > it was > > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk > > that > > list and clean the entries. I think it should be fine as long as > > free_sa > > was called once with the proper device. > > OK, so the free will be called either in del_sa_all() or free_sa(). > Something like this? > [...] Unfortunately, after applying these changes and reasoning about them for a bit, I don't think this will work. There are still races left. For example: 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all is called in parallel, doesn't call delete on xs (because it's dead), then calls free (incorrect without delete first), then removes the list entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, and calls delete (incorrect, out of order with free). Finally, bond_ipsec_free_sa is called, which fortunately doesn't do anything silly in the new proposed form because xs is no longer in the list. 2. A more sinister form of the above race can happen when bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and immediately after __xfrm_state_delete marks xs as DEAD and calls bond_ipsec_del_sa() which happily calls delete on real_dev again. In order to fix these races (and others like it), I think bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >lock for each xs being processed. This would prevent xfrm from concurrently initiating add/delete operations on the managed states. Cosmin.
On 2/28/25 13:07, Nikolay Aleksandrov wrote: > On 2/28/25 12:31, Cosmin Ratiu wrote: >> On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: >>> On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: >>>>>> One more thing - note I'm not an xfrm expert by far but it >>>>>> seems to me here you have >>>>>> to also callĀ xdo_dev_state_free() with the old active slave >>>>>> dev otherwise that will >>>>>> never get called with the original real_dev after the switch to >>>>>> a new >>>>>> active slave (or more accurately it might if the GC runs >>>>>> between the switching >>>>>> but it is a race), care must be taken wrt sequence of events >>>>>> because the XFRM >>>>> >>>>> Can we just call xs->xso.real_dev->xfrmdev_ops- >>>>>> xdo_dev_state_free(xs) >>>>> no matter xs->xso.real_dev == real_dev or not? I'm afraid calling >>>>> xdo_dev_state_free() every where may make us lot more easily. >>>>> >>>> >>>> You'd have to check all drivers that implement the callback to >>>> answer that and even then >>>> I'd stick to the canonical way of how it's done in xfrm and make >>>> the bond just passthrough. >>>> Any other games become dangerous and new code will have to be >>>> carefully reviewed every >>>> time, calling another device's free_sa when it wasn't added before >>>> doesn't sound good. >>>> >>>>>> GC may be running in parallel which probably means that in >>>>>> bond_ipsec_free_sa() >>>>>> you'll have to take the mutex before calling >>>>>> xdo_dev_state_free() and check >>>>>> if the entry is still linked in the bond's ipsec list before >>>>>> calling the free_sa >>>>>> callback, if it isn't then del_sa_all got to it before the GC >>>>>> and there's nothing >>>>>> to do if it also called the dev's free_sa callback. The check >>>>>> for real_dev doesn't >>>>>> seem enough to protect against this race. >>>>> >>>>> I agree that we need to take the mutex before calling >>>>> xdo_dev_state_free() >>>>> in bond_ipsec_free_sa(). Do you think if this is enough? I'm a >>>>> bit lot here. >>>>> >>>>> Thanks >>>>> Hangbin >>>> >>>> Well, the race is between the xfrm GC and del_sa_all, in bond's >>>> free_sa if you >>>> walk the list under the mutex before calling real_dev's free >>>> callback and >>>> don't find the current element that's being freed in free_sa then >>>> it was >>>> cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk >>>> that >>>> list and clean the entries. I think it should be fine as long as >>>> free_sa >>>> was called once with the proper device. >>> >>> OK, so the free will be called either in del_sa_all() or free_sa(). >>> Something like this? >>> >> [...] >> >> Unfortunately, after applying these changes and reasoning about them >> for a bit, I don't think this will work. There are still races left. >> For example: >> 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and >> before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all >> is called in parallel, doesn't call delete on xs (because it's dead), >> then calls free (incorrect without delete first), then removes the list >> entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, >> and calls delete (incorrect, out of order with free). Finally, >> bond_ipsec_free_sa is called, which fortunately doesn't do anything >> silly in the new proposed form because xs is no longer in the list. >> >> 2. A more sinister form of the above race can happen when >> bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and >> immediately after __xfrm_state_delete marks xs as DEAD and calls >> bond_ipsec_del_sa() which happily calls delete on real_dev again. >> >> In order to fix these races (and others like it), I think >> bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >>> lock for each xs being processed. This would prevent xfrm from >> concurrently initiating add/delete operations on the managed states. >> >> Cosmin. > > Duh, right you are. The state is protected by x->lock and cannot be trusted > outside of it. If you take x->lock inside the list walk with the mutex held > you can deadlock. > > Cheers, > Nik > Correction - actually took a closer look at the xfrm code and it should be fine. The x->lock is taken only in the delete path and if the mutex is not acquired by bond's del_sa callback it should be ok. Though this must be very well documented.