diff mbox series

[net-next,1/3] flow_offload: allow user to offload tc action to net device

Message ID 20210722091938.12956-2-simon.horman@corigine.com
State New
Headers show
Series flow_offload: hardware offload of TC actions | expand

Commit Message

Simon Horman July 22, 2021, 9:19 a.m. UTC
From: Baowen Zheng <baowen.zheng@corigine.com>

Use flow_indr_dev_register/flow_indr_dev_setup_offload to
offload tc action.

We offload the tc action mainly for ovs meter configuration.
Make some basic changes for different vendors to return EOPNOTSUPP.

We need to call tc_cleanup_flow_action to clean up tc action entry since
in tc_setup_action, some actions may hold dev refcnt, especially the mirror
action.

As per review from the RFC, the kernel test robot will fail to run, so
we add CONFIG_NET_CLS_ACT control for the action offload.

Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  2 +-
 .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |  3 ++
 .../ethernet/netronome/nfp/flower/offload.c   |  3 ++
 include/linux/netdevice.h                     |  1 +
 include/net/flow_offload.h                    | 15 ++++++++
 include/net/pkt_cls.h                         | 15 ++++++++
 net/core/flow_offload.c                       | 26 +++++++++++++-
 net/sched/act_api.c                           | 33 +++++++++++++++++
 net/sched/cls_api.c                           | 36 ++++++++++++++++---
 9 files changed, 128 insertions(+), 6 deletions(-)

Comments

Simon Horman July 27, 2021, 1:04 p.m. UTC | #1
On Thu, Jul 22, 2021 at 09:33:09AM -0400, Jamal Hadi Salim wrote:
> On 2021-07-22 9:29 a.m., Vlad Buslov wrote:

> > On Thu 22 Jul 2021 at 12:19, Simon Horman <simon.horman@corigine.com> wrote:

> > > From: Baowen Zheng <baowen.zheng@corigine.com>

> > > 

> > > Use flow_indr_dev_register/flow_indr_dev_setup_offload to

> > > offload tc action.

> > > 

> > > We offload the tc action mainly for ovs meter configuration.

> > > Make some basic changes for different vendors to return EOPNOTSUPP.

> > > 

> > > We need to call tc_cleanup_flow_action to clean up tc action entry since

> > > in tc_setup_action, some actions may hold dev refcnt, especially the mirror

> > > action.

> > > 

> > > As per review from the RFC, the kernel test robot will fail to run, so

> > > we add CONFIG_NET_CLS_ACT control for the action offload.

> > > 

> > > Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>

> > > Signed-off-by: Louis Peens <louis.peens@corigine.com>

> > > Signed-off-by: Simon Horman <simon.horman@corigine.com>

> > > ---

> > >   drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  2 +-

> > >   .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |  3 ++

> 

> > >   			    void *data,

> > >   			    void (*cleanup)(struct flow_block_cb *block_cb))

> > >   {

> > > +	if (!netdev)

> > > +		return -EOPNOTSUPP;

> > > +

> > >   	switch (type) {

> > >   	case TC_SETUP_BLOCK:

> > >   		return mlx5e_rep_indr_setup_block(netdev, sch, cb_priv, type_data,

> > > diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c

> 

> [..]

> 

> > > +	/* offload actions to hardware if possible */

> > > +	tcf_action_offload_cmd(actions, extack);

> > > +

> > 

> > I think this has already been suggested for RFC, but some sort of

> > visibility for offload status of action would be extremely welcome.

> > Perhaps "IN_HW" flag and counter, similar to what we have for offloaded

> > filters.

> > 

> 

> Also showing a tc command line in the cover letter on how one would

> ask for a specific action to be offloaded.


In practice actions are offloaded when a flow using them is offloaded.
So I think we need to consider what the meaning of IN_HW is.

Is it that:

* The driver (and potentially hardware, though not in our current
  implementation) has accepted the action for offload;
* That a classifier that uses the action has bee offloaded;
* Or something else?

With regards to a counter, I'm not quite sure what this would be:

* The number of devices where the action has been offloaded (which ties
  into the question of what we mean by IN_HW)
* The number of offloaded classifier instances using the action
* Something else

Regarding a flag to control offload:

* For classifiers (at least the flower classifier) there is the skip_sw and
  skip_hw flags, which allow control of placement of a classifier in SW and
  HW.
* We could add similar flags for actions, which at least in my
  world view would have the net-effect of controlling which classifiers can
  be added to sw and hw - f.e. a classifier that uses an action marked
  skip_hw could not be added to HW.
* Doing so would add some extra complexity and its not immediately apparent
  to me what the use-case would be given that there are already flags for
  classifiers.
Vlad Buslov July 27, 2021, 2:38 p.m. UTC | #2
On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:
> On Thu, Jul 22, 2021 at 09:33:09AM -0400, Jamal Hadi Salim wrote:

>> On 2021-07-22 9:29 a.m., Vlad Buslov wrote:

>> > On Thu 22 Jul 2021 at 12:19, Simon Horman <simon.horman@corigine.com> wrote:

>> > > From: Baowen Zheng <baowen.zheng@corigine.com>

>> > > 

>> > > Use flow_indr_dev_register/flow_indr_dev_setup_offload to

>> > > offload tc action.

>> > > 

>> > > We offload the tc action mainly for ovs meter configuration.

>> > > Make some basic changes for different vendors to return EOPNOTSUPP.

>> > > 

>> > > We need to call tc_cleanup_flow_action to clean up tc action entry since

>> > > in tc_setup_action, some actions may hold dev refcnt, especially the mirror

>> > > action.

>> > > 

>> > > As per review from the RFC, the kernel test robot will fail to run, so

>> > > we add CONFIG_NET_CLS_ACT control for the action offload.

>> > > 

>> > > Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>

>> > > Signed-off-by: Louis Peens <louis.peens@corigine.com>

>> > > Signed-off-by: Simon Horman <simon.horman@corigine.com>

>> > > ---

>> > >   drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  2 +-

>> > >   .../ethernet/mellanox/mlx5/core/en/rep/tc.c   |  3 ++

>> 

>> > >   			    void *data,

>> > >   			    void (*cleanup)(struct flow_block_cb *block_cb))

>> > >   {

>> > > +	if (!netdev)

>> > > +		return -EOPNOTSUPP;

>> > > +

>> > >   	switch (type) {

>> > >   	case TC_SETUP_BLOCK:

>> > >   		return mlx5e_rep_indr_setup_block(netdev, sch, cb_priv, type_data,

>> > > diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c

>> 

>> [..]

>> 

>> > > +	/* offload actions to hardware if possible */

>> > > +	tcf_action_offload_cmd(actions, extack);

>> > > +

>> > 

>> > I think this has already been suggested for RFC, but some sort of

>> > visibility for offload status of action would be extremely welcome.

>> > Perhaps "IN_HW" flag and counter, similar to what we have for offloaded

>> > filters.

>> > 

>> 

>> Also showing a tc command line in the cover letter on how one would

>> ask for a specific action to be offloaded.

>

> In practice actions are offloaded when a flow using them is offloaded.

> So I think we need to consider what the meaning of IN_HW is.

>

> Is it that:

>

> * The driver (and potentially hardware, though not in our current

>   implementation) has accepted the action for offload;

> * That a classifier that uses the action has bee offloaded;

> * Or something else?


I think we have the same issue with filters - they might not be in
hardware after driver callback returned "success" (due to neigh state
being invalid for tunnel_key encap, for example).

>

> With regards to a counter, I'm not quite sure what this would be:

>

> * The number of devices where the action has been offloaded (which ties

>   into the question of what we mean by IN_HW)

> * The number of offloaded classifier instances using the action

> * Something else


I would prefer to have semantics similar to filters:

1. Count number of driver callbacks that returned "success".

2. If count > 0, then set in_hw flag.

3. Set in_hw_count to success count.

This would allow user to immediately determine whether action passed
driver validation.

>

> Regarding a flag to control offload:

>

> * For classifiers (at least the flower classifier) there is the skip_sw and

>   skip_hw flags, which allow control of placement of a classifier in SW and

>   HW.

> * We could add similar flags for actions, which at least in my

>   world view would have the net-effect of controlling which classifiers can

>   be added to sw and hw - f.e. a classifier that uses an action marked

>   skip_hw could not be added to HW.

> * Doing so would add some extra complexity and its not immediately apparent

>   to me what the use-case would be given that there are already flags for

>   classifiers.


Yeah, adding such flag for action offload seems to complicate things.
Also, "skip_sw" flag doesn't even make much sense for actions. I thought
that "skip_hw" flag would be nice to have for users that would like to
avoid "spamming" their NIC drivers (potentially causing higher latency
and resource consumption) for filters/actions they have no intention to
offload to hardware, but I'm not sure how useful is that option really
is.
Jamal Hadi Salim July 27, 2021, 4:13 p.m. UTC | #3
On 2021-07-27 10:38 a.m., Vlad Buslov wrote:
> 

> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:


>>>

>>> Also showing a tc command line in the cover letter on how one would

>>> ask for a specific action to be offloaded.

>>

>> In practice actions are offloaded when a flow using them is offloaded.

>> So I think we need to consider what the meaning of IN_HW is.

>>

>> Is it that:

>>

>> * The driver (and potentially hardware, though not in our current

>>    implementation) has accepted the action for offload;

>> * That a classifier that uses the action has bee offloaded;

>> * Or something else?

> 

> I think we have the same issue with filters - they might not be in

> hardware after driver callback returned "success" (due to neigh state

> being invalid for tunnel_key encap, for example).

> 


Sounds like we need another state for this. Otherwise, how do you debug
that something is sitting in the driver and not in hardware after you
issued a command to offload it? How do i tell today?
Also knowing reason why something is sitting in the driver would be
helpful.

>> With regards to a counter, I'm not quite sure what this would be:

>>

>> * The number of devices where the action has been offloaded (which ties

>>    into the question of what we mean by IN_HW)

>> * The number of offloaded classifier instances using the action

>> * Something else

> 

> I would prefer to have semantics similar to filters:

> 

> 1. Count number of driver callbacks that returned "success".

> 

> 2. If count > 0, then set in_hw flag.

> 

> 3. Set in_hw_count to success count.

> 

> This would allow user to immediately determine whether action passed

> driver validation.

>


I didnt follow this:
Are we refering to the the "block" semantics (where a filter for
example applies to multiple devices)?

>>

>> Regarding a flag to control offload:

>>

>> * For classifiers (at least the flower classifier) there is the skip_sw and

>>    skip_hw flags, which allow control of placement of a classifier in SW and

>>    HW.

>> * We could add similar flags for actions, which at least in my

>>    world view would have the net-effect of controlling which classifiers can

>>    be added to sw and hw - f.e. a classifier that uses an action marked

>>    skip_hw could not be added to HW.


I guess it depends on the hardware implementation.
In S/W we have two modes:
Approach A: create an action and then 2) bind it to a filter.
Approach B: Create a filter and then bind it to an action.

And #2A can be repeated multiple times for the same action
(would require some index as a reference for the action)
To Simon's comment above that would mean allowing
"a classifier that uses an action marked skip_hw to be added to HW"
i.e
Some hardware is capable of doing both option #A and #B.

Todays offload assumes #B - in which both filter and action are assumed
offloaded.

I am hoping whatever approach we end up agreeing on doesnt limit
either mode.

>> * Doing so would add some extra complexity and its not immediately apparent

>>    to me what the use-case would be given that there are already flags for

>>    classifiers.

> Yeah, adding such flag for action offload seems to complicate things.

> Also, "skip_sw" flag doesn't even make much sense for actions. I thought

> that "skip_hw" flag would be nice to have for users that would like to

> avoid "spamming" their NIC drivers (potentially causing higher latency

> and resource consumption) for filters/actions they have no intention to

> offload to hardware, but I'm not sure how useful is that option really

> is.


Hold on Vlad.
So you are looking at this mostly as an optimization to speed up h/w
control updates? ;->

I was looking at it more as a (currently missing) feature improvement.
We already have a use case that is implemented by s/w today. The feature
mimics it in h/w.

At minimal all existing NICs should be able to support the counters
as mapped to simple actions like drop. I understand for example if some
cant support adding separately offloading of tunnels for example.
So the syntax is something along the lines of:

tc actions add action drop index 15 skip_sw
tc filter add dev ...parent ... protocol ip prio X ..\
u32/flower skip_sw match ... flowid 1:10 action gact index 15

You get an error if counter index 15 is not offloaded or
if skip_sw was left out..

And then later on, if you support sharing of actions:
tc filter add dev ...parent ... protocol ip prio X2 ..\
u32/flower skip_sw match ... flowid 1:10 action gact index 15

cheers,
jamal
Vlad Buslov July 27, 2021, 4:47 p.m. UTC | #4
On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

>> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

>

>>>>

>>>> Also showing a tc command line in the cover letter on how one would

>>>> ask for a specific action to be offloaded.

>>>

>>> In practice actions are offloaded when a flow using them is offloaded.

>>> So I think we need to consider what the meaning of IN_HW is.

>>>

>>> Is it that:

>>>

>>> * The driver (and potentially hardware, though not in our current

>>>    implementation) has accepted the action for offload;

>>> * That a classifier that uses the action has bee offloaded;

>>> * Or something else?

>> I think we have the same issue with filters - they might not be in

>> hardware after driver callback returned "success" (due to neigh state

>> being invalid for tunnel_key encap, for example).

>> 

>

> Sounds like we need another state for this. Otherwise, how do you debug

> that something is sitting in the driver and not in hardware after you

> issued a command to offload it? How do i tell today?

> Also knowing reason why something is sitting in the driver would be

> helpful.


It is not about just adding another state. The issue is that there is no
way for drivers to change the state of software filter dynamically.

>

>>> With regards to a counter, I'm not quite sure what this would be:

>>>

>>> * The number of devices where the action has been offloaded (which ties

>>>    into the question of what we mean by IN_HW)

>>> * The number of offloaded classifier instances using the action

>>> * Something else

>> I would prefer to have semantics similar to filters:

>> 1. Count number of driver callbacks that returned "success".

>> 2. If count > 0, then set in_hw flag.

>> 3. Set in_hw_count to success count.

>> This would allow user to immediately determine whether action passed

>> driver validation.

>>

>

> I didnt follow this:

> Are we refering to the the "block" semantics (where a filter for

> example applies to multiple devices)?


This uses indirect offload infrastructure, which means all drivers
in flow_block_indr_dev_list will receive action offload requests.

>

>>>

>>> Regarding a flag to control offload:

>>>

>>> * For classifiers (at least the flower classifier) there is the skip_sw and

>>>    skip_hw flags, which allow control of placement of a classifier in SW and

>>>    HW.

>>> * We could add similar flags for actions, which at least in my

>>>    world view would have the net-effect of controlling which classifiers can

>>>    be added to sw and hw - f.e. a classifier that uses an action marked

>>>    skip_hw could not be added to HW.

>

> I guess it depends on the hardware implementation.

> In S/W we have two modes:

> Approach A: create an action and then 2) bind it to a filter.

> Approach B: Create a filter and then bind it to an action.

>

> And #2A can be repeated multiple times for the same action

> (would require some index as a reference for the action)

> To Simon's comment above that would mean allowing

> "a classifier that uses an action marked skip_hw to be added to HW"

> i.e

> Some hardware is capable of doing both option #A and #B.

>

> Todays offload assumes #B - in which both filter and action are assumed

> offloaded.

>

> I am hoping whatever approach we end up agreeing on doesnt limit

> either mode.

>

>>> * Doing so would add some extra complexity and its not immediately apparent

>>>    to me what the use-case would be given that there are already flags for

>>>    classifiers.

>> Yeah, adding such flag for action offload seems to complicate things.

>> Also, "skip_sw" flag doesn't even make much sense for actions. I thought

>> that "skip_hw" flag would be nice to have for users that would like to

>> avoid "spamming" their NIC drivers (potentially causing higher latency

>> and resource consumption) for filters/actions they have no intention to

>> offload to hardware, but I'm not sure how useful is that option really

>> is.

>

> Hold on Vlad.

> So you are looking at this mostly as an optimization to speed up h/w

> control updates? ;->


No. How would adding more flags improve h/w update rate? I was just
thinking that it is strange that users that are not interested in
offloads would suddenly have higher memory usage for their actions just
because they happen to have offload-capable driver loaded. But it is not
a major concern for me.

>

> I was looking at it more as a (currently missing) feature improvement.

> We already have a use case that is implemented by s/w today. The feature

> mimics it in h/w.

>

> At minimal all existing NICs should be able to support the counters

> as mapped to simple actions like drop. I understand for example if some

> cant support adding separately offloading of tunnels for example.

> So the syntax is something along the lines of:

>

> tc actions add action drop index 15 skip_sw

> tc filter add dev ...parent ... protocol ip prio X ..\

> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>

> You get an error if counter index 15 is not offloaded or

> if skip_sw was left out..

>

> And then later on, if you support sharing of actions:

> tc filter add dev ...parent ... protocol ip prio X2 ..\

> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>

> cheers,

> jamal
Simon Horman July 28, 2021, 7:46 a.m. UTC | #5
On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:
> On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> > On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

> >> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

> >

> >>>>

> >>>> Also showing a tc command line in the cover letter on how one would

> >>>> ask for a specific action to be offloaded.

> >>>

> >>> In practice actions are offloaded when a flow using them is offloaded.

> >>> So I think we need to consider what the meaning of IN_HW is.

> >>>

> >>> Is it that:

> >>>

> >>> * The driver (and potentially hardware, though not in our current

> >>>    implementation) has accepted the action for offload;

> >>> * That a classifier that uses the action has bee offloaded;

> >>> * Or something else?

> >> I think we have the same issue with filters - they might not be in

> >> hardware after driver callback returned "success" (due to neigh state

> >> being invalid for tunnel_key encap, for example).

> >> 

> >

> > Sounds like we need another state for this. Otherwise, how do you debug

> > that something is sitting in the driver and not in hardware after you

> > issued a command to offload it? How do i tell today?

> > Also knowing reason why something is sitting in the driver would be

> > helpful.

> 

> It is not about just adding another state. The issue is that there is no

> way for drivers to change the state of software filter dynamically.


I think it might be worth considering enhancing things at some point.
But I agree that its more than a matter of adding an extra flag. And
I think it's reasonable to implement something similar to the classifier
current offload handling of IN_HW now and consider enhancements separately.

> >>> With regards to a counter, I'm not quite sure what this would be:

> >>>

> >>> * The number of devices where the action has been offloaded (which ties

> >>>    into the question of what we mean by IN_HW)

> >>> * The number of offloaded classifier instances using the action

> >>> * Something else

> >> I would prefer to have semantics similar to filters:

> >> 1. Count number of driver callbacks that returned "success".

> >> 2. If count > 0, then set in_hw flag.

> >> 3. Set in_hw_count to success count.

> >> This would allow user to immediately determine whether action passed

> >> driver validation.


Thanks, that makes sense to me.

> > I didnt follow this:

> > Are we refering to the the "block" semantics (where a filter for

> > example applies to multiple devices)?

> 

> This uses indirect offload infrastructure, which means all drivers

> in flow_block_indr_dev_list will receive action offload requests.

> 

> >>> Regarding a flag to control offload:

> >>>

> >>> * For classifiers (at least the flower classifier) there is the skip_sw and

> >>>    skip_hw flags, which allow control of placement of a classifier in SW and

> >>>    HW.

> >>> * We could add similar flags for actions, which at least in my

> >>>    world view would have the net-effect of controlling which classifiers can

> >>>    be added to sw and hw - f.e. a classifier that uses an action marked

> >>>    skip_hw could not be added to HW.

> >

> > I guess it depends on the hardware implementation.

> > In S/W we have two modes:

> > Approach A: create an action and then 2) bind it to a filter.

> > Approach B: Create a filter and then bind it to an action.

> >

> > And #2A can be repeated multiple times for the same action

> > (would require some index as a reference for the action)

> > To Simon's comment above that would mean allowing

> > "a classifier that uses an action marked skip_hw to be added to HW"

> > i.e

> > Some hardware is capable of doing both option #A and #B.

> >

> > Todays offload assumes #B - in which both filter and action are assumed

> > offloaded.

> >

> > I am hoping whatever approach we end up agreeing on doesnt limit

> > either mode.

> >

> >>> * Doing so would add some extra complexity and its not immediately apparent

> >>>    to me what the use-case would be given that there are already flags for

> >>>    classifiers.

> >> Yeah, adding such flag for action offload seems to complicate things.

> >> Also, "skip_sw" flag doesn't even make much sense for actions. I thought

> >> that "skip_hw" flag would be nice to have for users that would like to

> >> avoid "spamming" their NIC drivers (potentially causing higher latency

> >> and resource consumption) for filters/actions they have no intention to

> >> offload to hardware, but I'm not sure how useful is that option really

> >> is.

> >

> > Hold on Vlad.

> > So you are looking at this mostly as an optimization to speed up h/w

> > control updates? ;->

> 

> No. How would adding more flags improve h/w update rate? I was just

> thinking that it is strange that users that are not interested in

> offloads would suddenly have higher memory usage for their actions just

> because they happen to have offload-capable driver loaded. But it is not

> a major concern for me.


In that case can we rely on the global tc-offload on/off flag
provided by ethtool? (I understand its not the same, but perhaps
it is sufficient in practice.)

> > I was looking at it more as a (currently missing) feature improvement.

> > We already have a use case that is implemented by s/w today. The feature

> > mimics it in h/w.

> >

> > At minimal all existing NICs should be able to support the counters

> > as mapped to simple actions like drop. I understand for example if some

> > cant support adding separately offloading of tunnels for example.

> > So the syntax is something along the lines of:

> >

> > tc actions add action drop index 15 skip_sw

> > tc filter add dev ...parent ... protocol ip prio X ..\

> > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> >

> > You get an error if counter index 15 is not offloaded or

> > if skip_sw was left out..

> >

> > And then later on, if you support sharing of actions:

> > tc filter add dev ...parent ... protocol ip prio X2 ..\

> > u32/flower skip_sw match ... flowid 1:10 action gact index 15


Right, I understand that makes sense and is internally consistent.
But I think that in practice it only makes a difference "Approach B"
implementations, none of which currently exist.

I would suggest we can add this when the need arises, rather than
speculatively without hw/driver support. Its not precluded by the current
model AFAIK.
Vlad Buslov July 28, 2021, 8:05 a.m. UTC | #6
On Wed 28 Jul 2021 at 10:46, Simon Horman <simon.horman@corigine.com> wrote:
> On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

>> On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>> > On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

>> >> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

>> >

>> >>>>

>> >>>> Also showing a tc command line in the cover letter on how one would

>> >>>> ask for a specific action to be offloaded.

>> >>>

>> >>> In practice actions are offloaded when a flow using them is offloaded.

>> >>> So I think we need to consider what the meaning of IN_HW is.

>> >>>

>> >>> Is it that:

>> >>>

>> >>> * The driver (and potentially hardware, though not in our current

>> >>>    implementation) has accepted the action for offload;

>> >>> * That a classifier that uses the action has bee offloaded;

>> >>> * Or something else?

>> >> I think we have the same issue with filters - they might not be in

>> >> hardware after driver callback returned "success" (due to neigh state

>> >> being invalid for tunnel_key encap, for example).

>> >> 

>> >

>> > Sounds like we need another state for this. Otherwise, how do you debug

>> > that something is sitting in the driver and not in hardware after you

>> > issued a command to offload it? How do i tell today?

>> > Also knowing reason why something is sitting in the driver would be

>> > helpful.

>> 

>> It is not about just adding another state. The issue is that there is no

>> way for drivers to change the state of software filter dynamically.

>

> I think it might be worth considering enhancing things at some point.

> But I agree that its more than a matter of adding an extra flag. And

> I think it's reasonable to implement something similar to the classifier

> current offload handling of IN_HW now and consider enhancements separately.

>

>> >>> With regards to a counter, I'm not quite sure what this would be:

>> >>>

>> >>> * The number of devices where the action has been offloaded (which ties

>> >>>    into the question of what we mean by IN_HW)

>> >>> * The number of offloaded classifier instances using the action

>> >>> * Something else

>> >> I would prefer to have semantics similar to filters:

>> >> 1. Count number of driver callbacks that returned "success".

>> >> 2. If count > 0, then set in_hw flag.

>> >> 3. Set in_hw_count to success count.

>> >> This would allow user to immediately determine whether action passed

>> >> driver validation.

>

> Thanks, that makes sense to me.

>

>> > I didnt follow this:

>> > Are we refering to the the "block" semantics (where a filter for

>> > example applies to multiple devices)?

>> 

>> This uses indirect offload infrastructure, which means all drivers

>> in flow_block_indr_dev_list will receive action offload requests.

>> 

>> >>> Regarding a flag to control offload:

>> >>>

>> >>> * For classifiers (at least the flower classifier) there is the skip_sw and

>> >>>    skip_hw flags, which allow control of placement of a classifier in SW and

>> >>>    HW.

>> >>> * We could add similar flags for actions, which at least in my

>> >>>    world view would have the net-effect of controlling which classifiers can

>> >>>    be added to sw and hw - f.e. a classifier that uses an action marked

>> >>>    skip_hw could not be added to HW.

>> >

>> > I guess it depends on the hardware implementation.

>> > In S/W we have two modes:

>> > Approach A: create an action and then 2) bind it to a filter.

>> > Approach B: Create a filter and then bind it to an action.

>> >

>> > And #2A can be repeated multiple times for the same action

>> > (would require some index as a reference for the action)

>> > To Simon's comment above that would mean allowing

>> > "a classifier that uses an action marked skip_hw to be added to HW"

>> > i.e

>> > Some hardware is capable of doing both option #A and #B.

>> >

>> > Todays offload assumes #B - in which both filter and action are assumed

>> > offloaded.

>> >

>> > I am hoping whatever approach we end up agreeing on doesnt limit

>> > either mode.

>> >

>> >>> * Doing so would add some extra complexity and its not immediately apparent

>> >>>    to me what the use-case would be given that there are already flags for

>> >>>    classifiers.

>> >> Yeah, adding such flag for action offload seems to complicate things.

>> >> Also, "skip_sw" flag doesn't even make much sense for actions. I thought

>> >> that "skip_hw" flag would be nice to have for users that would like to

>> >> avoid "spamming" their NIC drivers (potentially causing higher latency

>> >> and resource consumption) for filters/actions they have no intention to

>> >> offload to hardware, but I'm not sure how useful is that option really

>> >> is.

>> >

>> > Hold on Vlad.

>> > So you are looking at this mostly as an optimization to speed up h/w

>> > control updates? ;->

>> 

>> No. How would adding more flags improve h/w update rate? I was just

>> thinking that it is strange that users that are not interested in

>> offloads would suddenly have higher memory usage for their actions just

>> because they happen to have offload-capable driver loaded. But it is not

>> a major concern for me.

>

> In that case can we rely on the global tc-offload on/off flag

> provided by ethtool? (I understand its not the same, but perhaps

> it is sufficient in practice.)


Yes, the ethtool should be sufficient. Didn't think about it initially.
Thanks!

>

>> > I was looking at it more as a (currently missing) feature improvement.

>> > We already have a use case that is implemented by s/w today. The feature

>> > mimics it in h/w.

>> >

>> > At minimal all existing NICs should be able to support the counters

>> > as mapped to simple actions like drop. I understand for example if some

>> > cant support adding separately offloading of tunnels for example.

>> > So the syntax is something along the lines of:

>> >

>> > tc actions add action drop index 15 skip_sw

>> > tc filter add dev ...parent ... protocol ip prio X ..\

>> > u32/flower skip_sw match ... flowid 1:10 action gact index 15

>> >

>> > You get an error if counter index 15 is not offloaded or

>> > if skip_sw was left out..

>> >

>> > And then later on, if you support sharing of actions:

>> > tc filter add dev ...parent ... protocol ip prio X2 ..\

>> > u32/flower skip_sw match ... flowid 1:10 action gact index 15

>

> Right, I understand that makes sense and is internally consistent.

> But I think that in practice it only makes a difference "Approach B"

> implementations, none of which currently exist.

>

> I would suggest we can add this when the need arises, rather than

> speculatively without hw/driver support. Its not precluded by the current

> model AFAIK.
Jamal Hadi Salim July 28, 2021, 1:51 p.m. UTC | #7
On 2021-07-28 3:46 a.m., Simon Horman wrote:
> On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

>> On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>>> On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

>>>> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:


[..]

>>>> I think we have the same issue with filters - they might not be in

>>>> hardware after driver callback returned "success" (due to neigh state

>>>> being invalid for tunnel_key encap, for example).

>>>>

>>>

>>> Sounds like we need another state for this. Otherwise, how do you debug

>>> that something is sitting in the driver and not in hardware after you

>>> issued a command to offload it? How do i tell today?

>>> Also knowing reason why something is sitting in the driver would be

>>> helpful.

>>

>> It is not about just adding another state. The issue is that there is no

>> way for drivers to change the state of software filter dynamically.

> 

> I think it might be worth considering enhancing things at some point.

> But I agree that its more than a matter of adding an extra flag. And

> I think it's reasonable to implement something similar to the classifier

> current offload handling of IN_HW now and consider enhancements separately.

> 


Debugability is very important. If we have such gotchas we need to have
the admin at least be able to tell if the driver returns "success"
and the request is still sitting in the driver for whatever reason
At minimal there needs to be some indicator somewhere which say
"inprogress" or "waiting for resolution" etc.
If the control plane(user space app) starts making other decisions
based on assumptions that filter was successfully installed i.e
packets are being treated in the hardware then there could be
consequences when this assumption is wrong.

So if i undestood the challenge correctly it is: how do you relay
this info back so it is reflected in the filter details. Yes that
would require some mechanism to exist and possibly mapping state
between whats in the driver and in the cls layer.
If i am not mistaken, the switchdev folks handle this asynchronicty?
+Cc Ido, Jiri, Roopa

And it should be noted that: Yes, the filters have this
pre-existing condition but doesnt mean given the opportunity
to do actions we should replicate what they do.

[..]

> 

>>> I didnt follow this:

>>> Are we refering to the the "block" semantics (where a filter for

>>> example applies to multiple devices)?

>>

>> This uses indirect offload infrastructure, which means all drivers

>> in flow_block_indr_dev_list will receive action offload requests.

>>


Ok, understood.

[..]

>>

>> No. How would adding more flags improve h/w update rate? I was just

>> thinking that it is strange that users that are not interested in

>> offloads would suddenly have higher memory usage for their actions just

>> because they happen to have offload-capable driver loaded. But it is not

>> a major concern for me.

> 

> In that case can we rely on the global tc-offload on/off flag

> provided by ethtool? (I understand its not the same, but perhaps

> it is sufficient in practice.)

> 


ok.
So: I think i have seen this what is probably the spamming refered
with the intel (800?) driver ;-> Basically driver was reacting to
all filters regardless of need to offload or not.
I thought it was an oversight on their part and the driver needed
fixing. Are we invoking the offload regardless of whether h/w offload
is requested? In my naive view - at least when i looked at the intel
code - it didnt seem hard to avoid the spamming.


>>> I was looking at it more as a (currently missing) feature improvement.

>>> We already have a use case that is implemented by s/w today. The feature

>>> mimics it in h/w.

>>>

>>> At minimal all existing NICs should be able to support the counters

>>> as mapped to simple actions like drop. I understand for example if some

>>> cant support adding separately offloading of tunnels for example.

>>> So the syntax is something along the lines of:

>>>

>>> tc actions add action drop index 15 skip_sw

>>> tc filter add dev ...parent ... protocol ip prio X ..\

>>> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>>>

>>> You get an error if counter index 15 is not offloaded or

>>> if skip_sw was left out..

>>>

>>> And then later on, if you support sharing of actions:

>>> tc filter add dev ...parent ... protocol ip prio X2 ..\

>>> u32/flower skip_sw match ... flowid 1:10 action gact index 15

> 

> Right, I understand that makes sense and is internally consistent.

> But I think that in practice it only makes a difference "Approach B"

> implementations, none of which currently exist.

> 


At minimal:
Shouldnt counters (easily correlated to basic actions like drop or
accept) fit the scenario of:
tc actions add action drop index 15 skip_sw
tc filter add dev ...parent ... protocol ip prio X .. \
u32/flower skip_sw match ... flowid 1:10 action gact index 15

?

> I would suggest we can add this when the need arises, rather than

> speculatively without hw/driver support. Its not precluded by the current

> model AFAIK.

> 


We are going to work on a driver that would have the "B" approach.
I am hoping - whatever the consensus here - it doesnt require a
surgery afterwards to make that work.

cheers,
jamal
Simon Horman July 28, 2021, 2:46 p.m. UTC | #8
On Wed, Jul 28, 2021 at 09:51:00AM -0400, Jamal Hadi Salim wrote:
> On 2021-07-28 3:46 a.m., Simon Horman wrote:

> > On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

> > > On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> > > > On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

> > > > > On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

> 

> [..]

> 

> > > > > I think we have the same issue with filters - they might not be in

> > > > > hardware after driver callback returned "success" (due to neigh state

> > > > > being invalid for tunnel_key encap, for example).

> > > > 

> > > > Sounds like we need another state for this. Otherwise, how do you debug

> > > > that something is sitting in the driver and not in hardware after you

> > > > issued a command to offload it? How do i tell today?

> > > > Also knowing reason why something is sitting in the driver would be

> > > > helpful.

> > > 

> > > It is not about just adding another state. The issue is that there is no

> > > way for drivers to change the state of software filter dynamically.

> > 

> > I think it might be worth considering enhancing things at some point.

> > But I agree that its more than a matter of adding an extra flag. And

> > I think it's reasonable to implement something similar to the classifier

> > current offload handling of IN_HW now and consider enhancements separately.

> 

> Debugability is very important. If we have such gotchas we need to have

> the admin at least be able to tell if the driver returns "success"

> and the request is still sitting in the driver for whatever reason

> At minimal there needs to be some indicator somewhere which say

> "inprogress" or "waiting for resolution" etc.

> If the control plane(user space app) starts making other decisions

> based on assumptions that filter was successfully installed i.e

> packets are being treated in the hardware then there could be

> consequences when this assumption is wrong.

> 

> So if i undestood the challenge correctly it is: how do you relay

> this info back so it is reflected in the filter details. Yes that

> would require some mechanism to exist and possibly mapping state

> between whats in the driver and in the cls layer.

> If i am not mistaken, the switchdev folks handle this asynchronicty?

> +Cc Ido, Jiri, Roopa

> 

> And it should be noted that: Yes, the filters have this

> pre-existing condition but doesnt mean given the opportunity

> to do actions we should replicate what they do.


I'd prefer symmetry between the use of IN_HW for filters and actions,
which I believe is what Vlad has suggested.

If we wish to enhance things - f.e. for debugging, which I
agree is important - then I think that is a separate topic.

> [..]

> 

> > 

> > > > I didnt follow this:

> > > > Are we refering to the the "block" semantics (where a filter for

> > > > example applies to multiple devices)?

> > > 

> > > This uses indirect offload infrastructure, which means all drivers

> > > in flow_block_indr_dev_list will receive action offload requests.

> > > 

> 

> Ok, understood.

> 

> [..]

> 

> > > 

> > > No. How would adding more flags improve h/w update rate? I was just

> > > thinking that it is strange that users that are not interested in

> > > offloads would suddenly have higher memory usage for their actions just

> > > because they happen to have offload-capable driver loaded. But it is not

> > > a major concern for me.

> > 

> > In that case can we rely on the global tc-offload on/off flag

> > provided by ethtool? (I understand its not the same, but perhaps

> > it is sufficient in practice.)

> > 

> 

> ok.

> So: I think i have seen this what is probably the spamming refered

> with the intel (800?) driver ;-> Basically driver was reacting to

> all filters regardless of need to offload or not.

> I thought it was an oversight on their part and the driver needed

> fixing. Are we invoking the offload regardless of whether h/w offload

> is requested? In my naive view - at least when i looked at the intel

> code - it didnt seem hard to avoid the spamming.


There is a per-netdev (not global as I wrote above) flag to enable and
disable offload. And there is per-classifier skip_hw flag. I can dig
through the code as easily as you can but I'd be surprised if the
driver is seeing offload requests if either of those settings
are in effect.

> > > > I was looking at it more as a (currently missing) feature improvement.

> > > > We already have a use case that is implemented by s/w today. The feature

> > > > mimics it in h/w.

> > > > 

> > > > At minimal all existing NICs should be able to support the counters

> > > > as mapped to simple actions like drop. I understand for example if some

> > > > cant support adding separately offloading of tunnels for example.

> > > > So the syntax is something along the lines of:

> > > > 

> > > > tc actions add action drop index 15 skip_sw

> > > > tc filter add dev ...parent ... protocol ip prio X ..\

> > > > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> > > > 

> > > > You get an error if counter index 15 is not offloaded or

> > > > if skip_sw was left out..

> > > > 

> > > > And then later on, if you support sharing of actions:

> > > > tc filter add dev ...parent ... protocol ip prio X2 ..\

> > > > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> > 

> > Right, I understand that makes sense and is internally consistent.

> > But I think that in practice it only makes a difference "Approach B"

> > implementations, none of which currently exist.

> > 

> 

> At minimal:

> Shouldnt counters (easily correlated to basic actions like drop or

> accept) fit the scenario of:

> tc actions add action drop index 15 skip_sw

> tc filter add dev ...parent ... protocol ip prio X .. \

> u32/flower skip_sw match ... flowid 1:10 action gact index 15

> 

> ?

> 

> > I would suggest we can add this when the need arises, rather than

> > speculatively without hw/driver support. Its not precluded by the current

> > model AFAIK.

> > 

> 

> We are going to work on a driver that would have the "B" approach.

> I am hoping - whatever the consensus here - it doesnt require a

> surgery afterwards to make that work.


You should be able to build on the work proposed here to add what you
suggest into the framework to meet these requirements for your driver work.
Jamal Hadi Salim July 30, 2021, 10:17 a.m. UTC | #9
On 2021-07-28 10:46 a.m., Simon Horman wrote:
> On Wed, Jul 28, 2021 at 09:51:00AM -0400, Jamal Hadi Salim wrote:

>> On 2021-07-28 3:46 a.m., Simon Horman wrote:

>>> On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

>>>> On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>>>>> On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

>>>>>> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

>>

>> [..]

>>

>>>>>> I think we have the same issue with filters - they might not be in

>>>>>> hardware after driver callback returned "success" (due to neigh state

>>>>>> being invalid for tunnel_key encap, for example).

>>>>>

>>>>> Sounds like we need another state for this. Otherwise, how do you debug

>>>>> that something is sitting in the driver and not in hardware after you

>>>>> issued a command to offload it? How do i tell today?

>>>>> Also knowing reason why something is sitting in the driver would be

>>>>> helpful.

>>>>

>>>> It is not about just adding another state. The issue is that there is no

>>>> way for drivers to change the state of software filter dynamically.

>>>

>>> I think it might be worth considering enhancing things at some point.

>>> But I agree that its more than a matter of adding an extra flag. And

>>> I think it's reasonable to implement something similar to the classifier

>>> current offload handling of IN_HW now and consider enhancements separately.

>>

>> Debugability is very important. If we have such gotchas we need to have

>> the admin at least be able to tell if the driver returns "success"

>> and the request is still sitting in the driver for whatever reason

>> At minimal there needs to be some indicator somewhere which say

>> "inprogress" or "waiting for resolution" etc.

>> If the control plane(user space app) starts making other decisions

>> based on assumptions that filter was successfully installed i.e

>> packets are being treated in the hardware then there could be

>> consequences when this assumption is wrong.

>>

>> So if i undestood the challenge correctly it is: how do you relay

>> this info back so it is reflected in the filter details. Yes that

>> would require some mechanism to exist and possibly mapping state

>> between whats in the driver and in the cls layer.

>> If i am not mistaken, the switchdev folks handle this asynchronicty?

>> +Cc Ido, Jiri, Roopa

>>

>> And it should be noted that: Yes, the filters have this

>> pre-existing condition but doesnt mean given the opportunity

>> to do actions we should replicate what they do.

> 

> I'd prefer symmetry between the use of IN_HW for filters and actions,

> which I believe is what Vlad has suggested.

> 


It still not clear to me what it means from a command line pov.
How do i add a rule and when i dump it what does it show?

> If we wish to enhance things - f.e. for debugging, which I

> agree is important - then I think that is a separate topic.

> 


My only concern is not to repeat mistakes that are in filters
just for the sake of symmetry. Example the fact that something
went wrong with insertion or insertion is still in progress
and you get an indication that all went well.
Looking at mlnx (NIC) ndrivers it does seem that in the normal case
the insertion into hw is synchronous (for anything that is not sw
only). I didnt quiet see what Vlad was referring to.
We have spent literally hours debugging issues where rules are being
offloaded thinking it was the driver so any extra info helps.


>>>>

>>>> No. How would adding more flags improve h/w update rate? I was just

>>>> thinking that it is strange that users that are not interested in

>>>> offloads would suddenly have higher memory usage for their actions just

>>>> because they happen to have offload-capable driver loaded. But it is not

>>>> a major concern for me.

>>>

>>> In that case can we rely on the global tc-offload on/off flag

>>> provided by ethtool? (I understand its not the same, but perhaps

>>> it is sufficient in practice.)

>>>

>>

>> ok.

>> So: I think i have seen this what is probably the spamming refered

>> with the intel (800?) driver ;-> Basically driver was reacting to

>> all filters regardless of need to offload or not.

>> I thought it was an oversight on their part and the driver needed

>> fixing. Are we invoking the offload regardless of whether h/w offload

>> is requested? In my naive view - at least when i looked at the intel

>> code - it didnt seem hard to avoid the spamming.

> 

> There is a per-netdev (not global as I wrote above) flag to enable and

> disable offload. And there is per-classifier skip_hw flag. I can dig

> through the code as easily as you can but I'd be surprised if the

> driver is seeing offload requests if either of those settings

> are in effect.

> 


For sure it was the 800 (Ice driver if you want to dig into the code).
I think the ethtool flag was turned on but not skip_sw in the policy. I
dont have the card installed in any board right now - either will go and
dig into the logs (because it spews the message into the logs).
Again not clear if this is what Vlad was calling spam.

>>>>> I was looking at it more as a (currently missing) feature improvement.

>>>>> We already have a use case that is implemented by s/w today. The feature

>>>>> mimics it in h/w.

>>>>>

>>>>> At minimal all existing NICs should be able to support the counters

>>>>> as mapped to simple actions like drop. I understand for example if some

>>>>> cant support adding separately offloading of tunnels for example.

>>>>> So the syntax is something along the lines of:

>>>>>

>>>>> tc actions add action drop index 15 skip_sw

>>>>> tc filter add dev ...parent ... protocol ip prio X ..\

>>>>> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>>>>>

>>>>> You get an error if counter index 15 is not offloaded or

>>>>> if skip_sw was left out..

>>>>>

>>>>> And then later on, if you support sharing of actions:

>>>>> tc filter add dev ...parent ... protocol ip prio X2 ..\

>>>>> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>>>

>>> Right, I understand that makes sense and is internally consistent.

>>> But I think that in practice it only makes a difference "Approach B"

>>> implementations, none of which currently exist.

>>>

>>

>> At minimal:

>> Shouldnt counters (easily correlated to basic actions like drop or

>> accept) fit the scenario of:

>> tc actions add action drop index 15 skip_sw

>> tc filter add dev ...parent ... protocol ip prio X .. \

>> u32/flower skip_sw match ... flowid 1:10 action gact index 15

>>

>> ?

>>

>>> I would suggest we can add this when the need arises, rather than

>>> speculatively without hw/driver support. Its not precluded by the current

>>> model AFAIK.

>>>

>>

>> We are going to work on a driver that would have the "B" approach.

>> I am hoping - whatever the consensus here - it doesnt require a

>> surgery afterwards to make that work.

> 

> You should be able to build on the work proposed here to add what you

> suggest into the framework to meet these requirements for your driver work.

> 


Then we are good. These are the same patches you have here?

cheers,
jamal
Vlad Buslov July 30, 2021, 11:40 a.m. UTC | #10
On Fri 30 Jul 2021 at 13:17, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On 2021-07-28 10:46 a.m., Simon Horman wrote:

>> On Wed, Jul 28, 2021 at 09:51:00AM -0400, Jamal Hadi Salim wrote:

>>> On 2021-07-28 3:46 a.m., Simon Horman wrote:

>>>> On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

>>>>> On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>>>>>> On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

>>>>>>> On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

>>>

>>> [..]

>>>

>>>>>>> I think we have the same issue with filters - they might not be in

>>>>>>> hardware after driver callback returned "success" (due to neigh state

>>>>>>> being invalid for tunnel_key encap, for example).

>>>>>>

>>>>>> Sounds like we need another state for this. Otherwise, how do you debug

>>>>>> that something is sitting in the driver and not in hardware after you

>>>>>> issued a command to offload it? How do i tell today?

>>>>>> Also knowing reason why something is sitting in the driver would be

>>>>>> helpful.

>>>>>

>>>>> It is not about just adding another state. The issue is that there is no

>>>>> way for drivers to change the state of software filter dynamically.

>>>>

>>>> I think it might be worth considering enhancing things at some point.

>>>> But I agree that its more than a matter of adding an extra flag. And

>>>> I think it's reasonable to implement something similar to the classifier

>>>> current offload handling of IN_HW now and consider enhancements separately.

>>>

>>> Debugability is very important. If we have such gotchas we need to have

>>> the admin at least be able to tell if the driver returns "success"

>>> and the request is still sitting in the driver for whatever reason

>>> At minimal there needs to be some indicator somewhere which say

>>> "inprogress" or "waiting for resolution" etc.

>>> If the control plane(user space app) starts making other decisions

>>> based on assumptions that filter was successfully installed i.e

>>> packets are being treated in the hardware then there could be

>>> consequences when this assumption is wrong.

>>>

>>> So if i undestood the challenge correctly it is: how do you relay

>>> this info back so it is reflected in the filter details. Yes that

>>> would require some mechanism to exist and possibly mapping state

>>> between whats in the driver and in the cls layer.

>>> If i am not mistaken, the switchdev folks handle this asynchronicty?

>>> +Cc Ido, Jiri, Roopa

>>>

>>> And it should be noted that: Yes, the filters have this

>>> pre-existing condition but doesnt mean given the opportunity

>>> to do actions we should replicate what they do.

>> I'd prefer symmetry between the use of IN_HW for filters and actions,

>> which I believe is what Vlad has suggested.

>> 

>

> It still not clear to me what it means from a command line pov.

> How do i add a rule and when i dump it what does it show?

>

>> If we wish to enhance things - f.e. for debugging, which I

>> agree is important - then I think that is a separate topic.

>> 

>

> My only concern is not to repeat mistakes that are in filters

> just for the sake of symmetry. Example the fact that something

> went wrong with insertion or insertion is still in progress

> and you get an indication that all went well.

> Looking at mlnx (NIC) ndrivers it does seem that in the normal case

> the insertion into hw is synchronous (for anything that is not sw

> only). I didnt quiet see what Vlad was referring to.


Filters with tunnel_key encap actions can be offloaded/unoffloaded
dynamically based on neigh state (see mlx5e_rep_neigh_update()) and fib
events (see mlx5e_tc_fib_event_work()).

[...]
Simon Horman July 30, 2021, 1:20 p.m. UTC | #11
On Fri, Jul 30, 2021 at 06:17:18AM -0400, Jamal Hadi Salim wrote:
> On 2021-07-28 10:46 a.m., Simon Horman wrote:

> > On Wed, Jul 28, 2021 at 09:51:00AM -0400, Jamal Hadi Salim wrote:

> > > On 2021-07-28 3:46 a.m., Simon Horman wrote:

> > > > On Tue, Jul 27, 2021 at 07:47:43PM +0300, Vlad Buslov wrote:

> > > > > On Tue 27 Jul 2021 at 19:13, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> > > > > > On 2021-07-27 10:38 a.m., Vlad Buslov wrote:

> > > > > > > On Tue 27 Jul 2021 at 16:04, Simon Horman <simon.horman@corigine.com> wrote:

> > > 

> > > [..]

> > > 

> > > > > > > I think we have the same issue with filters - they might not be in

> > > > > > > hardware after driver callback returned "success" (due to neigh state

> > > > > > > being invalid for tunnel_key encap, for example).

> > > > > > 

> > > > > > Sounds like we need another state for this. Otherwise, how do you debug

> > > > > > that something is sitting in the driver and not in hardware after you

> > > > > > issued a command to offload it? How do i tell today?

> > > > > > Also knowing reason why something is sitting in the driver would be

> > > > > > helpful.

> > > > > 

> > > > > It is not about just adding another state. The issue is that there is no

> > > > > way for drivers to change the state of software filter dynamically.

> > > > 

> > > > I think it might be worth considering enhancing things at some point.

> > > > But I agree that its more than a matter of adding an extra flag. And

> > > > I think it's reasonable to implement something similar to the classifier

> > > > current offload handling of IN_HW now and consider enhancements separately.

> > > 

> > > Debugability is very important. If we have such gotchas we need to have

> > > the admin at least be able to tell if the driver returns "success"

> > > and the request is still sitting in the driver for whatever reason

> > > At minimal there needs to be some indicator somewhere which say

> > > "inprogress" or "waiting for resolution" etc.

> > > If the control plane(user space app) starts making other decisions

> > > based on assumptions that filter was successfully installed i.e

> > > packets are being treated in the hardware then there could be

> > > consequences when this assumption is wrong.

> > > 

> > > So if i undestood the challenge correctly it is: how do you relay

> > > this info back so it is reflected in the filter details. Yes that

> > > would require some mechanism to exist and possibly mapping state

> > > between whats in the driver and in the cls layer.

> > > If i am not mistaken, the switchdev folks handle this asynchronicty?

> > > +Cc Ido, Jiri, Roopa

> > > 

> > > And it should be noted that: Yes, the filters have this

> > > pre-existing condition but doesnt mean given the opportunity

> > > to do actions we should replicate what they do.

> > 

> > I'd prefer symmetry between the use of IN_HW for filters and actions,

> > which I believe is what Vlad has suggested.

> 

> It still not clear to me what it means from a command line pov.

> How do i add a rule and when i dump it what does it show?


How about we confirm that once we've implemented the feature.

But I would assume that:

* Existing methods for adding rules work as before
* When one dumps an action (in a sufficiently verbose
  way) the in_hw and in_hw_counter fields are displayed as they are for
  filters.

Does that help?

> > If we wish to enhance things - f.e. for debugging, which I

> > agree is important - then I think that is a separate topic.

> > 

> 

> My only concern is not to repeat mistakes that are in filters

> just for the sake of symmetry. Example the fact that something

> went wrong with insertion or insertion is still in progress

> and you get an indication that all went well.

> Looking at mlnx (NIC) ndrivers it does seem that in the normal case

> the insertion into hw is synchronous (for anything that is not sw

> only). I didnt quiet see what Vlad was referring to.

> We have spent literally hours debugging issues where rules are being

> offloaded thinking it was the driver so any extra info helps.


I do think there is a value to symmetry between the APIs.
And I don't think doing so moves things in a bad direction.
But rather a separate discussion is needed to discuss how to
improve debuggability.

...

> > > > > > I was looking at it more as a (currently missing) feature improvement.

> > > > > > We already have a use case that is implemented by s/w today. The feature

> > > > > > mimics it in h/w.

> > > > > > 

> > > > > > At minimal all existing NICs should be able to support the counters

> > > > > > as mapped to simple actions like drop. I understand for example if some

> > > > > > cant support adding separately offloading of tunnels for example.

> > > > > > So the syntax is something along the lines of:

> > > > > > 

> > > > > > tc actions add action drop index 15 skip_sw

> > > > > > tc filter add dev ...parent ... protocol ip prio X ..\

> > > > > > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> > > > > > 

> > > > > > You get an error if counter index 15 is not offloaded or

> > > > > > if skip_sw was left out..

> > > > > > 

> > > > > > And then later on, if you support sharing of actions:

> > > > > > tc filter add dev ...parent ... protocol ip prio X2 ..\

> > > > > > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> > > > 

> > > > Right, I understand that makes sense and is internally consistent.

> > > > But I think that in practice it only makes a difference "Approach B"

> > > > implementations, none of which currently exist.

> > > > 

> > > 

> > > At minimal:

> > > Shouldnt counters (easily correlated to basic actions like drop or

> > > accept) fit the scenario of:

> > > tc actions add action drop index 15 skip_sw

> > > tc filter add dev ...parent ... protocol ip prio X .. \

> > > u32/flower skip_sw match ... flowid 1:10 action gact index 15

> > > 

> > > ?

> > > 

> > > > I would suggest we can add this when the need arises, rather than

> > > > speculatively without hw/driver support. Its not precluded by the current

> > > > model AFAIK.

> > > > 

> > > 

> > > We are going to work on a driver that would have the "B" approach.

> > > I am hoping - whatever the consensus here - it doesnt require a

> > > surgery afterwards to make that work.

> > 

> > You should be able to build on the work proposed here to add what you

> > suggest into the framework to meet these requirements for your driver work.

> > 

> 

> Then we are good. These are the same patches you have here?


Yes.
Jamal Hadi Salim Aug. 3, 2021, 9:57 a.m. UTC | #12
On 2021-07-30 7:40 a.m., Vlad Buslov wrote:
> On Fri 30 Jul 2021 at 13:17, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>> On 2021-07-28 10:46 a.m., Simon Horman wrote:


> 

> Filters with tunnel_key encap actions can be offloaded/unoffloaded

> dynamically based on neigh state (see mlx5e_rep_neigh_update()) and fib

> events (see mlx5e_tc_fib_event_work()).

>


Thanks. Will look and compare against the FIB case.

cheers,
jamal
Jamal Hadi Salim Aug. 3, 2021, 10:14 a.m. UTC | #13
On 2021-07-30 9:20 a.m., Simon Horman wrote:
> On Fri, Jul 30, 2021 at 06:17:18AM -0400, Jamal Hadi Salim wrote:

>> On 2021-07-28 10:46 a.m., Simon Horman wrote:


[..]

>> It still not clear to me what it means from a command line pov.

>> How do i add a rule and when i dump it what does it show?

> 

> How about we confirm that once we've implemented the feature.

> 

> But I would assume that:

> 

> * Existing methods for adding rules work as before

> * When one dumps an action (in a sufficiently verbose

>    way) the in_hw and in_hw_counter fields are displayed as they are for

>    filters.

> 

> Does that help?

> 


I think it would help a lot more to say explicitly what it actually
means in the cover letter from a tc cli pov since the subject
is about offloading actions _independently_ of filters.
I am assuming you have some more patches on top of these that
actually will actually work for that.

Example of something you could show was adding a policer,
like so:

tc actions add action ... skip_sw...

then show get or dump showing things in h/w.
And del..

And i certainly hope that the above works and it is
not meant just for the consumption of some OVS use
case.

>>> If we wish to enhance things - f.e. for debugging, which I

>>> agree is important - then I think that is a separate topic.

>>>

>>

>> My only concern is not to repeat mistakes that are in filters

>> just for the sake of symmetry. Example the fact that something

>> went wrong with insertion or insertion is still in progress

>> and you get an indication that all went well.

>> Looking at mlnx (NIC) ndrivers it does seem that in the normal case

>> the insertion into hw is synchronous (for anything that is not sw

>> only). I didnt quiet see what Vlad was referring to.

>> We have spent literally hours debugging issues where rules are being

>> offloaded thinking it was the driver so any extra info helps.

> 

> I do think there is a value to symmetry between the APIs.

> And I don't think doing so moves things in a bad direction.

> But rather a separate discussion is needed to discuss how to

> improve debuggability.

> 


Fair enough - lets have a separate discussion on debug-ability.
Maybe a new thread after reviewing Vlad's pointer.

>>> You should be able to build on the work proposed here to add what you

>>> suggest into the framework to meet these requirements for your driver work.

>>>

>>

>> Then we are good. These are the same patches you have here?

> 

> Yes.


Thanks - will review with that in mind.

cheers,
jamal
Jamal Hadi Salim Aug. 3, 2021, 10:50 a.m. UTC | #14
On 2021-07-22 5:19 a.m., Simon Horman wrote:
> From: Baowen Zheng <baowen.zheng@corigine.com>

> 

> Use flow_indr_dev_register/flow_indr_dev_setup_offload to

> offload tc action.

> 

> We offload the tc action mainly for ovs meter configuration.

> Make some basic changes for different vendors to return EOPNOTSUPP.

> 

> We need to call tc_cleanup_flow_action to clean up tc action entry since

> in tc_setup_action, some actions may hold dev refcnt, especially the mirror

> action.

> 



[..]

> --- a/net/sched/act_api.c

> +++ b/net/sched/act_api.c

> @@ -1060,6 +1060,36 @@ struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,

>   	return ERR_PTR(err);

>   }

>   

> +/* offload the tc command after inserted */

> +int tcf_action_offload_cmd(struct tc_action *actions[],

> +			   struct netlink_ext_ack *extack)

> +{

> +	struct flow_offload_action *fl_act;

> +	int err = 0;

> +

> +	fl_act = flow_action_alloc(tcf_act_num_actions(actions));

> +	if (!fl_act)

> +		return -ENOMEM;

> +

> +	fl_act->extack = extack;

> +	err = tc_setup_action(&fl_act->action, actions);

> +	if (err) {

> +		NL_SET_ERR_MSG_MOD(extack,

> +				   "Failed to setup tc actions for offload\n");

> +		goto err_out;

> +	}

> +	fl_act->command = FLOW_ACT_REPLACE;

> +


The fn name is a bit misleading with _cmd suffix when it is
only targeting one command: REPLACE (and not the other two).

cheers,
jamal
Jamal Hadi Salim Aug. 3, 2021, 11:05 a.m. UTC | #15
On 2021-07-22 5:19 a.m., Simon Horman wrote:

Triggered by my observation on 2/3 went back and looked at this again to
see if we have same problem with notification on REPLACE case (I think
we do) but here's another comment:

> +EXPORT_SYMBOL(tcf_action_offload_cmd);

> +

>   /* Returns numbers of initialized actions or negative error. */

>   

>   int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,

> @@ -1514,6 +1544,9 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,

>   		return ret;

>   	ret = tcf_add_notify(net, n, actions, portid, attr_size, extack);

>   

> +	/* offload actions to hardware if possible */

> +	tcf_action_offload_cmd(actions, extack);

> +


Above seems to be unconditional whether hw update is requested or not?
The comment says the right thing ("if possible") but the code
should have checked some sort of skip_sw check?

cheers,
jamal
Simon Horman Aug. 3, 2021, 11:31 a.m. UTC | #16
On Tue, Aug 03, 2021 at 07:05:52AM -0400, Jamal Hadi Salim wrote:
> On 2021-07-22 5:19 a.m., Simon Horman wrote:

> 

> Triggered by my observation on 2/3 went back and looked at this again to

> see if we have same problem with notification on REPLACE case (I think

> we do) but here's another comment:

> 

> > +EXPORT_SYMBOL(tcf_action_offload_cmd);

> > +

> >   /* Returns numbers of initialized actions or negative error. */

> >   int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,

> > @@ -1514,6 +1544,9 @@ static int tcf_action_add(struct net *net, struct nlattr *nla,

> >   		return ret;

> >   	ret = tcf_add_notify(net, n, actions, portid, attr_size, extack);

> > +	/* offload actions to hardware if possible */

> > +	tcf_action_offload_cmd(actions, extack);

> > +

> 

> Above seems to be unconditional whether hw update is requested or not?

> The comment says the right thing ("if possible") but the code

> should have checked some sort of skip_sw check?


Jamal, we are going around in circles.

As we have already discussed, this patchset does not add support for
skip_sw (or skip_hw) for actions.
Simon Horman Aug. 3, 2021, 11:36 a.m. UTC | #17
On Tue, Aug 03, 2021 at 06:14:08AM -0400, Jamal Hadi Salim wrote:
> On 2021-07-30 9:20 a.m., Simon Horman wrote:

> > On Fri, Jul 30, 2021 at 06:17:18AM -0400, Jamal Hadi Salim wrote:

> > > On 2021-07-28 10:46 a.m., Simon Horman wrote:

> 

> [..]

> 

> > > It still not clear to me what it means from a command line pov.

> > > How do i add a rule and when i dump it what does it show?

> > 

> > How about we confirm that once we've implemented the feature.

> > 

> > But I would assume that:

> > 

> > * Existing methods for adding rules work as before

> > * When one dumps an action (in a sufficiently verbose

> >    way) the in_hw and in_hw_counter fields are displayed as they are for

> >    filters.

> > 

> > Does that help?

> > 

> 

> I think it would help a lot more to say explicitly what it actually

> means in the cover letter from a tc cli pov since the subject

> is about offloading actions _independently_ of filters.

> I am assuming you have some more patches on top of these that

> actually will actually work for that.

> 

> Example of something you could show was adding a policer,

> like so:

> 

> tc actions add action ... skip_sw...

> 

> then show get or dump showing things in h/w.

> And del..

> 

> And i certainly hope that the above works and it is

> not meant just for the consumption of some OVS use

> case.


I agree it would be useful to include a tc cli example in the cover letter.
I'll see about making that so in v2.
Jamal Hadi Salim Aug. 3, 2021, 11:45 a.m. UTC | #18
On 2021-08-03 7:36 a.m., Simon Horman wrote:
> On Tue, Aug 03, 2021 at 06:14:08AM -0400, Jamal Hadi Salim wrote:

>> On 2021-07-30 9:20 a.m., Simon Horman wrote:


[..]

>> Example of something you could show was adding a policer,

>> like so:

>>

>> tc actions add action ... skip_sw...

>>

>> then show get or dump showing things in h/w.

>> And del..

>>

>> And i certainly hope that the above works and it is

>> not meant just for the consumption of some OVS use

>> case.

> 

> I agree it would be useful to include a tc cli example in the cover letter.

> I'll see about making that so in v2.

> 


Thanks. That will remove 95% of my commentary.
Check my other comment on the user space notification.
I was just looking at the way classifiers handle things from this
perspective and they dont return to cls_api until completion so
the notification gets sent after both h/w and software succeed...

cheers,
jamal
Jamal Hadi Salim Aug. 3, 2021, 12:02 p.m. UTC | #19
I just changed the subject line..

On 2021-08-03 5:57 a.m., Jamal Hadi Salim wrote:
> On 2021-07-30 7:40 a.m., Vlad Buslov wrote:
>> On Fri 30 Jul 2021 at 13:17, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>>> On 2021-07-28 10:46 a.m., Simon Horman wrote:
> 
>>
>> Filters with tunnel_key encap actions can be offloaded/unoffloaded
>> dynamically based on neigh state (see mlx5e_rep_neigh_update()) and fib
>> events (see mlx5e_tc_fib_event_work()).
>>
> 
> Thanks. Will look and compare against the FIB case.
> 

So unless i am mistaken Vlad:
a) there is no way to reflect the  details when someone dumps the rules.
b) No notifications sent to the control plane (user space) when the
neighbor updates are offloaded.

My comments earlier are inspired by debugging tc offload and by this:

https://patches.linaro.org/cover/378345/

cheers,
jamal
Vlad Buslov Aug. 3, 2021, 12:14 p.m. UTC | #20
On Tue 03 Aug 2021 at 15:02, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> I just changed the subject line..

>

> On 2021-08-03 5:57 a.m., Jamal Hadi Salim wrote:

>> On 2021-07-30 7:40 a.m., Vlad Buslov wrote:

>>> On Fri 30 Jul 2021 at 13:17, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

>>>> On 2021-07-28 10:46 a.m., Simon Horman wrote:

>> 

>>>

>>> Filters with tunnel_key encap actions can be offloaded/unoffloaded

>>> dynamically based on neigh state (see mlx5e_rep_neigh_update()) and fib

>>> events (see mlx5e_tc_fib_event_work()).

>>>

>> Thanks. Will look and compare against the FIB case.

>> 

>

> So unless i am mistaken Vlad:

> a) there is no way to reflect the  details when someone dumps the rules.

> b) No notifications sent to the control plane (user space) when the

> neighbor updates are offloaded.


Correct.

>

> My comments earlier are inspired by debugging tc offload and by this:

>

> https://patches.linaro.org/cover/378345/

>

> cheers,

> jamal
Simon Horman Aug. 3, 2021, 12:31 p.m. UTC | #21
On Tue, Aug 03, 2021 at 07:45:13AM -0400, Jamal Hadi Salim wrote:
> On 2021-08-03 7:36 a.m., Simon Horman wrote:

> > On Tue, Aug 03, 2021 at 06:14:08AM -0400, Jamal Hadi Salim wrote:

> > > On 2021-07-30 9:20 a.m., Simon Horman wrote:

> 

> [..]

> 

> > > Example of something you could show was adding a policer,

> > > like so:

> > > 

> > > tc actions add action ... skip_sw...

> > > 

> > > then show get or dump showing things in h/w.

> > > And del..

> > > 

> > > And i certainly hope that the above works and it is

> > > not meant just for the consumption of some OVS use

> > > case.

> > 

> > I agree it would be useful to include a tc cli example in the cover letter.

> > I'll see about making that so in v2.

> > 

> 

> Thanks. That will remove 95% of my commentary.


:)

> Check my other comment on the user space notification.

> I was just looking at the way classifiers handle things from this

> perspective and they dont return to cls_api until completion so

> the notification gets sent after both h/w and software succeed...


Thanks, I will look into this. But it would make my life slightly easier if

a) You could be more specific about what portion of cls_api you are
   referring to.
b) Constrained comments to a topic to a single sub-thread.
Jamal Hadi Salim Aug. 3, 2021, 12:50 p.m. UTC | #22
On 2021-08-03 8:14 a.m., Vlad Buslov wrote:
> 

> On Tue 03 Aug 2021 at 15:02, Jamal Hadi Salim <jhs@mojatatu.com> wrote:


[..]

>>

>> So unless i am mistaken Vlad:

>> a) there is no way to reflect the  details when someone dumps the rules.

>> b) No notifications sent to the control plane (user space) when the

>> neighbor updates are offloaded.

> 

> Correct.

>


Feels like we can adopt the same mechanics. Although, unless i am
misreading, it seems Ido's patches cover a slightly different use
case: not totally synchronous in successfully pushing the rule to
hardware i.e could be sitting somewhere in firmware on its way to
the ASIC (and at least your connectx driver seems to only be
relinquishing control after confirming the update succeeded).
Am i mistaken Ido?


cheers,
jamal
Jamal Hadi Salim Aug. 3, 2021, 1:01 p.m. UTC | #23
On 2021-08-03 8:31 a.m., Simon Horman wrote:
> On Tue, Aug 03, 2021 at 07:45:13AM -0400, Jamal Hadi Salim wrote:



> Thanks, I will look into this. But it would make my life slightly easier if

> 

> a) You could be more specific about what portion of cls_api you are

>     referring to.

> b) Constrained comments to a topic to a single sub-thread.

> 


Context, this was on the comment i made on 2/3 here:

-----
-        ret = tcf_del_notify(net, n, actions, portid, attr_size, extack);
+        tcf_action_offload_cmd_pre(actions, FLOW_ACT_DESTROY, extack, 
&fl_act);
+        ret = tcf_del_notify(net, n, actions, portid, attr_size, 
extack, &fallback_num);
+        tcf_action_offload_del_post(fl_act, actions, extack, fallback_num);
           if (ret)
               goto err;
----

where a notification goes to user space to say "success" but hardware
update fails.

If you look at fl_change() which does the offload you'll see that it
returns err on any of sw or hw failure (depending on request).
Notification of success is done in cls_api.c - example for 
creating/replacing with this snippet:

---
         err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
                               flags, extack);
         if (err == 0) {
                 tfilter_notify(net, skb, n, tp, block, q, parent, fh,
                                RTM_NEWTFILTER, false, rtnl_held);
                 tfilter_put(tp, fh);
                 /* q pointer is NULL for shared blocks */
                 if (q)
                         q->flags &= ~TCQ_F_CAN_BYPASS;
         }
---

cheers,
jamal
Ido Schimmel Aug. 3, 2021, 1:34 p.m. UTC | #24
On Tue, Aug 03, 2021 at 08:50:27AM -0400, Jamal Hadi Salim wrote:
> On 2021-08-03 8:14 a.m., Vlad Buslov wrote:

> > 

> > On Tue 03 Aug 2021 at 15:02, Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> 

> [..]

> 

> > > 

> > > So unless i am mistaken Vlad:

> > > a) there is no way to reflect the  details when someone dumps the rules.

> > > b) No notifications sent to the control plane (user space) when the

> > > neighbor updates are offloaded.

> > 

> > Correct.

> > 

> 

> Feels like we can adopt the same mechanics. Although, unless i am

> misreading, it seems Ido's patches cover a slightly different use

> case: not totally synchronous in successfully pushing the rule to

> hardware i.e could be sitting somewhere in firmware on its way to

> the ASIC (and at least your connectx driver seems to only be

> relinquishing control after confirming the update succeeded).

> Am i mistaken Ido?


It is simply that all routes are notified from an atomic context, which
means installation to hardware needs to be deferred. But even if we
solve this case, there are other cases that cannot be solved.

For example, in IPv6 routes can be installed in response to RA packets
from softIRQ context. As another example, you can have several routes
with the same key already installed in the kernel, but only the one with
the lowest metric will be installed in hardware. Once it is deleted (for
example, in response to a netdev event), the kernel will try to install
the route with the higher metric. As a user, you want to get a
notification if this operation failed.

The same problem exists with tc actions, not sure about classifiers. For
example, we can mirror to a gretap to emulate ERSPAN. The hardware
expects all the headers to be specified, which means the path towards
the remote IP must be resolved in the kernel. Due to FIB/neigh/FDB
events it can obviously change, which means that the validity of the
action in hardware (not the classifier) changes over time with zero
visibility to user space.

So I believe that we need to be able to notify user space about the
state of the action. Whether it is in hardware / not in hardware /
failed to be installed to hardware.

> 

> 

> cheers,

> jamal
Simon Horman Aug. 3, 2021, 2:46 p.m. UTC | #25
On Tue, Aug 03, 2021 at 09:01:45AM -0400, Jamal Hadi Salim wrote:
> On 2021-08-03 8:31 a.m., Simon Horman wrote:

> > On Tue, Aug 03, 2021 at 07:45:13AM -0400, Jamal Hadi Salim wrote:

> 

> 

> > Thanks, I will look into this. But it would make my life slightly easier if

> > 

> > a) You could be more specific about what portion of cls_api you are

> >     referring to.

> > b) Constrained comments to a topic to a single sub-thread.

> > 

> 

> Context, this was on the comment i made on 2/3 here:

> 

> -----

> -        ret = tcf_del_notify(net, n, actions, portid, attr_size, extack);

> +        tcf_action_offload_cmd_pre(actions, FLOW_ACT_DESTROY, extack,

> &fl_act);

> +        ret = tcf_del_notify(net, n, actions, portid, attr_size, extack,

> &fallback_num);

> +        tcf_action_offload_del_post(fl_act, actions, extack, fallback_num);

>           if (ret)

>               goto err;

> ----

> 

> where a notification goes to user space to say "success" but hardware

> update fails.

> 

> If you look at fl_change() which does the offload you'll see that it

> returns err on any of sw or hw failure (depending on request).

> Notification of success is done in cls_api.c - example for

> creating/replacing with this snippet:

> 

> ---

>         err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,

>                               flags, extack);

>         if (err == 0) {

>                 tfilter_notify(net, skb, n, tp, block, q, parent, fh,

>                                RTM_NEWTFILTER, false, rtnl_held);

>                 tfilter_put(tp, fh);

>                 /* q pointer is NULL for shared blocks */

>                 if (q)

>                         q->flags &= ~TCQ_F_CAN_BYPASS;

>         }

> ---


Thanks for the context Jamal, much appreciated.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 5e4429b14b8c..edbbf7b4df77 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1951,7 +1951,7 @@  static int bnxt_tc_setup_indr_cb(struct net_device *netdev, struct Qdisc *sch, v
 				 void *data,
 				 void (*cleanup)(struct flow_block_cb *block_cb))
 {
-	if (!bnxt_is_netdev_indr_offload(netdev))
+	if (!netdev || !bnxt_is_netdev_indr_offload(netdev))
 		return -EOPNOTSUPP;
 
 	switch (type) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
index 059799e4f483..111daacc4cc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/tc.c
@@ -486,6 +486,9 @@  int mlx5e_rep_indr_setup_cb(struct net_device *netdev, struct Qdisc *sch, void *
 			    void *data,
 			    void (*cleanup)(struct flow_block_cb *block_cb))
 {
+	if (!netdev)
+		return -EOPNOTSUPP;
+
 	switch (type) {
 	case TC_SETUP_BLOCK:
 		return mlx5e_rep_indr_setup_block(netdev, sch, cb_priv, type_data,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 2406d33356ad..88bbc86347b4 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1869,6 +1869,9 @@  nfp_flower_indr_setup_tc_cb(struct net_device *netdev, struct Qdisc *sch, void *
 			    void *data,
 			    void (*cleanup)(struct flow_block_cb *block_cb))
 {
+	if (!netdev)
+		return -EOPNOTSUPP;
+
 	if (!nfp_fl_is_netdev_to_offload(netdev))
 		return -EOPNOTSUPP;
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 42f6f866d5f3..b138219baf6f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -923,6 +923,7 @@  enum tc_setup_type {
 	TC_SETUP_QDISC_TBF,
 	TC_SETUP_QDISC_FIFO,
 	TC_SETUP_QDISC_HTB,
+	TC_SETUP_ACT,
 };
 
 /* These structures hold the attributes of bpf state that are being passed
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 69c9eabf8325..26644596fd54 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -553,6 +553,21 @@  struct flow_cls_offload {
 	u32 classid;
 };
 
+enum flow_act_command {
+	FLOW_ACT_REPLACE,
+	FLOW_ACT_DESTROY,
+	FLOW_ACT_STATS,
+};
+
+struct flow_offload_action {
+	struct netlink_ext_ack *extack;
+	enum flow_act_command command;
+	struct flow_stats stats;
+	struct flow_action action;
+};
+
+struct flow_offload_action *flow_action_alloc(unsigned int num_actions);
+
 static inline struct flow_rule *
 flow_cls_offload_flow_rule(struct flow_cls_offload *flow_cmd)
 {
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ec7823921bd2..cd4cf6b10f5d 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -266,6 +266,9 @@  static inline void tcf_exts_put_net(struct tcf_exts *exts)
 	for (; 0; (void)(i), (void)(a), (void)(exts))
 #endif
 
+#define tcf_act_for_each_action(i, a, actions) \
+	for (i = 0; i < TCA_ACT_MAX_PRIO && ((a) = actions[i]); i++)
+
 static inline void
 tcf_exts_stats_update(const struct tcf_exts *exts,
 		      u64 bytes, u64 packets, u64 drops, u64 lastuse,
@@ -536,8 +539,19 @@  tcf_match_indev(struct sk_buff *skb, int ifindex)
 	return ifindex == skb->skb_iif;
 }
 
+#ifdef CONFIG_NET_CLS_ACT
 int tc_setup_flow_action(struct flow_action *flow_action,
 			 const struct tcf_exts *exts);
+#else
+static inline int tc_setup_flow_action(struct flow_action *flow_action,
+				       const struct tcf_exts *exts)
+{
+		return 0;
+}
+#endif
+
+int tc_setup_action(struct flow_action *flow_action,
+		    struct tc_action *actions[]);
 void tc_cleanup_flow_action(struct flow_action *flow_action);
 
 int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
@@ -558,6 +572,7 @@  int tc_setup_cb_reoffload(struct tcf_block *block, struct tcf_proto *tp,
 			  enum tc_setup_type type, void *type_data,
 			  void *cb_priv, u32 *flags, unsigned int *in_hw_count);
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
+unsigned int tcf_act_num_actions(struct tc_action *actions[]);
 
 #ifdef CONFIG_NET_CLS_ACT
 int tcf_qevent_init(struct tcf_qevent *qe, struct Qdisc *sch,
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 715b67f6c62f..0fa2f75cc9b3 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -27,6 +27,27 @@  struct flow_rule *flow_rule_alloc(unsigned int num_actions)
 }
 EXPORT_SYMBOL(flow_rule_alloc);
 
+struct flow_offload_action *flow_action_alloc(unsigned int num_actions)
+{
+	struct flow_offload_action *fl_action;
+	int i;
+
+	fl_action = kzalloc(struct_size(fl_action, action.entries, num_actions),
+			    GFP_KERNEL);
+	if (!fl_action)
+		return NULL;
+
+	fl_action->action.num_entries = num_actions;
+	/* Pre-fill each action hw_stats with DONT_CARE.
+	 * Caller can override this if it wants stats for a given action.
+	 */
+	for (i = 0; i < num_actions; i++)
+		fl_action->action.entries[i].hw_stats = FLOW_ACTION_HW_STATS_DONT_CARE;
+
+	return fl_action;
+}
+EXPORT_SYMBOL(flow_action_alloc);
+
 #define FLOW_DISSECTOR_MATCH(__rule, __type, __out)				\
 	const struct flow_match *__m = &(__rule)->match;			\
 	struct flow_dissector *__d = (__m)->dissector;				\
@@ -476,6 +497,9 @@  int flow_indr_dev_setup_offload(struct net_device *dev, struct Qdisc *sch,
 
 	mutex_unlock(&flow_indr_block_lock);
 
-	return list_empty(&bo->cb_list) ? -EOPNOTSUPP : 0;
+	if (bo)
+		return list_empty(&bo->cb_list) ? -EOPNOTSUPP : 0;
+	else
+		return 0;
 }
 EXPORT_SYMBOL(flow_indr_dev_setup_offload);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 998a2374f7ae..185f17ea60d5 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1060,6 +1060,36 @@  struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
 	return ERR_PTR(err);
 }
 
+/* offload the tc command after inserted */
+int tcf_action_offload_cmd(struct tc_action *actions[],
+			   struct netlink_ext_ack *extack)
+{
+	struct flow_offload_action *fl_act;
+	int err = 0;
+
+	fl_act = flow_action_alloc(tcf_act_num_actions(actions));
+	if (!fl_act)
+		return -ENOMEM;
+
+	fl_act->extack = extack;
+	err = tc_setup_action(&fl_act->action, actions);
+	if (err) {
+		NL_SET_ERR_MSG_MOD(extack,
+				   "Failed to setup tc actions for offload\n");
+		goto err_out;
+	}
+	fl_act->command = FLOW_ACT_REPLACE;
+
+	flow_indr_dev_setup_offload(NULL, NULL, TC_SETUP_ACT, fl_act, NULL, NULL);
+
+	tc_cleanup_flow_action(&fl_act->action);
+
+err_out:
+	kfree(fl_act);
+	return err;
+}
+EXPORT_SYMBOL(tcf_action_offload_cmd);
+
 /* Returns numbers of initialized actions or negative error. */
 
 int tcf_action_init(struct net *net, struct tcf_proto *tp, struct nlattr *nla,
@@ -1514,6 +1544,9 @@  static int tcf_action_add(struct net *net, struct nlattr *nla,
 		return ret;
 	ret = tcf_add_notify(net, n, actions, portid, attr_size, extack);
 
+	/* offload actions to hardware if possible */
+	tcf_action_offload_cmd(actions, extack);
+
 	/* only put existing actions */
 	for (i = 0; i < TCA_ACT_MAX_PRIO; i++)
 		if (init_res[i] == ACT_P_CREATED)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c8cb59a11098..9b9770dab5e8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3544,8 +3544,8 @@  static enum flow_action_hw_stats tc_act_hw_stats(u8 hw_stats)
 	return hw_stats;
 }
 
-int tc_setup_flow_action(struct flow_action *flow_action,
-			 const struct tcf_exts *exts)
+int tc_setup_action(struct flow_action *flow_action,
+		    struct tc_action *actions[])
 {
 	struct tc_action *act;
 	int i, j, k, err = 0;
@@ -3554,11 +3554,11 @@  int tc_setup_flow_action(struct flow_action *flow_action,
 	BUILD_BUG_ON(TCA_ACT_HW_STATS_IMMEDIATE != FLOW_ACTION_HW_STATS_IMMEDIATE);
 	BUILD_BUG_ON(TCA_ACT_HW_STATS_DELAYED != FLOW_ACTION_HW_STATS_DELAYED);
 
-	if (!exts)
+	if (!actions)
 		return 0;
 
 	j = 0;
-	tcf_exts_for_each_action(i, act, exts) {
+	tcf_act_for_each_action(i, act, actions) {
 		struct flow_action_entry *entry;
 
 		entry = &flow_action->entries[j];
@@ -3725,7 +3725,19 @@  int tc_setup_flow_action(struct flow_action *flow_action,
 	spin_unlock_bh(&act->tcfa_lock);
 	goto err_out;
 }
+EXPORT_SYMBOL(tc_setup_action);
+
+#ifdef CONFIG_NET_CLS_ACT
+int tc_setup_flow_action(struct flow_action *flow_action,
+			 const struct tcf_exts *exts)
+{
+	if (!exts)
+		return 0;
+
+	return tc_setup_action(flow_action, exts->actions);
+}
 EXPORT_SYMBOL(tc_setup_flow_action);
+#endif
 
 unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
 {
@@ -3743,6 +3755,22 @@  unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
 }
 EXPORT_SYMBOL(tcf_exts_num_actions);
 
+unsigned int tcf_act_num_actions(struct tc_action *actions[])
+{
+	unsigned int num_acts = 0;
+	struct tc_action *act;
+	int i;
+
+	tcf_act_for_each_action(i, act, actions) {
+		if (is_tcf_pedit(act))
+			num_acts += tcf_pedit_nkeys(act);
+		else
+			num_acts++;
+	}
+	return num_acts;
+}
+EXPORT_SYMBOL(tcf_act_num_actions);
+
 #ifdef CONFIG_NET_CLS_ACT
 static int tcf_qevent_parse_block_index(struct nlattr *block_index_attr,
 					u32 *p_block_index,