mbox series

[pull,request,net-next,00/10] mlx5 updates 2020-06-23

Message ID 20200623195229.26411-1-saeedm@mellanox.com
Headers show
Series mlx5 updates 2020-06-23 | expand

Message

Saeed Mahameed June 23, 2020, 7:52 p.m. UTC
Hi Dave, Jakub

This series adds misc updates and one small feature, Relaxed ordering, 
to mlx5 driver.

For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit 8af7b4525acf5012b2f111a8b168b8647f2c8d60:

  Merge branch 'net-atlantic-additional-A2-features' (2020-06-22 21:10:22 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2020-06-23

for you to fetch changes up to 378dd789c6335191ca38f57b71c88a8ff4387335:

  net/mlx5e: Add support for PCI relaxed ordering (2020-06-23 12:49:14 -0700)

----------------------------------------------------------------
mlx5-updates-2020-06-23

1) Misc updates and cleanup
2) Use RCU instead of spinlock for vxlan table
3) Support for PCI relaxed ordering
    On some systems, especially ARM and AMD systems, with relaxed
    ordering set, traffic on the remote-numa is at the same
    level as when on the local numa. Running TCP single stream over
    ConnectX-4 LX, ARM CPU on remote-numa has 300% improvement in the
    bandwidth.
    With relaxed ordering turned off: BW:10 [GB/s]
    With relaxed ordering turned on:  BW:40 [GB/s]

----------------------------------------------------------------
Alaa Hleihel (1):
      net/mlx5e: Move including net/arp.h from en_rep.c to rep/neigh.c

Aya Levin (1):
      net/mlx5e: Add support for PCI relaxed ordering

Denis Efremov (1):
      net/mlx5: Use kfree(ft->g) in arfs_create_groups()

Hu Haowen (2):
      net/mlx5: FWTrace: Add missing space
      net/mlx5: Add a missing macro undefinition

Maxim Mikityanskiy (1):
      net/mlx5e: Remove unused mlx5e_xsk_first_unused_channel

Parav Pandit (1):
      net/mlx5: Avoid eswitch header inclusion in fs core layer

Saeed Mahameed (2):
      net/mlx5e: vxlan: Use RCU for vxlan table lookup
      net/mlx5e: vxlan: Return bool instead of opaque ptr in port_lookup()

Vlad Buslov (1):
      net/mlx5e: Move TC-specific function definitions into MLX5_CLS_ACT

 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  3 +
 .../net/ethernet/mellanox/mlx5/core/en/rep/neigh.c |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.c  | 13 -----
 .../net/ethernet/mellanox/mlx5/core/en/xsk/umem.h  |  2 -
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c  |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_common.c    | 67 ++++++++++++++++++++--
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 46 +++++++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 29 ++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.h    | 16 +++---
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.h  | 10 ----
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |  1 -
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  | 10 ++++
 .../net/ethernet/mellanox/mlx5/core/lib/vxlan.c    | 64 +++++++++------------
 .../net/ethernet/mellanox/mlx5/core/lib/vxlan.h    |  5 +-
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  2 +
 include/linux/mlx5/driver.h                        | 10 +++-
 18 files changed, 195 insertions(+), 89 deletions(-)

Comments

Aya Levin July 6, 2020, 1 p.m. UTC | #1
On 6/24/2020 11:30 PM, Jakub Kicinski wrote:
> On Wed, 24 Jun 2020 20:15:14 +0000 Saeed Mahameed wrote:

>> On Wed, 2020-06-24 at 10:22 -0700, Jakub Kicinski wrote:

>>> On Wed, 24 Jun 2020 10:34:40 +0300 Aya Levin wrote:

>>>>>> I think Michal will rightly complain that this does not belong

>>>>>> in

>>>>>> private flags any more. As (/if?) ARM deployments take a

>>>>>> foothold

>>>>>> in DC this will become a common setting for most NICs.

>>>>>

>>>>> Initially we used pcie_relaxed_ordering_enabled() to

>>>>>    programmatically enable this on/off on boot but this seems to

>>>>> introduce some degradation on some Intel CPUs since the Intel

>>>>> Faulty

>>>>> CPUs list is not up to date. Aya is discussing this with Bjorn.

>>>> Adding Bjorn Helgaas

>>>

>>> I see. Simply using pcie_relaxed_ordering_enabled() and blacklisting

>>> bad CPUs seems far nicer from operational perspective. Perhaps Bjorn

>>> will chime in. Pushing the validation out to the user is not a great

>>> solution IMHO.

>>

>> Can we move on with this patch for now ? since we are going to keep the

>> user knob anyway, what is missing is setting the default value

>> automatically but this can't be done until we

>> fix pcie_relaxed_ordering_enabled()

> 

> If this patch was just adding a chicken bit that'd be fine, but opt in

> I'm not hugely comfortable with. Seems like Bjorn has provided some

> assistance already on the defaults but there doesn't appear to be much

> progress being made.


Hi Jakub, Dave

Assuming the discussions with Bjorn will conclude in a well-trusted API 
that ensures relaxed ordering in enabled, I'd still like a method to 
turn off relaxed ordering for performance debugging sake.
Bjorn highlighted the fact that the PCIe sub system can only offer a 
query method. Even if theoretically a set API will be provided, this 
will not fit a netdev debugging - I wonder if CPU vendors even support 
relaxed ordering set/unset...
On the driver's side relaxed ordering is an attribute of the mkey and 
should be available for configuration (similar to number of CPU vs. 
number of channels).
Based on the above, and binding the driver's default relaxed ordering to 
the return value from pcie_relaxed_ordering_enabled(), may I continue 
with previous direction of a private-flag to control the client side (my 
driver) ?

Aya.
>
Jakub Kicinski July 6, 2020, 4:52 p.m. UTC | #2
On Mon, 6 Jul 2020 16:00:59 +0300 Aya Levin wrote:
> Assuming the discussions with Bjorn will conclude in a well-trusted API 

> that ensures relaxed ordering in enabled, I'd still like a method to 

> turn off relaxed ordering for performance debugging sake.

> Bjorn highlighted the fact that the PCIe sub system can only offer a 

> query method. Even if theoretically a set API will be provided, this 

> will not fit a netdev debugging - I wonder if CPU vendors even support 

> relaxed ordering set/unset...

> On the driver's side relaxed ordering is an attribute of the mkey and 

> should be available for configuration (similar to number of CPU vs. 

> number of channels).

> Based on the above, and binding the driver's default relaxed ordering to 

> the return value from pcie_relaxed_ordering_enabled(), may I continue 

> with previous direction of a private-flag to control the client side (my 

> driver) ?


That's fine with me, chicken bit seems reasonable as long as the
default is dictated by the PCI subsystem. I have no particularly 
strong feeling on the API used for the chicken bit, but others may.
David Miller July 6, 2020, 7:49 p.m. UTC | #3
From: Aya Levin <ayal@mellanox.com>

Date: Mon, 6 Jul 2020 16:00:59 +0300

> Assuming the discussions with Bjorn will conclude in a well-trusted

> API that ensures relaxed ordering in enabled, I'd still like a method

> to turn off relaxed ordering for performance debugging sake.

> Bjorn highlighted the fact that the PCIe sub system can only offer a

> query method. Even if theoretically a set API will be provided, this

> will not fit a netdev debugging - I wonder if CPU vendors even support

> relaxed ordering set/unset...

> On the driver's side relaxed ordering is an attribute of the mkey and

> should be available for configuration (similar to number of CPU

> vs. number of channels).

> Based on the above, and binding the driver's default relaxed ordering

> to the return value from pcie_relaxed_ordering_enabled(), may I

> continue with previous direction of a private-flag to control the

> client side (my driver) ?


I don't like this situation at all.

If RO is so dodgy that it potentially needs to be disabled, that is
going to be an issue not just with networking devices but also with
storage and other device types as well.

Will every device type have a custom way to disable RO, thus
inconsistently, in order to accomodate this?

That makes no sense and is a terrible user experience.

That's why the knob belongs generically in PCI or similar.
Bjorn Helgaas July 8, 2020, 11:16 p.m. UTC | #4
On Sun, Jul 08, 2040 at 11:22:12AM +0300, Aya Levin wrote:
> On 7/6/2020 10:49 PM, David Miller wrote:

> > From: Aya Levin <ayal@mellanox.com>

> > Date: Mon, 6 Jul 2020 16:00:59 +0300

> > 

> > > Assuming the discussions with Bjorn will conclude in a well-trusted

> > > API that ensures relaxed ordering in enabled, I'd still like a method

> > > to turn off relaxed ordering for performance debugging sake.

> > > Bjorn highlighted the fact that the PCIe sub system can only offer a

> > > query method. Even if theoretically a set API will be provided, this

> > > will not fit a netdev debugging - I wonder if CPU vendors even support

> > > relaxed ordering set/unset...

> > > On the driver's side relaxed ordering is an attribute of the mkey and

> > > should be available for configuration (similar to number of CPU

> > > vs. number of channels).

> > > Based on the above, and binding the driver's default relaxed ordering

> > > to the return value from pcie_relaxed_ordering_enabled(), may I

> > > continue with previous direction of a private-flag to control the

> > > client side (my driver) ?

> > 

> > I don't like this situation at all.

> > 

> > If RO is so dodgy that it potentially needs to be disabled, that is

> > going to be an issue not just with networking devices but also with

> > storage and other device types as well.

> > 

> > Will every device type have a custom way to disable RO, thus

> > inconsistently, in order to accomodate this?

> > 

> > That makes no sense and is a terrible user experience.

> > 

> > That's why the knob belongs generically in PCI or similar.

> > 

> Hi Bjorn,

> 

> Mellanox NIC supports relaxed ordering operation over DMA buffers.

> However for debug prepossess we must have a chicken bit to disable

> relaxed ordering on a specific system without effecting others in

> run-time. In order to meet this requirement, I added a netdev

> private-flag to ethtool for set RO API.

> 

> Dave raised a concern regarding embedding relaxed ordering set API

> per system (networking, storage and others). We need the ability to

> manage relaxed ordering in a unify manner. Could you please define a

> PCI sub-system solution to meet this requirement?


I agree, this is definitely a mess.  Let me just outline what I think
we have today and what we're missing.

  - On the hardware side, device use of Relaxed Ordering is controlled
    by the Enable Relaxed Ordering bit in the PCIe Device Control
    register (or the PCI-X Command register).  If set, the device is
    allowed but not required to set the Relaxed Ordering bit in
    transactions it initiates (PCIe r5.0, sec 7.5.3.4; PCI-X 2.0, sec
    7.2.3).

    I suspect there may be device-specific controls, too, because [1]
    claims to enable/disable Relaxed Ordering but doesn't touch the
    PCIe Device Control register.  Device-specific controls are
    certainly allowed, but of course it would be up to the driver, and
    the device cannot generate TLPs with Relaxed Ordering unless the
    architected PCIe Enable Relaxed Ordering bit is *also* set.

  - Platform firmware can enable Relaxed Ordering for a device either
    before handoff to the OS or via the _HPX ACPI method.

  - The PCI core never enables Relaxed Ordering itself except when
    applying _HPX.

  - At enumeration-time, the PCI core disables Relaxed Ordering in
    pci_configure_relaxed_ordering() if the device is below a Root
    Port that has a quirk indicating an erratum.  This quirk currently
    includes many Intel Root Ports, but not all, and is an ongoing
    maintenance problem.

  - The PCI core provides pcie_relaxed_ordering_enabled() which tells
    you whether Relaxed Ordering is enabled.  Only used by cxgb4 and
    csio, which use that information to fill in Ingress Queue
    Commands.

  - The PCI core does not provide a driver interface to enable or
    disable Relaxed Ordering.

  - Some drivers disable Relaxed Ordering themselves: mtip32xx,
    netup_unidvb, tg3, myri10ge (oddly, only if CONFIG_MYRI10GE_DCA),
    tsi721, kp2000_pcie.

  - Some drivers enable Relaxed Ordering themselves: niu, tegra.

What are we missing and what should the PCI core do?

  - Currently the Enable Relaxed Ordering bit depends on what firmware
    did.  Maybe the PCI core should always clear it during
    enumeration?

  - The PCI core should probably have a driver interface like
    pci_set_relaxed_ordering(dev, enable) that can set or clear the
    architected PCI-X or PCIe Enable Relaxed Ordering bit.

  - Maybe there should be a kernel command-line parameter like
    "pci=norelax" that disables Relaxed Ordering for every device and
    prevents pci_set_relaxed_ordering() from enabling it.

    I'm mixed on this because these tend to become folklore about how
    to "fix" problems and we end up with systems that don't work
    unless you happen to find the option on the web.  For debugging
    issues, it might be enough to disable Relaxed Ordering using
    setpci, e.g., "setpci -s02:00.0 CAP_EXP+8.w=0"

[1] https://lore.kernel.org/netdev/20200623195229.26411-11-saeedm@mellanox.com/
Jason Gunthorpe July 8, 2020, 11:26 p.m. UTC | #5
On Wed, Jul 08, 2020 at 06:16:30PM -0500, Bjorn Helgaas wrote:
>     I suspect there may be device-specific controls, too, because [1]

>     claims to enable/disable Relaxed Ordering but doesn't touch the

>     PCIe Device Control register.  Device-specific controls are

>     certainly allowed, but of course it would be up to the driver, and

>     the device cannot generate TLPs with Relaxed Ordering unless the

>     architected PCIe Enable Relaxed Ordering bit is *also* set.


Yes, at least on RDMA relaxed ordering can be set on a per transaction
basis and is something userspace can choose to use or not at a fine
granularity. This is because we have to support historical
applications that make assumptions that data arrives in certain
orders.

I've been thinking of doing the same as this patch but for RDMA kernel
ULPs and just globally turn it on if the PCI CAP is enabled as none of
our in-kernel uses have the legacy data ordering problem.

There are reports that using relaxed ordering is a *huge* speed up in
certain platforms/configurations/etc.

>     issues, it might be enough to disable Relaxed Ordering using

>     setpci, e.g., "setpci -s02:00.0 CAP_EXP+8.w=0"


For the purposes of occasional performance testing I think this should
be good enough?

Aya?

Jason
Jonathan Lemon July 9, 2020, 5:35 p.m. UTC | #6
On Wed, Jul 08, 2020 at 08:26:02PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 08, 2020 at 06:16:30PM -0500, Bjorn Helgaas wrote:

> >     I suspect there may be device-specific controls, too, because [1]

> >     claims to enable/disable Relaxed Ordering but doesn't touch the

> >     PCIe Device Control register.  Device-specific controls are

> >     certainly allowed, but of course it would be up to the driver, and

> >     the device cannot generate TLPs with Relaxed Ordering unless the

> >     architected PCIe Enable Relaxed Ordering bit is *also* set.

> 

> Yes, at least on RDMA relaxed ordering can be set on a per transaction

> basis and is something userspace can choose to use or not at a fine

> granularity. This is because we have to support historical

> applications that make assumptions that data arrives in certain

> orders.

> 

> I've been thinking of doing the same as this patch but for RDMA kernel

> ULPs and just globally turn it on if the PCI CAP is enabled as none of

> our in-kernel uses have the legacy data ordering problem.


If I'm following this correctly - there are two different controls being
discussed here:

    1) having the driver request PCI relaxed ordering, which may or may
       not be granted, based on other system settings, and

    2) having the driver set RO on the transactions it initiates, which
       are honored iff the PCI bit is set.

It seems that in addition to the PCI core changes, there still is a need
for driver controls?  Unless the driver always enables RO if it's capable?
-- 
Jonathan
Jason Gunthorpe July 9, 2020, 6:20 p.m. UTC | #7
On Thu, Jul 09, 2020 at 10:35:50AM -0700, Jonathan Lemon wrote:
> On Wed, Jul 08, 2020 at 08:26:02PM -0300, Jason Gunthorpe wrote:

> > On Wed, Jul 08, 2020 at 06:16:30PM -0500, Bjorn Helgaas wrote:

> > >     I suspect there may be device-specific controls, too, because [1]

> > >     claims to enable/disable Relaxed Ordering but doesn't touch the

> > >     PCIe Device Control register.  Device-specific controls are

> > >     certainly allowed, but of course it would be up to the driver, and

> > >     the device cannot generate TLPs with Relaxed Ordering unless the

> > >     architected PCIe Enable Relaxed Ordering bit is *also* set.

> > 

> > Yes, at least on RDMA relaxed ordering can be set on a per transaction

> > basis and is something userspace can choose to use or not at a fine

> > granularity. This is because we have to support historical

> > applications that make assumptions that data arrives in certain

> > orders.

> > 

> > I've been thinking of doing the same as this patch but for RDMA kernel

> > ULPs and just globally turn it on if the PCI CAP is enabled as none of

> > our in-kernel uses have the legacy data ordering problem.

> 

> If I'm following this correctly - there are two different controls being

> discussed here:

> 

>     1) having the driver request PCI relaxed ordering, which may or may

>        not be granted, based on other system settings, and


This is what Bjorn was thinking about, yes, it is some PCI layer
function to control the global config space bit.

>     2) having the driver set RO on the transactions it initiates, which

>        are honored iff the PCI bit is set.

>

> It seems that in addition to the PCI core changes, there still is a need

> for driver controls?  Unless the driver always enables RO if it's capable?


I think the PCI spec imagined that when the config space RO bit was
enabled the PCI device would just start using RO packets, in an
appropriate and device specific way.

So the fine grained control in #2 is something done extra by some
devices.

IMHO if the driver knows it is functionally correct with RO then it
should enable it fully on the device when the config space bit is set.

I'm not sure there is a reason to allow users to finely tune RO, at
least I haven't heard of cases where RO is a degredation depending on
workload.

If some platform doesn't work when RO is turned on then it should be
globally black listed like is already done in some cases.

If the devices has bugs and uses RO wrong, or the driver has bugs and
is only stable with !RO and Intel, then the driver shouldn't turn it
on at all.

In all of these cases it is not a user tunable.

Development and testing reasons, like 'is my crash from a RO bug?' to
tune should be met by the device global setpci, I think.

Jason
Jakub Kicinski July 9, 2020, 7:47 p.m. UTC | #8
On Thu, 9 Jul 2020 15:20:11 -0300 Jason Gunthorpe wrote:
> >     2) having the driver set RO on the transactions it initiates, which

> >        are honored iff the PCI bit is set.

> >

> > It seems that in addition to the PCI core changes, there still is a need

> > for driver controls?  Unless the driver always enables RO if it's capable?  

> 

> I think the PCI spec imagined that when the config space RO bit was

> enabled the PCI device would just start using RO packets, in an

> appropriate and device specific way.

> 

> So the fine grained control in #2 is something done extra by some

> devices.

> 

> IMHO if the driver knows it is functionally correct with RO then it

> should enable it fully on the device when the config space bit is set.

> 

> I'm not sure there is a reason to allow users to finely tune RO, at

> least I haven't heard of cases where RO is a degredation depending on

> workload.

> 

> If some platform doesn't work when RO is turned on then it should be

> globally black listed like is already done in some cases.

> 

> If the devices has bugs and uses RO wrong, or the driver has bugs and

> is only stable with !RO and Intel, then the driver shouldn't turn it

> on at all.

> 

> In all of these cases it is not a user tunable.

> 

> Development and testing reasons, like 'is my crash from a RO bug?' to

> tune should be met by the device global setpci, I think.


+1
Jonathan Lemon July 9, 2020, 8:33 p.m. UTC | #9
On Thu, Jul 09, 2020 at 03:20:11PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 09, 2020 at 10:35:50AM -0700, Jonathan Lemon wrote:

> > On Wed, Jul 08, 2020 at 08:26:02PM -0300, Jason Gunthorpe wrote:

> > > On Wed, Jul 08, 2020 at 06:16:30PM -0500, Bjorn Helgaas wrote:

> > > >     I suspect there may be device-specific controls, too, because [1]

> > > >     claims to enable/disable Relaxed Ordering but doesn't touch the

> > > >     PCIe Device Control register.  Device-specific controls are

> > > >     certainly allowed, but of course it would be up to the driver, and

> > > >     the device cannot generate TLPs with Relaxed Ordering unless the

> > > >     architected PCIe Enable Relaxed Ordering bit is *also* set.

> > > 

> > > Yes, at least on RDMA relaxed ordering can be set on a per transaction

> > > basis and is something userspace can choose to use or not at a fine

> > > granularity. This is because we have to support historical

> > > applications that make assumptions that data arrives in certain

> > > orders.

> > > 

> > > I've been thinking of doing the same as this patch but for RDMA kernel

> > > ULPs and just globally turn it on if the PCI CAP is enabled as none of

> > > our in-kernel uses have the legacy data ordering problem.

> > 

> > If I'm following this correctly - there are two different controls being

> > discussed here:

> > 

> >     1) having the driver request PCI relaxed ordering, which may or may

> >        not be granted, based on other system settings, and

> 

> This is what Bjorn was thinking about, yes, it is some PCI layer

> function to control the global config space bit.

> 

> >     2) having the driver set RO on the transactions it initiates, which

> >        are honored iff the PCI bit is set.

> >

> > It seems that in addition to the PCI core changes, there still is a need

> > for driver controls?  Unless the driver always enables RO if it's capable?

> 

> I think the PCI spec imagined that when the config space RO bit was

> enabled the PCI device would just start using RO packets, in an

> appropriate and device specific way.

> 

> So the fine grained control in #2 is something done extra by some

> devices.

> 

> IMHO if the driver knows it is functionally correct with RO then it

> should enable it fully on the device when the config space bit is set.


Sounds reasonable to me.
-- 
Jonathan
Saeed Mahameed July 10, 2020, 2:18 a.m. UTC | #10
On Thu, 2020-07-09 at 12:47 -0700, Jakub Kicinski wrote:
> On Thu, 9 Jul 2020 15:20:11 -0300 Jason Gunthorpe wrote:

> > >     2) having the driver set RO on the transactions it initiates,

> > > which

> > >        are honored iff the PCI bit is set.

> > > 

> > > It seems that in addition to the PCI core changes, there still is

> > > a need

> > > for driver controls?  Unless the driver always enables RO if it's

> > > capable?  

> > 

> > I think the PCI spec imagined that when the config space RO bit was

> > enabled the PCI device would just start using RO packets, in an

> > appropriate and device specific way.

> > 

> > So the fine grained control in #2 is something done extra by some

> > devices.

> > 

> > IMHO if the driver knows it is functionally correct with RO then it

> > should enable it fully on the device when the config space bit is

> > set.

> > 

> > I'm not sure there is a reason to allow users to finely tune RO, at

> > least I haven't heard of cases where RO is a degredation depending

> > on

> > workload.

> > 

> > If some platform doesn't work when RO is turned on then it should

> > be

> > globally black listed like is already done in some cases.

> > 

> > If the devices has bugs and uses RO wrong, or the driver has bugs

> > and

> > is only stable with !RO and Intel, then the driver shouldn't turn

> > it

> > on at all.

> > 

> > In all of these cases it is not a user tunable.

> > 

> > Development and testing reasons, like 'is my crash from a RO bug?'

> > to

> > tune should be met by the device global setpci, I think.

> 

> +1


Be careful though to load driver with RO on and then setpci RO off.. 
not sure what the side effects are, unstable driver maybe ?
And not sure what should be the procedure then ? reload driver ? FW
will get a notification from PCI ?
Jason Gunthorpe July 10, 2020, 12:21 p.m. UTC | #11
On Fri, Jul 10, 2020 at 02:18:02AM +0000, Saeed Mahameed wrote:

> Be careful though to load driver with RO on and then setpci RO off.. 

> not sure what the side effects are, unstable driver maybe ?


According to the PCI spec HW should stop doing RO immediately once the
config space bit is cleared.

In any event continuing to issue RO won't harm anything.

> And not sure what should be the procedure then ? reload driver ? FW

> will get a notification from PCI ? 


At worst you'd have to reload the driver - continuing to use RO if the
driver starts with RO off is seriously broken and probably won't work
with the quirks to disable RO on buggy platforms.

But as above, the RO config space bit should have immedaite effect on
the device and it should stop using RO. The device HW itself has to
enforce this to be spec compliant.

Jason
Aya Levin July 14, 2020, 10:47 a.m. UTC | #12
On 7/9/2020 2:16 AM, Bjorn Helgaas wrote:
> On Sun, Jul 08, 2040 at 11:22:12AM +0300, Aya Levin wrote:

>> On 7/6/2020 10:49 PM, David Miller wrote:

>>> From: Aya Levin <ayal@mellanox.com>

>>> Date: Mon, 6 Jul 2020 16:00:59 +0300

>>>

>>>> Assuming the discussions with Bjorn will conclude in a well-trusted

>>>> API that ensures relaxed ordering in enabled, I'd still like a method

>>>> to turn off relaxed ordering for performance debugging sake.

>>>> Bjorn highlighted the fact that the PCIe sub system can only offer a

>>>> query method. Even if theoretically a set API will be provided, this

>>>> will not fit a netdev debugging - I wonder if CPU vendors even support

>>>> relaxed ordering set/unset...

>>>> On the driver's side relaxed ordering is an attribute of the mkey and

>>>> should be available for configuration (similar to number of CPU

>>>> vs. number of channels).

>>>> Based on the above, and binding the driver's default relaxed ordering

>>>> to the return value from pcie_relaxed_ordering_enabled(), may I

>>>> continue with previous direction of a private-flag to control the

>>>> client side (my driver) ?

>>>

>>> I don't like this situation at all.

>>>

>>> If RO is so dodgy that it potentially needs to be disabled, that is

>>> going to be an issue not just with networking devices but also with

>>> storage and other device types as well.

>>>

>>> Will every device type have a custom way to disable RO, thus

>>> inconsistently, in order to accomodate this?

>>>

>>> That makes no sense and is a terrible user experience.

>>>

>>> That's why the knob belongs generically in PCI or similar.

>>>

>> Hi Bjorn,

>>

>> Mellanox NIC supports relaxed ordering operation over DMA buffers.

>> However for debug prepossess we must have a chicken bit to disable

>> relaxed ordering on a specific system without effecting others in

>> run-time. In order to meet this requirement, I added a netdev

>> private-flag to ethtool for set RO API.

>>

>> Dave raised a concern regarding embedding relaxed ordering set API

>> per system (networking, storage and others). We need the ability to

>> manage relaxed ordering in a unify manner. Could you please define a

>> PCI sub-system solution to meet this requirement?

> 

> I agree, this is definitely a mess.  Let me just outline what I think

> we have today and what we're missing.

> 

>    - On the hardware side, device use of Relaxed Ordering is controlled

>      by the Enable Relaxed Ordering bit in the PCIe Device Control

>      register (or the PCI-X Command register).  If set, the device is

>      allowed but not required to set the Relaxed Ordering bit in

>      transactions it initiates (PCIe r5.0, sec 7.5.3.4; PCI-X 2.0, sec

>      7.2.3).

> 

>      I suspect there may be device-specific controls, too, because [1]

>      claims to enable/disable Relaxed Ordering but doesn't touch the

>      PCIe Device Control register.  Device-specific controls are

>      certainly allowed, but of course it would be up to the driver, and

>      the device cannot generate TLPs with Relaxed Ordering unless the

>      architected PCIe Enable Relaxed Ordering bit is *also* set.

> 

>    - Platform firmware can enable Relaxed Ordering for a device either

>      before handoff to the OS or via the _HPX ACPI method.

> 

>    - The PCI core never enables Relaxed Ordering itself except when

>      applying _HPX.

> 

>    - At enumeration-time, the PCI core disables Relaxed Ordering in

>      pci_configure_relaxed_ordering() if the device is below a Root

>      Port that has a quirk indicating an erratum.  This quirk currently

>      includes many Intel Root Ports, but not all, and is an ongoing

>      maintenance problem.

> 

>    - The PCI core provides pcie_relaxed_ordering_enabled() which tells

>      you whether Relaxed Ordering is enabled.  Only used by cxgb4 and

>      csio, which use that information to fill in Ingress Queue

>      Commands.

> 

>    - The PCI core does not provide a driver interface to enable or

>      disable Relaxed Ordering.

> 

>    - Some drivers disable Relaxed Ordering themselves: mtip32xx,

>      netup_unidvb, tg3, myri10ge (oddly, only if CONFIG_MYRI10GE_DCA),

>      tsi721, kp2000_pcie.

> 

>    - Some drivers enable Relaxed Ordering themselves: niu, tegra.

> 

> What are we missing and what should the PCI core do?

> 

>    - Currently the Enable Relaxed Ordering bit depends on what firmware

>      did.  Maybe the PCI core should always clear it during

>      enumeration?

> 

>    - The PCI core should probably have a driver interface like

>      pci_set_relaxed_ordering(dev, enable) that can set or clear the

>      architected PCI-X or PCIe Enable Relaxed Ordering bit.

> 

>    - Maybe there should be a kernel command-line parameter like

>      "pci=norelax" that disables Relaxed Ordering for every device and

>      prevents pci_set_relaxed_ordering() from enabling it.

> 

>      I'm mixed on this because these tend to become folklore about how

>      to "fix" problems and we end up with systems that don't work

>      unless you happen to find the option on the web.  For debugging

>      issues, it might be enough to disable Relaxed Ordering using

>      setpci, e.g., "setpci -s02:00.0 CAP_EXP+8.w=0"

> 

> [1] https://lore.kernel.org/netdev/20200623195229.26411-11-saeedm@mellanox.com/

> 

Hi Bjorn,

Thanks for the detailed reply. From initial testing I can say that 
turning off the relaxed ordering on the PCI (setpci -s02:00.0 
CAP_EXP+8.w=0) is the chicken bit I was looking for.
This lower the risk of depending on pcie_relaxed_ordering_enabled(). I 
will update my patch and resubmit.

Thanks,
Aya
Alexander Duyck July 23, 2020, 9:03 p.m. UTC | #13
On 7/8/2040 1:22 AM, Aya Levin wrote:
> 

> 

> On 7/6/2020 10:49 PM, David Miller wrote:

>> From: Aya Levin <ayal@mellanox.com>

>> Date: Mon, 6 Jul 2020 16:00:59 +0300

>>

>>> Assuming the discussions with Bjorn will conclude in a well-trusted

>>> API that ensures relaxed ordering in enabled, I'd still like a method

>>> to turn off relaxed ordering for performance debugging sake.

>>> Bjorn highlighted the fact that the PCIe sub system can only offer a

>>> query method. Even if theoretically a set API will be provided, this

>>> will not fit a netdev debugging - I wonder if CPU vendors even support

>>> relaxed ordering set/unset...

>>> On the driver's side relaxed ordering is an attribute of the mkey and

>>> should be available for configuration (similar to number of CPU

>>> vs. number of channels).

>>> Based on the above, and binding the driver's default relaxed ordering

>>> to the return value from pcie_relaxed_ordering_enabled(), may I

>>> continue with previous direction of a private-flag to control the

>>> client side (my driver) ?

>>

>> I don't like this situation at all.

>>

>> If RO is so dodgy that it potentially needs to be disabled, that is

>> going to be an issue not just with networking devices but also with

>> storage and other device types as well.

>>

>> Will every device type have a custom way to disable RO, thus

>> inconsistently, in order to accomodate this?

>>

>> That makes no sense and is a terrible user experience.

>>

>> That's why the knob belongs generically in PCI or similar.

>>

> Hi Bjorn,

> 

> Mellanox NIC supports relaxed ordering operation over DMA buffers. 

> However for debug prepossess we must have a chicken bit to disable 

> relaxed ordering on a specific system without effecting others in 

> run-time. In order to meet this requirement, I added a netdev 

> private-flag to ethtool for set RO API.

> 

> Dave raised a concern regarding embedding relaxed ordering set API per 

> system (networking, storage and others). We need the ability to manage 

> relaxed ordering in a unify manner. Could you please define a PCI 

> sub-system solution to meet this requirement?

> 

> Aya.


Isn't there a relaxed ordering bit in the PCIe configuration space? 
Couldn't you use that as a global indication of if you can support 
relaxed ordering or not? Reading through the spec it seems like that is 
kind of the point of the config space bit in the Device Control 
register. If the bit is not set there then you shouldn't be able to use 
relaxed ordering in the device.

Then it is just a matter of using setpci to enable/disable it.

Thanks.

- Alex
Aya Levin July 8, 2040, 8:22 a.m. UTC | #14
On 7/6/2020 10:49 PM, David Miller wrote:
> From: Aya Levin <ayal@mellanox.com>

> Date: Mon, 6 Jul 2020 16:00:59 +0300

> 

>> Assuming the discussions with Bjorn will conclude in a well-trusted

>> API that ensures relaxed ordering in enabled, I'd still like a method

>> to turn off relaxed ordering for performance debugging sake.

>> Bjorn highlighted the fact that the PCIe sub system can only offer a

>> query method. Even if theoretically a set API will be provided, this

>> will not fit a netdev debugging - I wonder if CPU vendors even support

>> relaxed ordering set/unset...

>> On the driver's side relaxed ordering is an attribute of the mkey and

>> should be available for configuration (similar to number of CPU

>> vs. number of channels).

>> Based on the above, and binding the driver's default relaxed ordering

>> to the return value from pcie_relaxed_ordering_enabled(), may I

>> continue with previous direction of a private-flag to control the

>> client side (my driver) ?

> 

> I don't like this situation at all.

> 

> If RO is so dodgy that it potentially needs to be disabled, that is

> going to be an issue not just with networking devices but also with

> storage and other device types as well.

> 

> Will every device type have a custom way to disable RO, thus

> inconsistently, in order to accomodate this?

> 

> That makes no sense and is a terrible user experience.

> 

> That's why the knob belongs generically in PCI or similar.

> 

Hi Bjorn,

Mellanox NIC supports relaxed ordering operation over DMA buffers. 
However for debug prepossess we must have a chicken bit to disable 
relaxed ordering on a specific system without effecting others in 
run-time. In order to meet this requirement, I added a netdev 
private-flag to ethtool for set RO API.

Dave raised a concern regarding embedding relaxed ordering set API per 
system (networking, storage and others). We need the ability to manage 
relaxed ordering in a unify manner. Could you please define a PCI 
sub-system solution to meet this requirement?

Aya.