mbox series

[v4,00/23] Add Intel Ethernet Protocol Driver for RDMA (irdma)

Message ID 20210406210125.241-1-shiraz.saleem@intel.com
Headers show
Series Add Intel Ethernet Protocol Driver for RDMA (irdma) | expand

Message

Saleem, Shiraz April 6, 2021, 9:01 p.m. UTC
The following patch series introduces a unified Intel Ethernet Protocol Driver
for RDMA (irdma) for the X722 iWARP device and a new E810 device which supports
iWARP and RoCEv2. The irdma module replaces the legacy i40iw module for X722
and extends the ABI already defined for i40iw. It is backward compatible with
legacy X722 rdma-core provider (libi40iw).

X722 and E810 are PCI network devices that are RDMA capable. The RDMA block of
this parent device is represented via an auxiliary device exported to 'irdma'
using the core auxiliary bus infrastructure recently added for 5.11 kernel.
The parent PCI netdev drivers 'i40e' and 'ice' register auxiliary RDMA devices
with private data/ops encapsulated that bind to auxiliary drivers registered
in irdma module.

This patchset was initially submitted as an RFC where in we got feedback
to come up with a generic scheme for RDMA drivers to attach to a PCI device
owned by netdev PCI driver [1]. Solutions using platform bus and MFD were explored
but rejected by the community and the consensus was to add a new bus infrastructure
to support this usage model.

Further revisions of this series along with the auxiliary bus were submitted
[2]. At this point, Greg KH requested that we take the auxiliary bus review and
revision process to an internal mailing list and garner the buy-in of a respected
kernel contributor, along with consensus of all major stakeholders including
Nvidia (for mlx5 sub-function use-case) and Intel sound driver. This process
took a while and stalled further development/review of this netdev/irdma series.
The auxiliary bus was eventually merged in 5.11.

Between v1 to v2 of this submission, the IIDC went through a major re-write
based on the feedback and we hope it is now more in alignment with what the
community wants.

This series is built against rdma for-next and currently includes the netdev
patches for ease of review. This includes updates to 'ice' driver to provide
RDMA support and converts 'i40e' driver to use the auxiliary bus infrastructure.
Once the patches are closer to merging, a shared pull request will be submitted.

v3-->v4:
* Fixup W=1 warnings in ice patches
* Fix issues uncovered by pyverbs for create user AH and multicast
* Fix descriptor set issue for fast register introduced during port to FIELD_PREP in v2 submission 

v2-->v3:
* rebase rdma for-next. Adapt to core change '1fb7f8973f51 ("RDMA: Support more than 255 rdma ports")'
* irdma Kconfig updates to conform with linux coding style.
* Fix a 0-day build issue
* Remove rdma_resource_limits selector devlink param. Follow on patch to be submitted
for it with suggestion from Parav to use devlink resource.
* Capitalize abbreviations in ice idc. e.g. 'aux' to 'AUX'

v1-->v2:
* Remove IIDC channel OPs - open, close, peer_register and peer_unregister.
  And all its associated FSM in ice PCI core driver.
* Use device_lock in ice PCI core driver while issuing IIDC ops callbacks.
* Remove peer_* verbiage from shared IIDC header and rename the structs and channel ops
  with iidc_core*/iidc_auxiliary*.
* Allocate ib_device at start and register it at the end of drv.probe() in irdma gen2 auxiliary driver.
* Use ibdev_* printing extensively throughout most of the driver
  Remove idev_to_dev, ihw_to_dev macros as no longer required in new print scheme.
* Do not bump ABI ver. to 6 in irdma. Maintain irdma ABI ver. at 5 for legacy i40iw user-provider compatibility.
* Add a boundary check in irdma_alloc_ucontext to fail binding with < 4 user-space provider version.
* Remove devlink from irdma. Add 2 new rdma-related devlink parameters added to ice PCI core driver.
* Use FIELD_PREP/FIELD_GET/GENMASK on get/set of descriptor fields versus home grown ones LS_*/RS_*.
* Bind 2 separate auxiliary drivers in irdma - one for gen1 and one for gen2 and future devices.
* Misc. driver fixes in irdma

[1] https://patchwork.kernel.org/project/linux-rdma/patch/20190215171107.6464-2-shiraz.saleem@intel.com/
[2] https://lore.kernel.org/linux-rdma/20200520070415.3392210-1-jeffrey.t.kirsher@intel.com/

Dave Ertman (4):
  iidc: Introduce iidc.h
  ice: Initialize RDMA support
  ice: Implement iidc operations
  ice: Register auxiliary device to provide RDMA

Michael J. Ruhl (1):
  RDMA/irdma: Add dynamic tracing for CM

Mustafa Ismail (13):
  RDMA/irdma: Register auxiliary driver and implement private channel
    OPs
  RDMA/irdma: Implement device initialization definitions
  RDMA/irdma: Implement HW Admin Queue OPs
  RDMA/irdma: Add HMC backing store setup functions
  RDMA/irdma: Add privileged UDA queue implementation
  RDMA/irdma: Add QoS definitions
  RDMA/irdma: Add connection manager
  RDMA/irdma: Add PBLE resource manager
  RDMA/irdma: Implement device supported verb APIs
  RDMA/irdma: Add RoCEv2 UD OP support
  RDMA/irdma: Add user/kernel shared libraries
  RDMA/irdma: Add miscellaneous utility definitions
  RDMA/irdma: Add ABI definitions

Shiraz Saleem (5):
  ice: Add devlink params support
  i40e: Prep i40e header for aux bus conversion
  i40e: Register auxiliary devices to provide RDMA
  RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
  RDMA/irdma: Update MAINTAINERS file

 Documentation/ABI/stable/sysfs-class-infiniband    |   20 -
 .../networking/devlink/devlink-params.rst          |    6 +
 Documentation/networking/devlink/ice.rst           |   13 +
 MAINTAINERS                                        |   17 +-
 drivers/infiniband/Kconfig                         |    2 +-
 drivers/infiniband/hw/Makefile                     |    2 +-
 drivers/infiniband/hw/i40iw/Kconfig                |    9 -
 drivers/infiniband/hw/i40iw/Makefile               |    9 -
 drivers/infiniband/hw/i40iw/i40iw.h                |  602 --
 drivers/infiniband/hw/i40iw/i40iw_cm.c             | 4419 --------------
 drivers/infiniband/hw/i40iw/i40iw_cm.h             |  462 --
 drivers/infiniband/hw/i40iw/i40iw_ctrl.c           | 5243 -----------------
 drivers/infiniband/hw/i40iw/i40iw_d.h              | 1746 ------
 drivers/infiniband/hw/i40iw/i40iw_hmc.c            |  821 ---
 drivers/infiniband/hw/i40iw/i40iw_hmc.h            |  241 -
 drivers/infiniband/hw/i40iw/i40iw_hw.c             |  851 ---
 drivers/infiniband/hw/i40iw/i40iw_main.c           | 2066 -------
 drivers/infiniband/hw/i40iw/i40iw_osdep.h          |  195 -
 drivers/infiniband/hw/i40iw/i40iw_p.h              |  129 -
 drivers/infiniband/hw/i40iw/i40iw_pble.c           |  613 --
 drivers/infiniband/hw/i40iw/i40iw_pble.h           |  131 -
 drivers/infiniband/hw/i40iw/i40iw_puda.c           | 1496 -----
 drivers/infiniband/hw/i40iw/i40iw_puda.h           |  188 -
 drivers/infiniband/hw/i40iw/i40iw_register.h       | 1030 ----
 drivers/infiniband/hw/i40iw/i40iw_status.h         |  101 -
 drivers/infiniband/hw/i40iw/i40iw_type.h           | 1358 -----
 drivers/infiniband/hw/i40iw/i40iw_uk.c             | 1200 ----
 drivers/infiniband/hw/i40iw/i40iw_user.h           |  422 --
 drivers/infiniband/hw/i40iw/i40iw_utils.c          | 1518 -----
 drivers/infiniband/hw/i40iw/i40iw_verbs.c          | 2652 ---------
 drivers/infiniband/hw/i40iw/i40iw_verbs.h          |  179 -
 drivers/infiniband/hw/i40iw/i40iw_vf.c             |   85 -
 drivers/infiniband/hw/i40iw/i40iw_vf.h             |   62 -
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c       |  759 ---
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h       |  124 -
 drivers/infiniband/hw/irdma/Kconfig                |   11 +
 drivers/infiniband/hw/irdma/Makefile               |   27 +
 drivers/infiniband/hw/irdma/cm.c                   | 4425 ++++++++++++++
 drivers/infiniband/hw/irdma/cm.h                   |  417 ++
 drivers/infiniband/hw/irdma/ctrl.c                 | 6143 ++++++++++++++++++++
 drivers/infiniband/hw/irdma/defs.h                 | 1162 ++++
 drivers/infiniband/hw/irdma/hmc.c                  |  739 +++
 drivers/infiniband/hw/irdma/hmc.h                  |  180 +
 drivers/infiniband/hw/irdma/hw.c                   | 2755 +++++++++
 drivers/infiniband/hw/irdma/i40iw_hw.c             |  216 +
 drivers/infiniband/hw/irdma/i40iw_hw.h             |  160 +
 drivers/infiniband/hw/irdma/i40iw_if.c             |  229 +
 drivers/infiniband/hw/irdma/icrdma_hw.c            |   84 +
 drivers/infiniband/hw/irdma/icrdma_hw.h            |   71 +
 drivers/infiniband/hw/irdma/irdma.h                |  157 +
 drivers/infiniband/hw/irdma/main.c                 |  382 ++
 drivers/infiniband/hw/irdma/main.h                 |  565 ++
 drivers/infiniband/hw/irdma/osdep.h                |   86 +
 drivers/infiniband/hw/irdma/pble.c                 |  525 ++
 drivers/infiniband/hw/irdma/pble.h                 |  136 +
 drivers/infiniband/hw/irdma/protos.h               |  116 +
 drivers/infiniband/hw/irdma/puda.c                 | 1749 ++++++
 drivers/infiniband/hw/irdma/puda.h                 |  194 +
 drivers/infiniband/hw/irdma/status.h               |   71 +
 drivers/infiniband/hw/irdma/trace.c                |  112 +
 drivers/infiniband/hw/irdma/trace.h                |    3 +
 drivers/infiniband/hw/irdma/trace_cm.h             |  458 ++
 drivers/infiniband/hw/irdma/type.h                 | 1717 ++++++
 drivers/infiniband/hw/irdma/uda.c                  |  379 ++
 drivers/infiniband/hw/irdma/uda.h                  |   63 +
 drivers/infiniband/hw/irdma/uda_d.h                |  128 +
 drivers/infiniband/hw/irdma/uk.c                   | 1737 ++++++
 drivers/infiniband/hw/irdma/user.h                 |  464 ++
 drivers/infiniband/hw/irdma/utils.c                | 2543 ++++++++
 drivers/infiniband/hw/irdma/verbs.c                | 4544 +++++++++++++++
 drivers/infiniband/hw/irdma/verbs.h                |  225 +
 drivers/infiniband/hw/irdma/ws.c                   |  406 ++
 drivers/infiniband/hw/irdma/ws.h                   |   41 +
 drivers/net/ethernet/intel/Kconfig                 |    2 +
 drivers/net/ethernet/intel/i40e/i40e.h             |    2 +
 drivers/net/ethernet/intel/i40e/i40e_client.c      |  156 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c        |    1 +
 drivers/net/ethernet/intel/ice/Makefile            |    1 +
 drivers/net/ethernet/intel/ice/ice.h               |   40 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h    |   33 +
 drivers/net/ethernet/intel/ice/ice_common.c        |  217 +-
 drivers/net/ethernet/intel/ice/ice_common.h        |    9 +
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c       |   58 +
 drivers/net/ethernet/intel/ice/ice_dcb_lib.h       |    3 +
 drivers/net/ethernet/intel/ice/ice_devlink.c       |   92 +-
 drivers/net/ethernet/intel/ice/ice_devlink.h       |    5 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h    |    3 +-
 drivers/net/ethernet/intel/ice/ice_idc.c           |  793 +++
 drivers/net/ethernet/intel/ice/ice_idc_int.h       |   20 +
 drivers/net/ethernet/intel/ice/ice_lag.c           |    2 +
 drivers/net/ethernet/intel/ice/ice_lib.c           |   11 +
 drivers/net/ethernet/intel/ice/ice_lib.h           |    2 +-
 drivers/net/ethernet/intel/ice/ice_main.c          |  160 +-
 drivers/net/ethernet/intel/ice/ice_sched.c         |   69 +-
 drivers/net/ethernet/intel/ice/ice_switch.c        |   27 +
 drivers/net/ethernet/intel/ice/ice_switch.h        |    4 +
 drivers/net/ethernet/intel/ice/ice_type.h          |    4 +
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c   |   34 +
 include/linux/net/intel/i40e_client.h              |   14 +
 include/linux/net/intel/iidc.h                     |  242 +
 include/net/devlink.h                              |    4 +
 include/uapi/rdma/i40iw-abi.h                      |  107 -
 include/uapi/rdma/ib_user_ioctl_verbs.h            |    1 +
 include/uapi/rdma/irdma-abi.h                      |  116 +
 net/core/devlink.c                                 |    5 +
 105 files changed, 35528 insertions(+), 28900 deletions(-)
 delete mode 100644 drivers/infiniband/hw/i40iw/Kconfig
 delete mode 100644 drivers/infiniband/hw/i40iw/Makefile
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_ctrl.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_d.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hw.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_main.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_osdep.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_p.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_register.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_status.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_type.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_uk.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_user.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_utils.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h
 create mode 100644 drivers/infiniband/hw/irdma/Kconfig
 create mode 100644 drivers/infiniband/hw/irdma/Makefile
 create mode 100644 drivers/infiniband/hw/irdma/cm.c
 create mode 100644 drivers/infiniband/hw/irdma/cm.h
 create mode 100644 drivers/infiniband/hw/irdma/ctrl.c
 create mode 100644 drivers/infiniband/hw/irdma/defs.h
 create mode 100644 drivers/infiniband/hw/irdma/hmc.c
 create mode 100644 drivers/infiniband/hw/irdma/hmc.h
 create mode 100644 drivers/infiniband/hw/irdma/hw.c
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_hw.c
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_hw.h
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_if.c
 create mode 100644 drivers/infiniband/hw/irdma/icrdma_hw.c
 create mode 100644 drivers/infiniband/hw/irdma/icrdma_hw.h
 create mode 100644 drivers/infiniband/hw/irdma/irdma.h
 create mode 100644 drivers/infiniband/hw/irdma/main.c
 create mode 100644 drivers/infiniband/hw/irdma/main.h
 create mode 100644 drivers/infiniband/hw/irdma/osdep.h
 create mode 100644 drivers/infiniband/hw/irdma/pble.c
 create mode 100644 drivers/infiniband/hw/irdma/pble.h
 create mode 100644 drivers/infiniband/hw/irdma/protos.h
 create mode 100644 drivers/infiniband/hw/irdma/puda.c
 create mode 100644 drivers/infiniband/hw/irdma/puda.h
 create mode 100644 drivers/infiniband/hw/irdma/status.h
 create mode 100644 drivers/infiniband/hw/irdma/trace.c
 create mode 100644 drivers/infiniband/hw/irdma/trace.h
 create mode 100644 drivers/infiniband/hw/irdma/trace_cm.h
 create mode 100644 drivers/infiniband/hw/irdma/type.h
 create mode 100644 drivers/infiniband/hw/irdma/uda.c
 create mode 100644 drivers/infiniband/hw/irdma/uda.h
 create mode 100644 drivers/infiniband/hw/irdma/uda_d.h
 create mode 100644 drivers/infiniband/hw/irdma/uk.c
 create mode 100644 drivers/infiniband/hw/irdma/user.h
 create mode 100644 drivers/infiniband/hw/irdma/utils.c
 create mode 100644 drivers/infiniband/hw/irdma/verbs.c
 create mode 100644 drivers/infiniband/hw/irdma/verbs.h
 create mode 100644 drivers/infiniband/hw/irdma/ws.c
 create mode 100644 drivers/infiniband/hw/irdma/ws.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_idc.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_idc_int.h
 create mode 100644 include/linux/net/intel/iidc.h
 delete mode 100644 include/uapi/rdma/i40iw-abi.h
 create mode 100644 include/uapi/rdma/irdma-abi.h

Comments

Saleem, Shiraz April 6, 2021, 11:30 p.m. UTC | #1
> Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA
> (irdma)
> 
> On Tue, Apr 06, 2021 at 04:01:02PM -0500, Shiraz Saleem wrote:
> > Dave Ertman (4):
> >   iidc: Introduce iidc.h
> >   ice: Initialize RDMA support
> >   ice: Implement iidc operations
> >   ice: Register auxiliary device to provide RDMA
> >
> > Michael J. Ruhl (1):
> >   RDMA/irdma: Add dynamic tracing for CM
> >
> > Mustafa Ismail (13):
> >   RDMA/irdma: Register auxiliary driver and implement private channel
> >     OPs
> >   RDMA/irdma: Implement device initialization definitions
> >   RDMA/irdma: Implement HW Admin Queue OPs
> >   RDMA/irdma: Add HMC backing store setup functions
> >   RDMA/irdma: Add privileged UDA queue implementation
> >   RDMA/irdma: Add QoS definitions
> >   RDMA/irdma: Add connection manager
> >   RDMA/irdma: Add PBLE resource manager
> >   RDMA/irdma: Implement device supported verb APIs
> >   RDMA/irdma: Add RoCEv2 UD OP support
> >   RDMA/irdma: Add user/kernel shared libraries
> >   RDMA/irdma: Add miscellaneous utility definitions
> >   RDMA/irdma: Add ABI definitions
> >
> > Shiraz Saleem (5):
> >   ice: Add devlink params support
> >   i40e: Prep i40e header for aux bus conversion
> >   i40e: Register auxiliary devices to provide RDMA
> >   RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
> >   RDMA/irdma: Update MAINTAINERS file
> 
> This doesn't apply, and I don't really know why:
> 
> Applying: iidc: Introduce iidc.h
> Applying: ice: Initialize RDMA support
> Applying: ice: Implement iidc operations
> Applying: ice: Register auxiliary device to provide RDMA
> Applying: ice: Add devlink params support
> Applying: i40e: Prep i40e header for aux bus conversion
> Applying: i40e: Register auxiliary devices to provide RDMA
> Applying: RDMA/irdma: Register auxiliary driver and implement private channel
> OPs
> Applying: RDMA/irdma: Implement device initialization definitions
> Applying: RDMA/irdma: Implement HW Admin Queue OPs
> Applying: RDMA/irdma: Add HMC backing store setup functions
> Applying: RDMA/irdma: Add privileged UDA queue implementation
> Applying: RDMA/irdma: Add QoS definitions
> Applying: RDMA/irdma: Add connection manager
> Applying: RDMA/irdma: Add PBLE resource manager
> Applying: RDMA/irdma: Implement device supported verb APIs
> Applying: RDMA/irdma: Add RoCEv2 UD OP support
> Applying: RDMA/irdma: Add user/kernel shared libraries
> Applying: RDMA/irdma: Add miscellaneous utility definitions
> Applying: RDMA/irdma: Add dynamic tracing for CM
> Applying: RDMA/irdma: Add ABI definitions
> Applying: RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw Using index
> info to reconstruct a base tree...
> error: removal patch leaves file contents
> error: drivers/infiniband/hw/i40iw/Kconfig: patch does not apply
> 
> Can you investigate and fix it? Perhaps using a 9 year old version of git is the
> problem?
> 

Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for review that this one doesn't apply.

I can resend without it if your trying to apply.

Shiraz
Saleem, Shiraz April 7, 2021, 12:18 a.m. UTC | #2
> Subject: RE: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA
> (irdma)
> 
> > Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for
> > RDMA
> > (irdma)
> >
> > On Tue, Apr 06, 2021 at 04:01:02PM -0500, Shiraz Saleem wrote:
> > > Dave Ertman (4):
> > >   iidc: Introduce iidc.h
> > >   ice: Initialize RDMA support
> > >   ice: Implement iidc operations
> > >   ice: Register auxiliary device to provide RDMA
> > >
> > > Michael J. Ruhl (1):
> > >   RDMA/irdma: Add dynamic tracing for CM
> > >
> > > Mustafa Ismail (13):
> > >   RDMA/irdma: Register auxiliary driver and implement private channel
> > >     OPs
> > >   RDMA/irdma: Implement device initialization definitions
> > >   RDMA/irdma: Implement HW Admin Queue OPs
> > >   RDMA/irdma: Add HMC backing store setup functions
> > >   RDMA/irdma: Add privileged UDA queue implementation
> > >   RDMA/irdma: Add QoS definitions
> > >   RDMA/irdma: Add connection manager
> > >   RDMA/irdma: Add PBLE resource manager
> > >   RDMA/irdma: Implement device supported verb APIs
> > >   RDMA/irdma: Add RoCEv2 UD OP support
> > >   RDMA/irdma: Add user/kernel shared libraries
> > >   RDMA/irdma: Add miscellaneous utility definitions
> > >   RDMA/irdma: Add ABI definitions
> > >
> > > Shiraz Saleem (5):
> > >   ice: Add devlink params support
> > >   i40e: Prep i40e header for aux bus conversion
> > >   i40e: Register auxiliary devices to provide RDMA
> > >   RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
> > >   RDMA/irdma: Update MAINTAINERS file
> >
> > This doesn't apply, and I don't really know why:
> >
> > Applying: iidc: Introduce iidc.h
> > Applying: ice: Initialize RDMA support
> > Applying: ice: Implement iidc operations
> > Applying: ice: Register auxiliary device to provide RDMA
> > Applying: ice: Add devlink params support
> > Applying: i40e: Prep i40e header for aux bus conversion
> > Applying: i40e: Register auxiliary devices to provide RDMA
> > Applying: RDMA/irdma: Register auxiliary driver and implement private
> > channel OPs
> > Applying: RDMA/irdma: Implement device initialization definitions
> > Applying: RDMA/irdma: Implement HW Admin Queue OPs
> > Applying: RDMA/irdma: Add HMC backing store setup functions
> > Applying: RDMA/irdma: Add privileged UDA queue implementation
> > Applying: RDMA/irdma: Add QoS definitions
> > Applying: RDMA/irdma: Add connection manager
> > Applying: RDMA/irdma: Add PBLE resource manager
> > Applying: RDMA/irdma: Implement device supported verb APIs
> > Applying: RDMA/irdma: Add RoCEv2 UD OP support
> > Applying: RDMA/irdma: Add user/kernel shared libraries
> > Applying: RDMA/irdma: Add miscellaneous utility definitions
> > Applying: RDMA/irdma: Add dynamic tracing for CM
> > Applying: RDMA/irdma: Add ABI definitions
> > Applying: RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
> > Using index info to reconstruct a base tree...
> > error: removal patch leaves file contents
> > error: drivers/infiniband/hw/i40iw/Kconfig: patch does not apply
> >
> > Can you investigate and fix it? Perhaps using a 9 year old version of
> > git is the problem?
> >
> 
> Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for
> review that this one doesn't apply.
> 
> I can resend without it if your trying to apply.
> 

I resend it without using --irreversible-delete. Hopefully, it applies cleanly now.
Jason Gunthorpe April 7, 2021, 11:31 a.m. UTC | #3
On Tue, Apr 06, 2021 at 11:30:23PM +0000, Saleem, Shiraz wrote:

> Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for review that this one doesn't apply.


I doubt it

> I can resend without it if your trying to apply.


Now it is too big to go to the mailing lists

Jason
Jason Gunthorpe April 7, 2021, 2:57 p.m. UTC | #4
On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:
> Add a new generic runtime devlink parameter 'rdma_protocol'

> and use it in ice PCI driver. Configuration changes

> result in unplugging the auxiliary RDMA device and re-plugging

> it with updated values for irdma auxiiary driver to consume at

> drv.probe()

> 

> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

>  .../networking/devlink/devlink-params.rst          |  6 ++

>  Documentation/networking/devlink/ice.rst           | 13 +++

>  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92 +++++++++++++++++++++-

>  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

>  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

>  include/net/devlink.h                              |  4 +

>  net/core/devlink.c                                 |  5 ++

>  7 files changed, 125 insertions(+), 2 deletions(-)

> 

> diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst

> index 54c9f10..0b454c3 100644

> +++ b/Documentation/networking/devlink/devlink-params.rst

> @@ -114,3 +114,9 @@ own name.

>         will NACK any attempt of other host to reset the device. This parameter

>         is useful for setups where a device is shared by different hosts, such

>         as multi-host setup.

> +   * - ``rdma_protocol``

> +     - string

> +     - Selects the RDMA protocol selected for multi-protocol devices.

> +        - ``iwarp`` iWARP

> +	- ``roce`` RoCE

> +	- ``ib`` Infiniband


I'm still not sure this belongs in devlink. 

What about devices that support roce and iwarp concurrently?

There is nothing at the protocol level that precludes this - doesn't
this device allow it?

I know Parav is looking at the general problem of how to customize
what aux devices are created, that may be a better fit for this.

Can you remove the devlink parts to make progress?

Jason
Saleem, Shiraz April 7, 2021, 3:06 p.m. UTC | #5
> Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA

> (irdma)

> 

> On Tue, Apr 06, 2021 at 11:30:23PM +0000, Saleem, Shiraz wrote:

> 

> > Hi Jason - I think its because I used --irreversible-delete flag in git format-patch

> for review that this one doesn't apply.

> 

> I doubt it


The documentation hints at it.
https://git-scm.com/docs/git-format-patch

-D
--irreversible-delete
Omit the preimage for deletes, i.e. print only the header but not the diff between the preimage and /dev/null. The resulting patch is not meant to be applied with patch or git apply; this is solely for people who want to just concentrate on reviewing the text after the change. In addition, the output obviously lacks enough information to apply such a patch in reverse, even manually, hence the name of the option.

> 

> > I can resend without it if your trying to apply.

> 

> Now it is too big to go to the mailing lists

> 


It is showing now.
https://patchwork.kernel.org/project/linux-rdma/patch/20210407001502.1890-23-shiraz.saleem@intel.com/
Jason Gunthorpe April 7, 2021, 3:44 p.m. UTC | #6
On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> +/* Following APIs are implemented by core PCI driver */

> +struct iidc_core_ops {

> +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> +	 * completion queues, Tx/Rx queues, etc...

> +	 */

> +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> +			 struct iidc_res *res,

> +			 int partial_acceptable);

> +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> +			struct iidc_res *res);

> +

> +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> +			     enum iidc_reset_type reset_type);

> +

> +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> +				   u16 vport_id, bool enable);

> +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg,

> +		       u16 len);

> +};


What is this? There is only one implementation:

static const struct iidc_core_ops ops = {
	.alloc_res			= ice_cdev_info_alloc_res,
	.free_res			= ice_cdev_info_free_res,
	.request_reset			= ice_cdev_info_request_reset,
	.update_vport_filter		= ice_cdev_info_update_vsi_filter,
	.vc_send			= ice_cdev_info_vc_send,
};

So export and call the functions directly.

We just had this very same discussion with Broadcom.

I notice there is no module dependency between irdma and the ethernet
driver because the above ops are avoiding it.

This entire idea was already NAK'd once:

https://lore.kernel.org/linux-rdma/20180522203831.20624-1-jeffrey.t.kirsher@intel.com/

So please remove it.

Jason
Jason Gunthorpe April 7, 2021, 5:35 p.m. UTC | #7
On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> +struct iidc_res_base {

> +	/* Union for future provision e.g. other res_type */

> +	union {

> +		struct iidc_rdma_qset_params qsets;

> +	} res;


Use an anonymous union?

There is alot of confusiong provisioning for future types, do you have
concrete plans here? I'm a bit confused why this is so different from
how mlx5 ended up when it already has multiple types.

> +};

> +

> +struct iidc_res {

> +	/* Type of resource. */

> +	enum iidc_res_type res_type;

> +	/* Count requested */

> +	u16 cnt_req;

> +

> +	/* Number of resources allocated. Filled in by callee.

> +	 * Based on this value, caller to fill up "resources"

> +	 */

> +	u16 res_allocated;

> +

> +	/* Unique handle to resources allocated. Zero if call fails.

> +	 * Allocated by callee and for now used by caller for internal

> +	 * tracking purpose.

> +	 */

> +	u32 res_handle;

> +

> +	/* Peer driver has to allocate sufficient memory, to accommodate

> +	 * cnt_requested before calling this function.


Calling what function?

> +	 * Memory has to be zero initialized. It is input/output param.

> +	 * As a result of alloc_res API, this structures will be populated.

> +	 */

> +	struct iidc_res_base res[1];


So it is a wrongly defined flex array? Confused

The usages are all using this as some super-complicated function argument:

	struct iidc_res rdma_qset_res = {};

	rdma_qset_res.res_allocated = 1;
	rdma_qset_res.res_type = IIDC_RDMA_QSETS_TXSCHED;
	rdma_qset_res.res[0].res.qsets.vport_id = vsi->vsi_idx;
	rdma_qset_res.res[0].res.qsets.teid = tc_node->l2_sched_node_id;
	rdma_qset_res.res[0].res.qsets.qs_handle = tc_node->qs_handle;

	if (cdev_info->ops->free_res(cdev_info, &rdma_qset_res))

So the answer here is to make your function calls sane and well
architected. If you have to pass a union to call a function then
something is very wrong with the design.

You aren't trying to achieve ABI decoupling of the rdma/ethernet
modules with an obfuscated complicated function pointer indirection,
are you?

Please use sane function calls 

> +/* Following APIs are implemented by auxiliary drivers and invoked by core PCI

> + * driver

> + */

> +struct iidc_auxiliary_ops {

> +	/* This event_handler is meant to be a blocking call.  For instance,

> +	 * when a BEFORE_MTU_CHANGE event comes in, the event_handler will not

> +	 * return until the auxiliary driver is ready for the MTU change to

> +	 * happen.

> +	 */

> +	void (*event_handler)(struct iidc_core_dev_info *cdev_info,

> +			      struct iidc_event *event);

> +

> +	int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id,

> +			  u8 *msg, u16 len);

> +};


This is not the normal pattern:

> +struct iidc_auxiliary_drv {

> +	struct auxiliary_driver adrv;

> +	struct iidc_auxiliary_ops *ops;

> +};


Just put the two functions above in the drv directly:

struct iidc_auxiliary_drv {
        struct auxilary_driver adrv;
	void (*event_handler)(struct iidc_core_dev_info *cdev_info, *cdev_info,
			      struct iidc_event *event);

	int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id,
			  u8 *msg, u16 len);
}

> +

> +#define IIDC_RDMA_NAME	"intel_rdma"

> +#define IIDC_RDMA_ID	0x00000010

> +#define IIDC_MAX_NUM_AUX	4

> +

> +/* The const struct that instantiates cdev_info_id needs to be initialized

> + * in the .c with the macro ASSIGN_IIDC_INFO.

> + * For example:

> + * static const struct cdev_info_id cdev_info_ids[] = ASSIGN_IIDC_INFO;

> + */

> +struct cdev_info_id {

> +	char *name;

> +	int id;

> +};

> +

> +#define IIDC_RDMA_INFO   { .name = IIDC_RDMA_NAME,  .id = IIDC_RDMA_ID },

> +

> +#define ASSIGN_IIDC_INFO	\

> +{				\

> +	IIDC_RDMA_INFO		\

> +}


I tried to figure out what all this was for and came up short. There
is only one user and all this seems unnecessary in this series, add it
later when you need it.

> +

> +#define iidc_priv(x) ((x)->auxiliary_priv)


Use a static inline function

Jason
Saleem, Shiraz April 7, 2021, 8:58 p.m. UTC | #8
> Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> 

> On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > Add a new generic runtime devlink parameter 'rdma_protocol'

> > and use it in ice PCI driver. Configuration changes result in

> > unplugging the auxiliary RDMA device and re-plugging it with updated

> > values for irdma auxiiary driver to consume at

> > drv.probe()

> >

> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> >  .../networking/devlink/devlink-params.rst          |  6 ++

> >  Documentation/networking/devlink/ice.rst           | 13 +++

> >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92 +++++++++++++++++++++-

> >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> >  include/net/devlink.h                              |  4 +

> >  net/core/devlink.c                                 |  5 ++

> >  7 files changed, 125 insertions(+), 2 deletions(-)

> >

> > diff --git a/Documentation/networking/devlink/devlink-params.rst

> > b/Documentation/networking/devlink/devlink-params.rst

> > index 54c9f10..0b454c3 100644

> > +++ b/Documentation/networking/devlink/devlink-params.rst

> > @@ -114,3 +114,9 @@ own name.

> >         will NACK any attempt of other host to reset the device. This parameter

> >         is useful for setups where a device is shared by different hosts, such

> >         as multi-host setup.

> > +   * - ``rdma_protocol``

> > +     - string

> > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > +        - ``iwarp`` iWARP

> > +	- ``roce`` RoCE

> > +	- ``ib`` Infiniband

> 

> I'm still not sure this belongs in devlink.


I believe you suggested we use devlink for protocol switch.

> 

> What about devices that support roce and iwarp concurrently?

> 

> There is nothing at the protocol level that precludes this - doesn't this device allow

> it?


Nope. This device doesn’t support both protocols concurrently on same PCI function.

Maybe then it makes sense to move this protocol switch as driver specific devlink?

> 

> I know Parav is looking at the general problem of how to customize what aux

> devices are created, that may be a better fit for this.

> 

> Can you remove the devlink parts to make progress?

> 


It is important since otherwise the customer will have no way to use RoCEv2 on this device.

Shiraz
Saleem, Shiraz April 7, 2021, 8:58 p.m. UTC | #9
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> 

> On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> 

> > +/* Following APIs are implemented by core PCI driver */ struct

> > +iidc_core_ops {

> > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > +	 * completion queues, Tx/Rx queues, etc...

> > +	 */

> > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > +			 struct iidc_res *res,

> > +			 int partial_acceptable);

> > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > +			struct iidc_res *res);

> > +

> > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > +			     enum iidc_reset_type reset_type);

> > +

> > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > +				   u16 vport_id, bool enable);

> > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg,

> > +		       u16 len);

> > +};

> 

> What is this? There is only one implementation:

> 

> static const struct iidc_core_ops ops = {

> 	.alloc_res			= ice_cdev_info_alloc_res,

> 	.free_res			= ice_cdev_info_free_res,

> 	.request_reset			= ice_cdev_info_request_reset,

> 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> 	.vc_send			= ice_cdev_info_vc_send,

> };

> 

> So export and call the functions directly.


No. Then we end up requiring ice to be loaded even when just want to use irdma with x722 [whose ethernet driver is "i40e"].
This will not allow us to be a unified RDMA driver.

> 

> We just had this very same discussion with Broadcom.


Yes, but they only have a single ethernet driver today I presume.

Here we have a single RDMA driver and 2 ethernet drivers, i40e and ice.

Shiraz
Jason Gunthorpe April 7, 2021, 10:43 p.m. UTC | #10
On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > 

> > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> > 

> > > +/* Following APIs are implemented by core PCI driver */ struct

> > > +iidc_core_ops {

> > > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > > +	 * completion queues, Tx/Rx queues, etc...

> > > +	 */

> > > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > > +			 struct iidc_res *res,

> > > +			 int partial_acceptable);

> > > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > > +			struct iidc_res *res);

> > > +

> > > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > > +			     enum iidc_reset_type reset_type);

> > > +

> > > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > > +				   u16 vport_id, bool enable);

> > > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg,

> > > +		       u16 len);

> > > +};

> > 

> > What is this? There is only one implementation:

> > 

> > static const struct iidc_core_ops ops = {

> > 	.alloc_res			= ice_cdev_info_alloc_res,

> > 	.free_res			= ice_cdev_info_free_res,

> > 	.request_reset			= ice_cdev_info_request_reset,

> > 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> > 	.vc_send			= ice_cdev_info_vc_send,

> > };

> > 

> > So export and call the functions directly.

> 

> No. Then we end up requiring ice to be loaded even when just want to

> use irdma with x722 [whose ethernet driver is "i40e"].


So what? What does it matter to load a few extra kb of modules?

If you had both drivers use the same ops interface you'd have an
argument.

Jason
Jason Gunthorpe April 7, 2021, 10:46 p.m. UTC | #11
On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > 

> > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > > Add a new generic runtime devlink parameter 'rdma_protocol'

> > > and use it in ice PCI driver. Configuration changes result in

> > > unplugging the auxiliary RDMA device and re-plugging it with updated

> > > values for irdma auxiiary driver to consume at

> > > drv.probe()

> > >

> > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > >  .../networking/devlink/devlink-params.rst          |  6 ++

> > >  Documentation/networking/devlink/ice.rst           | 13 +++

> > >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92 +++++++++++++++++++++-

> > >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> > >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> > >  include/net/devlink.h                              |  4 +

> > >  net/core/devlink.c                                 |  5 ++

> > >  7 files changed, 125 insertions(+), 2 deletions(-)

> > >

> > > diff --git a/Documentation/networking/devlink/devlink-params.rst

> > > b/Documentation/networking/devlink/devlink-params.rst

> > > index 54c9f10..0b454c3 100644

> > > +++ b/Documentation/networking/devlink/devlink-params.rst

> > > @@ -114,3 +114,9 @@ own name.

> > >         will NACK any attempt of other host to reset the device. This parameter

> > >         is useful for setups where a device is shared by different hosts, such

> > >         as multi-host setup.

> > > +   * - ``rdma_protocol``

> > > +     - string

> > > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > > +        - ``iwarp`` iWARP

> > > +	- ``roce`` RoCE

> > > +	- ``ib`` Infiniband

> > 

> > I'm still not sure this belongs in devlink.

> 

> I believe you suggested we use devlink for protocol switch.


Yes, devlink is the right place, but selecting a *single* protocol
doesn't seem right, or general enough.

Parav is talking about generic ways to customize the aux devices
created and that would seem to serve the same function as this.

> > I know Parav is looking at the general problem of how to customize what aux

> > devices are created, that may be a better fit for this.

> > 

> > Can you remove the devlink parts to make progress?

> 

> It is important since otherwise the customer will have no way to use RoCEv2 on this device.


I'm not saying to not having it eventually, I'm just getting tired of
looking at 23 patches. You can argue it out after

I'm also half thinking of applying this under driver/staging or
CONFIG_BROKEN or something just because I am getting sick of looking
at it.

Jason
Leon Romanovsky April 8, 2021, 7:14 a.m. UTC | #12
On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > > 

> > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> > > 

> > > > +/* Following APIs are implemented by core PCI driver */ struct

> > > > +iidc_core_ops {

> > > > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > > > +	 * completion queues, Tx/Rx queues, etc...

> > > > +	 */

> > > > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > > > +			 struct iidc_res *res,

> > > > +			 int partial_acceptable);

> > > > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > > > +			struct iidc_res *res);

> > > > +

> > > > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > > > +			     enum iidc_reset_type reset_type);

> > > > +

> > > > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > > > +				   u16 vport_id, bool enable);

> > > > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg,

> > > > +		       u16 len);

> > > > +};

> > > 

> > > What is this? There is only one implementation:

> > > 

> > > static const struct iidc_core_ops ops = {

> > > 	.alloc_res			= ice_cdev_info_alloc_res,

> > > 	.free_res			= ice_cdev_info_free_res,

> > > 	.request_reset			= ice_cdev_info_request_reset,

> > > 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> > > 	.vc_send			= ice_cdev_info_vc_send,

> > > };

> > > 

> > > So export and call the functions directly.

> > 

> > No. Then we end up requiring ice to be loaded even when just want to

> > use irdma with x722 [whose ethernet driver is "i40e"].

> 

> So what? What does it matter to load a few extra kb of modules?


And if user cares about it, he will blacklist that module anyway.

Thanks
Saleem, Shiraz April 9, 2021, 1:38 a.m. UTC | #13
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> 

> On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote:

> > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote:

> > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > > >

> > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> > > >

> > > > > +/* Following APIs are implemented by core PCI driver */ struct

> > > > > +iidc_core_ops {

> > > > > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > > > > +	 * completion queues, Tx/Rx queues, etc...

> > > > > +	 */

> > > > > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > > > > +			 struct iidc_res *res,

> > > > > +			 int partial_acceptable);

> > > > > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > > > > +			struct iidc_res *res);

> > > > > +

> > > > > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > > > > +			     enum iidc_reset_type reset_type);

> > > > > +

> > > > > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > > > > +				   u16 vport_id, bool enable);

> > > > > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8

> *msg,

> > > > > +		       u16 len);

> > > > > +};

> > > >

> > > > What is this? There is only one implementation:

> > > >

> > > > static const struct iidc_core_ops ops = {

> > > > 	.alloc_res			= ice_cdev_info_alloc_res,

> > > > 	.free_res			= ice_cdev_info_free_res,

> > > > 	.request_reset			= ice_cdev_info_request_reset,

> > > > 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> > > > 	.vc_send			= ice_cdev_info_vc_send,

> > > > };

> > > >

> > > > So export and call the functions directly.

> > >

> > > No. Then we end up requiring ice to be loaded even when just want to

> > > use irdma with x722 [whose ethernet driver is "i40e"].

> >

> > So what? What does it matter to load a few extra kb of modules?

> 

> And if user cares about it, he will blacklist that module anyway.

> 

 blacklist ice when you just have an x722 card? How does that solve anything? You wont be able to load irdma then.

Shiraz
Leon Romanovsky April 11, 2021, 11:48 a.m. UTC | #14
On Fri, Apr 09, 2021 at 01:38:37AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > 

> > On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote:

> > > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote:

> > > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > > > >

> > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> > > > >

> > > > > > +/* Following APIs are implemented by core PCI driver */ struct

> > > > > > +iidc_core_ops {

> > > > > > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > > > > > +	 * completion queues, Tx/Rx queues, etc...

> > > > > > +	 */

> > > > > > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > > > > > +			 struct iidc_res *res,

> > > > > > +			 int partial_acceptable);

> > > > > > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > > > > > +			struct iidc_res *res);

> > > > > > +

> > > > > > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > > > > > +			     enum iidc_reset_type reset_type);

> > > > > > +

> > > > > > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > > > > > +				   u16 vport_id, bool enable);

> > > > > > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8

> > *msg,

> > > > > > +		       u16 len);

> > > > > > +};

> > > > >

> > > > > What is this? There is only one implementation:

> > > > >

> > > > > static const struct iidc_core_ops ops = {

> > > > > 	.alloc_res			= ice_cdev_info_alloc_res,

> > > > > 	.free_res			= ice_cdev_info_free_res,

> > > > > 	.request_reset			= ice_cdev_info_request_reset,

> > > > > 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> > > > > 	.vc_send			= ice_cdev_info_vc_send,

> > > > > };

> > > > >

> > > > > So export and call the functions directly.

> > > >

> > > > No. Then we end up requiring ice to be loaded even when just want to

> > > > use irdma with x722 [whose ethernet driver is "i40e"].

> > >

> > > So what? What does it matter to load a few extra kb of modules?

> > 

> > And if user cares about it, he will blacklist that module anyway.

> > 

>  blacklist ice when you just have an x722 card? How does that solve anything? You wont be able to load irdma then.


You will blacklist i40e if you want solely irdma functionality.

Thanks

> 

> Shiraz
Saleem, Shiraz April 12, 2021, 2:50 p.m. UTC | #15
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> 

> On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> > >

> > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> > >

> > > > +/* Following APIs are implemented by core PCI driver */ struct

> > > > +iidc_core_ops {

> > > > +	/* APIs to allocate resources such as VEB, VSI, Doorbell queues,

> > > > +	 * completion queues, Tx/Rx queues, etc...

> > > > +	 */

> > > > +	int (*alloc_res)(struct iidc_core_dev_info *cdev_info,

> > > > +			 struct iidc_res *res,

> > > > +			 int partial_acceptable);

> > > > +	int (*free_res)(struct iidc_core_dev_info *cdev_info,

> > > > +			struct iidc_res *res);

> > > > +

> > > > +	int (*request_reset)(struct iidc_core_dev_info *cdev_info,

> > > > +			     enum iidc_reset_type reset_type);

> > > > +

> > > > +	int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info,

> > > > +				   u16 vport_id, bool enable);

> > > > +	int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg,

> > > > +		       u16 len);

> > > > +};

> > >

> > > What is this? There is only one implementation:

> > >

> > > static const struct iidc_core_ops ops = {

> > > 	.alloc_res			= ice_cdev_info_alloc_res,

> > > 	.free_res			= ice_cdev_info_free_res,

> > > 	.request_reset			= ice_cdev_info_request_reset,

> > > 	.update_vport_filter		= ice_cdev_info_update_vsi_filter,

> > > 	.vc_send			= ice_cdev_info_vc_send,

> > > };

> > >

> > > So export and call the functions directly.

> >

> > No. Then we end up requiring ice to be loaded even when just want to

> > use irdma with x722 [whose ethernet driver is "i40e"].

> 

> So what? What does it matter to load a few extra kb of modules?


Because it is an unnecessary thing to force a user to build/load drivers for
which they don't have the HW for? The problem gets compounded if we have to
do it for all future HW Intel PCI drivers, i.e. depends on ICE && ....

IIDC is Intel's converged and generic RDMA <--> PCI driver channel interface; which
we intend to use moving forward. And these .ops callbacks are part of this interface which will
have different implementations by each HW generation PCI core driver. It is extensible
with new ops added to the table for new HW and where implementations of the certain ops on some
HW will be NULL.

There is a near-term Intel ethernet VF driver which will use IIDC to provide RDMA in the VF,
and implement some of these .ops callbacks. There is also intent to move i40e to IIDC. 

And yes, it allows to load a unified irdma driver without having all the mulit-gen PCI core drivers to be
built/loaded as a pre-requisite which is solving a pain-point to the user and does not unnecessarily
add a memory foot-print.

In the past, with i40e <-> i40iw, I acknowledge such a dependency was decoupled
for the wrong reasons [1] and understand where your frustration is coming from. But in
a unified irdma driver model connecting to multiple PCI gen drivers, I do think it serves
a reason. This has also been voiced over the years in some of our discussions [2] leading to
the auxiliary bus and been part of our submissions from the get go. In fact, use of such domain
specific .ops from the parent device is an assumption baked into the design when the auxiliary bus
was conceived and in the documentation [3] (See Example Usage).

[1] https://lore.kernel.org/linux-rdma/20180522205612.GD7502@mellanox.com/
[2] https://lore.kernel.org/linux-rdma/2B0E3F215D1AB84DA946C8BEE234CCC97B2E1D29@ORSMSX101.amr.corp.intel.com/
[3] https://www.kernel.org/doc/html/latest/driver-api/auxiliary_bus.html

Shiraz
Saleem, Shiraz April 12, 2021, 2:50 p.m. UTC | #16
> Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> 

> On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > >

> > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > > > Add a new generic runtime devlink parameter 'rdma_protocol'

> > > > and use it in ice PCI driver. Configuration changes result in

> > > > unplugging the auxiliary RDMA device and re-plugging it with

> > > > updated values for irdma auxiiary driver to consume at

> > > > drv.probe()

> > > >

> > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > > >  .../networking/devlink/devlink-params.rst          |  6 ++

> > > >  Documentation/networking/devlink/ice.rst           | 13 +++

> > > >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92

> +++++++++++++++++++++-

> > > >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> > > >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> > > >  include/net/devlink.h                              |  4 +

> > > >  net/core/devlink.c                                 |  5 ++

> > > >  7 files changed, 125 insertions(+), 2 deletions(-)

> > > >

> > > > diff --git a/Documentation/networking/devlink/devlink-params.rst

> > > > b/Documentation/networking/devlink/devlink-params.rst

> > > > index 54c9f10..0b454c3 100644

> > > > +++ b/Documentation/networking/devlink/devlink-params.rst

> > > > @@ -114,3 +114,9 @@ own name.

> > > >         will NACK any attempt of other host to reset the device. This parameter

> > > >         is useful for setups where a device is shared by different hosts, such

> > > >         as multi-host setup.

> > > > +   * - ``rdma_protocol``

> > > > +     - string

> > > > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > > > +        - ``iwarp`` iWARP

> > > > +	- ``roce`` RoCE

> > > > +	- ``ib`` Infiniband

> > >

> > > I'm still not sure this belongs in devlink.

> >

> > I believe you suggested we use devlink for protocol switch.

> 

> Yes, devlink is the right place, but selecting a *single* protocol doesn't seem right,

> or general enough.

> 

> Parav is talking about generic ways to customize the aux devices created and that

> would seem to serve the same function as this.


Is there an RFC or something posted for us to look at?

> 

> > > I know Parav is looking at the general problem of how to customize

> > > what aux devices are created, that may be a better fit for this.

> > >

> > > Can you remove the devlink parts to make progress?

> >

> > It is important since otherwise the customer will have no way to use RoCEv2 on

> this device.

> 

> I'm not saying to not having it eventually, I'm just getting tired of looking at 23

> patches. You can argue it out after

> 


Since this device cannot do concurrent protocols per function,
is having a driver specific devlink to switch between the 2 also a NACK?

There is something similar in another PCI driver.
https://www.kernel.org/doc/html/latest/networking/devlink/qed.html

Can we not adapt to Parav's solution (if it's a good fit) when it becomes available?

Shiraz
Saleem, Shiraz April 12, 2021, 2:51 p.m. UTC | #17
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> 

> On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote:

> 

> > +struct iidc_res_base {

> > +	/* Union for future provision e.g. other res_type */

> > +	union {

> > +		struct iidc_rdma_qset_params qsets;

> > +	} res;

> 

> Use an anonymous union?

> 

> There is alot of confusiong provisioning for future types, do you have concrete

> plans here? I'm a bit confused why this is so different from how mlx5 ended up

> when it already has multiple types.


It was initially designed to be extensible for more resource types. But at this point,
there is no concrete plan and hence it doesn't need to be a union. 

> 

> > +};

> > +

> > +struct iidc_res {

> > +	/* Type of resource. */

> > +	enum iidc_res_type res_type;

> > +	/* Count requested */

> > +	u16 cnt_req;

> > +

> > +	/* Number of resources allocated. Filled in by callee.

> > +	 * Based on this value, caller to fill up "resources"

> > +	 */

> > +	u16 res_allocated;

> > +

> > +	/* Unique handle to resources allocated. Zero if call fails.

> > +	 * Allocated by callee and for now used by caller for internal

> > +	 * tracking purpose.

> > +	 */

> > +	u32 res_handle;

> > +

> > +	/* Peer driver has to allocate sufficient memory, to accommodate

> > +	 * cnt_requested before calling this function.

> 

> Calling what function?


Left over cruft from the re-write of IIDC in v2.
> 

> > +	 * Memory has to be zero initialized. It is input/output param.

> > +	 * As a result of alloc_res API, this structures will be populated.

> > +	 */

> > +	struct iidc_res_base res[1];

> 

> So it is a wrongly defined flex array? Confused


Needs fixing.

> 

> The usages are all using this as some super-complicated function argument:

> 

> 	struct iidc_res rdma_qset_res = {};

> 

> 	rdma_qset_res.res_allocated = 1;

> 	rdma_qset_res.res_type = IIDC_RDMA_QSETS_TXSCHED;

> 	rdma_qset_res.res[0].res.qsets.vport_id = vsi->vsi_idx;

> 	rdma_qset_res.res[0].res.qsets.teid = tc_node->l2_sched_node_id;

> 	rdma_qset_res.res[0].res.qsets.qs_handle = tc_node->qs_handle;

> 

> 	if (cdev_info->ops->free_res(cdev_info, &rdma_qset_res))

> 

> So the answer here is to make your function calls sane and well architected. If you

> have to pass a union to call a function then something is very wrong with the

> design.

> 


Based on previous comment, the union will be removed.

> You aren't trying to achieve ABI decoupling of the rdma/ethernet modules with an

> obfuscated complicated function pointer indirection, are you?


As discussed in other thread, this is part of the IIDC interface exporting the core device .ops callbacks.
> 

> > +/* Following APIs are implemented by auxiliary drivers and invoked by

> > +core PCI

> > + * driver

> > + */

> > +struct iidc_auxiliary_ops {

> > +	/* This event_handler is meant to be a blocking call.  For instance,

> > +	 * when a BEFORE_MTU_CHANGE event comes in, the event_handler will

> not

> > +	 * return until the auxiliary driver is ready for the MTU change to

> > +	 * happen.

> > +	 */

> > +	void (*event_handler)(struct iidc_core_dev_info *cdev_info,

> > +			      struct iidc_event *event);

> > +

> > +	int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id,

> > +			  u8 *msg, u16 len);

> > +};

> 

> This is not the normal pattern:

> 

> > +struct iidc_auxiliary_drv {

> > +	struct auxiliary_driver adrv;

> > +	struct iidc_auxiliary_ops *ops;

> > +};

> 

> Just put the two functions above in the drv directly:


Ok.


> 

> struct iidc_auxiliary_drv {

>         struct auxilary_driver adrv;

> 	void (*event_handler)(struct iidc_core_dev_info *cdev_info, *cdev_info,

> 			      struct iidc_event *event);

> 

> 	int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id,

> 			  u8 *msg, u16 len);

> }

> 

> > +

> > +#define IIDC_RDMA_NAME	"intel_rdma"

> > +#define IIDC_RDMA_ID	0x00000010

> > +#define IIDC_MAX_NUM_AUX	4

> > +

> > +/* The const struct that instantiates cdev_info_id needs to be

> > +initialized

> > + * in the .c with the macro ASSIGN_IIDC_INFO.

> > + * For example:

> > + * static const struct cdev_info_id cdev_info_ids[] =

> > +ASSIGN_IIDC_INFO;  */ struct cdev_info_id {

> > +	char *name;

> > +	int id;

> > +};

> > +

> > +#define IIDC_RDMA_INFO   { .name = IIDC_RDMA_NAME,  .id =

> IIDC_RDMA_ID },

> > +

> > +#define ASSIGN_IIDC_INFO	\

> > +{				\

> > +	IIDC_RDMA_INFO		\

> > +}

> 

> I tried to figure out what all this was for and came up short. There is only one user

> and all this seems unnecessary in this series, add it later when you need it.


No plan for new user, so this should go.

> 

> > +

> > +#define iidc_priv(x) ((x)->auxiliary_priv)

> 

> Use a static inline function

> 

Ok
Jason Gunthorpe April 12, 2021, 4:12 p.m. UTC | #18
On Mon, Apr 12, 2021 at 02:50:43PM +0000, Saleem, Shiraz wrote:

> Because it is an unnecessary thing to force a user to build/load drivers for

> which they don't have the HW for? 


Happens automatically in all distros, so I don't agree with this.

> The problem gets compounded if we have to do it for all future HW

> Intel PCI drivers, i.e. depends on ICE && ....


Then someday build a proper pluggable abstraction and put all your
ethernet drivers under it. 

Today you haven't done that and all we have a set of ops that only
work with one eth driver and a totally different set of functions that
only work with a different driver.

It is all just dead code until it gets finished and process is to not
merge dead code.

> There is a near-term Intel ethernet VF driver which will use IIDC to

> provide RDMA in the VF, and implement some of these .ops

> callbacks. There is also intent to move i40e to IIDC.


"near-term" We are now on year three of Intel talking about this
driver!

Get the bulk of the thing merged and deal with the rest in followup
patches.

> But in a unified irdma driver model connecting to multiple PCI gen

> drivers, I do think it serves a reason.


It is fine as a high level idea, but the implementation has to meet
kernel standards.

Jason
Parav Pandit April 12, 2021, 7:07 p.m. UTC | #19
> From: Saleem, Shiraz <shiraz.saleem@intel.com>

> Sent: Monday, April 12, 2021 8:21 PM

> 

> > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> >

> > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote:

> > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > > >

> > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > > > > Add a new generic runtime devlink parameter 'rdma_protocol'

> > > > > and use it in ice PCI driver. Configuration changes result in

> > > > > unplugging the auxiliary RDMA device and re-plugging it with

> > > > > updated values for irdma auxiiary driver to consume at

> > > > > drv.probe()

> > > > >

> > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > > > >  .../networking/devlink/devlink-params.rst          |  6 ++

> > > > >  Documentation/networking/devlink/ice.rst           | 13 +++

> > > > >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92

> > +++++++++++++++++++++-

> > > > >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> > > > >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> > > > >  include/net/devlink.h                              |  4 +

> > > > >  net/core/devlink.c                                 |  5 ++

> > > > >  7 files changed, 125 insertions(+), 2 deletions(-)

> > > > >

> > > > > diff --git a/Documentation/networking/devlink/devlink-params.rst

> > > > > b/Documentation/networking/devlink/devlink-params.rst

> > > > > index 54c9f10..0b454c3 100644

> > > > > +++ b/Documentation/networking/devlink/devlink-params.rst

> > > > > @@ -114,3 +114,9 @@ own name.

> > > > >         will NACK any attempt of other host to reset the device. This

> parameter

> > > > >         is useful for setups where a device is shared by different hosts,

> such

> > > > >         as multi-host setup.

> > > > > +   * - ``rdma_protocol``

> > > > > +     - string

> > > > > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > > > > +        - ``iwarp`` iWARP

> > > > > +	- ``roce`` RoCE

> > > > > +	- ``ib`` Infiniband

> > > >

> > > > I'm still not sure this belongs in devlink.

> > >

> > > I believe you suggested we use devlink for protocol switch.

> >

> > Yes, devlink is the right place, but selecting a *single* protocol

> > doesn't seem right, or general enough.

> >

> > Parav is talking about generic ways to customize the aux devices

> > created and that would seem to serve the same function as this.

> 

> Is there an RFC or something posted for us to look at?

I do not have polished RFC content ready yet.
But coping the full config sequence snippet from the internal draft (changed for ice example) here as I like to discuss with you in this context.

# (1) show auxiliary device types supported by a given devlink device.
# applies to pci pf,vf,sf. (in general at devlink instance).
$ devlink dev auxdev show pci/0000:06.00.0
pci/0000:06.00.0:
  current:
    roce eth
  new:
  supported:
    roce eth iwarp

# (2) enable iwarp and ethernet type of aux devices and disable roce.
$ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

# (3) now see which aux devices will be enable on next reload.
$ devlink dev auxdev show pci/0000:06:00.0
pci/0000:06:00.0:
  current:
    roce eth
  new:
    eth iwarp
  supported:
    roce eth iwarp

# (4) now reload the device and see which aux devices are created.
At this point driver undergoes reconfig for removal of roce and adding iwarp.
$ devlink reload pci/0000:06:00.0

# (5) verify which are the aux devices now activated.
$ devlink dev auxdev show pci/0000:06:00.0
pci/0000:06:00.0:
  current:
    roce eth
  new:
  supported:
    roce eth iwarp

Above 'new' section is only shown when its set. (similar to devlink resource).
In command two vendor driver can fail the call when iwarp and roce both enabled by user.
mlx5_core doesn't have iwarp, but its vdpa type of device. And for other cases we just want to enable roce disabling eth and vdpa.
Parav Pandit April 13, 2021, 4:03 a.m. UTC | #20
> From: Parav Pandit <parav@nvidia.com>

> Sent: Tuesday, April 13, 2021 12:38 AM

> 

> > From: Saleem, Shiraz <shiraz.saleem@intel.com>

> > Sent: Monday, April 12, 2021 8:21 PM

> >

> > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > >

> > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote:

> > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > > > >

> > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > > > > > Add a new generic runtime devlink parameter 'rdma_protocol'

> > > > > > and use it in ice PCI driver. Configuration changes result in

> > > > > > unplugging the auxiliary RDMA device and re-plugging it with

> > > > > > updated values for irdma auxiiary driver to consume at

> > > > > > drv.probe()

> > > > > >

> > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > > > > >  .../networking/devlink/devlink-params.rst          |  6 ++

> > > > > >  Documentation/networking/devlink/ice.rst           | 13 +++

> > > > > >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92

> > > +++++++++++++++++++++-

> > > > > >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> > > > > >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> > > > > >  include/net/devlink.h                              |  4 +

> > > > > >  net/core/devlink.c                                 |  5 ++

> > > > > >  7 files changed, 125 insertions(+), 2 deletions(-)

> > > > > >

> > > > > > diff --git

> > > > > > a/Documentation/networking/devlink/devlink-params.rst

> > > > > > b/Documentation/networking/devlink/devlink-params.rst

> > > > > > index 54c9f10..0b454c3 100644

> > > > > > +++ b/Documentation/networking/devlink/devlink-params.rst

> > > > > > @@ -114,3 +114,9 @@ own name.

> > > > > >         will NACK any attempt of other host to reset the

> > > > > > device. This

> > parameter

> > > > > >         is useful for setups where a device is shared by

> > > > > > different hosts,

> > such

> > > > > >         as multi-host setup.

> > > > > > +   * - ``rdma_protocol``

> > > > > > +     - string

> > > > > > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > > > > > +        - ``iwarp`` iWARP

> > > > > > +	- ``roce`` RoCE

> > > > > > +	- ``ib`` Infiniband

> > > > >

> > > > > I'm still not sure this belongs in devlink.

> > > >

> > > > I believe you suggested we use devlink for protocol switch.

> > >

> > > Yes, devlink is the right place, but selecting a *single* protocol

> > > doesn't seem right, or general enough.

> > >

> > > Parav is talking about generic ways to customize the aux devices

> > > created and that would seem to serve the same function as this.

> >

> > Is there an RFC or something posted for us to look at?

> I do not have polished RFC content ready yet.

> But coping the full config sequence snippet from the internal draft (changed

> for ice example) here as I like to discuss with you in this context.

> 

> # (1) show auxiliary device types supported by a given devlink device.

> # applies to pci pf,vf,sf. (in general at devlink instance).

> $ devlink dev auxdev show pci/0000:06.00.0

> pci/0000:06.00.0:

>   current:

>     roce eth

>   new:

>   supported:

>     roce eth iwarp

> 

> # (2) enable iwarp and ethernet type of aux devices and disable roce.

> $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

> 

> # (3) now see which aux devices will be enable on next reload.

> $ devlink dev auxdev show pci/0000:06:00.0

> pci/0000:06:00.0:

>   current:

>     roce eth

>   new:

>     eth iwarp

>   supported:

>     roce eth iwarp

> 

> # (4) now reload the device and see which aux devices are created.

> At this point driver undergoes reconfig for removal of roce and adding iwarp.

> $ devlink reload pci/0000:06:00.0

> 

> # (5) verify which are the aux devices now activated.

> $ devlink dev auxdev show pci/0000:06:00.0

> pci/0000:06:00.0:

>   current:

>     roce eth

This was copy paste error from my command #1.
It should be "roce eth" should be "iwarp eth".
This is because in step #3, iwarp was enabled.

>   new:

>   supported:

>     roce eth iwarp

> 

> Above 'new' section is only shown when its set. (similar to devlink resource).

> In command two vendor driver can fail the call when iwarp and roce both

> enabled by user.

> mlx5_core doesn't have iwarp, but its vdpa type of device. And for other

> cases we just want to enable roce disabling eth and vdpa.
Saleem, Shiraz April 13, 2021, 2:40 p.m. UTC | #21
> Subject: RE: [PATCH v4 05/23] ice: Add devlink params support

> 

> 

> 

> > From: Saleem, Shiraz <shiraz.saleem@intel.com>

> > Sent: Monday, April 12, 2021 8:21 PM

> >

> > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > >

> > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote:

> > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support

> > > > >

> > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote:

> > > > > > Add a new generic runtime devlink parameter 'rdma_protocol'

> > > > > > and use it in ice PCI driver. Configuration changes result in

> > > > > > unplugging the auxiliary RDMA device and re-plugging it with

> > > > > > updated values for irdma auxiiary driver to consume at

> > > > > > drv.probe()

> > > > > >

> > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > > > > >  .../networking/devlink/devlink-params.rst          |  6 ++

> > > > > >  Documentation/networking/devlink/ice.rst           | 13 +++

> > > > > >  drivers/net/ethernet/intel/ice/ice_devlink.c       | 92

> > > +++++++++++++++++++++-

> > > > > >  drivers/net/ethernet/intel/ice/ice_devlink.h       |  5 ++

> > > > > >  drivers/net/ethernet/intel/ice/ice_main.c          |  2 +

> > > > > >  include/net/devlink.h                              |  4 +

> > > > > >  net/core/devlink.c                                 |  5 ++

> > > > > >  7 files changed, 125 insertions(+), 2 deletions(-)

> > > > > >

> > > > > > diff --git

> > > > > > a/Documentation/networking/devlink/devlink-params.rst

> > > > > > b/Documentation/networking/devlink/devlink-params.rst

> > > > > > index 54c9f10..0b454c3 100644

> > > > > > +++ b/Documentation/networking/devlink/devlink-params.rst

> > > > > > @@ -114,3 +114,9 @@ own name.

> > > > > >         will NACK any attempt of other host to reset the

> > > > > > device. This

> > parameter

> > > > > >         is useful for setups where a device is shared by

> > > > > > different hosts,

> > such

> > > > > >         as multi-host setup.

> > > > > > +   * - ``rdma_protocol``

> > > > > > +     - string

> > > > > > +     - Selects the RDMA protocol selected for multi-protocol devices.

> > > > > > +        - ``iwarp`` iWARP

> > > > > > +	- ``roce`` RoCE

> > > > > > +	- ``ib`` Infiniband

> > > > >

> > > > > I'm still not sure this belongs in devlink.

> > > >

> > > > I believe you suggested we use devlink for protocol switch.

> > >

> > > Yes, devlink is the right place, but selecting a *single* protocol

> > > doesn't seem right, or general enough.

> > >

> > > Parav is talking about generic ways to customize the aux devices

> > > created and that would seem to serve the same function as this.

> >

> > Is there an RFC or something posted for us to look at?

> I do not have polished RFC content ready yet.

> But coping the full config sequence snippet from the internal draft (changed for ice

> example) here as I like to discuss with you in this context.


Thanks Parav! Some comments below.

> 

> # (1) show auxiliary device types supported by a given devlink device.

> # applies to pci pf,vf,sf. (in general at devlink instance).

> $ devlink dev auxdev show pci/0000:06.00.0

> pci/0000:06.00.0:

>   current:

>     roce eth

>   new:

>   supported:

>     roce eth iwarp

> 

> # (2) enable iwarp and ethernet type of aux devices and disable roce.

> $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

> 

> # (3) now see which aux devices will be enable on next reload.

> $ devlink dev auxdev show pci/0000:06:00.0

> pci/0000:06:00.0:

>   current:

>     roce eth

>   new:

>     eth iwarp

>   supported:

>     roce eth iwarp

> 

> # (4) now reload the device and see which aux devices are created.

> At this point driver undergoes reconfig for removal of roce and adding iwarp.

> $ devlink reload pci/0000:06:00.0


I see this is modeled like devlink resource.

Do we really to need a PCI driver re-init to switch the type of the auxdev hanging off the PCI dev?

Why not just allow the setting to apply dynamically during a 'set' itself with an unplug/plug
of the auxdev with correct type.

Shiraz
Parav Pandit April 13, 2021, 5:36 p.m. UTC | #22
> From: Saleem, Shiraz <shiraz.saleem@intel.com>

> Sent: Tuesday, April 13, 2021 8:11 PM

[..]

> > > > Parav is talking about generic ways to customize the aux devices

> > > > created and that would seem to serve the same function as this.

> > >

> > > Is there an RFC or something posted for us to look at?

> > I do not have polished RFC content ready yet.

> > But coping the full config sequence snippet from the internal draft

> > (changed for ice

> > example) here as I like to discuss with you in this context.

> 

> Thanks Parav! Some comments below.

> 

> >

> > # (1) show auxiliary device types supported by a given devlink device.

> > # applies to pci pf,vf,sf. (in general at devlink instance).

> > $ devlink dev auxdev show pci/0000:06.00.0

> > pci/0000:06.00.0:

> >   current:

> >     roce eth

> >   new:

> >   supported:

> >     roce eth iwarp

> >

> > # (2) enable iwarp and ethernet type of aux devices and disable roce.

> > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

> >

> > # (3) now see which aux devices will be enable on next reload.

> > $ devlink dev auxdev show pci/0000:06:00.0

> > pci/0000:06:00.0:

> >   current:

> >     roce eth

> >   new:

> >     eth iwarp

> >   supported:

> >     roce eth iwarp

> >

> > # (4) now reload the device and see which aux devices are created.

> > At this point driver undergoes reconfig for removal of roce and adding

> iwarp.

> > $ devlink reload pci/0000:06:00.0

> 

> I see this is modeled like devlink resource.

> 

> Do we really to need a PCI driver re-init to switch the type of the auxdev

> hanging off the PCI dev?

> 

I don't see a need to re-init the whole PCI driver. Since only aux device config is changed only that piece to get reloaded.

> Why not just allow the setting to apply dynamically during a 'set' itself with an

> unplug/plug of the auxdev with correct type.

> 

This suggestion came up in the internal discussion too.
However such task needs to synchronize with devlink reload command and also with driver remove() sequence.
So locking wise and depending on amount of config change, it is close to what reload will do.
For example other resource config or other params setting also to take effect.
So to avoid defining multiple config sequence, doing as part of already existing devlink reload, it brings simple sequence to user.

For example,
1. enable/disable desired aux devices
2. configure device resources
3. set some device params
4. do devlink reload and apply settings done in #1 to #3
Saleem, Shiraz April 14, 2021, 12:21 a.m. UTC | #23
> Subject: RE: [PATCH v4 05/23] ice: Add devlink params support

> 

> 

> 

> > From: Saleem, Shiraz <shiraz.saleem@intel.com>

> > Sent: Tuesday, April 13, 2021 8:11 PM

> [..]

> 

> > > > > Parav is talking about generic ways to customize the aux devices

> > > > > created and that would seem to serve the same function as this.

> > > >

> > > > Is there an RFC or something posted for us to look at?

> > > I do not have polished RFC content ready yet.

> > > But coping the full config sequence snippet from the internal draft

> > > (changed for ice

> > > example) here as I like to discuss with you in this context.

> >

> > Thanks Parav! Some comments below.

> >

> > >

> > > # (1) show auxiliary device types supported by a given devlink device.

> > > # applies to pci pf,vf,sf. (in general at devlink instance).

> > > $ devlink dev auxdev show pci/0000:06.00.0

> > > pci/0000:06.00.0:

> > >   current:

> > >     roce eth

> > >   new:

> > >   supported:

> > >     roce eth iwarp

> > >

> > > # (2) enable iwarp and ethernet type of aux devices and disable roce.

> > > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

> > >

> > > # (3) now see which aux devices will be enable on next reload.

> > > $ devlink dev auxdev show pci/0000:06:00.0

> > > pci/0000:06:00.0:

> > >   current:

> > >     roce eth

> > >   new:

> > >     eth iwarp

> > >   supported:

> > >     roce eth iwarp

> > >

> > > # (4) now reload the device and see which aux devices are created.

> > > At this point driver undergoes reconfig for removal of roce and

> > > adding

> > iwarp.

> > > $ devlink reload pci/0000:06:00.0

> >

> > I see this is modeled like devlink resource.

> >

> > Do we really to need a PCI driver re-init to switch the type of the

> > auxdev hanging off the PCI dev?

> >

> I don't see a need to re-init the whole PCI driver. Since only aux device config is

> changed only that piece to get reloaded.


But that is what mlx5 and other implementations does on reload no? i.e. a PCI driver reinit.
I can see an ice implementation of reload morphing to similar over time to support a new config
that requires a true reinit of PCI driver entities.

> 

> > Why not just allow the setting to apply dynamically during a 'set'

> > itself with an unplug/plug of the auxdev with correct type.

> >

> This suggestion came up in the internal discussion too.

> However such task needs to synchronize with devlink reload command and also

> with driver remove() sequence.

> So locking wise and depending on amount of config change, it is close to what

> reload will do.


Holding this mutex across the auxiliary device unplug/plug in "set" wont cut it?
https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304

> For example other resource config or other params setting also to take effect.

> So to avoid defining multiple config sequence, doing as part of already existing

> devlink reload, it brings simple sequence to user.

> 

> For example,

> 1. enable/disable desired aux devices

> 2. configure device resources

> 3. set some device params

> 4. do devlink reload and apply settings done in #1 to #3


Sure. But a user might also just want to operate on just an auxiliary device configuration change. As in #1.
And he ends up having everything hanging off the PF to get blown out, including potentially
the VFs. That feels like too big a hammer.

Shiraz
Parav Pandit April 14, 2021, 5:27 a.m. UTC | #24
+ Jiri.

> From: Saleem, Shiraz <shiraz.saleem@intel.com>

> Sent: Wednesday, April 14, 2021 5:51 AM

> 

> > Subject: RE: [PATCH v4 05/23] ice: Add devlink params support

> >

> >

> >

> > > From: Saleem, Shiraz <shiraz.saleem@intel.com>

> > > Sent: Tuesday, April 13, 2021 8:11 PM

> > [..]

> >

> > > > > > Parav is talking about generic ways to customize the aux

> > > > > > devices created and that would seem to serve the same function as

> this.

> > > > >

> > > > > Is there an RFC or something posted for us to look at?

> > > > I do not have polished RFC content ready yet.

> > > > But coping the full config sequence snippet from the internal

> > > > draft (changed for ice

> > > > example) here as I like to discuss with you in this context.

> > >

> > > Thanks Parav! Some comments below.

> > >

> > > >

> > > > # (1) show auxiliary device types supported by a given devlink device.

> > > > # applies to pci pf,vf,sf. (in general at devlink instance).

> > > > $ devlink dev auxdev show pci/0000:06.00.0

> > > > pci/0000:06.00.0:

> > > >   current:

> > > >     roce eth

> > > >   new:

> > > >   supported:

> > > >     roce eth iwarp

> > > >

> > > > # (2) enable iwarp and ethernet type of aux devices and disable roce.

> > > > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on

> > > >

> > > > # (3) now see which aux devices will be enable on next reload.

> > > > $ devlink dev auxdev show pci/0000:06:00.0

> > > > pci/0000:06:00.0:

> > > >   current:

> > > >     roce eth

> > > >   new:

> > > >     eth iwarp

> > > >   supported:

> > > >     roce eth iwarp

> > > >

> > > > # (4) now reload the device and see which aux devices are created.

> > > > At this point driver undergoes reconfig for removal of roce and

> > > > adding

> > > iwarp.

> > > > $ devlink reload pci/0000:06:00.0

> > >

> > > I see this is modeled like devlink resource.

> > >

> > > Do we really to need a PCI driver re-init to switch the type of the

> > > auxdev hanging off the PCI dev?

> > >

> > I don't see a need to re-init the whole PCI driver. Since only aux

> > device config is changed only that piece to get reloaded.

> 

> But that is what mlx5 and other implementations does on reload no? i.e. a

> PCI driver reinit.

Currently yes, reload does PCI re-init.
However I am not seeing the value of reload if no config (param, resource, auxdev) is changed.

> I can see an ice implementation of reload morphing to similar over time to

> support a new config that requires a true reinit of PCI driver entities.

> 

Sure.

> >

> > > Why not just allow the setting to apply dynamically during a 'set'

> > > itself with an unplug/plug of the auxdev with correct type.

> > >

> > This suggestion came up in the internal discussion too.

> > However such task needs to synchronize with devlink reload command and

> > also with driver remove() sequence.

> > So locking wise and depending on amount of config change, it is close

> > to what reload will do.

> 

> Holding this mutex across the auxiliary device unplug/plug in "set" wont cut

> it?

> https://elixir.bootlin.com/linux/v5.12-

> rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304

> 

Currently devlink reload for mlx5 is source of lockdep assert, use after free access and a deadlock in net ns. :-(
Multiple of us (Leon, Saeed, Moshe) working on it resolve it.
So I want to stay away from intf_mutex for now.

> > For example other resource config or other params setting also to take

> effect.

> > So to avoid defining multiple config sequence, doing as part of

> > already existing devlink reload, it brings simple sequence to user.

> >

> > For example,

> > 1. enable/disable desired aux devices

> > 2. configure device resources

> > 3. set some device params

> > 4. do devlink reload and apply settings done in #1 to #3

> 

> Sure. But a user might also just want to operate on just an auxiliary device

> configuration change. As in #1.

> And he ends up having everything hanging off the PF to get blown out,

> including potentially the VFs. That feels like too big a hammer.

This is certainly not desired.

If we want aux device enable/disable to take effect when its done without reload than above flow should be redefined as,

1. configure device resources (optional)
2. set some device params (optional)
3. enable/disable desired aux devices

Step-3 needs to apply the settings of (1) and (2) without user doing devlink reload.
devlink core doesn't know on step #3, that reload_down() and reload_up() to be done.
So driver internally needs to implement reload_down(), up() on callback of #3.
This builds parallel framework to devlink reload.

Jiri,
What do you think of it?
Saleem, Shiraz April 15, 2021, 5:36 p.m. UTC | #25
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h

> 

> On Mon, Apr 12, 2021 at 02:50:43PM +0000, Saleem, Shiraz wrote:

> 


[....]

> > There is a near-term Intel ethernet VF driver which will use IIDC to

> > provide RDMA in the VF, and implement some of these .ops callbacks.

> > There is also intent to move i40e to IIDC.

> 

> "near-term" We are now on year three of Intel talking about this driver!

> 

> Get the bulk of the thing merged and deal with the rest in followup patches.


We will submit with symbols exported from ice and direct calls from irdma.

Shiraz
Leon Romanovsky April 18, 2021, 11:51 a.m. UTC | #26
On Wed, Apr 14, 2021 at 12:21:08AM +0000, Saleem, Shiraz wrote:
> > Subject: RE: [PATCH v4 05/23] ice: Add devlink params support


<...>

> > > Why not just allow the setting to apply dynamically during a 'set'

> > > itself with an unplug/plug of the auxdev with correct type.

> > >

> > This suggestion came up in the internal discussion too.

> > However such task needs to synchronize with devlink reload command and also

> > with driver remove() sequence.

> > So locking wise and depending on amount of config change, it is close to what

> > reload will do.

> 

> Holding this mutex across the auxiliary device unplug/plug in "set" wont cut it?

> https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304


Like Parav said, we are working to fix it and already have one working
solution, unfortunately it has one eyebrow raising change and we are
trying another one.

You can take a look here to get sense of the scope:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=devlink-core

Thanks