mbox series

[00/22] Add Intel Ethernet Protocol Driver for RDMA (irdma)

Message ID 20210122234827.1353-1-shiraz.saleem@intel.com
Headers show
Series Add Intel Ethernet Protocol Driver for RDMA (irdma) | expand

Message

Saleem, Shiraz Jan. 22, 2021, 11:48 p.m. UTC
From: "Shiraz, Saleem" <shiraz.saleem@intel.com>

The following patch series introduces a unified Intel Ethernet Protocol Driver
for RDMA (irdma) for the X722 iWARP device and a new E810 device which supports
iWARP and RoCEv2. The irdma driver replaces the legacy i40iw driver for X722
and extends the ABI already defined for i40iw. It is backward compatible with
legacy X722 rdma-core provider (libi40iw).

X722 and E810 are PCI network devices that are RDMA capable. The RDMA block of
this parent device is represented via an auxiliary device exported to 'irdma'
using the core auxiliary bus infrastructure recently added for 5.11 kernel.
The parent PCI netdev drivers 'i40e' and 'ice' register auxiliary RDMA devices
with private data/ops encapsulated that bind to an 'irdma' auxiliary driver. 

This series is a follow on to an RFC series [1]. This series was built against
rdma for-next and currently includes the netdev patches for ease of review.
This include updates to 'ice' driver to provide RDMA support and converts 'i40e'
driver to use the auxiliary bus infrastructure .

Once the patches are closer to merging, this series will be split into a
netdev-next and rdma-next patch series targeted at their respective subsystems
with Patch #1 and Patch #5 included in both. This is the shared header file that
will allow each series to independently compile.

[1] https://lore.kernel.org/linux-rdma/20200520070415.3392210-1-jeffrey.t.kirsher@intel.com/

Dave Ertman (4):
  iidc: Introduce iidc.h
  ice: Initialize RDMA support
  ice: Implement iidc operations
  ice: Register auxiliary device to provide RDMA

Michael J. Ruhl (1):
  RDMA/irdma: Add dynamic tracing for CM

Mustafa Ismail (13):
  RDMA/irdma: Register an auxiliary driver and implement private channel
    OPs
  RDMA/irdma: Implement device initialization definitions
  RDMA/irdma: Implement HW Admin Queue OPs
  RDMA/irdma: Add HMC backing store setup functions
  RDMA/irdma: Add privileged UDA queue implementation
  RDMA/irdma: Add QoS definitions
  RDMA/irdma: Add connection manager
  RDMA/irdma: Add PBLE resource manager
  RDMA/irdma: Implement device supported verb APIs
  RDMA/irdma: Add RoCEv2 UD OP support
  RDMA/irdma: Add user/kernel shared libraries
  RDMA/irdma: Add miscellaneous utility definitions
  RDMA/irdma: Add ABI definitions

Shiraz Saleem (4):
  i40e: Prep i40e header for aux bus conversion
  i40e: Register auxiliary devices to provide RDMA
  RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw
  RDMA/irdma: Update MAINTAINERS file

 Documentation/ABI/stable/sysfs-class-infiniband  |   20 -
 MAINTAINERS                                      |   17 +-
 drivers/infiniband/Kconfig                       |    2 +-
 drivers/infiniband/hw/Makefile                   |    2 +-
 drivers/infiniband/hw/i40iw/Kconfig              |    9 -
 drivers/infiniband/hw/i40iw/Makefile             |    9 -
 drivers/infiniband/hw/i40iw/i40iw.h              |  611 ---
 drivers/infiniband/hw/i40iw/i40iw_cm.c           | 4419 ----------------
 drivers/infiniband/hw/i40iw/i40iw_cm.h           |  462 --
 drivers/infiniband/hw/i40iw/i40iw_ctrl.c         | 5243 -------------------
 drivers/infiniband/hw/i40iw/i40iw_d.h            | 1746 -------
 drivers/infiniband/hw/i40iw/i40iw_hmc.c          |  821 ---
 drivers/infiniband/hw/i40iw/i40iw_hmc.h          |  241 -
 drivers/infiniband/hw/i40iw/i40iw_hw.c           |  851 ---
 drivers/infiniband/hw/i40iw/i40iw_main.c         | 2066 --------
 drivers/infiniband/hw/i40iw/i40iw_osdep.h        |  217 -
 drivers/infiniband/hw/i40iw/i40iw_p.h            |  129 -
 drivers/infiniband/hw/i40iw/i40iw_pble.c         |  613 ---
 drivers/infiniband/hw/i40iw/i40iw_pble.h         |  131 -
 drivers/infiniband/hw/i40iw/i40iw_puda.c         | 1496 ------
 drivers/infiniband/hw/i40iw/i40iw_puda.h         |  188 -
 drivers/infiniband/hw/i40iw/i40iw_register.h     | 1030 ----
 drivers/infiniband/hw/i40iw/i40iw_status.h       |  101 -
 drivers/infiniband/hw/i40iw/i40iw_type.h         | 1358 -----
 drivers/infiniband/hw/i40iw/i40iw_uk.c           | 1200 -----
 drivers/infiniband/hw/i40iw/i40iw_user.h         |  422 --
 drivers/infiniband/hw/i40iw/i40iw_utils.c        | 1518 ------
 drivers/infiniband/hw/i40iw/i40iw_verbs.c        | 2652 ----------
 drivers/infiniband/hw/i40iw/i40iw_verbs.h        |  179 -
 drivers/infiniband/hw/i40iw/i40iw_vf.c           |   85 -
 drivers/infiniband/hw/i40iw/i40iw_vf.h           |   62 -
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c     |  759 ---
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h     |  124 -
 drivers/infiniband/hw/irdma/Kconfig              |   11 +
 drivers/infiniband/hw/irdma/Makefile             |   28 +
 drivers/infiniband/hw/irdma/cm.c                 | 4463 ++++++++++++++++
 drivers/infiniband/hw/irdma/cm.h                 |  429 ++
 drivers/infiniband/hw/irdma/ctrl.c               | 6103 ++++++++++++++++++++++
 drivers/infiniband/hw/irdma/defs.h               | 2210 ++++++++
 drivers/infiniband/hw/irdma/hmc.c                |  734 +++
 drivers/infiniband/hw/irdma/hmc.h                |  184 +
 drivers/infiniband/hw/irdma/hw.c                 | 2773 ++++++++++
 drivers/infiniband/hw/irdma/i40iw_hw.c           |  221 +
 drivers/infiniband/hw/irdma/i40iw_hw.h           |  162 +
 drivers/infiniband/hw/irdma/i40iw_if.c           |  226 +
 drivers/infiniband/hw/irdma/icrdma_hw.c          |   82 +
 drivers/infiniband/hw/irdma/icrdma_hw.h          |   67 +
 drivers/infiniband/hw/irdma/irdma.h              |  203 +
 drivers/infiniband/hw/irdma/irdma_if.c           |  422 ++
 drivers/infiniband/hw/irdma/main.c               |  364 ++
 drivers/infiniband/hw/irdma/main.h               |  613 +++
 drivers/infiniband/hw/irdma/osdep.h              |   99 +
 drivers/infiniband/hw/irdma/pble.c               |  525 ++
 drivers/infiniband/hw/irdma/pble.h               |  136 +
 drivers/infiniband/hw/irdma/protos.h             |  118 +
 drivers/infiniband/hw/irdma/puda.c               | 1743 ++++++
 drivers/infiniband/hw/irdma/puda.h               |  194 +
 drivers/infiniband/hw/irdma/status.h             |   70 +
 drivers/infiniband/hw/irdma/trace.c              |  112 +
 drivers/infiniband/hw/irdma/trace.h              |    3 +
 drivers/infiniband/hw/irdma/trace_cm.h           |  458 ++
 drivers/infiniband/hw/irdma/type.h               | 1726 ++++++
 drivers/infiniband/hw/irdma/uda.c                |  391 ++
 drivers/infiniband/hw/irdma/uda.h                |   63 +
 drivers/infiniband/hw/irdma/uda_d.h              |  382 ++
 drivers/infiniband/hw/irdma/uk.c                 | 1729 ++++++
 drivers/infiniband/hw/irdma/user.h               |  462 ++
 drivers/infiniband/hw/irdma/utils.c              | 2680 ++++++++++
 drivers/infiniband/hw/irdma/verbs.c              | 4617 ++++++++++++++++
 drivers/infiniband/hw/irdma/verbs.h              |  218 +
 drivers/infiniband/hw/irdma/ws.c                 |  404 ++
 drivers/infiniband/hw/irdma/ws.h                 |   41 +
 drivers/net/ethernet/intel/Kconfig               |    2 +
 drivers/net/ethernet/intel/i40e/i40e.h           |    2 +
 drivers/net/ethernet/intel/i40e/i40e_client.c    |  154 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |    1 +
 drivers/net/ethernet/intel/ice/Makefile          |    1 +
 drivers/net/ethernet/intel/ice/ice.h             |   17 +
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h  |   33 +
 drivers/net/ethernet/intel/ice/ice_common.c      |  217 +-
 drivers/net/ethernet/intel/ice/ice_common.h      |    9 +
 drivers/net/ethernet/intel/ice/ice_dcb_lib.c     |   64 +
 drivers/net/ethernet/intel/ice/ice_dcb_lib.h     |    3 +
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h  |    3 +-
 drivers/net/ethernet/intel/ice/ice_idc.c         | 1430 +++++
 drivers/net/ethernet/intel/ice/ice_idc_int.h     |  104 +
 drivers/net/ethernet/intel/ice/ice_lib.c         |   27 +
 drivers/net/ethernet/intel/ice/ice_lib.h         |    2 +-
 drivers/net/ethernet/intel/ice/ice_main.c        |  124 +-
 drivers/net/ethernet/intel/ice/ice_sched.c       |   69 +-
 drivers/net/ethernet/intel/ice/ice_switch.c      |   27 +
 drivers/net/ethernet/intel/ice/ice_switch.h      |    4 +
 drivers/net/ethernet/intel/ice/ice_type.h        |    4 +
 drivers/net/ethernet/intel/ice/ice_virtchnl_pf.c |   34 +
 include/linux/net/intel/i40e_client.h            |   14 +
 include/linux/net/intel/iidc.h                   |  342 ++
 include/uapi/rdma/i40iw-abi.h                    |  107 -
 include/uapi/rdma/ib_user_ioctl_verbs.h          |    1 +
 include/uapi/rdma/irdma-abi.h                    |  140 +
 99 files changed, 38262 insertions(+), 28922 deletions(-)
 delete mode 100644 drivers/infiniband/hw/i40iw/Kconfig
 delete mode 100644 drivers/infiniband/hw/i40iw/Makefile
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_ctrl.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_d.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_hw.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_main.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_osdep.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_p.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_register.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_status.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_type.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_uk.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_user.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_utils.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.h
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c
 delete mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h
 create mode 100644 drivers/infiniband/hw/irdma/Kconfig
 create mode 100644 drivers/infiniband/hw/irdma/Makefile
 create mode 100644 drivers/infiniband/hw/irdma/cm.c
 create mode 100644 drivers/infiniband/hw/irdma/cm.h
 create mode 100644 drivers/infiniband/hw/irdma/ctrl.c
 create mode 100644 drivers/infiniband/hw/irdma/defs.h
 create mode 100644 drivers/infiniband/hw/irdma/hmc.c
 create mode 100644 drivers/infiniband/hw/irdma/hmc.h
 create mode 100644 drivers/infiniband/hw/irdma/hw.c
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_hw.c
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_hw.h
 create mode 100644 drivers/infiniband/hw/irdma/i40iw_if.c
 create mode 100644 drivers/infiniband/hw/irdma/icrdma_hw.c
 create mode 100644 drivers/infiniband/hw/irdma/icrdma_hw.h
 create mode 100644 drivers/infiniband/hw/irdma/irdma.h
 create mode 100644 drivers/infiniband/hw/irdma/irdma_if.c
 create mode 100644 drivers/infiniband/hw/irdma/main.c
 create mode 100644 drivers/infiniband/hw/irdma/main.h
 create mode 100644 drivers/infiniband/hw/irdma/osdep.h
 create mode 100644 drivers/infiniband/hw/irdma/pble.c
 create mode 100644 drivers/infiniband/hw/irdma/pble.h
 create mode 100644 drivers/infiniband/hw/irdma/protos.h
 create mode 100644 drivers/infiniband/hw/irdma/puda.c
 create mode 100644 drivers/infiniband/hw/irdma/puda.h
 create mode 100644 drivers/infiniband/hw/irdma/status.h
 create mode 100644 drivers/infiniband/hw/irdma/trace.c
 create mode 100644 drivers/infiniband/hw/irdma/trace.h
 create mode 100644 drivers/infiniband/hw/irdma/trace_cm.h
 create mode 100644 drivers/infiniband/hw/irdma/type.h
 create mode 100644 drivers/infiniband/hw/irdma/uda.c
 create mode 100644 drivers/infiniband/hw/irdma/uda.h
 create mode 100644 drivers/infiniband/hw/irdma/uda_d.h
 create mode 100644 drivers/infiniband/hw/irdma/uk.c
 create mode 100644 drivers/infiniband/hw/irdma/user.h
 create mode 100644 drivers/infiniband/hw/irdma/utils.c
 create mode 100644 drivers/infiniband/hw/irdma/verbs.c
 create mode 100644 drivers/infiniband/hw/irdma/verbs.h
 create mode 100644 drivers/infiniband/hw/irdma/ws.c
 create mode 100644 drivers/infiniband/hw/irdma/ws.h
 create mode 100644 drivers/net/ethernet/intel/ice/ice_idc.c
 create mode 100644 drivers/net/ethernet/intel/ice/ice_idc_int.h
 create mode 100644 include/linux/net/intel/iidc.h
 delete mode 100644 include/uapi/rdma/i40iw-abi.h
 create mode 100644 include/uapi/rdma/irdma-abi.h

Comments

Leon Romanovsky Jan. 24, 2021, 1:45 p.m. UTC | #1
On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:
> From: Mustafa Ismail <mustafa.ismail@intel.com>

>

> Register irdma as an auxiliary driver which can attach to auxiliary RDMA

> devices from Intel PCI netdev drivers i40e and ice. Implement the private

> channel ops, add basic devlink support in the driver and register net

> notifiers.


Devlink part in "the RDMA client" is interesting thing.

The idea behind auxiliary bus was that PCI logic will stay at one place
and devlink considered as the tool to manage that.

In current form every client is going to get independent devlink instance.

<...>

> +static int irdma_devlink_rsrc_limits_validate(struct devlink *dl, u32 id,

> +					      union devlink_param_value val,

> +					      struct netlink_ext_ack *extack)

> +{

> +	u8 value = val.vu8;

> +

> +	if (value > 7) {

> +		NL_SET_ERR_MSG_MOD(extack, "resource limits selector range is (0-7)");

> +		return -ERANGE;

> +	}

> +

> +	return 0;

> +}

> +

> +static int irdma_devlink_enable_roce_validate(struct devlink *dl, u32 id,

> +					      union devlink_param_value val,

> +					      struct netlink_ext_ack *extack)

> +{

> +	struct irdma_dl_priv *priv = devlink_priv(dl);

> +	bool value = val.vbool;

> +

> +	if (value && priv->drvdata->hw_ver == IRDMA_GEN_1) {

> +		NL_SET_ERR_MSG_MOD(extack, "RoCE not supported on device");

> +		return -EOPNOTSUPP;

> +	}

> +

> +	return 0;

> +}

> +

> +static int irdma_devlink_upload_ctx_get(struct devlink *devlink, u32 id,

> +					struct devlink_param_gset_ctx *ctx)

> +{

> +	ctx->val.vbool = irdma_upload_context;

> +	return 0;

> +}

> +

> +static int irdma_devlink_upload_ctx_set(struct devlink *devlink, u32 id,

> +					struct devlink_param_gset_ctx *ctx)

> +{

> +	irdma_upload_context = ctx->val.vbool;

> +	return 0;

> +}

> +

> +enum irdma_dl_param_id {

> +	IRDMA_DEVLINK_PARAM_ID_BASE = DEVLINK_PARAM_GENERIC_ID_MAX,

> +	IRDMA_DEVLINK_PARAM_ID_LIMITS_SELECTOR,

> +	IRDMA_DEVLINK_PARAM_ID_UPLOAD_CONTEXT,

> +};

> +

> +static const struct devlink_param irdma_devlink_params[] = {

> +	DEVLINK_PARAM_DRIVER(IRDMA_DEVLINK_PARAM_ID_LIMITS_SELECTOR,

> +			     "resource_limits_selector", DEVLINK_PARAM_TYPE_U8,

> +			      BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),

> +			      NULL, NULL, irdma_devlink_rsrc_limits_validate),

> +	DEVLINK_PARAM_DRIVER(IRDMA_DEVLINK_PARAM_ID_UPLOAD_CONTEXT,

> +			     "upload_context", DEVLINK_PARAM_TYPE_BOOL,

> +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),

> +			     irdma_devlink_upload_ctx_get,

> +			     irdma_devlink_upload_ctx_set, NULL),

> +	DEVLINK_PARAM_GENERIC(ENABLE_ROCE, BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),

> +			      NULL, NULL, irdma_devlink_enable_roce_validate),

> +};


RoCE enable knob is understandable, but others are not explained.

> +

> +static int irdma_devlink_reload_down(struct devlink *devlink, bool netns_change,

> +				     enum devlink_reload_action action,

> +				     enum devlink_reload_limit limit,

> +				     struct netlink_ext_ack *extack)

> +{

> +	struct irdma_dl_priv *priv = devlink_priv(devlink);

> +	struct auxiliary_device *aux_dev = to_auxiliary_dev(devlink->dev);

> +

> +	if (netns_change) {

> +		NL_SET_ERR_MSG_MOD(extack, "Namespace change is not supported");

> +		return -EOPNOTSUPP;

> +	}

> +

> +	priv->drvdata->deinit_dev(aux_dev);

> +

> +	return 0;

> +}

> +

> +static int irdma_devlink_reload_up(struct devlink *devlink, enum devlink_reload_action action,

> +				   enum devlink_reload_limit limit, u32 *actions_performed,

> +				   struct netlink_ext_ack *extack)

> +{

> +	struct irdma_dl_priv *priv = devlink_priv(devlink);

> +	struct auxiliary_device *aux_dev = to_auxiliary_dev(devlink->dev);

> +	union devlink_param_value saved_value;

> +	int ret;

> +

> +	*actions_performed = BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT);

> +

> +	devlink_param_driverinit_value_get(devlink,

> +				DEVLINK_PARAM_GENERIC_ID_ENABLE_ROCE,

> +				&saved_value);

> +	priv->roce_ena = saved_value.vbool;

> +	devlink_param_driverinit_value_get(devlink,

> +				IRDMA_DEVLINK_PARAM_ID_LIMITS_SELECTOR,

> +				&saved_value);

> +	priv->limits_sel = saved_value.vbool;

> +

> +	ret = priv->drvdata->init_dev(aux_dev);

> +

> +	return ret;

> +}

> +

> +static const struct devlink_ops irdma_devlink_ops = {

> +	.reload_up = irdma_devlink_reload_up,

> +	.reload_down = irdma_devlink_reload_down,

> +	.reload_actions = BIT(DEVLINK_RELOAD_ACTION_DRIVER_REINIT),

> +};


So the plan is that every client will implement same devlink reload logic?

Thanks
Jason Gunthorpe Jan. 25, 2021, 1:28 p.m. UTC | #2
On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:
> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > From: Mustafa Ismail <mustafa.ismail@intel.com>

> >

> > Register irdma as an auxiliary driver which can attach to auxiliary RDMA

> > devices from Intel PCI netdev drivers i40e and ice. Implement the private

> > channel ops, add basic devlink support in the driver and register net

> > notifiers.

> 

> Devlink part in "the RDMA client" is interesting thing.

> 

> The idea behind auxiliary bus was that PCI logic will stay at one place

> and devlink considered as the tool to manage that.


Yes, this doesn't seem right, I don't think these auxiliary bus
objects should have devlink instances, or at least someone from
devlink land should approve of the idea.

Especially since Mellanox is already putting them on the PCI function,
it seems like the sort of pointless difference AlexD was complaining
about.

Jason
Jason Gunthorpe Jan. 25, 2021, 1:29 p.m. UTC | #3
On Fri, Jan 22, 2021 at 05:48:05PM -0600, Shiraz Saleem wrote:

> Once the patches are closer to merging, this series will be split into a

> netdev-next and rdma-next patch series targeted at their respective subsystems

> with Patch #1 and Patch #5 included in both. This is the shared header file that

> will allow each series to independently compile.


Nope, send a proper shared pull request.

Jason
Jason Gunthorpe Jan. 25, 2021, 6:42 p.m. UTC | #4
On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:
> +/**

> + * irdma_init_dev - GEN_2 device init

> + * @aux_dev: auxiliary device

> + *

> + * Create device resources, set up queues, pble and hmc objects.

> + * Return 0 if successful, otherwise return error

> + */

> +int irdma_init_dev(struct auxiliary_device *aux_dev)

> +{

> +	struct iidc_auxiliary_object *vo = container_of(aux_dev,

> +							struct iidc_auxiliary_object,

> +							adev);

> +	struct iidc_peer_obj *peer_info = vo->peer_obj;

> +	struct irdma_handler *hdl;

> +	struct irdma_pci_f *rf;

> +	struct irdma_sc_dev *dev;

> +	struct irdma_priv_peer_info *priv_peer_info;

> +	int err;

> +

> +	hdl = irdma_find_handler(peer_info->pdev);

> +	if (hdl)

> +		return -EBUSY;

> +

> +	hdl = kzalloc(sizeof(*hdl), GFP_KERNEL);

> +	if (!hdl)

> +		return -ENOMEM;

> +

> +	rf = &hdl->rf;

> +	priv_peer_info = &rf->priv_peer_info;

> +	rf->aux_dev = aux_dev;

> +	rf->hdl = hdl;

> +	dev = &rf->sc_dev;

> +	dev->back_dev = rf;

> +	rf->gen_ops.init_hw = icrdma_init_hw;

> +	rf->gen_ops.request_reset = icrdma_request_reset;

> +	rf->gen_ops.register_qset = irdma_lan_register_qset;

> +	rf->gen_ops.unregister_qset = irdma_lan_unregister_qset;

> +	priv_peer_info->peer_info = peer_info;

> +	rf->rdma_ver = IRDMA_GEN_2;

> +	irdma_set_config_params(rf);

> +	dev->pci_rev = peer_info->pdev->revision;

> +	rf->default_vsi.vsi_idx = peer_info->pf_vsi_num;

> +	/* save information from peer_info to priv_peer_info*/

> +	priv_peer_info->fn_num = PCI_FUNC(peer_info->pdev->devfn);

> +	rf->hw.hw_addr = peer_info->hw_addr;

> +	rf->pcidev = peer_info->pdev;

> +	rf->netdev = peer_info->netdev;

> +	priv_peer_info->ftype = peer_info->ftype;

> +	priv_peer_info->msix_count = peer_info->msix_count;

> +	priv_peer_info->msix_entries = peer_info->msix_entries;

> +	irdma_add_handler(hdl);

> +	if (irdma_ctrl_init_hw(rf)) {

> +		err = -EIO;

> +		goto err_ctrl_init;

> +	}

> +	peer_info->peer_ops = &irdma_peer_ops;

> +	peer_info->peer_drv = &irdma_peer_drv;

> +	err = peer_info->ops->peer_register(peer_info);

> +	if (err)

> +		goto err_peer_reg;


No to this, I don't want to see aux bus layered on top of another
management framework in new drivers. When this driver uses aux bus get
rid of the old i40iw stuff. I already said this in one of the older
postings of this driver.

auxbus probe() for a RDMA driver should call ib_alloc_device() near
its start and ib_register_device() near the end its end.

drvdata for the aux device should point to the driver struct
containing the ib_device.

Just like any ordinary PCI based rdma driver, look at how something
like pvrdma_pci_probe() is structued.

Jason
Jason Gunthorpe Jan. 25, 2021, 6:44 p.m. UTC | #5
On Fri, Jan 22, 2021 at 05:48:05PM -0600, Shiraz Saleem wrote:
> From: "Shiraz, Saleem" <shiraz.saleem@intel.com>

> 

> The following patch series introduces a unified Intel Ethernet Protocol Driver

> for RDMA (irdma) for the X722 iWARP device and a new E810 device which supports

> iWARP and RoCEv2. The irdma driver replaces the legacy i40iw driver for X722

> and extends the ABI already defined for i40iw. It is backward compatible with

> legacy X722 rdma-core provider (libi40iw).

> 

> X722 and E810 are PCI network devices that are RDMA capable. The RDMA block of

> this parent device is represented via an auxiliary device exported to 'irdma'

> using the core auxiliary bus infrastructure recently added for 5.11 kernel.

> The parent PCI netdev drivers 'i40e' and 'ice' register auxiliary RDMA devices

> with private data/ops encapsulated that bind to an 'irdma' auxiliary driver. 

> 

> This series is a follow on to an RFC series [1]. This series was built against

> rdma for-next and currently includes the netdev patches for ease of review.

> This include updates to 'ice' driver to provide RDMA support and converts 'i40e'

> driver to use the auxiliary bus infrastructure .

> 

> Once the patches are closer to merging, this series will be split into a

> netdev-next and rdma-next patch series targeted at their respective subsystems

> with Patch #1 and Patch #5 included in both. This is the shared header file that

> will allow each series to independently compile.

> 

> [1] https://lore.kernel.org/linux-rdma/20200520070415.3392210-1-jeffrey.t.kirsher@intel.com/

> 

> Dave Ertman (4):

>   iidc: Introduce iidc.h

>   ice: Initialize RDMA support

>   ice: Implement iidc operations

>   ice: Register auxiliary device to provide RDMA

> 

> Michael J. Ruhl (1):

>   RDMA/irdma: Add dynamic tracing for CM

> 

> Mustafa Ismail (13):

>   RDMA/irdma: Register an auxiliary driver and implement private channel

>     OPs

>   RDMA/irdma: Implement device initialization definitions

>   RDMA/irdma: Implement HW Admin Queue OPs

>   RDMA/irdma: Add HMC backing store setup functions

>   RDMA/irdma: Add privileged UDA queue implementation

>   RDMA/irdma: Add QoS definitions

>   RDMA/irdma: Add connection manager

>   RDMA/irdma: Add PBLE resource manager

>   RDMA/irdma: Implement device supported verb APIs

>   RDMA/irdma: Add RoCEv2 UD OP support

>   RDMA/irdma: Add user/kernel shared libraries

>   RDMA/irdma: Add miscellaneous utility definitions

>   RDMA/irdma: Add ABI definitions


I didn't check, but I will remind you to compile with make W=1 and
ensure this is all clean. Lee is doing good work making RDMA clean for
W=1.

Thanks,
Jason
Jason Gunthorpe Jan. 25, 2021, 6:50 p.m. UTC | #6
On Fri, Jan 22, 2021 at 05:48:26PM -0600, Shiraz Saleem wrote:
> Add Kconfig and Makefile to build irdma driver.

> 

> Remove i40iw driver and add an alias in irdma.

> irdma is the replacement driver that supports X722.

> 

> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> ---


This didn't make it to patchworks or the mailing list

You will need to send it with the git format-patch flag --irreversible-delete

Jason
Jason Gunthorpe Jan. 25, 2021, 7:16 p.m. UTC | #7
On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> +static int irdma_probe(struct auxiliary_device *aux_dev,

> +		       const struct auxiliary_device_id *id)

> +{

> +	struct irdma_drvdata *drvdata;

> +	int ret;

> +

> +	drvdata = kzalloc(sizeof(*drvdata), GFP_KERNEL);

> +	if (!drvdata)

> +		return -ENOMEM;

> +

> +	switch (id->driver_data) {

> +	case IRDMA_GEN_2:

> +		drvdata->init_dev = irdma_init_dev;

> +		drvdata->deinit_dev = irdma_deinit_dev;

> +		break;

> +	case IRDMA_GEN_1:

> +		drvdata->init_dev = i40iw_init_dev;

> +		drvdata->deinit_dev = i40iw_deinit_dev;

> +		break;

> +	default:

> +		ret = -ENODEV;

> +		goto ver_err;


Also don't do this, if the drivers are so different then give them
different aux bus names and bind two drivers with the different
flow.

I suppose the old i40e can keep its weird registration thing, but ice
should not duplicate that, new code must use aux devices properly, as
in my other email.

Jason
Jason Gunthorpe Jan. 25, 2021, 7:37 p.m. UTC | #8
On Fri, Jan 22, 2021 at 05:48:23PM -0600, Shiraz Saleem wrote:
> From: Mustafa Ismail <mustafa.ismail@intel.com>

> 

> Add miscellaneous utility functions and headers.

> 

> Signed-off-by: Mustafa Ismail <mustafa.ismail@intel.com>

> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

>  drivers/infiniband/hw/irdma/osdep.h  |   99 ++

>  drivers/infiniband/hw/irdma/protos.h |  118 ++

>  drivers/infiniband/hw/irdma/status.h |   70 +

>  drivers/infiniband/hw/irdma/utils.c  | 2680 ++++++++++++++++++++++++++++++++++

>  4 files changed, 2967 insertions(+)

>  create mode 100644 drivers/infiniband/hw/irdma/osdep.h

>  create mode 100644 drivers/infiniband/hw/irdma/protos.h

>  create mode 100644 drivers/infiniband/hw/irdma/status.h

>  create mode 100644 drivers/infiniband/hw/irdma/utils.c

> 

> diff --git a/drivers/infiniband/hw/irdma/osdep.h b/drivers/infiniband/hw/irdma/osdep.h

> new file mode 100644

> index 0000000..10e2e02

> +++ b/drivers/infiniband/hw/irdma/osdep.h


> @@ -0,0 +1,99 @@

> +/* SPDX-License-Identifier: GPL-2.0 or Linux-OpenIB */

> +/* Copyright (c) 2015 - 2021 Intel Corporation */

> +#ifndef IRDMA_OSDEP_H

> +#define IRDMA_OSDEP_H

> +

> +#include <linux/pci.h>

> +#include <crypto/hash.h>

> +#include <rdma/ib_verbs.h>

> +

> +#define STATS_TIMER_DELAY	60000

> +

> +#define idev_to_dev(ptr) (&((ptr)->hw->pcidev->dev))

> +#define ihw_to_dev(hw)   (&(hw)->pcidev->dev)


Try to avoid defines like this, it looses the typing information of
the arguments. These should be static inline functions

Jason
Jakub Kicinski Jan. 25, 2021, 8:52 p.m. UTC | #9
On Mon, 25 Jan 2021 09:28:34 -0400 Jason Gunthorpe wrote:
> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> > On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:  

> > > From: Mustafa Ismail <mustafa.ismail@intel.com>

> > >

> > > Register irdma as an auxiliary driver which can attach to auxiliary RDMA

> > > devices from Intel PCI netdev drivers i40e and ice. Implement the private

> > > channel ops, add basic devlink support in the driver and register net

> > > notifiers.  

> > 

> > Devlink part in "the RDMA client" is interesting thing.

> > 

> > The idea behind auxiliary bus was that PCI logic will stay at one place

> > and devlink considered as the tool to manage that.  

> 

> Yes, this doesn't seem right, I don't think these auxiliary bus

> objects should have devlink instances


+1
Saleem, Shiraz Jan. 26, 2021, 12:39 a.m. UTC | #10
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> 

> > +static int irdma_devlink_rsrc_limits_validate(struct devlink *dl, u32 id,

> > +					      union devlink_param_value val,

> > +					      struct netlink_ext_ack *extack) {

> > +	u8 value = val.vu8;

> > +

> > +	if (value > 7) {

> > +		NL_SET_ERR_MSG_MOD(extack, "resource limits selector range

> is (0-7)");

> > +		return -ERANGE;

> > +	}

> > +

> > +	return 0;

> > +}

> > +

> > +static int irdma_devlink_enable_roce_validate(struct devlink *dl, u32 id,

> > +					      union devlink_param_value val,

> > +					      struct netlink_ext_ack *extack) {

> > +	struct irdma_dl_priv *priv = devlink_priv(dl);

> > +	bool value = val.vbool;

> > +

> > +	if (value && priv->drvdata->hw_ver == IRDMA_GEN_1) {

> > +		NL_SET_ERR_MSG_MOD(extack, "RoCE not supported on

> device");

> > +		return -EOPNOTSUPP;

> > +	}

> > +

> > +	return 0;

> > +}

> > +

> > +static int irdma_devlink_upload_ctx_get(struct devlink *devlink, u32 id,

> > +					struct devlink_param_gset_ctx *ctx) {

> > +	ctx->val.vbool = irdma_upload_context;

> > +	return 0;

> > +}

> > +

> > +static int irdma_devlink_upload_ctx_set(struct devlink *devlink, u32 id,

> > +					struct devlink_param_gset_ctx *ctx) {

> > +	irdma_upload_context = ctx->val.vbool;

> > +	return 0;

> > +}

> > +

> > +enum irdma_dl_param_id {

> > +	IRDMA_DEVLINK_PARAM_ID_BASE =

> DEVLINK_PARAM_GENERIC_ID_MAX,

> > +	IRDMA_DEVLINK_PARAM_ID_LIMITS_SELECTOR,

> > +	IRDMA_DEVLINK_PARAM_ID_UPLOAD_CONTEXT,

> > +};

> > +

> > +static const struct devlink_param irdma_devlink_params[] = {

> > +

> 	DEVLINK_PARAM_DRIVER(IRDMA_DEVLINK_PARAM_ID_LIMITS_SELE

> CTOR,

> > +			     "resource_limits_selector",

> DEVLINK_PARAM_TYPE_U8,

> > +			      BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),

> > +			      NULL, NULL, irdma_devlink_rsrc_limits_validate),

> > +

> 	DEVLINK_PARAM_DRIVER(IRDMA_DEVLINK_PARAM_ID_UPLOAD_CON

> TEXT,

> > +			     "upload_context", DEVLINK_PARAM_TYPE_BOOL,

> > +			     BIT(DEVLINK_PARAM_CMODE_RUNTIME),

> > +			     irdma_devlink_upload_ctx_get,

> > +			     irdma_devlink_upload_ctx_set, NULL),

> > +	DEVLINK_PARAM_GENERIC(ENABLE_ROCE,

> BIT(DEVLINK_PARAM_CMODE_DRIVERINIT),

> > +			      NULL, NULL, irdma_devlink_enable_roce_validate),

> > +};

> 

> RoCE enable knob is understandable, but others are not explained.

> 


OK. That can be fixed.
Saleem, Shiraz Jan. 26, 2021, 12:39 a.m. UTC | #11
> Subject: Re: [PATCH 00/22] Add Intel Ethernet Protocol Driver for RDMA (irdma)

> 

> On Fri, Jan 22, 2021 at 05:48:05PM -0600, Shiraz Saleem wrote:

> > From: "Shiraz, Saleem" <shiraz.saleem@intel.com>

> >

> > The following patch series introduces a unified Intel Ethernet

> > Protocol Driver for RDMA (irdma) for the X722 iWARP device and a new

> > E810 device which supports iWARP and RoCEv2. The irdma driver replaces

> > the legacy i40iw driver for X722 and extends the ABI already defined

> > for i40iw. It is backward compatible with legacy X722 rdma-core provider

> (libi40iw).

> >

> > X722 and E810 are PCI network devices that are RDMA capable. The RDMA

> > block of this parent device is represented via an auxiliary device exported to

> 'irdma'

> > using the core auxiliary bus infrastructure recently added for 5.11 kernel.

> > The parent PCI netdev drivers 'i40e' and 'ice' register auxiliary RDMA

> > devices with private data/ops encapsulated that bind to an 'irdma' auxiliary

> driver.

> >

> > This series is a follow on to an RFC series [1]. This series was built

> > against rdma for-next and currently includes the netdev patches for ease of

> review.

> > This include updates to 'ice' driver to provide RDMA support and converts 'i40e'

> > driver to use the auxiliary bus infrastructure .

> >

> > Once the patches are closer to merging, this series will be split into

> > a netdev-next and rdma-next patch series targeted at their respective

> > subsystems with Patch #1 and Patch #5 included in both. This is the

> > shared header file that will allow each series to independently compile.

> >

> > [1]

> > https://lore.kernel.org/linux-rdma/20200520070415.3392210-1-jeffrey.t.

> > kirsher@intel.com/

> >

> > Dave Ertman (4):

> >   iidc: Introduce iidc.h

> >   ice: Initialize RDMA support

> >   ice: Implement iidc operations

> >   ice: Register auxiliary device to provide RDMA

> >

> > Michael J. Ruhl (1):

> >   RDMA/irdma: Add dynamic tracing for CM

> >

> > Mustafa Ismail (13):

> >   RDMA/irdma: Register an auxiliary driver and implement private channel

> >     OPs

> >   RDMA/irdma: Implement device initialization definitions

> >   RDMA/irdma: Implement HW Admin Queue OPs

> >   RDMA/irdma: Add HMC backing store setup functions

> >   RDMA/irdma: Add privileged UDA queue implementation

> >   RDMA/irdma: Add QoS definitions

> >   RDMA/irdma: Add connection manager

> >   RDMA/irdma: Add PBLE resource manager

> >   RDMA/irdma: Implement device supported verb APIs

> >   RDMA/irdma: Add RoCEv2 UD OP support

> >   RDMA/irdma: Add user/kernel shared libraries

> >   RDMA/irdma: Add miscellaneous utility definitions

> >   RDMA/irdma: Add ABI definitions

> 

> I didn't check, but I will remind you to compile with make W=1 and ensure this is all

> clean. Lee is doing good work making RDMA clean for W=1.

> 


Yes. That is done.
Saleem, Shiraz Jan. 26, 2021, 12:39 a.m. UTC | #12
> Subject: Re: [PATCH 21/22] RDMA/irdma: Add irdma Kconfig/Makefile and remove

> i40iw

> 

> On Fri, Jan 22, 2021 at 05:48:26PM -0600, Shiraz Saleem wrote:

> > Add Kconfig and Makefile to build irdma driver.

> >

> > Remove i40iw driver and add an alias in irdma.

> > irdma is the replacement driver that supports X722.

> >

> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > ---

> 

> This didn't make it to patchworks or the mailing list

> 

> You will need to send it with the git format-patch flag --irreversible-delete

> 


Ok.
Saleem, Shiraz Jan. 26, 2021, 12:39 a.m. UTC | #13
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> > On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > > From: Mustafa Ismail <mustafa.ismail@intel.com>

> > >

> > > Register irdma as an auxiliary driver which can attach to auxiliary

> > > RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

> > > the private channel ops, add basic devlink support in the driver and

> > > register net notifiers.

> >

> > Devlink part in "the RDMA client" is interesting thing.

> >

> > The idea behind auxiliary bus was that PCI logic will stay at one

> > place and devlink considered as the tool to manage that.

> 

> Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

> devlink instances, or at least someone from devlink land should approve of the

> idea.

> 


In our model, we have one auxdev (for RDMA) per PCI device function owned by netdev driver
and one devlink instance per auxdev. Plus there is an Intel netdev driver for each HW generation.
Moving the devlink logic to the PCI netdev driver would mean duplicating the same set of RDMA
params in each Intel netdev driver. Additionally, plumbing RDMA specific params in the netdev
driver sort of seems misplaced to me.

devlink is on a per 'struct device' object right? Should we be limiting ourselves in its
usage to only the PCI driver and PCI dev? And not other devices like auxdev?

To Leon's question on number of devlink instances, I think it is going to be same if you implement it in netdev or RDMA driver in this case.

Shiraz
Saleem, Shiraz Jan. 26, 2021, 12:42 a.m. UTC | #14
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > +/**

> > + * irdma_init_dev - GEN_2 device init

> > + * @aux_dev: auxiliary device

> > + *

> > + * Create device resources, set up queues, pble and hmc objects.

> > + * Return 0 if successful, otherwise return error  */ int

> > +irdma_init_dev(struct auxiliary_device *aux_dev) {

> > +	struct iidc_auxiliary_object *vo = container_of(aux_dev,

> > +							struct

> iidc_auxiliary_object,

> > +							adev);

> > +	struct iidc_peer_obj *peer_info = vo->peer_obj;

> > +	struct irdma_handler *hdl;

> > +	struct irdma_pci_f *rf;

> > +	struct irdma_sc_dev *dev;

> > +	struct irdma_priv_peer_info *priv_peer_info;

> > +	int err;

> > +

> > +	hdl = irdma_find_handler(peer_info->pdev);

> > +	if (hdl)

> > +		return -EBUSY;

> > +

> > +	hdl = kzalloc(sizeof(*hdl), GFP_KERNEL);

> > +	if (!hdl)

> > +		return -ENOMEM;

> > +

> > +	rf = &hdl->rf;

> > +	priv_peer_info = &rf->priv_peer_info;

> > +	rf->aux_dev = aux_dev;

> > +	rf->hdl = hdl;

> > +	dev = &rf->sc_dev;

> > +	dev->back_dev = rf;

> > +	rf->gen_ops.init_hw = icrdma_init_hw;

> > +	rf->gen_ops.request_reset = icrdma_request_reset;

> > +	rf->gen_ops.register_qset = irdma_lan_register_qset;

> > +	rf->gen_ops.unregister_qset = irdma_lan_unregister_qset;

> > +	priv_peer_info->peer_info = peer_info;

> > +	rf->rdma_ver = IRDMA_GEN_2;

> > +	irdma_set_config_params(rf);

> > +	dev->pci_rev = peer_info->pdev->revision;

> > +	rf->default_vsi.vsi_idx = peer_info->pf_vsi_num;

> > +	/* save information from peer_info to priv_peer_info*/

> > +	priv_peer_info->fn_num = PCI_FUNC(peer_info->pdev->devfn);

> > +	rf->hw.hw_addr = peer_info->hw_addr;

> > +	rf->pcidev = peer_info->pdev;

> > +	rf->netdev = peer_info->netdev;

> > +	priv_peer_info->ftype = peer_info->ftype;

> > +	priv_peer_info->msix_count = peer_info->msix_count;

> > +	priv_peer_info->msix_entries = peer_info->msix_entries;

> > +	irdma_add_handler(hdl);

> > +	if (irdma_ctrl_init_hw(rf)) {

> > +		err = -EIO;

> > +		goto err_ctrl_init;

> > +	}

> > +	peer_info->peer_ops = &irdma_peer_ops;

> > +	peer_info->peer_drv = &irdma_peer_drv;

> > +	err = peer_info->ops->peer_register(peer_info);

> > +	if (err)

> > +		goto err_peer_reg;

> 

> No to this, I don't want to see aux bus layered on top of another management

> framework in new drivers. When this driver uses aux bus get rid of the old i40iw

> stuff. I already said this in one of the older postings of this driver.

> 

> auxbus probe() for a RDMA driver should call ib_alloc_device() near its start and

> ib_register_device() near the end its end.

> 

> drvdata for the aux device should point to the driver struct containing the

> ib_device.

> 

> Just like any ordinary PCI based rdma driver, look at how something like

> pvrdma_pci_probe() is structued.

> 


I think this essentially means doing away with .open/.close piece. Or are you saying that is ok?
Yes we had a discussion in the past and I thought we concluded. But maybe I misunderstood.

https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A7DCD4FD03@fmsmsx124.amr.corp.intel.com/

The concern is around losing the resources held for rdma VFs by rdma PF due to underlying config changes which require a de-reg/re-registration of the ibdevice.
Today such config changes are handled with the netdev PCI driver using the .close() private callback into rdma driver which unregister ibdevice in PF while allowing
the RDMA VF to survive.

Shiraz
Jason Gunthorpe Jan. 26, 2021, 12:47 a.m. UTC | #15
On Tue, Jan 26, 2021 at 12:39:53AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> > 

> > On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> > > On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > > > From: Mustafa Ismail <mustafa.ismail@intel.com>

> > > >

> > > > Register irdma as an auxiliary driver which can attach to auxiliary

> > > > RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

> > > > the private channel ops, add basic devlink support in the driver and

> > > > register net notifiers.

> > >

> > > Devlink part in "the RDMA client" is interesting thing.

> > >

> > > The idea behind auxiliary bus was that PCI logic will stay at one

> > > place and devlink considered as the tool to manage that.

> > 

> > Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

> > devlink instances, or at least someone from devlink land should approve of the

> > idea.

> > 

> 

> In our model, we have one auxdev (for RDMA) per PCI device function

> owned by netdev driver and one devlink instance per auxdev. Plus

> there is an Intel netdev driver for each HW generation.  Moving the

> devlink logic to the PCI netdev driver would mean duplicating the

> same set of RDMA params in each Intel netdev driver. Additionally,

> plumbing RDMA specific params in the netdev driver sort of seems

> misplaced to me.


That's because it is not supposed to be "the netdev driver" but the
shared physical PCI function driver, and devlink is a shared part of
the PCI function.

> devlink is on a per 'struct device' object right? Should we be

> limiting ourselves in its usage to only the PCI driver and PCI dev?

> And not other devices like auxdev?


The devlink should not be created on the aux device, devlink should be
created against PCI functions.

It is important to follow establish convention here, otherwise it is a
user mess to know what to do

Jason
Keller, Jacob E Jan. 26, 2021, 12:57 a.m. UTC | #16
> -----Original Message-----

> From: Jason Gunthorpe <jgg@nvidia.com>

> Sent: Monday, January 25, 2021 4:48 PM

> To: Saleem, Shiraz <shiraz.saleem@intel.com>

> Cc: Leon Romanovsky <leon@kernel.org>; dledford@redhat.com;

> kuba@kernel.org; davem@davemloft.net; linux-rdma@vger.kernel.org;

> gregkh@linuxfoundation.org; netdev@vger.kernel.org; Ertman, David M

> <david.m.ertman@intel.com>; Nguyen, Anthony L

> <anthony.l.nguyen@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>;

> Keller, Jacob E <jacob.e.keller@intel.com>; jiri@nvidia.com; Samudrala, Sridhar

> <sridhar.samudrala@intel.com>; Williams, Dan J <dan.j.williams@intel.com>

> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Tue, Jan 26, 2021 at 12:39:53AM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > > implement private channel OPs

> > >

> > > On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> > > > On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > > > > From: Mustafa Ismail <mustafa.ismail@intel.com>

> > > > >

> > > > > Register irdma as an auxiliary driver which can attach to auxiliary

> > > > > RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

> > > > > the private channel ops, add basic devlink support in the driver and

> > > > > register net notifiers.

> > > >

> > > > Devlink part in "the RDMA client" is interesting thing.

> > > >

> > > > The idea behind auxiliary bus was that PCI logic will stay at one

> > > > place and devlink considered as the tool to manage that.

> > >

> > > Yes, this doesn't seem right, I don't think these auxiliary bus objects should

> have

> > > devlink instances, or at least someone from devlink land should approve of

> the

> > > idea.

> > >

> >

> > In our model, we have one auxdev (for RDMA) per PCI device function

> > owned by netdev driver and one devlink instance per auxdev. Plus

> > there is an Intel netdev driver for each HW generation.  Moving the

> > devlink logic to the PCI netdev driver would mean duplicating the

> > same set of RDMA params in each Intel netdev driver. Additionally,

> > plumbing RDMA specific params in the netdev driver sort of seems

> > misplaced to me.

> 

> That's because it is not supposed to be "the netdev driver" but the

> shared physical PCI function driver, and devlink is a shared part of

> the PCI function.


Well, at least in Intel ice driver case, we have multiple PCI functions, and each function gets its own devlink (as opposed to a single devlink instance). There's no separation between the netdev driver and the PCI function driver today.

> 

> > devlink is on a per 'struct device' object right? Should we be

> > limiting ourselves in its usage to only the PCI driver and PCI dev?

> > And not other devices like auxdev?

> 

> The devlink should not be created on the aux device, devlink should be

> created against PCI functions.

> 

> It is important to follow establish convention here, otherwise it is a

> user mess to know what to do

> 

> Jason
Jason Gunthorpe Jan. 26, 2021, 12:59 a.m. UTC | #17
On Tue, Jan 26, 2021 at 12:42:16AM +0000, Saleem, Shiraz wrote:

> I think this essentially means doing away with .open/.close piece. 


Yes, that too, and probably the FSM as well.

> Or are you saying that is ok?  Yes we had a discussion in the past

> and I thought we concluded. But maybe I misunderstood.

> 

> https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A7DCD4FD03@fmsmsx124.amr.corp.intel.com/


Well, having now seen how aux bus ended up and the way it effected the
mlx5 driver, I am more firmly of the opinion this needs to be
fixed. It is extremly hard to get everything right with two different
registration schemes running around.

You never answered my question:

> Still, you need to be able to cope with the user unbinding your

> drivers in any order via sysfs. What happens to the VFs when the PF is

> unbound and releases whatever resources? This is where the broadcom

> driver ran into troubles..


?

> PF due to underlying config changes which require a

> de-reg/re-registration of the ibdevice.  Today such config changes

> are handled with the netdev PCI driver using the .close() private

> callback into rdma driver which unregister ibdevice in PF while

> allowing the RDMA VF to survive.


Putting resources the device needs to function in the aux driver is no
good, managing shared resources is the role of the PCI function owning
core driver.

If you put the open/close on the aux device_driver struct and got rid
of the redundant life cycle stuff, it might be OK enough, assuming
there is a good answer to the question above.

Jason
Keller, Jacob E Jan. 26, 2021, 1:01 a.m. UTC | #18
On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:
>> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

>> implement private channel OPs

>>

>> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

>>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

>>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

>>>>

>>>> Register irdma as an auxiliary driver which can attach to auxiliary

>>>> RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

>>>> the private channel ops, add basic devlink support in the driver and

>>>> register net notifiers.

>>>

>>> Devlink part in "the RDMA client" is interesting thing.

>>>

>>> The idea behind auxiliary bus was that PCI logic will stay at one

>>> place and devlink considered as the tool to manage that.

>>

>> Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

>> devlink instances, or at least someone from devlink land should approve of the

>> idea.

>>

> 

> In our model, we have one auxdev (for RDMA) per PCI device function owned by netdev driver

> and one devlink instance per auxdev. Plus there is an Intel netdev driver for each HW generation.

> Moving the devlink logic to the PCI netdev driver would mean duplicating the same set of RDMA

> params in each Intel netdev driver. Additionally, plumbing RDMA specific params in the netdev

> driver sort of seems misplaced to me.

> 


I agree that plumbing these parameters at the PCI side in the devlink of
the parent device is weird. They don't seem to be parameters that the
parent driver cares about.

Maybe there is another mechanism that makes more sense? To me it is a
bit like if we were plumbing netdev specific paramters into devlink
instead of trying to expose them through netdevice specific interfaces
like iproute2 or ethtool.
Jason Gunthorpe Jan. 26, 2021, 1:10 a.m. UTC | #19
On Mon, Jan 25, 2021 at 05:01:40PM -0800, Jacob Keller wrote:
> 

> 

> On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:

> >> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> >> implement private channel OPs

> >>

> >> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> >>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> >>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

> >>>>

> >>>> Register irdma as an auxiliary driver which can attach to auxiliary

> >>>> RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

> >>>> the private channel ops, add basic devlink support in the driver and

> >>>> register net notifiers.

> >>>

> >>> Devlink part in "the RDMA client" is interesting thing.

> >>>

> >>> The idea behind auxiliary bus was that PCI logic will stay at one

> >>> place and devlink considered as the tool to manage that.

> >>

> >> Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

> >> devlink instances, or at least someone from devlink land should approve of the

> >> idea.

> >>

> > 

> > In our model, we have one auxdev (for RDMA) per PCI device function owned by netdev driver

> > and one devlink instance per auxdev. Plus there is an Intel netdev driver for each HW generation.

> > Moving the devlink logic to the PCI netdev driver would mean duplicating the same set of RDMA

> > params in each Intel netdev driver. Additionally, plumbing RDMA specific params in the netdev

> > driver sort of seems misplaced to me.

> > 

> 

> I agree that plumbing these parameters at the PCI side in the devlink of

> the parent device is weird. They don't seem to be parameters that the

> parent driver cares about.


It does, the PCI driver is not supposed to spawn any aux devices for
RDMA at all if RDMA is disabled.

For an iWarp driver I would consider ENABLE_ROCE to really be a
general ENABLE_RDMA.

Are you sure you need to implement this?

In any event, you just can't put the generic ENABLE_ROCE flag anyplace
but the PCI device for devlink, it breaks the expected user API
established by mlx5

Jason
Leon Romanovsky Jan. 26, 2021, 5:29 a.m. UTC | #20
On Mon, Jan 25, 2021 at 05:01:40PM -0800, Jacob Keller wrote:
>

>

> On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:

> >> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> >> implement private channel OPs

> >>

> >> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> >>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> >>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

> >>>>

> >>>> Register irdma as an auxiliary driver which can attach to auxiliary

> >>>> RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

> >>>> the private channel ops, add basic devlink support in the driver and

> >>>> register net notifiers.

> >>>

> >>> Devlink part in "the RDMA client" is interesting thing.

> >>>

> >>> The idea behind auxiliary bus was that PCI logic will stay at one

> >>> place and devlink considered as the tool to manage that.

> >>

> >> Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

> >> devlink instances, or at least someone from devlink land should approve of the

> >> idea.

> >>

> >

> > In our model, we have one auxdev (for RDMA) per PCI device function owned by netdev driver

> > and one devlink instance per auxdev. Plus there is an Intel netdev driver for each HW generation.

> > Moving the devlink logic to the PCI netdev driver would mean duplicating the same set of RDMA

> > params in each Intel netdev driver. Additionally, plumbing RDMA specific params in the netdev

> > driver sort of seems misplaced to me.

> >

>

> I agree that plumbing these parameters at the PCI side in the devlink of

> the parent device is weird. They don't seem to be parameters that the

> parent driver cares about.

>

> Maybe there is another mechanism that makes more sense? To me it is a

> bit like if we were plumbing netdev specific paramters into devlink

> instead of trying to expose them through netdevice specific interfaces

> like iproute2 or ethtool.


I'm far from being expert in devlink, but for me separation is following:
1. devlink - operates on physical device level, when PCI device already initialized.
2. ethtool - changes needed to be done on netdev layer.
3. ip - upper layer of the netdev
4. rdmatool - RDMA specific when IB device already exists.

And the ENABLE_ROCE/ENABLE_RDMA thing shouldn't be in the RDMA driver at
all, because it is physical device property which once toggled will
prohibit creation of respective aux device.

Thanks
Leon Romanovsky Jan. 26, 2021, 5:37 a.m. UTC | #21
On Mon, Jan 25, 2021 at 02:42:48PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > +/**

> > + * irdma_init_dev - GEN_2 device init

> > + * @aux_dev: auxiliary device

> > + *

> > + * Create device resources, set up queues, pble and hmc objects.

> > + * Return 0 if successful, otherwise return error

> > + */

> > +int irdma_init_dev(struct auxiliary_device *aux_dev)

> > +{

> > +	struct iidc_auxiliary_object *vo = container_of(aux_dev,

> > +							struct iidc_auxiliary_object,

> > +							adev);

> > +	struct iidc_peer_obj *peer_info = vo->peer_obj;

> > +	struct irdma_handler *hdl;

> > +	struct irdma_pci_f *rf;

> > +	struct irdma_sc_dev *dev;

> > +	struct irdma_priv_peer_info *priv_peer_info;

> > +	int err;

> > +

> > +	hdl = irdma_find_handler(peer_info->pdev);

> > +	if (hdl)

> > +		return -EBUSY;

> > +

> > +	hdl = kzalloc(sizeof(*hdl), GFP_KERNEL);

> > +	if (!hdl)

> > +		return -ENOMEM;

> > +

> > +	rf = &hdl->rf;

> > +	priv_peer_info = &rf->priv_peer_info;

> > +	rf->aux_dev = aux_dev;

> > +	rf->hdl = hdl;

> > +	dev = &rf->sc_dev;

> > +	dev->back_dev = rf;

> > +	rf->gen_ops.init_hw = icrdma_init_hw;

> > +	rf->gen_ops.request_reset = icrdma_request_reset;

> > +	rf->gen_ops.register_qset = irdma_lan_register_qset;

> > +	rf->gen_ops.unregister_qset = irdma_lan_unregister_qset;

> > +	priv_peer_info->peer_info = peer_info;

> > +	rf->rdma_ver = IRDMA_GEN_2;

> > +	irdma_set_config_params(rf);

> > +	dev->pci_rev = peer_info->pdev->revision;

> > +	rf->default_vsi.vsi_idx = peer_info->pf_vsi_num;

> > +	/* save information from peer_info to priv_peer_info*/

> > +	priv_peer_info->fn_num = PCI_FUNC(peer_info->pdev->devfn);

> > +	rf->hw.hw_addr = peer_info->hw_addr;

> > +	rf->pcidev = peer_info->pdev;

> > +	rf->netdev = peer_info->netdev;

> > +	priv_peer_info->ftype = peer_info->ftype;

> > +	priv_peer_info->msix_count = peer_info->msix_count;

> > +	priv_peer_info->msix_entries = peer_info->msix_entries;

> > +	irdma_add_handler(hdl);

> > +	if (irdma_ctrl_init_hw(rf)) {

> > +		err = -EIO;

> > +		goto err_ctrl_init;

> > +	}

> > +	peer_info->peer_ops = &irdma_peer_ops;

> > +	peer_info->peer_drv = &irdma_peer_drv;

> > +	err = peer_info->ops->peer_register(peer_info);

> > +	if (err)

> > +		goto err_peer_reg;

>

> No to this, I don't want to see aux bus layered on top of another

> management framework in new drivers. When this driver uses aux bus get

> rid of the old i40iw stuff. I already said this in one of the older

> postings of this driver.

>

> auxbus probe() for a RDMA driver should call ib_alloc_device() near

> its start and ib_register_device() near the end its end.

>

> drvdata for the aux device should point to the driver struct

> containing the ib_device.


My other expectation is to see at least two aux_drivers, one for the
RoCE and another for the iWARP. It will allow easy management for the
users if they decide to disable/enable specific functionality
(/sys/bus/auxiliary/device/*). It will simplify code management too.

Thanks
Leon Romanovsky Jan. 26, 2021, 5:47 a.m. UTC | #22
On Mon, Jan 25, 2021 at 02:50:07PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 22, 2021 at 05:48:26PM -0600, Shiraz Saleem wrote:

> > Add Kconfig and Makefile to build irdma driver.

> >

> > Remove i40iw driver and add an alias in irdma.

> > irdma is the replacement driver that supports X722.

> >

> > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>

> > ---

>

> This didn't make it to patchworks or the mailing list


It arrived - https://lore.kernel.org/linux-rdma/20210125185007.GU4147@nvidia.com/
Yesterday, VGER had some weird behaviour.

Thanks
Keller, Jacob E Jan. 26, 2021, 10:07 p.m. UTC | #23
On 1/25/2021 9:29 PM, Leon Romanovsky wrote:
> On Mon, Jan 25, 2021 at 05:01:40PM -0800, Jacob Keller wrote:

>>

>>

>> On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:

>>>> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

>>>> implement private channel OPs

>>>>

>>>> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

>>>>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

>>>>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

>>>>>>

>>>>>> Register irdma as an auxiliary driver which can attach to auxiliary

>>>>>> RDMA devices from Intel PCI netdev drivers i40e and ice. Implement

>>>>>> the private channel ops, add basic devlink support in the driver and

>>>>>> register net notifiers.

>>>>>

>>>>> Devlink part in "the RDMA client" is interesting thing.

>>>>>

>>>>> The idea behind auxiliary bus was that PCI logic will stay at one

>>>>> place and devlink considered as the tool to manage that.

>>>>

>>>> Yes, this doesn't seem right, I don't think these auxiliary bus objects should have

>>>> devlink instances, or at least someone from devlink land should approve of the

>>>> idea.

>>>>

>>>

>>> In our model, we have one auxdev (for RDMA) per PCI device function owned by netdev driver

>>> and one devlink instance per auxdev. Plus there is an Intel netdev driver for each HW generation.

>>> Moving the devlink logic to the PCI netdev driver would mean duplicating the same set of RDMA

>>> params in each Intel netdev driver. Additionally, plumbing RDMA specific params in the netdev

>>> driver sort of seems misplaced to me.

>>>

>>

>> I agree that plumbing these parameters at the PCI side in the devlink of

>> the parent device is weird. They don't seem to be parameters that the

>> parent driver cares about.

>>

>> Maybe there is another mechanism that makes more sense? To me it is a

>> bit like if we were plumbing netdev specific paramters into devlink

>> instead of trying to expose them through netdevice specific interfaces

>> like iproute2 or ethtool.

> 

> I'm far from being expert in devlink, but for me separation is following:

> 1. devlink - operates on physical device level, when PCI device already initialized.

> 2. ethtool - changes needed to be done on netdev layer.

> 3. ip - upper layer of the netdev

> 4. rdmatool - RDMA specific when IB device already exists.

> 

> And the ENABLE_ROCE/ENABLE_RDMA thing shouldn't be in the RDMA driver at

> all, because it is physical device property which once toggled will

> prohibit creation of respective aux device.

> 


Ok. I guess I hadn't looked quite as close at the specifics here. I
agree that ENABLE_RDMA should go in the PF devlink.

If there's any other sort of RDMA-specific configuration that ties to
the IB device, that should go somehow into rdmatool, rather than
devlink. And thus: I think I agree, we don't want the IB device or the
aux device to create a devlink instance.

> Thanks

>
Saleem, Shiraz Jan. 27, 2021, 12:41 a.m. UTC | #24
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Tue, Jan 26, 2021 at 12:42:16AM +0000, Saleem, Shiraz wrote:

> 

> > I think this essentially means doing away with .open/.close piece.

> 

> Yes, that too, and probably the FSM as well.

> 

> > Or are you saying that is ok?  Yes we had a discussion in the past and

> > I thought we concluded. But maybe I misunderstood.

> >

> > https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A7DCD

> > 4FD03@fmsmsx124.amr.corp.intel.com/

> 

> Well, having now seen how aux bus ended up and the way it effected the

> mlx5 driver, I am more firmly of the opinion this needs to be fixed. It is extremly

> hard to get everything right with two different registration schemes running around.

> 

> You never answered my question:


Sorry I missed it.
> 

> > Still, you need to be able to cope with the user unbinding your

> > drivers in any order via sysfs. What happens to the VFs when the PF is

> > unbound and releases whatever resources? This is where the broadcom

> > driver ran into troubles..

> 

> ?


echo -n "ice.intel_rdma.0" > /sys/bus/auxiliary/drivers/irdma/unbind  ???

That I believe will trigger a drv.remove() on the rdma PF side which require
the rdma VFs to go down.

Yes, we currently have a requirement the aux rdma PF driver remain inited at least to .probe()
for VFs to survive.

We are doing internal review, but it appears we could potentially get rid of the .open/.close callbacks.
And its associated FSM in ice. 

But if we remove peer_register/unregister, how do we synchronize between say unload of the rdma driver 
and netdev driver stop accessing the priv channel iidc_peer_ops that it uses to send events to rdma?

Shiraz
Saleem, Shiraz Jan. 27, 2021, 12:42 a.m. UTC | #25
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Mon, Jan 25, 2021 at 05:01:40PM -0800, Jacob Keller wrote:

> >

> >

> > On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:

> > >> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> > >> and implement private channel OPs

> > >>

> > >> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> > >>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > >>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

> > >>>>

> > >>>> Register irdma as an auxiliary driver which can attach to

> > >>>> auxiliary RDMA devices from Intel PCI netdev drivers i40e and

> > >>>> ice. Implement the private channel ops, add basic devlink support

> > >>>> in the driver and register net notifiers.

> > >>>

> > >>> Devlink part in "the RDMA client" is interesting thing.

> > >>>

> > >>> The idea behind auxiliary bus was that PCI logic will stay at one

> > >>> place and devlink considered as the tool to manage that.

> > >>

> > >> Yes, this doesn't seem right, I don't think these auxiliary bus

> > >> objects should have devlink instances, or at least someone from

> > >> devlink land should approve of the idea.

> > >>

> > >

> > > In our model, we have one auxdev (for RDMA) per PCI device function

> > > owned by netdev driver and one devlink instance per auxdev. Plus there is an

> Intel netdev driver for each HW generation.

> > > Moving the devlink logic to the PCI netdev driver would mean

> > > duplicating the same set of RDMA params in each Intel netdev driver.

> > > Additionally, plumbing RDMA specific params in the netdev driver sort of

> seems misplaced to me.

> > >

> >

> > I agree that plumbing these parameters at the PCI side in the devlink

> > of the parent device is weird. They don't seem to be parameters that

> > the parent driver cares about.

> 

> It does, the PCI driver is not supposed to spawn any aux devices for RDMA at all

> if RDMA is disabled.

> 

> For an iWarp driver I would consider ENABLE_ROCE to really be a general

> ENABLE_RDMA.


Well the driver supports iWARP and RoCE for E810 device.
Are you saying that this generic enable_roce devlink param really
is an enable 'rdma' traffic or not param?

> 

> Are you sure you need to implement this?


What we are after is some mechanism for user to switch the protocols iWARP vs RoCE
[default the device comes up as an iWARP dev]. The protocol info is really needed early-on
in the RDMA driver.probe(). i.e. when the rdma admin queue is created.

The same goes with the other param resource_limits_selector. It's a profile selector that a user
can chose to different # of max QP, CQs, MRs etc.

> 

> In any event, you just can't put the generic ENABLE_ROCE flag anyplace but the

> PCI device for devlink, it breaks the expected user API established by mlx5

> 

> Jason
Saleem, Shiraz Jan. 27, 2021, 1:02 a.m. UTC | #26
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> 

> 

> On 1/25/2021 9:29 PM, Leon Romanovsky wrote:

> > On Mon, Jan 25, 2021 at 05:01:40PM -0800, Jacob Keller wrote:

> >>

> >>

> >> On 1/25/2021 4:39 PM, Saleem, Shiraz wrote:

> >>>> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> >>>> and implement private channel OPs

> >>>>

> >>>> On Sun, Jan 24, 2021 at 03:45:51PM +0200, Leon Romanovsky wrote:

> >>>>> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> >>>>>> From: Mustafa Ismail <mustafa.ismail@intel.com>

> >>>>>>

> >>>>>> Register irdma as an auxiliary driver which can attach to

> >>>>>> auxiliary RDMA devices from Intel PCI netdev drivers i40e and

> >>>>>> ice. Implement the private channel ops, add basic devlink support

> >>>>>> in the driver and register net notifiers.

> >>>>>

> >>>>> Devlink part in "the RDMA client" is interesting thing.

> >>>>>

> >>>>> The idea behind auxiliary bus was that PCI logic will stay at one

> >>>>> place and devlink considered as the tool to manage that.

> >>>>

> >>>> Yes, this doesn't seem right, I don't think these auxiliary bus

> >>>> objects should have devlink instances, or at least someone from

> >>>> devlink land should approve of the idea.

> >>>>

> >>>

> >>> In our model, we have one auxdev (for RDMA) per PCI device function

> >>> owned by netdev driver and one devlink instance per auxdev. Plus there is an

> Intel netdev driver for each HW generation.

> >>> Moving the devlink logic to the PCI netdev driver would mean

> >>> duplicating the same set of RDMA params in each Intel netdev driver.

> >>> Additionally, plumbing RDMA specific params in the netdev driver sort of

> seems misplaced to me.

> >>>

> >>

> >> I agree that plumbing these parameters at the PCI side in the devlink

> >> of the parent device is weird. They don't seem to be parameters that

> >> the parent driver cares about.

> >>

> >> Maybe there is another mechanism that makes more sense? To me it is a

> >> bit like if we were plumbing netdev specific paramters into devlink

> >> instead of trying to expose them through netdevice specific

> >> interfaces like iproute2 or ethtool.

> >

> > I'm far from being expert in devlink, but for me separation is following:

> > 1. devlink - operates on physical device level, when PCI device already

> initialized.

> > 2. ethtool - changes needed to be done on netdev layer.

> > 3. ip - upper layer of the netdev

> > 4. rdmatool - RDMA specific when IB device already exists.

> >

> > And the ENABLE_ROCE/ENABLE_RDMA thing shouldn't be in the RDMA driver

> > at all, because it is physical device property which once toggled will

> > prohibit creation of respective aux device.

> >

> 

> Ok. I guess I hadn't looked quite as close at the specifics here. I agree that

> ENABLE_RDMA should go in the PF devlink.

> 

> If there's any other sort of RDMA-specific configuration that ties to the IB device,

> that should go somehow into rdmatool, rather than devlink. And thus: I think I

> agree, we don't want the IB device or the aux device to create a devlink instance.

> 


I think rdma-tool might be too late for this type of param. We need the protocol info (iWARP vs RoCE)
early on driver probe.
Jason Gunthorpe Jan. 27, 2021, 2:44 a.m. UTC | #27
On Wed, Jan 27, 2021 at 12:42:09AM +0000, Saleem, Shiraz wrote:

> > It does, the PCI driver is not supposed to spawn any aux devices for RDMA at all

> > if RDMA is disabled.

> > 

> > For an iWarp driver I would consider ENABLE_ROCE to really be a general

> > ENABLE_RDMA.

> 

> Well the driver supports iWARP and RoCE for E810 device.

> Are you saying that this generic enable_roce devlink param really

> is an enable 'rdma' traffic or not param?


I've thought of it that way, that is what it was created for at least.

Overloading it to be a iwarp not roce switch feels wrong
 
> > Are you sure you need to implement this?

> 

> What we are after is some mechanism for user to switch the protocols iWARP vs RoCE

> [default the device comes up as an iWARP dev]. The protocol info is really needed early-on

> in the RDMA driver.probe(). i.e. when the rdma admin queue is created.


This needs to be a pci devlink at least, some kind of mode switch
seems appropriate
 
> The same goes with the other param resource_limits_selector. It's a

> profile selector that a user can chose to different # of max QP,

> CQs, MRs etc.


And it must be done at init time? Also seems like pci devlink

Generally speaking anything that requires the rdma driver to be
reloaded should remove/restore the aux device.

Mode switch from roce to/from iwarp should create aux devices of
different names which naturally triggers the right kind of sequences
in the driver core

Jason
Leon Romanovsky Jan. 27, 2021, 12:18 p.m. UTC | #28
On Wed, Jan 27, 2021 at 12:41:41AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> >

> > On Tue, Jan 26, 2021 at 12:42:16AM +0000, Saleem, Shiraz wrote:

> >

> > > I think this essentially means doing away with .open/.close piece.

> >

> > Yes, that too, and probably the FSM as well.

> >

> > > Or are you saying that is ok?  Yes we had a discussion in the past and

> > > I thought we concluded. But maybe I misunderstood.

> > >

> > > https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A7DCD

> > > 4FD03@fmsmsx124.amr.corp.intel.com/

> >

> > Well, having now seen how aux bus ended up and the way it effected the

> > mlx5 driver, I am more firmly of the opinion this needs to be fixed. It is extremly

> > hard to get everything right with two different registration schemes running around.

> >

> > You never answered my question:

>

> Sorry I missed it.

> >

> > > Still, you need to be able to cope with the user unbinding your

> > > drivers in any order via sysfs. What happens to the VFs when the PF is

> > > unbound and releases whatever resources? This is where the broadcom

> > > driver ran into troubles..

> >

> > ?

>

> echo -n "ice.intel_rdma.0" > /sys/bus/auxiliary/drivers/irdma/unbind  ???

>

> That I believe will trigger a drv.remove() on the rdma PF side which require

> the rdma VFs to go down.

>

> Yes, we currently have a requirement the aux rdma PF driver remain inited at least to .probe()

> for VFs to survive.

>

> We are doing internal review, but it appears we could potentially get rid of the .open/.close callbacks.

> And its associated FSM in ice.

>

> But if we remove peer_register/unregister, how do we synchronize between say unload of the rdma driver

> and netdev driver stop accessing the priv channel iidc_peer_ops that it uses to send events to rdma?


And here we are returning to square one of intended usage of aux bus.
Your driver should be structured to have PCI core logic that will represent
physical device and many small sub-devices with their respective drivers.

ETH is another sub-device that shouldn't talk directly to the RDMA.

Thanks

>

> Shiraz

>

>
Saleem, Shiraz Jan. 27, 2021, 10:17 p.m. UTC | #29
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Wed, Jan 27, 2021 at 12:41:41AM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> > > and implement private channel OPs

> > >

> > > On Tue, Jan 26, 2021 at 12:42:16AM +0000, Saleem, Shiraz wrote:

> > >

> > > > I think this essentially means doing away with .open/.close piece.

> > >

> > > Yes, that too, and probably the FSM as well.

> > >

> > > > Or are you saying that is ok?  Yes we had a discussion in the past

> > > > and I thought we concluded. But maybe I misunderstood.

> > > >

> > > > https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A

> > > > 7DCD 4FD03@fmsmsx124.amr.corp.intel.com/

> > >

> > > Well, having now seen how aux bus ended up and the way it effected

> > > the

> > > mlx5 driver, I am more firmly of the opinion this needs to be fixed.

> > > It is extremly hard to get everything right with two different registration

> schemes running around.

> > >

> > > You never answered my question:

> >

> > Sorry I missed it.

> > >

> > > > Still, you need to be able to cope with the user unbinding your

> > > > drivers in any order via sysfs. What happens to the VFs when the

> > > > PF is unbound and releases whatever resources? This is where the

> > > > broadcom driver ran into troubles..

> > >

> > > ?

> >

> > echo -n "ice.intel_rdma.0" > /sys/bus/auxiliary/drivers/irdma/unbind  ???

> >

> > That I believe will trigger a drv.remove() on the rdma PF side which

> > require the rdma VFs to go down.

> >

> > Yes, we currently have a requirement the aux rdma PF driver remain

> > inited at least to .probe() for VFs to survive.

> >

> > We are doing internal review, but it appears we could potentially get rid of the

> .open/.close callbacks.

> > And its associated FSM in ice.

> >

> > But if we remove peer_register/unregister, how do we synchronize

> > between say unload of the rdma driver and netdev driver stop accessing the priv

> channel iidc_peer_ops that it uses to send events to rdma?

> 

> And here we are returning to square one of intended usage of aux bus.

> Your driver should be structured to have PCI core logic that will represent physical

> device and many small sub-devices with their respective drivers.


Even with another core PCI driver, there still needs to be private communication channel
between the aux rdma driver and this PCI driver to pass things like QoS updates.

The split of the PCI core device into aux devices is a design choice.
Not sure what that has to do with the usage of bus.

> 

> ETH is another sub-device that shouldn't talk directly to the RDMA.

> 


ETH should talk indirectly to RDMA driver via the core PCI driver? And how does that solve this?
Jason Gunthorpe Jan. 27, 2021, 11:16 p.m. UTC | #30
On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> Even with another core PCI driver, there still needs to be private

> communication channel between the aux rdma driver and this PCI

> driver to pass things like QoS updates.


Data pushed from the core driver to its aux drivers should either be
done through new callbacks in a struct device_driver or by having a
notifier chain scheme from the core driver.

Jason
Leon Romanovsky Jan. 28, 2021, 5:41 a.m. UTC | #31
On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

>

> > Even with another core PCI driver, there still needs to be private

> > communication channel between the aux rdma driver and this PCI

> > driver to pass things like QoS updates.

>

> Data pushed from the core driver to its aux drivers should either be

> done through new callbacks in a struct device_driver or by having a

> notifier chain scheme from the core driver.


Right, and internal to driver/core device_lock will protect from
parallel probe/remove and PCI flows.

I would say that all this handmade register/unregister and peer_client
dance will be gone if driver would use properly auxbus.

Thanks

>

> Jason
Saleem, Shiraz Jan. 30, 2021, 1:19 a.m. UTC | #32
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Wed, Jan 27, 2021 at 12:42:09AM +0000, Saleem, Shiraz wrote:

> 

> > > It does, the PCI driver is not supposed to spawn any aux devices for

> > > RDMA at all if RDMA is disabled.

> > >

> > > For an iWarp driver I would consider ENABLE_ROCE to really be a

> > > general ENABLE_RDMA.

> >

> > Well the driver supports iWARP and RoCE for E810 device.

> > Are you saying that this generic enable_roce devlink param really is

> > an enable 'rdma' traffic or not param?

> 

> I've thought of it that way, that is what it was created for at least.

> 

> Overloading it to be a iwarp not roce switch feels wrong


OK.

> 

> > > Are you sure you need to implement this?

> >

> > What we are after is some mechanism for user to switch the protocols

> > iWARP vs RoCE [default the device comes up as an iWARP dev]. The

> > protocol info is really needed early-on in the RDMA driver.probe(). i.e. when the

> rdma admin queue is created.

> 

> This needs to be a pci devlink at least, some kind of mode switch seems

> appropriate

> 

> > The same goes with the other param resource_limits_selector. It's a

> > profile selector that a user can chose to different # of max QP, CQs,

> > MRs etc.

> 

> And it must be done at init time? Also seems like pci devlink


Yes.

> 

> Generally speaking anything that requires the rdma driver to be reloaded should

> remove/restore the aux device.

> 

> Mode switch from roce to/from iwarp should create aux devices of different names

> which naturally triggers the right kind of sequences in the driver core

> 


OK we will move devlink out of aux rdma driver to PCI driver.

About separate aux dev names for iWARP, RoCE, that sounds reasonable.
Saleem, Shiraz Jan. 30, 2021, 1:19 a.m. UTC | #33
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Mon, Jan 25, 2021 at 02:42:48PM -0400, Jason Gunthorpe wrote:

> > On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> > > +/**

> > > + * irdma_init_dev - GEN_2 device init

> > > + * @aux_dev: auxiliary device

> > > + *

> > > + * Create device resources, set up queues, pble and hmc objects.

> > > + * Return 0 if successful, otherwise return error  */ int

> > > +irdma_init_dev(struct auxiliary_device *aux_dev) {

> > > +	struct iidc_auxiliary_object *vo = container_of(aux_dev,

> > > +							struct

> iidc_auxiliary_object,

> > > +							adev);

> > > +	struct iidc_peer_obj *peer_info = vo->peer_obj;

> > > +	struct irdma_handler *hdl;

> > > +	struct irdma_pci_f *rf;

> > > +	struct irdma_sc_dev *dev;

> > > +	struct irdma_priv_peer_info *priv_peer_info;

> > > +	int err;

> > > +

> > > +	hdl = irdma_find_handler(peer_info->pdev);

> > > +	if (hdl)

> > > +		return -EBUSY;

> > > +

> > > +	hdl = kzalloc(sizeof(*hdl), GFP_KERNEL);

> > > +	if (!hdl)

> > > +		return -ENOMEM;

> > > +

> > > +	rf = &hdl->rf;

> > > +	priv_peer_info = &rf->priv_peer_info;

> > > +	rf->aux_dev = aux_dev;

> > > +	rf->hdl = hdl;

> > > +	dev = &rf->sc_dev;

> > > +	dev->back_dev = rf;

> > > +	rf->gen_ops.init_hw = icrdma_init_hw;

> > > +	rf->gen_ops.request_reset = icrdma_request_reset;

> > > +	rf->gen_ops.register_qset = irdma_lan_register_qset;

> > > +	rf->gen_ops.unregister_qset = irdma_lan_unregister_qset;

> > > +	priv_peer_info->peer_info = peer_info;

> > > +	rf->rdma_ver = IRDMA_GEN_2;

> > > +	irdma_set_config_params(rf);

> > > +	dev->pci_rev = peer_info->pdev->revision;

> > > +	rf->default_vsi.vsi_idx = peer_info->pf_vsi_num;

> > > +	/* save information from peer_info to priv_peer_info*/

> > > +	priv_peer_info->fn_num = PCI_FUNC(peer_info->pdev->devfn);

> > > +	rf->hw.hw_addr = peer_info->hw_addr;

> > > +	rf->pcidev = peer_info->pdev;

> > > +	rf->netdev = peer_info->netdev;

> > > +	priv_peer_info->ftype = peer_info->ftype;

> > > +	priv_peer_info->msix_count = peer_info->msix_count;

> > > +	priv_peer_info->msix_entries = peer_info->msix_entries;

> > > +	irdma_add_handler(hdl);

> > > +	if (irdma_ctrl_init_hw(rf)) {

> > > +		err = -EIO;

> > > +		goto err_ctrl_init;

> > > +	}

> > > +	peer_info->peer_ops = &irdma_peer_ops;

> > > +	peer_info->peer_drv = &irdma_peer_drv;

> > > +	err = peer_info->ops->peer_register(peer_info);

> > > +	if (err)

> > > +		goto err_peer_reg;

> >

> > No to this, I don't want to see aux bus layered on top of another

> > management framework in new drivers. When this driver uses aux bus get

> > rid of the old i40iw stuff. I already said this in one of the older

> > postings of this driver.

> >

> > auxbus probe() for a RDMA driver should call ib_alloc_device() near

> > its start and ib_register_device() near the end its end.

> >

> > drvdata for the aux device should point to the driver struct

> > containing the ib_device.

> 

> My other expectation is to see at least two aux_drivers, one for the RoCE and

> another for the iWARP. It will allow easy management for the users if they decide

> to disable/enable specific functionality (/sys/bus/auxiliary/device/*). It will simplify

> code management too.

> 


Do you mean 2 different auxiliary device names - one for RoCE and iWARP?
The drv.probe() and other callbacks will be very similar for gen2, so one gen2 aux driver
which can bind to iW and RoCE aux device should suffice.

Shiraz
Saleem, Shiraz Jan. 30, 2021, 1:19 a.m. UTC | #34
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Fri, Jan 22, 2021 at 05:48:12PM -0600, Shiraz Saleem wrote:

> 

> > +static int irdma_probe(struct auxiliary_device *aux_dev,

> > +		       const struct auxiliary_device_id *id) {

> > +	struct irdma_drvdata *drvdata;

> > +	int ret;

> > +

> > +	drvdata = kzalloc(sizeof(*drvdata), GFP_KERNEL);

> > +	if (!drvdata)

> > +		return -ENOMEM;

> > +

> > +	switch (id->driver_data) {

> > +	case IRDMA_GEN_2:

> > +		drvdata->init_dev = irdma_init_dev;

> > +		drvdata->deinit_dev = irdma_deinit_dev;

> > +		break;

> > +	case IRDMA_GEN_1:

> > +		drvdata->init_dev = i40iw_init_dev;

> > +		drvdata->deinit_dev = i40iw_deinit_dev;

> > +		break;

> > +	default:

> > +		ret = -ENODEV;

> > +		goto ver_err;

> 

> Also don't do this, if the drivers are so different then give them different aux bus

> names and bind two drivers with the different flow.


I suppose we could have a gen1 aux driver and one for gen2 with the flows being a little orthogonal from each other.

The gen2 aux driver auxiliary_device_id table would have entries for iWARP and RoCE
aux dev's.


> 

> I suppose the old i40e can keep its weird registration thing, but ice should not

> duplicate that, new code must use aux devices properly, as in my other email.

> 

> Jason
Saleem, Shiraz Jan. 30, 2021, 1:19 a.m. UTC | #35
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> >

> > > Even with another core PCI driver, there still needs to be private

> > > communication channel between the aux rdma driver and this PCI

> > > driver to pass things like QoS updates.

> >

> > Data pushed from the core driver to its aux drivers should either be

> > done through new callbacks in a struct device_driver or by having a

> > notifier chain scheme from the core driver.

> 

> Right, and internal to driver/core device_lock will protect from parallel

> probe/remove and PCI flows.

> 


OK. We will hold the device_lock while issuing the .ops callbacks from core driver.
This should solve our synchronization issue.

There have been a few discussions in this thread. And I would like to be clear on what
to do.

So we will,

1. Remove .open/.close, .peer_register/.peer_unregister
2. Protect ops callbacks issued from core driver to the aux driver with device_lock
3. Move the custom iidc_peer_op callbacks to an irdma driver struct that encapsulates the auxiliary driver struct. For core driver to use.
4. Remove ice FSM around open, close etc...
5. RDMA aux driver probe will allocate ib_device and register it at the end of probe.

Does this sound acceptable?
Leon Romanovsky Feb. 1, 2021, 6:09 a.m. UTC | #36
On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> >

> > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > >

> > > > Even with another core PCI driver, there still needs to be private

> > > > communication channel between the aux rdma driver and this PCI

> > > > driver to pass things like QoS updates.

> > >

> > > Data pushed from the core driver to its aux drivers should either be

> > > done through new callbacks in a struct device_driver or by having a

> > > notifier chain scheme from the core driver.

> >

> > Right, and internal to driver/core device_lock will protect from parallel

> > probe/remove and PCI flows.

> >

>

> OK. We will hold the device_lock while issuing the .ops callbacks from core driver.

> This should solve our synchronization issue.

>

> There have been a few discussions in this thread. And I would like to be clear on what

> to do.

>

> So we will,

>

> 1. Remove .open/.close, .peer_register/.peer_unregister

> 2. Protect ops callbacks issued from core driver to the aux driver with device_lock

> 3. Move the custom iidc_peer_op callbacks to an irdma driver struct that encapsulates the auxiliary driver struct. For core driver to use.

> 4. Remove ice FSM around open, close etc...

> 5. RDMA aux driver probe will allocate ib_device and register it at the end of probe.

>

> Does this sound acceptable?


I think that it will be good start, it just hard to say in advance
without seeing the end result.

Thanks
Jason Gunthorpe Feb. 1, 2021, 7:18 p.m. UTC | #37
On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> > 

> > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > >

> > > > Even with another core PCI driver, there still needs to be private

> > > > communication channel between the aux rdma driver and this PCI

> > > > driver to pass things like QoS updates.

> > >

> > > Data pushed from the core driver to its aux drivers should either be

> > > done through new callbacks in a struct device_driver or by having a

> > > notifier chain scheme from the core driver.

> > 

> > Right, and internal to driver/core device_lock will protect from parallel

> > probe/remove and PCI flows.

> > 

> 

> OK. We will hold the device_lock while issuing the .ops callbacks from core driver.

> This should solve our synchronization issue.

> 

> There have been a few discussions in this thread. And I would like to be clear on what

> to do.

> 

> So we will,

> 

> 1. Remove .open/.close, .peer_register/.peer_unregister

> 2. Protect ops callbacks issued from core driver to the aux driver

> with device_lock


A notifier chain is probably better, honestly.

Especially since you don't want to split the netdev side, a notifier
chain can be used by both cases equally.

Jason
Jason Gunthorpe Feb. 1, 2021, 7:19 p.m. UTC | #38
On Sat, Jan 30, 2021 at 01:19:22AM +0000, Saleem, Shiraz wrote:

> Do you mean 2 different auxiliary device names - one for RoCE and

> iWARP?  The drv.probe() and other callbacks will be very similar for

> gen2, so one gen2 aux driver which can bind to iW and RoCE aux

> device should suffice.


You probably end up with three aux device names, gen 1, gen 2 roce and
gen 2 iwarp

Certainly flipping modes should change the device name. This helps
keep everything synchronized, since you delete a device and wait for
all drivers to unbind then create a new device with a different
name.

Jason
Jason Gunthorpe Feb. 1, 2021, 7:21 p.m. UTC | #39
On Wed, Jan 27, 2021 at 12:41:41AM +0000, Saleem, Shiraz wrote:
> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> > 

> > On Tue, Jan 26, 2021 at 12:42:16AM +0000, Saleem, Shiraz wrote:

> > 

> > > I think this essentially means doing away with .open/.close piece.

> > 

> > Yes, that too, and probably the FSM as well.

> > 

> > > Or are you saying that is ok?  Yes we had a discussion in the past and

> > > I thought we concluded. But maybe I misunderstood.

> > >

> > > https://lore.kernel.org/linux-rdma/9DD61F30A802C4429A01CA4200E302A7DCD

> > > 4FD03@fmsmsx124.amr.corp.intel.com/

> > 

> > Well, having now seen how aux bus ended up and the way it effected the

> > mlx5 driver, I am more firmly of the opinion this needs to be fixed. It is extremly

> > hard to get everything right with two different registration schemes running around.

> > 

> > You never answered my question:

> 

> Sorry I missed it.

> > 

> > > Still, you need to be able to cope with the user unbinding your

> > > drivers in any order via sysfs. What happens to the VFs when the PF is

> > > unbound and releases whatever resources? This is where the broadcom

> > > driver ran into troubles..

> > 

> > ?

> 

> echo -n "ice.intel_rdma.0" > /sys/bus/auxiliary/drivers/irdma/unbind  ???

> 

> That I believe will trigger a drv.remove() on the rdma PF side which require

> the rdma VFs to go down.


How? They could be in VMs

Jason
Saleem, Shiraz Feb. 2, 2021, 12:40 a.m. UTC | #40
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:

> > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> > > and implement private channel OPs

> > >

> > > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > > >

> > > > > Even with another core PCI driver, there still needs to be

> > > > > private communication channel between the aux rdma driver and

> > > > > this PCI driver to pass things like QoS updates.

> > > >

> > > > Data pushed from the core driver to its aux drivers should either

> > > > be done through new callbacks in a struct device_driver or by

> > > > having a notifier chain scheme from the core driver.

> > >

> > > Right, and internal to driver/core device_lock will protect from

> > > parallel probe/remove and PCI flows.

> > >

> >

> > OK. We will hold the device_lock while issuing the .ops callbacks from core

> driver.

> > This should solve our synchronization issue.

> >

> > There have been a few discussions in this thread. And I would like to

> > be clear on what to do.

> >

> > So we will,

> >

> > 1. Remove .open/.close, .peer_register/.peer_unregister 2. Protect ops

> > callbacks issued from core driver to the aux driver with device_lock

> 

> A notifier chain is probably better, honestly.

> 

> Especially since you don't want to split the netdev side, a notifier chain can be

> used by both cases equally.

> 


The device_lock seems to be a simple solution to this synchronization problem.
May I ask what makes the notifier scheme better to solve this?

Also, are you suggesting we rid all the iidc_peer_op callbacks used for 'ice' to 'irdma' driver
communication and replace with events generated by ice driver which will be received
by subscriber irdma? Or just some specific events to solve this synchronization problem?
Sorry I am confused.

Shiraz
Dan Williams Feb. 2, 2021, 1:06 a.m. UTC | #41
On Mon, Feb 1, 2021 at 4:40 PM Saleem, Shiraz <shiraz.saleem@intel.com> wrote:
>

> > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > implement private channel OPs

> >

> > On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:

> > > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> > > > and implement private channel OPs

> > > >

> > > > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > > > >

> > > > > > Even with another core PCI driver, there still needs to be

> > > > > > private communication channel between the aux rdma driver and

> > > > > > this PCI driver to pass things like QoS updates.

> > > > >

> > > > > Data pushed from the core driver to its aux drivers should either

> > > > > be done through new callbacks in a struct device_driver or by

> > > > > having a notifier chain scheme from the core driver.

> > > >

> > > > Right, and internal to driver/core device_lock will protect from

> > > > parallel probe/remove and PCI flows.

> > > >

> > >

> > > OK. We will hold the device_lock while issuing the .ops callbacks from core

> > driver.

> > > This should solve our synchronization issue.

> > >

> > > There have been a few discussions in this thread. And I would like to

> > > be clear on what to do.

> > >

> > > So we will,

> > >

> > > 1. Remove .open/.close, .peer_register/.peer_unregister 2. Protect ops

> > > callbacks issued from core driver to the aux driver with device_lock

> >

> > A notifier chain is probably better, honestly.

> >

> > Especially since you don't want to split the netdev side, a notifier chain can be

> > used by both cases equally.

> >

>

> The device_lock seems to be a simple solution to this synchronization problem.

> May I ask what makes the notifier scheme better to solve this?

>


Only loosely following the arguments here, but one of the requirements
of the driver-op scheme is that the notifying agent needs to know the
target device. With the notifier-chain approach the target device
becomes anonymous to the notifier agent.
Jason Gunthorpe Feb. 2, 2021, 5:14 p.m. UTC | #42
On Mon, Feb 01, 2021 at 05:06:58PM -0800, Dan Williams wrote:
> On Mon, Feb 1, 2021 at 4:40 PM Saleem, Shiraz <shiraz.saleem@intel.com> wrote:

> >

> > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> > > implement private channel OPs

> > >

> > > On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:

> > > > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver

> > > > > and implement private channel OPs

> > > > >

> > > > > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > > > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > > > > >

> > > > > > > Even with another core PCI driver, there still needs to be

> > > > > > > private communication channel between the aux rdma driver and

> > > > > > > this PCI driver to pass things like QoS updates.

> > > > > >

> > > > > > Data pushed from the core driver to its aux drivers should either

> > > > > > be done through new callbacks in a struct device_driver or by

> > > > > > having a notifier chain scheme from the core driver.

> > > > >

> > > > > Right, and internal to driver/core device_lock will protect from

> > > > > parallel probe/remove and PCI flows.

> > > > >

> > > >

> > > > OK. We will hold the device_lock while issuing the .ops callbacks from core

> > > driver.

> > > > This should solve our synchronization issue.

> > > >

> > > > There have been a few discussions in this thread. And I would like to

> > > > be clear on what to do.

> > > >

> > > > So we will,

> > > >

> > > > 1. Remove .open/.close, .peer_register/.peer_unregister 2. Protect ops

> > > > callbacks issued from core driver to the aux driver with device_lock

> > >

> > > A notifier chain is probably better, honestly.

> > >

> > > Especially since you don't want to split the netdev side, a notifier chain can be

> > > used by both cases equally.

> > >

> >

> > The device_lock seems to be a simple solution to this synchronization problem.

> > May I ask what makes the notifier scheme better to solve this?

> >

> 

> Only loosely following the arguments here, but one of the requirements

> of the driver-op scheme is that the notifying agent needs to know the

> target device. With the notifier-chain approach the target device

> becomes anonymous to the notifier agent.


Yes, and you need to have an aux device in the first place. The netdev
side has neither of this things. I think it would be a bit odd to have
extensive callbacks that are for RDMA only, that suggests something in
the core API is not general enough.

Jason
Saleem, Shiraz Feb. 2, 2021, 7:42 p.m. UTC | #43
> Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary driver and

> implement private channel OPs

> 

> On Mon, Feb 01, 2021 at 05:06:58PM -0800, Dan Williams wrote:

> > On Mon, Feb 1, 2021 at 4:40 PM Saleem, Shiraz <shiraz.saleem@intel.com>

> wrote:

> > >

> > > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary

> > > > driver and implement private channel OPs

> > > >

> > > > On Sat, Jan 30, 2021 at 01:19:36AM +0000, Saleem, Shiraz wrote:

> > > > > > Subject: Re: [PATCH 07/22] RDMA/irdma: Register an auxiliary

> > > > > > driver and implement private channel OPs

> > > > > >

> > > > > > On Wed, Jan 27, 2021 at 07:16:41PM -0400, Jason Gunthorpe wrote:

> > > > > > > On Wed, Jan 27, 2021 at 10:17:56PM +0000, Saleem, Shiraz wrote:

> > > > > > >

> > > > > > > > Even with another core PCI driver, there still needs to be

> > > > > > > > private communication channel between the aux rdma driver

> > > > > > > > and this PCI driver to pass things like QoS updates.

> > > > > > >

> > > > > > > Data pushed from the core driver to its aux drivers should

> > > > > > > either be done through new callbacks in a struct

> > > > > > > device_driver or by having a notifier chain scheme from the core

> driver.

> > > > > >

> > > > > > Right, and internal to driver/core device_lock will protect

> > > > > > from parallel probe/remove and PCI flows.

> > > > > >

> > > > >

> > > > > OK. We will hold the device_lock while issuing the .ops

> > > > > callbacks from core

> > > > driver.

> > > > > This should solve our synchronization issue.

> > > > >

> > > > > There have been a few discussions in this thread. And I would

> > > > > like to be clear on what to do.

> > > > >

> > > > > So we will,

> > > > >

> > > > > 1. Remove .open/.close, .peer_register/.peer_unregister 2.

> > > > > Protect ops callbacks issued from core driver to the aux driver

> > > > > with device_lock

> > > >

> > > > A notifier chain is probably better, honestly.

> > > >

> > > > Especially since you don't want to split the netdev side, a

> > > > notifier chain can be used by both cases equally.

> > > >

> > >

> > > The device_lock seems to be a simple solution to this synchronization

> problem.

> > > May I ask what makes the notifier scheme better to solve this?

> > >

> >

> > Only loosely following the arguments here, but one of the requirements

> > of the driver-op scheme is that the notifying agent needs to know the

> > target device. With the notifier-chain approach the target device

> > becomes anonymous to the notifier agent.

> 

> Yes, and you need to have an aux device in the first place. The netdev side has

> neither of this things. 


But we do. The ice PCI driver is thing spawning the aux device. And we are trying to do
something directed here specifically between the ice PCI driver and the irdma aux driver.
Seems the notifier chain approach, from the comment above, is less directed and when
you want to broadcast events from core driver to multiple registered subscribers.

I think there is going to be need for some ops even if we were to use notifier chains.
Such as ones that need a ret_code.



I think it would be a bit odd to have extensive callbacks that
> are for RDMA only, that suggests something in the core API is not general enough.

> 


Yes there are some domain specific ops. But it is within the boundary of how the aux bus should be used no? 

https://www.kernel.org/doc/html/latest/driver-api/auxiliary_bus.html
"The auxiliary_driver can also be encapsulated inside custom drivers that make the core device's functionality extensible by adding additional domain-specific ops as follows:"

struct my_driver {
        struct auxiliary_driver auxiliary_drv;
        const struct my_ops ops;
};
Jason Gunthorpe Feb. 2, 2021, 11:17 p.m. UTC | #44
On Tue, Feb 02, 2021 at 07:42:11PM +0000, Saleem, Shiraz wrote:

> > > Only loosely following the arguments here, but one of the requirements

> > > of the driver-op scheme is that the notifying agent needs to know the

> > > target device. With the notifier-chain approach the target device

> > > becomes anonymous to the notifier agent.

> > 

> > Yes, and you need to have an aux device in the first place. The netdev side has

> > neither of this things. 

> 

> But we do. The ice PCI driver is thing spawning the aux device. And

> we are trying to do something directed here specifically between the

> ice PCI driver and the irdma aux driver.  Seems the notifier chain

> approach, from the comment above, is less directed and when you want

> to broadcast events from core driver to multiple registered

> subscribers.


Yes, generally for good design the net and rdma drivers should be
peers, using similar interfaces, otherwise there will be trouble
answering the question what each peice of code is for, and if a net
change breaks rdma land.

> > I think it would be a bit odd to have extensive callbacks that

> > are for RDMA only, that suggests something in the core API is not general enough.

> 

> Yes there are some domain specific ops. But it is within the

> boundary of how the aux bus should be used no?


Sure is, but I'm not sure it is a great design of a driver.

In the end I don't care alot about which thing you pick, so long as
the peer layer is fused with aux bus and there isn't a 2nd
registration layer for devices.

Jason