[rdma-next,0/8] Generalize if ULP supported check

Message ID	20210405055000.215792-1-leon@kernel.org
Headers	show Return-Path: <netdev-owner@kernel.org> From: Leon Romanovsky <leon@kernel.org> To: Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@nvidia.com> Cc: Leon Romanovsky <leonro@nvidia.com>, "David S. Miller" <davem@davemloft.net>, Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>, Jakub Kicinski <kuba@kernel.org>, Karsten Graul <kgraul@linux.ibm.com>, linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org, linux-s390@vger.kernel.org, Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>, netdev@vger.kernel.org, Parav Pandit <parav@nvidia.com>, rds-devel@oss.oracle.com, Santosh Shilimkar <santosh.shilimkar@oracle.com> Subject: [PATCH rdma-next 0/8] Generalize if ULP supported check Date: Mon, 5 Apr 2021 08:49:52 +0300 Message-Id: <20210405055000.215792-1-leon@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Generalize if ULP supported check \| expand [rdma-next,0/8] Generalize if ULP supported check [rdma-next,1/8] RDMA/core: Check if client supports IB device or not [rdma-next,2/8] RDMA/cma: Skip device which doesn't support CM [rdma-next,3/8] IB/cm: Skip device which doesn't support IB CM [rdma-next,4/8] IB/core: Skip device which doesn't have necessary capabilities [rdma-next,5/8] IB/IPoIB: Skip device which doesn't have InfiniBand port [rdma-next,6/8] IB/opa_vnic: Move to client_supported callback [rdma-next,7/8] net/smc: Move to client_supported callback [rdma-next,8/8] net/rds: Move to client_supported callback

Leon Romanovsky April 5, 2021, 5:49 a.m. UTC

From: Leon Romanovsky <leonro@nvidia.com>

Hi,

This series adds new callback to check if ib client is supported/not_supported.
Such general callback allows us to save memory footprint by not starting
on devices that not going to work on them anyway.

Thanks

Parav Pandit (8):
  RDMA/core: Check if client supports IB device or not
  RDMA/cma: Skip device which doesn't support CM
  IB/cm: Skip device which doesn't support IB CM
  IB/core: Skip device which doesn't have necessary capabilities
  IB/IPoIB: Skip device which doesn't have InfiniBand port
  IB/opa_vnic: Move to client_supported callback
  net/smc: Move to client_supported callback
  net/rds: Move to client_supported callback

 drivers/infiniband/core/cm.c                  | 15 +++++++++++++-
 drivers/infiniband/core/cma.c                 | 15 +++++++++++++-
 drivers/infiniband/core/device.c              |  3 +++
 drivers/infiniband/core/multicast.c           | 15 +++++++++++++-
 drivers/infiniband/core/sa_query.c            | 15 +++++++++++++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c     | 13 ++++++++++++
 .../infiniband/ulp/opa_vnic/opa_vnic_vema.c   |  4 +---
 include/rdma/ib_verbs.h                       |  9 +++++++++
 net/rds/ib.c                                  | 20 ++++++++++++-------
 net/smc/smc_ib.c                              |  9 ++++++---
 10 files changed, 101 insertions(+), 17 deletions(-)

Gal Pressman April 5, 2021, 6:20 a.m. UTC | #1

On 05/04/2021 8:49, Leon Romanovsky wrote:
> From: Parav Pandit <parav@nvidia.com>
> 
> RDMA devices are of different transport(iWarp, IB, RoCE) and have
> different attributes.
> Not all clients are interested in all type of devices.
> 
> Implement a generic callback that each IB client can implement to decide
> if client add() or remove() should be done by the IB core or not for a
> given IB device, client combination.
> 
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/infiniband/core/device.c | 3 +++
>  include/rdma/ib_verbs.h          | 9 +++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index c660cef66ac6..c9af2deba8c1 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -691,6 +691,9 @@ static int add_client_context(struct ib_device *device,
>  	if (!device->kverbs_provider && !client->no_kverbs_req)
>  		return 0;
>  
> +	if (client->is_supported && !client->is_supported(device))
> +		return 0;

Isn't it better to remove the kverbs_provider flag (from previous if statement)
and unify it with this generic support check?

Jason Gunthorpe April 6, 2021, 3:46 p.m. UTC | #2

On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:
> @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler *handler,

>  	}

>  }

>  

> +static bool ib_sa_client_supported(struct ib_device *device)

> +{

> +	unsigned int i;

> +

> +	rdma_for_each_port(device, i) {

> +		if (rdma_cap_ib_sa(device, i))

> +			return true;

> +	}

> +	return false;

> +}


This is already done though:

	for (i = 0; i <= e - s; ++i) {
		spin_lock_init(&sa_dev->port[i].ah_lock);
		if (!rdma_cap_ib_sa(device, i + 1))
			continue;
[..]

	if (!count) {
		ret = -EOPNOTSUPP;
		goto free;

Why does it need to be duplicated? The other patches are all basically
like that too.

The add_one function should return -EOPNOTSUPP if it doesn't want to
run on this device and any supported checks should just be at the
front - this is how things work right now

Jason

Parav Pandit April 7, 2021, 3:06 p.m. UTC | #3

> From: Jason Gunthorpe <jgg@nvidia.com>

> Sent: Tuesday, April 6, 2021 9:17 PM

> 

> On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:

> > @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler

> *handler,

> >  	}

> >  }

> >

> > +static bool ib_sa_client_supported(struct ib_device *device) {

> > +	unsigned int i;

> > +

> > +	rdma_for_each_port(device, i) {

> > +		if (rdma_cap_ib_sa(device, i))

> > +			return true;

> > +	}

> > +	return false;

> > +}

> 

> This is already done though:

It is but, ib_sa_device() allocates ib_sa_device worth of struct for each port without checking the rdma_cap_ib_sa().
This results into allocating 40 * 512 = 20480 rounded of to power of 2 to 32K bytes of memory for the rdma device with 512 ports.
Other modules are also similarly wasting such memory.

> 

> 	for (i = 0; i <= e - s; ++i) {

> 		spin_lock_init(&sa_dev->port[i].ah_lock);

> 		if (!rdma_cap_ib_sa(device, i + 1))

> 			continue;

> [..]

> 

> 	if (!count) {

> 		ret = -EOPNOTSUPP;

> 		goto free;

> 

> Why does it need to be duplicated? The other patches are all basically like

> that too.

> 

> The add_one function should return -EOPNOTSUPP if it doesn't want to run

> on this device and any supported checks should just be at the front - this is

> how things work right now

> 

I am ok to fold this check at the beginning of add callback.
When 512 to 1K RoCE devices are used, they do not have SA, CM, CMA etc caps on and all the client needs to go through refcnt + xa + sem and unroll them.
Is_supported() routine helps to cut down all of it. I didn't calculate the usec saved with it.

Please let me know.

Jason Gunthorpe April 7, 2021, 3:13 p.m. UTC | #4

On Wed, Apr 07, 2021 at 03:06:35PM +0000, Parav Pandit wrote:
> 

> 

> > From: Jason Gunthorpe <jgg@nvidia.com>

> > Sent: Tuesday, April 6, 2021 9:17 PM

> > 

> > On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:

> > > @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct ib_event_handler

> > *handler,

> > >  	}

> > >  }

> > >

> > > +static bool ib_sa_client_supported(struct ib_device *device) {

> > > +	unsigned int i;

> > > +

> > > +	rdma_for_each_port(device, i) {

> > > +		if (rdma_cap_ib_sa(device, i))

> > > +			return true;

> > > +	}

> > > +	return false;

> > > +}

> > 

> > This is already done though:


> It is but, ib_sa_device() allocates ib_sa_device worth of struct for

> each port without checking the rdma_cap_ib_sa().  This results into

> allocating 40 * 512 = 20480 rounded of to power of 2 to 32K bytes of

> memory for the rdma device with 512 ports.  Other modules are also

> similarly wasting such memory.


If it returns EOPNOTUPP then the remove is never called so if it
allocated memory and left it allocated then it is leaking memory.

If you are saying 32k bytes of temporary allocation matters during
device startup then it needs benchmarks and a use case.

> > The add_one function should return -EOPNOTSUPP if it doesn't want to run

> > on this device and any supported checks should just be at the front - this is

> > how things work right now


> I am ok to fold this check at the beginning of add callback.  When

> 512 to 1K RoCE devices are used, they do not have SA, CM, CMA etc

> caps on and all the client needs to go through refcnt + xa + sem and

> unroll them.  Is_supported() routine helps to cut down all of it. I

> didn't calculate the usec saved with it.


If that is the reason then explain in the cover letter and provide
benchmarks

Jason

Parav Pandit April 7, 2021, 3:44 p.m. UTC | #5

> From: Jason Gunthorpe <jgg@nvidia.com>

> Sent: Wednesday, April 7, 2021 8:44 PM

> 

> On Wed, Apr 07, 2021 at 03:06:35PM +0000, Parav Pandit wrote:

> >

> >

> > > From: Jason Gunthorpe <jgg@nvidia.com>

> > > Sent: Tuesday, April 6, 2021 9:17 PM

> > >

> > > On Mon, Apr 05, 2021 at 08:49:56AM +0300, Leon Romanovsky wrote:

> > > > @@ -2293,6 +2295,17 @@ static void ib_sa_event(struct

> > > > ib_event_handler

> > > *handler,

> > > >  	}

> > > >  }

> > > >

> > > > +static bool ib_sa_client_supported(struct ib_device *device) {

> > > > +	unsigned int i;

> > > > +

> > > > +	rdma_for_each_port(device, i) {

> > > > +		if (rdma_cap_ib_sa(device, i))

> > > > +			return true;

> > > > +	}

> > > > +	return false;

> > > > +}

> > >

> > > This is already done though:

> 

> > It is but, ib_sa_device() allocates ib_sa_device worth of struct for

> > each port without checking the rdma_cap_ib_sa().  This results into

> > allocating 40 * 512 = 20480 rounded of to power of 2 to 32K bytes of

> > memory for the rdma device with 512 ports.  Other modules are also

> > similarly wasting such memory.

> 

> If it returns EOPNOTUPP then the remove is never called so if it allocated

> memory and left it allocated then it is leaking memory.

> 

I probably confused you. There is no leak today because add_one allocates memory, and later on when SA/CM etc per port cap is not present, it is unused left there which is freed on remove_one().
Returning EOPNOTUPP is fine at start of add_one() before allocation.

> If you are saying 32k bytes of temporary allocation matters during device

> startup then it needs benchmarks and a use case.

> 

Use case is clear and explained in commit logs, i.e. to not allocate the memory which is never used.

> > > The add_one function should return -EOPNOTSUPP if it doesn't want to

> > > run on this device and any supported checks should just be at the

> > > front - this is how things work right now

> 

> > I am ok to fold this check at the beginning of add callback.  When

> > 512 to 1K RoCE devices are used, they do not have SA, CM, CMA etc caps

> > on and all the client needs to go through refcnt + xa + sem and unroll

> > them.  Is_supported() routine helps to cut down all of it. I didn't

> > calculate the usec saved with it.

> 

> If that is the reason then explain in the cover letter and provide benchmarks

I doubt it will be significant but I will do a benchmark.

Jason Gunthorpe April 8, 2021, 12:16 p.m. UTC | #6

On Wed, Apr 07, 2021 at 03:44:35PM +0000, Parav Pandit wrote:

> > If it returns EOPNOTUPP then the remove is never called so if it allocated

> > memory and left it allocated then it is leaking memory.

> > 

> I probably confused you. There is no leak today because add_one

> allocates memory, and later on when SA/CM etc per port cap is not

> present, it is unused left there which is freed on remove_one().

> Returning EOPNOTUPP is fine at start of add_one() before allocation.


Most of ULPs are OK, eg umad does:

	umad_dev = kzalloc(struct_size(umad_dev, ports, e - s + 1), GFP_KERNEL);
	if (!umad_dev)
		return -ENOMEM;
	for (i = s; i <= e; ++i) {
		if (!rdma_cap_ib_mad(device, i))
			continue;

	if (!count) {
		ret = -EOPNOTSUPP;
		goto free;
free:
	/* balances kref_init */
	ib_umad_dev_put(umad_dev);

It looks like only cm.c and cma.c need fixing, just fix those two.

The CM using ULPs have a different issue though..

Jason

Parav Pandit April 9, 2021, 12:31 p.m. UTC | #7

> From: Jason Gunthorpe <jgg@nvidia.com>

> Sent: Thursday, April 8, 2021 5:46 PM

> On Wed, Apr 07, 2021 at 03:44:35PM +0000, Parav Pandit wrote:

> 

> > > If it returns EOPNOTUPP then the remove is never called so if it

> > > allocated memory and left it allocated then it is leaking memory.

> > >

> > I probably confused you. There is no leak today because add_one

> > allocates memory, and later on when SA/CM etc per port cap is not

> > present, it is unused left there which is freed on remove_one().

> > Returning EOPNOTUPP is fine at start of add_one() before allocation.

> 

> Most of ULPs are OK, eg umad does:

> 

> 	umad_dev = kzalloc(struct_size(umad_dev, ports, e - s + 1),

> GFP_KERNEL);

> 	if (!umad_dev)

> 		return -ENOMEM;

> 	for (i = s; i <= e; ++i) {

> 		if (!rdma_cap_ib_mad(device, i))

> 			continue;

> 

> 	if (!count) {

> 		ret = -EOPNOTSUPP;

> 		goto free;

> free:

> 	/* balances kref_init */

> 	ib_umad_dev_put(umad_dev);

> 

> It looks like only cm.c and cma.c need fixing, just fix those two.

Only cma.c needs a fixing. cm.c also reports EOPNOTSUPP.
I will send the simplified fix through Leon.

[rdma-next,0/8] Generalize if ULP supported check

Message

Comments