mbox series

[pull,request,net,V2,00/15] mlx5 fixes 2020-09-30

Message ID 20201001195247.66636-1-saeed@kernel.org
Headers show
Series mlx5 fixes 2020-09-30 | expand

Message

Saeed Mahameed Oct. 1, 2020, 7:52 p.m. UTC
From: Saeed Mahameed <saeedm@nvidia.com>

Hi Dave,

This series introduces some fixes to mlx5 driver.

v1->v2:
  - Patch #1 Don't return while mutex is held. (Dave)

Please pull and let me know if there is any problem.

For -stable v4.15
 ('net/mlx5e: Fix VLAN cleanup flow')
 ('net/mlx5e: Fix VLAN create flow')

For -stable v4.16
 ('net/mlx5: Fix request_irqs error flow')

For -stable v5.4
 ('net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU')
 ('net/mlx5: Avoid possible free of command entry while timeout comp handler')

For -stable v5.7
 ('net/mlx5e: Fix return status when setting unsupported FEC mode')

For -stable v5.8
 ('net/mlx5e: Fix race condition on nhe->n pointer in neigh update')

Thanks,
Saeed.

---
The following changes since commit a59cf619787e628b31c310367f869fde26c8ede1:

  Merge branch 'Fix-bugs-in-Octeontx2-netdev-driver' (2020-09-30 15:07:19 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-fixes-2020-09-30

for you to fetch changes up to ae2cc06daf21c2a38c6caca2c19599d61a5b3890:

  net/mlx5e: Fix race condition on nhe->n pointer in neigh update (2020-10-01 12:46:37 -0700)

----------------------------------------------------------------
mlx5-fixes-2020-09-30

----------------------------------------------------------------
Aya Levin (6):
      net/mlx5e: Fix error path for RQ alloc
      net/mlx5e: Add resiliency in Striding RQ mode for packets larger than MTU
      net/mlx5e: Fix driver's declaration to support GRE offload
      net/mlx5e: Fix return status when setting unsupported FEC mode
      net/mlx5e: Fix VLAN cleanup flow
      net/mlx5e: Fix VLAN create flow

Eran Ben Elisha (4):
      net/mlx5: Fix a race when moving command interface to polling mode
      net/mlx5: Avoid possible free of command entry while timeout comp handler
      net/mlx5: poll cmd EQ in case of command timeout
      net/mlx5: Add retry mechanism to the command entry index allocation

Maor Dickman (1):
      net/mlx5e: CT, Fix coverity issue

Maor Gottlieb (1):
      net/mlx5: Fix request_irqs error flow

Saeed Mahameed (1):
      net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible

Shay Drory (1):
      net/mlx5: Don't allow health work when device is uninitialized

Vlad Buslov (1):
      net/mlx5e: Fix race condition on nhe->n pointer in neigh update

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c      | 198 +++++++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en.h       |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/port.c  |   3 +
 .../net/ethernet/mellanox/mlx5/core/en/rep/neigh.c |  81 +++++----
 drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    |  14 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 104 +++++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.h   |   6 -
 drivers/net/ethernet/mellanox/mlx5/core/eq.c       |  42 ++++-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   |  11 ++
 drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h   |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |   2 +
 .../net/ethernet/mellanox/mlx5/core/pagealloc.c    |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c  |   2 +-
 include/linux/mlx5/driver.h                        |   4 +
 15 files changed, 364 insertions(+), 119 deletions(-)

Comments

Jakub Kicinski Oct. 1, 2020, 11:15 p.m. UTC | #1
On Thu,  1 Oct 2020 12:52:33 -0700 saeed@kernel.org wrote:
> From: Shay Drory <shayd@mellanox.com>
> 
> On error flow due to failure on driver load, driver can be
> un-initializing while a health work is running in the background,
> health work shouldn't be allowed at this point, as it needs resources to
> be initialized and there is no point to recover on driver load failures.
> 
> Therefore, introducing a new state bit to indicated if device is
> initialized, for health work to check before trying to recover the driver.

Can't you cancel this work? Or make sure it's not scheduled?
IMHO those "INITILIZED" bits are an anti-pattern.

> Fixes: b6e0b6bebe07 ("net/mlx5: Fix fatal error handling during device load")
> Signed-off-by: Shay Drory <shayd@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

You signed off twice :)

We should teach verify_signoff to catch that..
Saeed Mahameed Oct. 2, 2020, 4:57 p.m. UTC | #2
On Thu, 2020-10-01 at 16:15 -0700, Jakub Kicinski wrote:
> On Thu,  1 Oct 2020 12:52:33 -0700 saeed@kernel.org wrote:
> > From: Shay Drory <shayd@mellanox.com>
> > 
> > On error flow due to failure on driver load, driver can be
> > un-initializing while a health work is running in the background,
> > health work shouldn't be allowed at this point, as it needs
> > resources to
> > be initialized and there is no point to recover on driver load
> > failures.
> > 
> > Therefore, introducing a new state bit to indicated if device is
> > initialized, for health work to check before trying to recover the
> > driver.
> 
> Can't you cancel this work? Or make sure it's not scheduled?
> IMHO those "INITILIZED" bits are an anti-pattern.
> 

Shay didn't want to make this patch complicated for net, since this
health work should start as early as possible and should be kept
running after driver is initialized, even if the driver instance
reloads after .. the main issue of the design is that we initialize +
allocate the driver structures once on the first boot, after that all
reloads will reuse the same structure, so there is some asymmetry that
we need to deal with, but nothing is impossible, the solution will be
more complicated but won't be too big to make it to net (i hope), I
will drop this patch for now.

> > Fixes: b6e0b6bebe07 ("net/mlx5: Fix fatal error handling during
> > device load")
> > Signed-off-by: Shay Drory <shayd@mellanox.com>
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> 
> You signed off twice :)
> 

Will fix this, old mellanox email :/

> We should teach verify_signoff to catch that..
it is not exactly twice, different emails..
Saeed Mahameed Oct. 2, 2020, 5:05 p.m. UTC | #3
On Thu, 2020-10-01 at 16:24 -0700, Jakub Kicinski wrote:
> On Thu,  1 Oct 2020 12:52:39 -0700 saeed@kernel.org wrote:
> > -	for (; i >= 0; i--) {
> > +	for (--i; i >= 0; i--) {
> 
> while (i--)

while(--i)

I like this, less characters to maintain :)
Mark Bloch Oct. 2, 2020, 5:19 p.m. UTC | #4
On 10/2/2020 10:05, Saeed Mahameed wrote:
> On Thu, 2020-10-01 at 16:24 -0700, Jakub Kicinski wrote:
>> On Thu,  1 Oct 2020 12:52:39 -0700 saeed@kernel.org wrote:
>>> -	for (; i >= 0; i--) {
>>> +	for (--i; i >= 0; i--) {
>>
>> while (i--)
> 
> while(--i)

It has to be: while (i--)
Case of i=0,

> 
> I like this, less characters to maintain :)
> 

Mark
Saeed Mahameed Oct. 2, 2020, 5:27 p.m. UTC | #5
On Fri, 2020-10-02 at 10:19 -0700, Mark Bloch wrote:
> 
> On 10/2/2020 10:05, Saeed Mahameed wrote:
> > On Thu, 2020-10-01 at 16:24 -0700, Jakub Kicinski wrote:
> > > On Thu,  1 Oct 2020 12:52:39 -0700 saeed@kernel.org wrote:
> > > > -	for (; i >= 0; i--) {
> > > > +	for (--i; i >= 0; i--) {
> > > 
> > > while (i--)
> > 
> > while(--i)
> 
> It has to be: while (i--)
> Case of i=0,
> 

woops !

while (i--) it is. 

Thanks Mark.