mbox series

[net-next,00/10] mlxsw: Expose transceiver overheat counter

Message ID 20200927075015.1417714-1-idosch@idosch.org
Headers show
Series mlxsw: Expose transceiver overheat counter | expand

Message

Ido Schimmel Sept. 27, 2020, 7:50 a.m. UTC
From: Ido Schimmel <idosch@nvidia.com>

Amit says:

An overheated transceiver can be the root cause of various network
problems such as link flapping. Counting the number of times a
transceiver's temperature was higher than its configured threshold can
therefore help in debugging such issues.

This patch set exposes a transceiver overheat counter via ethtool. This
is achieved by configuring the Spectrum ASIC to generate events whenever
a transceiver is overheated. The temperature thresholds are queried from
the transceiver (if available) and set to the default otherwise.

Example:

# ethtool -S swp1
...
transceiver_overheat: 2

Patch set overview:

Patches #1-#3 add required device registers
Patches #4-#5 add required infrastructure in mlxsw to configure and
count overheat events
Patches #6-#9 gradually add support for the transceiver overheat counter
Patch #10 exposes the transceiver overheat counter via ethtool

Amit Cohen (10):
  mlxsw: reg: Add Management Temperature Warning Event Register
  mlxsw: reg: Add Port Module Plug/Unplug Event Register
  mlxsw: reg: Add Ports Module Administrative and Operational Status
    Register
  mlxsw: core_hwmon: Query MTMP before writing to set only relevant
    fields
  mlxsw: core: Add an infrastructure to track transceiver overheat
    counter
  mlxsw: Update transceiver_overheat counter according to MTWE
  mlxsw: Enable temperature event for all supported port module sensors
  mlxsw: spectrum: Initialize netdev's module overheat counter
  mlxsw: Update module's settings when module is plugged in
  mlxsw: spectrum_ethtool: Expose transceiver_overheat counter

 drivers/net/ethernet/mellanox/mlxsw/core.c    |  27 ++
 drivers/net/ethernet/mellanox/mlxsw/core.h    |   5 +
 .../net/ethernet/mellanox/mlxsw/core_env.c    | 368 ++++++++++++++++++
 .../net/ethernet/mellanox/mlxsw/core_env.h    |   6 +
 .../net/ethernet/mellanox/mlxsw/core_hwmon.c  |  21 +-
 drivers/net/ethernet/mellanox/mlxsw/reg.h     | 132 +++++++
 .../net/ethernet/mellanox/mlxsw/spectrum.c    |  44 +++
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   1 +
 .../mellanox/mlxsw/spectrum_ethtool.c         |  57 ++-
 drivers/net/ethernet/mellanox/mlxsw/trap.h    |   4 +
 10 files changed, 660 insertions(+), 5 deletions(-)

Comments

David Miller Sept. 27, 2020, 8:27 p.m. UTC | #1
From: Ido Schimmel <idosch@idosch.org>
Date: Sun, 27 Sep 2020 10:50:05 +0300

> From: Ido Schimmel <idosch@nvidia.com>
> 
> Amit says:
> 
> An overheated transceiver can be the root cause of various network
> problems such as link flapping. Counting the number of times a
> transceiver's temperature was higher than its configured threshold can
> therefore help in debugging such issues.
> 
> This patch set exposes a transceiver overheat counter via ethtool. This
> is achieved by configuring the Spectrum ASIC to generate events whenever
> a transceiver is overheated. The temperature thresholds are queried from
> the transceiver (if available) and set to the default otherwise.
> 
> Example:
> 
> # ethtool -S swp1
> ...
> transceiver_overheat: 2
> 
> Patch set overview:
> 
> Patches #1-#3 add required device registers
> Patches #4-#5 add required infrastructure in mlxsw to configure and
> count overheat events
> Patches #6-#9 gradually add support for the transceiver overheat counter
> Patch #10 exposes the transceiver overheat counter via ethtool

Series applied, thanks.