diff mbox series

[v8,7/8] thermal: Add PCIe cooling driver

Message ID 20241009095223.7093-8-ilpo.jarvinen@linux.intel.com
State Superseded
Headers show
Series [v8,1/8] PCI: Protect Link Control 2 Register with RMW locking | expand

Commit Message

Ilpo Järvinen Oct. 9, 2024, 9:52 a.m. UTC
Add a thermal cooling driver to provide path to access PCIe bandwidth
controller using the usual thermal interfaces.

A cooling device is instantiated for controllable PCIe Ports from the
bwctrl service driver.

If registering the cooling device fails, allow bwctrl's probe to
succeed regardless. As cdev in that case contains IS_ERR() pseudo
"pointer", clean that up inside the probe function so the remove side
doesn't need to suddenly make an odd looking IS_ERR() check.

The thermal side state 0 means no throttling, i.e., maximum supported
PCIe Link Speed.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
---
 MAINTAINERS                    |  2 +
 drivers/pci/pcie/bwctrl.c      | 11 +++++
 drivers/thermal/Kconfig        |  9 ++++
 drivers/thermal/Makefile       |  2 +
 drivers/thermal/pcie_cooling.c | 80 ++++++++++++++++++++++++++++++++++
 include/linux/pci-bwctrl.h     | 28 ++++++++++++
 6 files changed, 132 insertions(+)
 create mode 100644 drivers/thermal/pcie_cooling.c
 create mode 100644 include/linux/pci-bwctrl.h

Comments

Jonathan Cameron Oct. 17, 2024, 11:04 a.m. UTC | #1
On Wed,  9 Oct 2024 12:52:22 +0300
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> Add a thermal cooling driver to provide path to access PCIe bandwidth
> controller using the usual thermal interfaces.
> 
> A cooling device is instantiated for controllable PCIe Ports from the
> bwctrl service driver.
> 
> If registering the cooling device fails, allow bwctrl's probe to
> succeed regardless. As cdev in that case contains IS_ERR() pseudo
> "pointer", clean that up inside the probe function so the remove side
> doesn't need to suddenly make an odd looking IS_ERR() check.
> 
> The thermal side state 0 means no throttling, i.e., maximum supported
> PCIe Link Speed.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective

Trivial thing noticed on a reread.


> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 61e7ae524b1f..d3f9686e26e7 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
>  
>  	  If you want this support, you should say Y here.
>  
> +config PCIE_THERMAL
> +	bool "PCIe cooling support"
> +	depends on PCIEPORTBUS
> +	help
> +	  This implements PCIe cooling mechanism through bandwidth reduction
> +	  for PCIe devices.

Technically links not devices, but don't think that matters much
Ilpo Järvinen Oct. 17, 2024, 12:16 p.m. UTC | #2
On Thu, 17 Oct 2024, Jonathan Cameron wrote:

> On Wed,  9 Oct 2024 12:52:22 +0300
> Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> 
> > Add a thermal cooling driver to provide path to access PCIe bandwidth
> > controller using the usual thermal interfaces.
> > 
> > A cooling device is instantiated for controllable PCIe Ports from the
> > bwctrl service driver.
> > 
> > If registering the cooling device fails, allow bwctrl's probe to
> > succeed regardless. As cdev in that case contains IS_ERR() pseudo
> > "pointer", clean that up inside the probe function so the remove side
> > doesn't need to suddenly make an odd looking IS_ERR() check.
> > 
> > The thermal side state 0 means no throttling, i.e., maximum supported
> > PCIe Link Speed.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
> 
> Trivial thing noticed on a reread.
> 
> 
> > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > index 61e7ae524b1f..d3f9686e26e7 100644
> > --- a/drivers/thermal/Kconfig
> > +++ b/drivers/thermal/Kconfig
> > @@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
> >  
> >  	  If you want this support, you should say Y here.
> >  
> > +config PCIE_THERMAL
> > +	bool "PCIe cooling support"
> > +	depends on PCIEPORTBUS
> > +	help
> > +	  This implements PCIe cooling mechanism through bandwidth reduction
> > +	  for PCIe devices.
> 
> Technically links not devices, but don't think that matters much

That distinction would be splitting hairs beyond what seems useful from 
ordinary user's point of view. If there's no device attached, BW 
controller cannot do anything since the link is not going to train.
The link speed reduction is going to impact the speed the device
can communicate with even if it technically occurs on the link.
Rafael J. Wysocki Oct. 17, 2024, 12:58 p.m. UTC | #3
On Thu, Oct 17, 2024 at 2:16 PM Ilpo Järvinen
<ilpo.jarvinen@linux.intel.com> wrote:
>
> On Thu, 17 Oct 2024, Jonathan Cameron wrote:
>
> > On Wed,  9 Oct 2024 12:52:22 +0300
> > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> >
> > > Add a thermal cooling driver to provide path to access PCIe bandwidth
> > > controller using the usual thermal interfaces.
> > >
> > > A cooling device is instantiated for controllable PCIe Ports from the
> > > bwctrl service driver.
> > >
> > > If registering the cooling device fails, allow bwctrl's probe to
> > > succeed regardless. As cdev in that case contains IS_ERR() pseudo
> > > "pointer", clean that up inside the probe function so the remove side
> > > doesn't need to suddenly make an odd looking IS_ERR() check.
> > >
> > > The thermal side state 0 means no throttling, i.e., maximum supported
> > > PCIe Link Speed.
> > >
> > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
> >
> > Trivial thing noticed on a reread.
> >
> >
> > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > index 61e7ae524b1f..d3f9686e26e7 100644
> > > --- a/drivers/thermal/Kconfig
> > > +++ b/drivers/thermal/Kconfig
> > > @@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
> > >
> > >       If you want this support, you should say Y here.
> > >
> > > +config PCIE_THERMAL
> > > +   bool "PCIe cooling support"
> > > +   depends on PCIEPORTBUS
> > > +   help
> > > +     This implements PCIe cooling mechanism through bandwidth reduction
> > > +     for PCIe devices.
> >
> > Technically links not devices, but don't think that matters much
>
> That distinction would be splitting hairs beyond what seems useful from
> ordinary user's point of view. If there's no device attached, BW
> controller cannot do anything since the link is not going to train.
> The link speed reduction is going to impact the speed the device
> can communicate with even if it technically occurs on the link.
Ilpo Järvinen Oct. 17, 2024, 1:02 p.m. UTC | #4
On Thu, 17 Oct 2024, Rafael J. Wysocki wrote:

> On Thu, Oct 17, 2024 at 2:16 PM Ilpo Järvinen
> <ilpo.jarvinen@linux.intel.com> wrote:
> >
> > On Thu, 17 Oct 2024, Jonathan Cameron wrote:
> >
> > > On Wed,  9 Oct 2024 12:52:22 +0300
> > > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> > >
> > > > Add a thermal cooling driver to provide path to access PCIe bandwidth
> > > > controller using the usual thermal interfaces.
> > > >
> > > > A cooling device is instantiated for controllable PCIe Ports from the
> > > > bwctrl service driver.
> > > >
> > > > If registering the cooling device fails, allow bwctrl's probe to
> > > > succeed regardless. As cdev in that case contains IS_ERR() pseudo
> > > > "pointer", clean that up inside the probe function so the remove side
> > > > doesn't need to suddenly make an odd looking IS_ERR() check.
> > > >
> > > > The thermal side state 0 means no throttling, i.e., maximum supported
> > > > PCIe Link Speed.
> > > >
> > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> > > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective
> > >
> > > Trivial thing noticed on a reread.
> > >
> > >
> > > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > > index 61e7ae524b1f..d3f9686e26e7 100644
> > > > --- a/drivers/thermal/Kconfig
> > > > +++ b/drivers/thermal/Kconfig
> > > > @@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
> > > >
> > > >       If you want this support, you should say Y here.
> > > >
> > > > +config PCIE_THERMAL
> > > > +   bool "PCIe cooling support"
> > > > +   depends on PCIEPORTBUS
> > > > +   help
> > > > +     This implements PCIe cooling mechanism through bandwidth reduction
> > > > +     for PCIe devices.
> > >
> > > Technically links not devices, but don't think that matters much
> >
> > That distinction would be splitting hairs beyond what seems useful from
> > ordinary user's point of view. If there's no device attached, BW
> > controller cannot do anything since the link is not going to train.
> > The link speed reduction is going to impact the speed the device
> > can communicate with even if it technically occurs on the link.
> 
> From the Kconfig description perspective I think it's better to say
> "devices" even though technically it is about links, because device
> performance is what users will measure and notice any changes of.

Yes, that's what I tried to explain above. I intend to keep it as 
"devices" for this very reason.
Jonathan Cameron Oct. 17, 2024, 1:28 p.m. UTC | #5
On Thu, 17 Oct 2024 16:02:34 +0300 (EEST)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> On Thu, 17 Oct 2024, Rafael J. Wysocki wrote:
> 
> > On Thu, Oct 17, 2024 at 2:16 PM Ilpo Järvinen
> > <ilpo.jarvinen@linux.intel.com> wrote:  
> > >
> > > On Thu, 17 Oct 2024, Jonathan Cameron wrote:
> > >  
> > > > On Wed,  9 Oct 2024 12:52:22 +0300
> > > > Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:
> > > >  
> > > > > Add a thermal cooling driver to provide path to access PCIe bandwidth
> > > > > controller using the usual thermal interfaces.
> > > > >
> > > > > A cooling device is instantiated for controllable PCIe Ports from the
> > > > > bwctrl service driver.
> > > > >
> > > > > If registering the cooling device fails, allow bwctrl's probe to
> > > > > succeed regardless. As cdev in that case contains IS_ERR() pseudo
> > > > > "pointer", clean that up inside the probe function so the remove side
> > > > > doesn't need to suddenly make an odd looking IS_ERR() check.
> > > > >
> > > > > The thermal side state 0 means no throttling, i.e., maximum supported
> > > > > PCIe Link Speed.
> > > > >
> > > > > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
> > > > > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > > > > Acked-by: Rafael J. Wysocki <rafael@kernel.org> # From the cooling device interface perspective  
> > > >
> > > > Trivial thing noticed on a reread.
> > > >
> > > >  
> > > > > diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> > > > > index 61e7ae524b1f..d3f9686e26e7 100644
> > > > > --- a/drivers/thermal/Kconfig
> > > > > +++ b/drivers/thermal/Kconfig
> > > > > @@ -220,6 +220,15 @@ config DEVFREQ_THERMAL
> > > > >
> > > > >       If you want this support, you should say Y here.
> > > > >
> > > > > +config PCIE_THERMAL
> > > > > +   bool "PCIe cooling support"
> > > > > +   depends on PCIEPORTBUS
> > > > > +   help
> > > > > +     This implements PCIe cooling mechanism through bandwidth reduction
> > > > > +     for PCIe devices.  
> > > >
> > > > Technically links not devices, but don't think that matters much  
> > >
> > > That distinction would be splitting hairs beyond what seems useful from
> > > ordinary user's point of view. If there's no device attached, BW
> > > controller cannot do anything since the link is not going to train.
> > > The link speed reduction is going to impact the speed the device
> > > can communicate with even if it technically occurs on the link.  
> > 
> > From the Kconfig description perspective I think it's better to say
> > "devices" even though technically it is about links, because device
> > performance is what users will measure and notice any changes of.  
> 
> Yes, that's what I tried to explain above. I intend to keep it as 
> "devices" for this very reason.
> 
Fine by me to keep it as things stand. Been reading too many specs
recently so my immediate thought was why is there a bandwidth control
for the engines / memory in the EP :)

Jonathan
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 8c555b3325d6..393ed7ce5ea1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17938,6 +17938,8 @@  M:	Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
 L:	linux-pci@vger.kernel.org
 S:	Supported
 F:	drivers/pci/pcie/bwctrl.c
+F:	drivers/thermal/pcie_cooling.c
+F:	include/linux/pci-bwctrl.h
 
 PCIE DRIVER FOR AMAZON ANNAPURNA LABS
 M:	Jonathan Chocron <jonnyc@amazon.com>
diff --git a/drivers/pci/pcie/bwctrl.c b/drivers/pci/pcie/bwctrl.c
index 1d3680ea8e06..0f686a2e636f 100644
--- a/drivers/pci/pcie/bwctrl.c
+++ b/drivers/pci/pcie/bwctrl.c
@@ -27,6 +27,7 @@ 
 #include <linux/interrupt.h>
 #include <linux/mutex.h>
 #include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
 #include <linux/rwsem.h>
 #include <linux/slab.h>
 #include <linux/types.h>
@@ -38,10 +39,12 @@ 
  * struct pcie_bwctrl_data - PCIe bandwidth controller
  * @set_speed_mutex:	Serializes link speed changes
  * @lbms_count:		Count for LBMS (since last reset)
+ * @cdev:		thermal cooling device associated with the port
  */
 struct pcie_bwctrl_data {
 	struct mutex set_speed_mutex;
 	atomic_t lbms_count;
+	struct thermal_cooling_device *cdev;
 };
 
 /* Prevents port removal during link speed changes and LBMS count accessors */
@@ -303,6 +306,11 @@  static int pcie_bwnotif_probe(struct pcie_device *srv)
 
 	pci_dbg(port, "enabled with IRQ %d\n", srv->irq);
 
+	/* Don't fail on errors. Don't leave IS_ERR() "pointer" into ->cdev */
+	port->link_bwctrl->cdev = pcie_cooling_device_register(port);
+	if (IS_ERR(port->link_bwctrl->cdev))
+		port->link_bwctrl->cdev = NULL;
+
 	return 0;
 }
 
@@ -310,6 +318,9 @@  static void pcie_bwnotif_remove(struct pcie_device *srv)
 {
 	struct pcie_bwctrl_data *data = get_service_data(srv);
 
+	if (data->cdev)
+		pcie_cooling_device_unregister(data->cdev);
+
 	pcie_bwnotif_disable(srv->port);
 	scoped_guard(rwsem_write, &pcie_bwctrl_remove_rwsem)
 		srv->port->link_bwctrl = NULL;
diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
index 61e7ae524b1f..d3f9686e26e7 100644
--- a/drivers/thermal/Kconfig
+++ b/drivers/thermal/Kconfig
@@ -220,6 +220,15 @@  config DEVFREQ_THERMAL
 
 	  If you want this support, you should say Y here.
 
+config PCIE_THERMAL
+	bool "PCIe cooling support"
+	depends on PCIEPORTBUS
+	help
+	  This implements PCIe cooling mechanism through bandwidth reduction
+	  for PCIe devices.
+
+	  If you want this support, you should say Y here.
+
 config THERMAL_EMULATION
 	bool "Thermal emulation mode support"
 	help
diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile
index 41c4d56beb40..210c16c91461 100644
--- a/drivers/thermal/Makefile
+++ b/drivers/thermal/Makefile
@@ -31,6 +31,8 @@  thermal_sys-$(CONFIG_CPU_IDLE_THERMAL)	+= cpuidle_cooling.o
 # devfreq cooling
 thermal_sys-$(CONFIG_DEVFREQ_THERMAL) += devfreq_cooling.o
 
+thermal_sys-$(CONFIG_PCIE_THERMAL) += pcie_cooling.o
+
 obj-$(CONFIG_K3_THERMAL)	+= k3_bandgap.o k3_j72xx_bandgap.o
 # platform thermal drivers
 obj-y				+= broadcom/
diff --git a/drivers/thermal/pcie_cooling.c b/drivers/thermal/pcie_cooling.c
new file mode 100644
index 000000000000..a876d64f1582
--- /dev/null
+++ b/drivers/thermal/pcie_cooling.c
@@ -0,0 +1,80 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * PCIe cooling device
+ *
+ * Copyright (C) 2023-2024 Intel Corporation
+ */
+
+#include <linux/build_bug.h>
+#include <linux/cleanup.h>
+#include <linux/err.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/pci-bwctrl.h>
+#include <linux/slab.h>
+#include <linux/sprintf.h>
+#include <linux/thermal.h>
+
+#define COOLING_DEV_TYPE_PREFIX		"PCIe_Port_Link_Speed_"
+
+static int pcie_cooling_get_max_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+	struct pci_dev *port = cdev->devdata;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	*state = port->subordinate->max_bus_speed - PCIE_SPEED_2_5GT;
+
+	return 0;
+}
+
+static int pcie_cooling_get_cur_level(struct thermal_cooling_device *cdev, unsigned long *state)
+{
+	struct pci_dev *port = cdev->devdata;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	*state = cdev->max_state - (port->subordinate->cur_bus_speed - PCIE_SPEED_2_5GT);
+
+	return 0;
+}
+
+static int pcie_cooling_set_cur_level(struct thermal_cooling_device *cdev, unsigned long state)
+{
+	struct pci_dev *port = cdev->devdata;
+	enum pci_bus_speed speed;
+
+	/* cooling state 0 is same as the maximum PCIe speed */
+	speed = (cdev->max_state - state) + PCIE_SPEED_2_5GT;
+
+	return pcie_set_target_speed(port, speed, true);
+}
+
+static struct thermal_cooling_device_ops pcie_cooling_ops = {
+	.get_max_state = pcie_cooling_get_max_level,
+	.get_cur_state = pcie_cooling_get_cur_level,
+	.set_cur_state = pcie_cooling_set_cur_level,
+};
+
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port)
+{
+	char *name __free(kfree) =
+		kasprintf(GFP_KERNEL, COOLING_DEV_TYPE_PREFIX "%s", pci_name(port));
+	if (!name)
+		return ERR_PTR(-ENOMEM);
+
+	return thermal_cooling_device_register(name, port, &pcie_cooling_ops);
+}
+
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+	thermal_cooling_device_unregister(cdev);
+}
+
+/* For bus_speed <-> state arithmetic */
+static_assert(PCIE_SPEED_2_5GT + 1 == PCIE_SPEED_5_0GT);
+static_assert(PCIE_SPEED_5_0GT + 1 == PCIE_SPEED_8_0GT);
+static_assert(PCIE_SPEED_8_0GT + 1 == PCIE_SPEED_16_0GT);
+static_assert(PCIE_SPEED_16_0GT + 1 == PCIE_SPEED_32_0GT);
+static_assert(PCIE_SPEED_32_0GT + 1 == PCIE_SPEED_64_0GT);
+
+MODULE_AUTHOR("Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>");
+MODULE_DESCRIPTION("PCIe cooling driver");
diff --git a/include/linux/pci-bwctrl.h b/include/linux/pci-bwctrl.h
new file mode 100644
index 000000000000..cee07127455b
--- /dev/null
+++ b/include/linux/pci-bwctrl.h
@@ -0,0 +1,28 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * PCIe bandwidth controller
+ *
+ * Copyright (C) 2023-2024 Intel Corporation
+ */
+
+#ifndef LINUX_PCI_BWCTRL_H
+#define LINUX_PCI_BWCTRL_H
+
+#include <linux/pci.h>
+
+struct thermal_cooling_device;
+
+#ifdef CONFIG_PCIE_THERMAL
+struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port);
+void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev);
+#else
+static inline struct thermal_cooling_device *pcie_cooling_device_register(struct pci_dev *port)
+{
+	return NULL;
+}
+static inline void pcie_cooling_device_unregister(struct thermal_cooling_device *cdev)
+{
+}
+#endif
+
+#endif