diff mbox series

[v4,4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

Message ID 20200928183529.471328-5-nitesh@redhat.com
State New
Headers show
Series isolation: limit msix vectors to housekeeping CPUs | expand

Commit Message

Nitesh Narayan Lal Sept. 28, 2020, 6:35 p.m. UTC
If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/pci/msi.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Comments

Bjorn Helgaas Sept. 28, 2020, 9:59 p.m. UTC | #1
[to: Christoph in case he has comments, since I think he wrote this code]

On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to

> move IRQs to housekeeping CPUs from the userspace to reduce latency

> overhead on the isolated CPUs.

> 

> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs

> may exceed per-CPU vector limits.

> 

> When we have isolated CPUs, limit the number of vectors allocated by

> pci_alloc_irq_vectors() to the minimum number required by the driver, or

> to one per housekeeping CPU if that is larger.

> 

> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>


Acked-by: Bjorn Helgaas <bhelgaas@google.com>


> ---

>  drivers/pci/msi.c | 18 ++++++++++++++++++

>  1 file changed, 18 insertions(+)

> 

> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c

> index 30ae4ffda5c1..8c156867803c 100644

> --- a/drivers/pci/msi.c

> +++ b/drivers/pci/msi.c

> @@ -23,6 +23,7 @@

>  #include <linux/slab.h>

>  #include <linux/irqdomain.h>

>  #include <linux/of_irq.h>

> +#include <linux/sched/isolation.h>

>  

>  #include "pci.h"

>  

> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,

>  				   struct irq_affinity *affd)

>  {

>  	struct irq_affinity msi_default_affd = {0};

> +	unsigned int hk_cpus;

>  	int nvecs = -ENOSPC;

>  

> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);

> +

> +	/*

> +	 * If we have isolated CPUs for use by real-time tasks, to keep the

> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved

> +	 * to the housekeeping CPUs from the userspace by changing their

> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from

> +	 * running out of IRQ vectors.

> +	 */

> +	if (hk_cpus < num_online_cpus()) {

> +		if (hk_cpus < min_vecs)

> +			max_vecs = min_vecs;

> +		else if (hk_cpus < max_vecs)

> +			max_vecs = hk_cpus;

> +	}

> +

>  	if (flags & PCI_IRQ_AFFINITY) {

>  		if (!affd)

>  			affd = &msi_default_affd;

> -- 

> 2.18.2

>
Christoph Hellwig Sept. 29, 2020, 5:46 p.m. UTC | #2
On Mon, Sep 28, 2020 at 04:59:31PM -0500, Bjorn Helgaas wrote:
> [to: Christoph in case he has comments, since I think he wrote this code]

I think I actually suggested this a few iterations back.

> > +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > +
> > +	/*
> > +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > +	 * to the housekeeping CPUs from the userspace by changing their
> > +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > +	 * running out of IRQ vectors.
> > +	 */
> > +	if (hk_cpus < num_online_cpus()) {

I woukd have moved the assignment to hk_cpus below the comment and
just above the if, but that is really just a minor style preference.

Otherwise this looks good:

Acked-by: Christoph Hellwig <hch@lst.de>
Peter Zijlstra Oct. 16, 2020, 12:20 p.m. UTC | #3
On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/pci/msi.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/irqdomain.h>
>  #include <linux/of_irq.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   struct irq_affinity *affd)
>  {
>  	struct irq_affinity msi_default_affd = {0};
> +	unsigned int hk_cpus;
>  	int nvecs = -ENOSPC;
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */
> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;

is that:

		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Also, do we really need to have that conditional on hk_cpus <
num_online_cpus()? That is, why can't we do this unconditionally?

And what are the (desired) semantics vs hotplug? Using a cpumask without
excluding hotplug is racy.

> +	}
> +
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> -- 
> 2.18.2
>
Nitesh Narayan Lal Oct. 18, 2020, 6:14 p.m. UTC | #4
On 10/16/20 8:20 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/pci/msi.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 30ae4ffda5c1..8c156867803c 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/of_irq.h>
>> +#include <linux/sched/isolation.h>
>>  
>>  #include "pci.h"
>>  
>> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>>  				   struct irq_affinity *affd)
>>  {
>>  	struct irq_affinity msi_default_affd = {0};
>> +	unsigned int hk_cpus;
>>  	int nvecs = -ENOSPC;
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
> is that:
>
> 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Yes, I think this will do.

>
> Also, do we really need to have that conditional on hk_cpus <
> num_online_cpus()? That is, why can't we do this unconditionally?


FWIU most of the drivers using this API already restricts the number of
vectors based on the num_online_cpus, if we do it unconditionally we can
unnecessary duplicate the restriction for cases where we don't have any
isolated CPUs.

Also, different driver seems to take different factors into consideration
along with num_online_cpus while finding the max_vecs to request, for
example in the case of mlx5:
MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
               MLX5_EQ_VEC_COMP_BASE

Having hk_cpus < num_online_cpus() helps us ensure that we are only
changing the behavior when we have isolated CPUs.

Does that make sense?

>
> And what are the (desired) semantics vs hotplug? Using a cpumask without
> excluding hotplug is racy.

The housekeeping_mask should still remain constant, isn't?
In any case, I can double check this.

>
>> +	}
>> +
>>  	if (flags & PCI_IRQ_AFFINITY) {
>>  		if (!affd)
>>  			affd = &msi_default_affd;
>> -- 
>> 2.18.2
>>
Peter Zijlstra Oct. 19, 2020, 11:11 a.m. UTC | #5
On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);

> >> +

> >> +	/*

> >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the

> >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved

> >> +	 * to the housekeeping CPUs from the userspace by changing their

> >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from

> >> +	 * running out of IRQ vectors.

> >> +	 */

> >> +	if (hk_cpus < num_online_cpus()) {

> >> +		if (hk_cpus < min_vecs)

> >> +			max_vecs = min_vecs;

> >> +		else if (hk_cpus < max_vecs)

> >> +			max_vecs = hk_cpus;

> > is that:

> >

> > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

> 

> Yes, I think this will do.

> 

> >

> > Also, do we really need to have that conditional on hk_cpus <

> > num_online_cpus()? That is, why can't we do this unconditionally?

> 

> FWIU most of the drivers using this API already restricts the number of

> vectors based on the num_online_cpus, if we do it unconditionally we can

> unnecessary duplicate the restriction for cases where we don't have any

> isolated CPUs.


unnecessary isn't really a concern here, this is a slow path. What's
important is code clarity.

> Also, different driver seems to take different factors into consideration

> along with num_online_cpus while finding the max_vecs to request, for

> example in the case of mlx5:

> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +

>                MLX5_EQ_VEC_COMP_BASE

> 

> Having hk_cpus < num_online_cpus() helps us ensure that we are only

> changing the behavior when we have isolated CPUs.

> 

> Does that make sense?


That seems to want to allocate N interrupts per cpu (plus some random
static amount, which seems weird, but whatever). This patch breaks that.

So I think it is important to figure out what that driver really wants
in the nohz_full case. If it wants to retain N interrupts per CPU, and
only reduce the number of CPUs, the proposed interface is wrong.

> > And what are the (desired) semantics vs hotplug? Using a cpumask without

> > excluding hotplug is racy.

> 

> The housekeeping_mask should still remain constant, isn't?

> In any case, I can double check this.


The goal is very much to have that dynamically configurable.
Thomas Gleixner Oct. 20, 2020, 2:16 p.m. UTC | #6
On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>  

> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);

> +

> +	/*

> +	 * If we have isolated CPUs for use by real-time tasks, to keep the

> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved

> +	 * to the housekeeping CPUs from the userspace by changing their

> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from

> +	 * running out of IRQ vectors.

> +	 */


This is not true for managed interrupts. The interrupts affinity of
those cannot be changed by user space.

> +	if (hk_cpus < num_online_cpus()) {

> +		if (hk_cpus < min_vecs)

> +			max_vecs = min_vecs;

> +		else if (hk_cpus < max_vecs)

> +			max_vecs = hk_cpus;

> +	}


So now with that assume a 16 core machine (HT off for simplicity)

17 Requested interrupts (1 general, 16 queues)

Managed interrupts will allocate

   1  general interrupt which is free movable by user space
   16 managed interrupts for queues (one per CPU)

This allows the driver to have 16 queues, i.e. one queue per CPU. These
interrupts are only used when an application on a CPU issues I/O.

With the above change this will result

   1  general interrupt which is free movable by user space
   1  managed interrupts (possible affinity to all 16 CPUs, but routed
      to housekeeping CPU as long as there is one online)

So the device is now limited to a single queue which also affects the
housekeeping CPUs because now they have to share a single queue.

With larger machines this gets even worse.

So no. This needs way more thought for managed interrupts and you cannot
do that at the PCI layer. Only the affinity spreading mechanism can do
the right thing here.

Thanks,

        tglx
diff mbox series

Patch

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 30ae4ffda5c1..8c156867803c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -23,6 +23,7 @@ 
 #include <linux/slab.h>
 #include <linux/irqdomain.h>
 #include <linux/of_irq.h>
+#include <linux/sched/isolation.h>
 
 #include "pci.h"
 
@@ -1191,8 +1192,25 @@  int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   struct irq_affinity *affd)
 {
 	struct irq_affinity msi_default_affd = {0};
+	unsigned int hk_cpus;
 	int nvecs = -ENOSPC;
 
+	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+
+	/*
+	 * If we have isolated CPUs for use by real-time tasks, to keep the
+	 * latency overhead to a minimum, device-specific IRQ vectors are moved
+	 * to the housekeeping CPUs from the userspace by changing their
+	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
+	 * running out of IRQ vectors.
+	 */
+	if (hk_cpus < num_online_cpus()) {
+		if (hk_cpus < min_vecs)
+			max_vecs = min_vecs;
+		else if (hk_cpus < max_vecs)
+			max_vecs = hk_cpus;
+	}
+
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
 			affd = &msi_default_affd;