diff mbox

[v3,3/4] irq: Allow multiple clients to register for irq affinity notification

Message ID 1409170479-29955-4-git-send-email-lina.iyer@linaro.org
State New
Headers show

Commit Message

Lina Iyer Aug. 27, 2014, 8:14 p.m. UTC
PM QoS and other idle frameworks can do a better job of addressing power
and performance requirements for a cpu, knowing the IRQs that are
affine to that cpu. If a performance request is placed against serving
the IRQ faster and if the IRQ is affine to a set of cpus, then setting
the performance requirements only on those cpus help save power on the
rest of the cpus. PM QoS framework is one such framework interested in
knowing the smp_affinity of an IRQ and the change notificiation in this
regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply
to all cpus, but when attached to an IRQ, can be applied only to the set
of cpus that IRQ's smp_affinity is set to. This allows other cpus to
enter deeper sleep states to save power. More than one framework/driver
can be interested in such information.

The current implementation allows only a single notification callback
whenever the IRQ's SMP affinity is changed. Adding a second notification
punts the existing notifier function out of registration.  Add a list of
notifiers, allowing multiple clients to register for irq affinity
notifications.

The kref object associated with the struct irq_affinity_notify was used
to prevent the notifier object from being released if there is a pending
notification. It was incremented before the work item was scheduled and
was decremented when the notification was completed. If the kref count
was zero at the end of it, the release function gets a callback allowing
the module to release the irq_affinity_notify memory. This works well
for a single notification. When multiple clients are registered, no
single kref object can be used. Hence, the work function when scheduled,
will increase the kref count using the kref_get_unless_zero(), so if the
module had already unregistered the irq_affinity_notify object while the
work function was scheduled, it will not be notified.

Signed-off-by: Lina Iyer <lina.iyer@linaro.org>

Comments

Thomas Gleixner Aug. 27, 2014, 8:56 p.m. UTC | #1
On Wed, 27 Aug 2014, Lina Iyer wrote:

> PM QoS and other idle frameworks can do a better job of addressing power
> and performance requirements for a cpu, knowing the IRQs that are
> affine to that cpu. If a performance request is placed against serving
> the IRQ faster and if the IRQ is affine to a set of cpus, then setting
> the performance requirements only on those cpus help save power on the
> rest of the cpus. PM QoS framework is one such framework interested in
> knowing the smp_affinity of an IRQ and the change notificiation in this
> regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply
> to all cpus, but when attached to an IRQ, can be applied only to the set
> of cpus that IRQ's smp_affinity is set to. This allows other cpus to
> enter deeper sleep states to save power. More than one framework/driver
> can be interested in such information.

All you are describing is the fact, that there is only a single
notifier possible right now, but you completely miss to describe WHY
you need multiple ones.

The existing notifier was introduced to tell the driver that the irq
affinity has changed, so it can move its buffers to the proper NUMA
node. So if that driver gets this information then it can tell the PM
QoS code that its affinity changed so that stuff can react
accordingly, right?

This is going nowhere unless you provide a proper usecase and
arguments why this is necessary. Handwaving and I want a pony
arguments are not sufficient.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lina Iyer Sept. 2, 2014, 6:43 p.m. UTC | #2
On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote:
>On Wed, 27 Aug 2014, Lina Iyer wrote:
>
>> PM QoS and other idle frameworks can do a better job of addressing power
>> and performance requirements for a cpu, knowing the IRQs that are
>> affine to that cpu. If a performance request is placed against serving
>> the IRQ faster and if the IRQ is affine to a set of cpus, then setting
>> the performance requirements only on those cpus help save power on the
>> rest of the cpus. PM QoS framework is one such framework interested in
>> knowing the smp_affinity of an IRQ and the change notificiation in this
>> regard. QoS requests for the CPU_DMA_LATENCY constraint currently apply
>> to all cpus, but when attached to an IRQ, can be applied only to the set
>> of cpus that IRQ's smp_affinity is set to. This allows other cpus to
>> enter deeper sleep states to save power. More than one framework/driver
>> can be interested in such information.
>
>All you are describing is the fact, that there is only a single
>notifier possible right now, but you completely miss to describe WHY
>you need multiple ones.
>
>The existing notifier was introduced to tell the driver that the irq
>affinity has changed, so it can move its buffers to the proper NUMA
>node. So if that driver gets this information then it can tell the PM
>QoS code that its affinity changed so that stuff can react
>accordingly, right?
>
With the new PM QoS changes, multiple drivers would now be interested in
knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the
notifier in the framework, so individual drivers dont have to register
for notification themselves and handle affinity notifications. But PM
QoS needs to coexist with NUMA and other arch drivers that need to
modify based on their arch specific needs upon affinity change
notifications.
Modifying the IRQ notifications to list, benefits having a simpler
mechanism to notify all arch drivers and frameworks like PM QoS.


>This is going nowhere unless you provide a proper usecase and
>arguments why this is necessary. Handwaving and I want a pony
>arguments are not sufficient.
>
>Thanks,
>
>	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Sept. 2, 2014, 8:56 p.m. UTC | #3
On Tue, 2 Sep 2014, Lina Iyer wrote:
> On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote:
> > On Wed, 27 Aug 2014, Lina Iyer wrote:
> > 
> > All you are describing is the fact, that there is only a single
> > notifier possible right now, but you completely miss to describe WHY
> > you need multiple ones.
> > 
> > The existing notifier was introduced to tell the driver that the irq
> > affinity has changed, so it can move its buffers to the proper NUMA
> > node. So if that driver gets this information then it can tell the PM
> > QoS code that its affinity changed so that stuff can react
> > accordingly, right?
> > 
> With the new PM QoS changes, multiple drivers would now be interested in
> knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the

Again, you fail to tell me WHY multiple drivers are interested and
which drivers are interested and what they do with that information.

> notifier in the framework, so individual drivers dont have to register
> for notification themselves and handle affinity notifications. But PM
> QoS needs to coexist with NUMA and other arch drivers that need to
> modify based on their arch specific needs upon affinity change
> notifications.

Lots of unspecified blurb. What has NUMA to do with other arch
drivers? Are you actually understanding what this is all about?

> Modifying the IRQ notifications to list, benefits having a simpler
> mechanism to notify all arch drivers and frameworks like PM QoS.

Painting my bikeshed blue benefits all neighbours and institutions
like the Royal Navy, right?

Lina, this is leading nowhere. You just make completely unspecified
claims and try to push your solution to a not explained problem
through. That's not how it works, at least not with me.

So please sit down and write up a proper description of the problem
you are trying to solve.

All I can see from your postings so far is:

    1) You want to use the notification for PM QoS

    2) You run into conflicts with the existing notifiers

    3) You want to solve that conflict

What you are not telling is:

    1) In which way is PM QoS going to use that notifier

    2) How does that conflict with the existing users. And we only
       have two of them:

       - InfiniPath 7322 chip driver

       - cpu_rmap driver, which is used by

       	- solarflare NIC driver
	- mellanox mlx4 driver
	- cisco enic driver

	So how is that conflicting with your ARM centric PM QoS work?

	Please provide proper debug info about such a conflict, if it
	exists at all.

    3) Which arch drivers are involved?

       So far none, at least not in mainline according to
       "grep irq_set_affinity_notifier arch/".

       If you have out of tree code which uses this, then we want to
       see it first to understand what the arch driver is trying to
       solve by using that notification mechanism.

The more I think about what you are not telling the more I get the
feeling that this whole PM QoS thing you try to do is a complete
design disaster.

From your changelog:

 "PM QoS and other idle frameworks can do a better job of addressing
  power and performance requirements for a cpu, knowing the IRQs that
  are affine to that cpu."

I agree with that. PM might make better decisions if it knows which
activities are targeted at a particular cpu.

So someone figured that out and you got tasked to implement it. So you
looked for a way to get to this information and you found the existing
notifier and decided to (ab)use it for your purpose. But you did not
spend a split second to figure out why this is wrong and even useless.

At the point where PM QoS or whatever decides to use that information,
it does not know at all what the current affinity mask is. So in your
patch 4/4, which I ignored so far you do:

+     	   	 struct irq_desc *desc = irq_to_desc(req->irq);
+ 		 struct cpumask *mask = desc->irq_data.affinity;

WTF?

You did not even have the courtesy to use the proper functions for
this. And of course you access all of this unlocked. So how is that
supposed to work with a concurrent affinity update? Not at
all.

Fiddling with irq_desc is a NONO: Consult your preferred search
machine or just the kernel changelogs about the consequences.

Looking at the other patches you really seem to have missed out on
some other NONOs:

 struct pm_qos_request {
+	enum pm_qos_req_type type;
+	struct cpumask cpus_affine;
+	/* Internal structure members */
 	struct plist_node node;
 	int pm_qos_class;
 	struct delayed_work work; /* for pm_qos_update_request_timeout */
@@ -80,6 +90,7 @@ enum pm_qos_type {
 struct pm_qos_constraints {
 	struct plist_head list;
 	s32 target_value; /* Do not change to 64 bit */
+	s32 target_per_cpu[NR_CPUS];
 	s32 default_value;
 	s32 no_constraint_value;
 	enum pm_qos_type type;

Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS
bodes on kernels compiled for NR_CPUS=8192 or larger.

Just get it: You are hacking core code, not some random ARM SoC
driver.

That aside, your design or the lack thereof sucks. While your hackery
might work somehow for the PM QoS problem at hand - ignoring the hard
to debug race window of the concurrent affinity update - it's
completely useless for other PM related decisions despite your claims
that its a benefit for all of that.

How would a general "keep track of the targets of all interrupts in
the system" mechanism make use of this? Not at all. Simply because it
has no idea which interrupts are requested at any given time. Sure
your answer will be "add another notifier" to solve that problem. But
that's the completely wrong answer because notifiers are a PITA and
should go away. Here is Linus' POV of notifiers and I really agree
with it:

 "Notifiers are a disgrace, and almost all of them are a major design
  mistake. They all have locking problems, the introduce internal
  arbitrary API's that are hard to fix later (because you have random
  people who decided to hook into them, which is the whole *point* of
  those notifier chains)."

And so I really fight that notifier chain tooth and nails until someone
provides me a proper argument WHY it cant be solved with proper
explicit and understandable mechanisms.

Back to your PM QoS trainwreck:

I'm really impressed that you did not notice at all how inaccurate and
therefor useless the affinity information which you are consulting is.

Did it ever hit even the outer hemisphere of your brain, that
irqdata.affinity is the affinity which was requested by the caller of
an affinity setter function and not the effective affinity?

I bet it did not, simply because you did not pay any attention and
bother to analyze that proper. Otherwise you would have tried to hack
around this as well in some disgusting way.

Really great savings, if the affinity is left at default to all cpus,
but the effective affinity is a single cpu due to irq routing
restrictions in the hardware.

Surely you tested against single CPU affinities and achieved great
results, but that's not a generic and useful solution.

Thanks,

	tglx







--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lina Iyer Sept. 25, 2014, 3:43 p.m. UTC | #4
On Tue, Sep 02 2014 at 14:56 -0600, Thomas Gleixner wrote:
>On Tue, 2 Sep 2014, Lina Iyer wrote:
>> On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote:
>> > On Wed, 27 Aug 2014, Lina Iyer wrote:
>> >
>> > All you are describing is the fact, that there is only a single
>> > notifier possible right now, but you completely miss to describe WHY
>> > you need multiple ones.
>> >
>> > The existing notifier was introduced to tell the driver that the irq
>> > affinity has changed, so it can move its buffers to the proper NUMA
>> > node. So if that driver gets this information then it can tell the PM
>> > QoS code that its affinity changed so that stuff can react
>> > accordingly, right?
>> >
>> With the new PM QoS changes, multiple drivers would now be interested in
>> knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the
>
>Again, you fail to tell me WHY multiple drivers are interested and
>which drivers are interested and what they do with that information.
>
>> notifier in the framework, so individual drivers dont have to register
>> for notification themselves and handle affinity notifications. But PM
>> QoS needs to coexist with NUMA and other arch drivers that need to
>> modify based on their arch specific needs upon affinity change
>> notifications.
>
>Lots of unspecified blurb. What has NUMA to do with other arch
>drivers? Are you actually understanding what this is all about?
>
>> Modifying the IRQ notifications to list, benefits having a simpler
>> mechanism to notify all arch drivers and frameworks like PM QoS.
>
>Painting my bikeshed blue benefits all neighbours and institutions
>like the Royal Navy, right?
Sorry about not being very clear in my responses.

>
>Lina, this is leading nowhere. You just make completely unspecified
>claims and try to push your solution to a not explained problem
>through. That's not how it works, at least not with me.
>
>So please sit down and write up a proper description of the problem
>you are trying to solve.
>
>All I can see from your postings so far is:
>
>    1) You want to use the notification for PM QoS
>
>    2) You run into conflicts with the existing notifiers
>
>    3) You want to solve that conflict
>
Here is the premise of my patchset - 

When a cpuidle governor selects the idle state of the cpu, it uses
the PM QoS CPU_DMA_LATENCY constraint value in determining the
acceptable tolerance in exit latency of the cpu, before choosing the low
power state for the cpu. Drivers specify the PM QoS request for this
constraint, indicating their tolerance. The aggregated QoS request is
used in determining the state of each cpu.
Now, when drivers make a request, they usually have a device IRQ in
mind, which generally needs to be addressed within the stipulated time
after the interrupt fires. By default, the IRQs' have affinity towards
many cpus. However, when used with a IRQ balancer, that sets the
affinity only to a subset of the available cpus, it seems inefficient
that all cpus obey the QoS requirement in terms of power.
Say, when an USB driver specifies a 2 millisecond QoS constraint, it
prevents all the cpus in the SoC from entering any power down state,
which need more than 2 ms to be effective.  Thats a lot of leakage
current that could be saved. The IRQ in many instances would just
wake up cpu0, while other cpus remain in high leakage idle states. This
information is critical to the governor to determine the best idle state
it could allow the cpu to be in.

So the change here is to specify QoS requests against a subset of cpus,
which when aggregated, the idle governor can distill out the QoS
requirement only for that cpu. When requests are made with a set of cpus
as the intended target for the QoS value, the PM QoS framework
calculates the QoS value for each cpu. The governor can then be modified
to request the QoS CPU_DMA_LATENCY constraint on an individual cpu and
can choose the deepest idle state that would respect the requirement.

By default QoS requests do not have a preference for cpus and that is
the behavior with these patches too. Using the per-cpu PM QoS, the
drivers can now specify a set of cpus that need to obey the QoS
requests, if they know the cpus or they would specify an IRQ, if the
request is to avoid latency in handling the device IRQ. The driver is
not expected to handle the IRQ's smp_affinity changes and would be done
by the PM QoS framework on behalf of the driver. This is not enforced on
the driver that they need to specify an IRQ or a subset of cpus for
their request. If there is no IRQ or cpu preference, then PM QoS
defaults to applying the request for all cpus. I would expect many of
the existing drivers still want this behavior.

In the case a driver is aware that the QoS request to guarantee a
certain performance in handling the device IRQ, the request can be made
to track the IRQ. PM QoS registers for a smp_affinity change and does
the best judgement based on the affinity. The IRQ may still fire on any
of the cpus in the affinity mask which would make it comparable to the
existing QoS behavior and no power saving is achieved.

>What you are not telling is:
>
>    1) In which way is PM QoS going to use that notifier
>
>    2) How does that conflict with the existing users. And we only
>       have two of them:
>
>       - InfiniPath 7322 chip driver
>
>       - cpu_rmap driver, which is used by
>
>       	- solarflare NIC driver
>	- mellanox mlx4 driver
>	- cisco enic driver
>
>	So how is that conflicting with your ARM centric PM QoS work?
>
>	Please provide proper debug info about such a conflict, if it
>	exists at all.
>
>    3) Which arch drivers are involved?
>
>       So far none, at least not in mainline according to
>       "grep irq_set_affinity_notifier arch/".
>
>       If you have out of tree code which uses this, then we want to
>       see it first to understand what the arch driver is trying to
>       solve by using that notification mechanism.
>

I assumed that PM QoS is also used in conjunction with NUMA. I did
notice a few systems that used the smp_affinity notifications. My
assumption was such systems may have drivers that could prefer a per-cpu
QoS request. per-cpu PM QoS is not an ARM specific change, though this
benefits quite a bit in a power constrained environment. An octa core
system with a low power or a high performance distinction amongst the
cores, clearly benefits from this feature, when used with an IRQ
balancer. 

It didnt seem right me that PM QoS gets notified from InifiniPath and NIC
drivers, while others have PM QoS registering directly.

As far my usecases are concerned in an ARM chipset, I did not see any
conflict with the existing users of the smp affinity notifications. But,
I assume the systems may want to use per-cpu PM QoS for some of their
IRQs on a later date. Hence the change to a notification mechanism.

>The more I think about what you are not telling the more I get the
>feeling that this whole PM QoS thing you try to do is a complete
>design disaster.
>
>From your changelog:
>
> "PM QoS and other idle frameworks can do a better job of addressing
>  power and performance requirements for a cpu, knowing the IRQs that
>  are affine to that cpu."
>
>I agree with that. PM might make better decisions if it knows which
>activities are targeted at a particular cpu.
>
>So someone figured that out and you got tasked to implement it. So you
>looked for a way to get to this information and you found the existing
>notifier and decided to (ab)use it for your purpose. But you did not
>spend a split second to figure out why this is wrong and even useless.
>
>At the point where PM QoS or whatever decides to use that information,
>it does not know at all what the current affinity mask is. So in your
>patch 4/4, which I ignored so far you do:
>
>+     	   	 struct irq_desc *desc = irq_to_desc(req->irq);
>+ 		 struct cpumask *mask = desc->irq_data.affinity;
>
>WTF?
>
Sorry, bad access on my part. I will fix it.

>You did not even have the courtesy to use the proper functions for
>this. And of course you access all of this unlocked. So how is that
>supposed to work with a concurrent affinity update? Not at
>all.
>
>Fiddling with irq_desc is a NONO: Consult your preferred search
>machine or just the kernel changelogs about the consequences.
>
>Looking at the other patches you really seem to have missed out on
>some other NONOs:
>
> struct pm_qos_request {
>+	enum pm_qos_req_type type;
>+	struct cpumask cpus_affine;
>+	/* Internal structure members */
> 	struct plist_node node;
> 	int pm_qos_class;
> 	struct delayed_work work; /* for pm_qos_update_request_timeout */
>@@ -80,6 +90,7 @@ enum pm_qos_type {
> struct pm_qos_constraints {
> 	struct plist_head list;
> 	s32 target_value; /* Do not change to 64 bit */
>+	s32 target_per_cpu[NR_CPUS];
> 	s32 default_value;
> 	s32 no_constraint_value;
> 	enum pm_qos_type type;
>
>Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS
>bodes on kernels compiled for NR_CPUS=8192 or larger.
>
Sorry, I did not imagine the possiblity of these failures.
I figured that I would not use up a per-cpu variable, since the access
to the values may be from one cpu where the code is being processed, but
seems like there are quite a few merits, that I overlooked.

>Just get it: You are hacking core code, not some random ARM SoC
>driver.
>
>That aside, your design or the lack thereof sucks. While your hackery
>might work somehow for the PM QoS problem at hand - ignoring the hard
>to debug race window of the concurrent affinity update - it's
>completely useless for other PM related decisions despite your claims
>that its a benefit for all of that.
Yes, there is a race condition in PM QoS, where the notification between
the time, the request is added and the notification is registered.
AFAIK, the behavior in the irq/manage.c seems to be held the same. If
there is a concurrent change in notification, the schedule_work would
not reschedule and possibly not send a notification, which happens today
as well. I will revisit the change and look, if I can make it better.

>
>How would a general "keep track of the targets of all interrupts in
>the system" mechanism make use of this? 
Sorry, I do not understand your question.
PM QoS is only interested in the IRQs specified in the QoS request. If
there are no requests that need to be associated with an IRQ, then PM
QoS will not register for an affinity change notification.

>Not at all. Simply because it
>has no idea which interrupts are requested at any given time. Sure
>your answer will be "add another notifier" to solve that problem. But
>that's the completely wrong answer because notifiers are a PITA and
>should go away. Here is Linus' POV of notifiers and I really agree
>with it:
>
> "Notifiers are a disgrace, and almost all of them are a major design
>  mistake. They all have locking problems, the introduce internal
>  arbitrary API's that are hard to fix later (because you have random
>  people who decided to hook into them, which is the whole *point* of
>  those notifier chains)."
>
>And so I really fight that notifier chain tooth and nails until someone
>provides me a proper argument WHY it cant be solved with proper
>explicit and understandable mechanisms.
>
FWIW, I do not like notifications either. I feel it adds unpredictablity
to the system. If I didnt think, that NUMA and other systems might
benefit from per-cpu PM QoS feature, I would not have added this
notification mechanism and instead registered directly. I admit to
having not much of a knowledge about NUMA, but atleast PM QoS seems to
be architecture agnostic and all systems could use this generic kernel
feature.

Also, the notification added is restricted to PM QoS framework. The
smp-affinity notifications are not relayed to the driver making the PM
QoS request.

The way it is with per-cpu PM QoS is that the framework registers for
notification, *only* if there is a QoS request tagged with an IRQ. When
a request does not specify an IRQ, the QoS framework does not register
for any IRQ affinity notifications. Also, its possible that multiple
drivers may have a QoS request and could tag the same IRQ, in which
case, the notfication list helps. But surely, I can use other methods in
PM QoS to do the same thing, without the need for a notification list in
the IRQ framework. But that still doesnt solve for other systems that
may have NUMA or NIC drivers using per-cpu PM QoS.

When notified PM QoS updates its request list and amends the request
affinity and recalculates the target value. In this patch, I currently
update requests when the irq notifications are sent out.

>Back to your PM QoS trainwreck:
>
>I'm really impressed that you did not notice at all how inaccurate and
>therefor useless the affinity information which you are consulting is.
>
>Did it ever hit even the outer hemisphere of your brain, that
>irqdata.affinity is the affinity which was requested by the caller of
>an affinity setter function and not the effective affinity?
>
Yes and the effective affinity may not match up to irqdata.affinity,
but the only reliable information we have is irqdata.affinity. We are
trying to do the best with the information available. It still is better
than having no information about affinity.

>I bet it did not, simply because you did not pay any attention and
>bother to analyze that proper. Otherwise you would have tried to hack
>around this as well in some disgusting way.
>
>Really great savings, if the affinity is left at default to all cpus,
>but the effective affinity is a single cpu due to irq routing
>restrictions in the hardware.
>
>Surely you tested against single CPU affinities and achieved great
>results, but that's not a generic and useful solution.

Here is what needs to happen, before the power savings can be
beneficial.
	- Drivers specifying PM QoS request update their drivers to
	  track the request against an IRQ
	- SoCs that have a way of constraining the IRQ affinity to a
	  subset of cpus. Or may use an IRQ balancer or may be design
	  limited to have IRQs affine to a set of cpus.
	- Menu and other cpuidle governors consider per-cpu PM QoS in
	  determining the idle state of the cpu.

We may not see any benefit without all these in play on a given system.
Like you say, with single CPU affinity and with these changes, we do see
quite a bit of power savings from the performance cores.

Obviously, I seem to be doing it wrong with this change. I am ready to
amend and start over, if there are better ways to get affinity
information.

Thanks,
Lina

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lina Iyer Sept. 25, 2014, 3:50 p.m. UTC | #5
Resending...
Hoping to retain the thread information, this time.

On Tue, Sep 02 2014 at 14:56 -0600, Thomas Gleixner wrote:
>On Tue, 2 Sep 2014, Lina Iyer wrote:
>> On Wed, Aug 27 2014 at 14:56 -0600, Thomas Gleixner wrote:
>> > On Wed, 27 Aug 2014, Lina Iyer wrote:
>> >
>> > All you are describing is the fact, that there is only a single
>> > notifier possible right now, but you completely miss to describe WHY
>> > you need multiple ones.
>> >
>> > The existing notifier was introduced to tell the driver that the irq
>> > affinity has changed, so it can move its buffers to the proper NUMA
>> > node. So if that driver gets this information then it can tell the PM
>> > QoS code that its affinity changed so that stuff can react
>> > accordingly, right?
>> >
>> With the new PM QoS changes, multiple drivers would now be interested in
>> knowing the smp_affinity changes of the same IRQ. PM QoS abstracts the
>
>Again, you fail to tell me WHY multiple drivers are interested and
>which drivers are interested and what they do with that information.
>
>> notifier in the framework, so individual drivers dont have to register
>> for notification themselves and handle affinity notifications. But PM
>> QoS needs to coexist with NUMA and other arch drivers that need to
>> modify based on their arch specific needs upon affinity change
>> notifications.
>
>Lots of unspecified blurb. What has NUMA to do with other arch
>drivers? Are you actually understanding what this is all about?
>
>> Modifying the IRQ notifications to list, benefits having a simpler
>> mechanism to notify all arch drivers and frameworks like PM QoS.
>
>Painting my bikeshed blue benefits all neighbours and institutions
>like the Royal Navy, right?
Sorry about not being very clear in my responses.

>
>Lina, this is leading nowhere. You just make completely unspecified
>claims and try to push your solution to a not explained problem
>through. That's not how it works, at least not with me.
>
>So please sit down and write up a proper description of the problem
>you are trying to solve.
>
>All I can see from your postings so far is:
>
>    1) You want to use the notification for PM QoS
>
>    2) You run into conflicts with the existing notifiers
>
>    3) You want to solve that conflict
>
Here is the premise of my patchset - 

When a cpuidle governor selects the idle state of the cpu, it uses
the PM QoS CPU_DMA_LATENCY constraint value in determining the
acceptable tolerance in exit latency of the cpu, before choosing the low
power state for the cpu. Drivers specify the PM QoS request for this
constraint, indicating their tolerance. The aggregated QoS request is
used in determining the state of each cpu.
Now, when drivers make a request, they usually have a device IRQ in
mind, which generally needs to be addressed within the stipulated time
after the interrupt fires. By default, the IRQs' have affinity towards
many cpus. However, when used with a IRQ balancer, that sets the
affinity only to a subset of the available cpus, it seems inefficient
that all cpus obey the QoS requirement in terms of power.
Say, when an USB driver specifies a 2 millisecond QoS constraint, it
prevents all the cpus in the SoC from entering any power down state,
which need more than 2 ms to be effective.  Thats a lot of leakage
current that could be saved. The IRQ in many instances would just
wake up cpu0, while other cpus remain in high leakage idle states. This
information is critical to the governor to determine the best idle state
it could allow the cpu to be in.

So the change here is to specify QoS requests against a subset of cpus,
which when aggregated, the idle governor can distill out the QoS
requirement only for that cpu. When requests are made with a set of cpus
as the intended target for the QoS value, the PM QoS framework
calculates the QoS value for each cpu. The governor can then be modified
to request the QoS CPU_DMA_LATENCY constraint on an individual cpu and
can choose the deepest idle state that would respect the requirement.

By default QoS requests do not have a preference for cpus and that is
the behavior with these patches too. Using the per-cpu PM QoS, the
drivers can now specify a set of cpus that need to obey the QoS
requests, if they know the cpus or they would specify an IRQ, if the
request is to avoid latency in handling the device IRQ. The driver is
not expected to handle the IRQ's smp_affinity changes and would be done
by the PM QoS framework on behalf of the driver. This is not enforced on
the driver that they need to specify an IRQ or a subset of cpus for
their request. If there is no IRQ or cpu preference, then PM QoS
defaults to applying the request for all cpus. I would expect many of
the existing drivers still want this behavior.

In the case a driver is aware that the QoS request to guarantee a
certain performance in handling the device IRQ, the request can be made
to track the IRQ. PM QoS registers for a smp_affinity change and does
the best judgement based on the affinity. The IRQ may still fire on any
of the cpus in the affinity mask which would make it comparable to the
existing QoS behavior and no power saving is achieved.

>What you are not telling is:
>
>    1) In which way is PM QoS going to use that notifier
>
>    2) How does that conflict with the existing users. And we only
>       have two of them:
>
>       - InfiniPath 7322 chip driver
>
>       - cpu_rmap driver, which is used by
>
>       	- solarflare NIC driver
>	- mellanox mlx4 driver
>	- cisco enic driver
>
>	So how is that conflicting with your ARM centric PM QoS work?
>
>	Please provide proper debug info about such a conflict, if it
>	exists at all.
>
>    3) Which arch drivers are involved?
>
>       So far none, at least not in mainline according to
>       "grep irq_set_affinity_notifier arch/".
>
>       If you have out of tree code which uses this, then we want to
>       see it first to understand what the arch driver is trying to
>       solve by using that notification mechanism.
>

I assumed that PM QoS is also used in conjunction with NUMA. I did
notice a few systems that used the smp_affinity notifications. My
assumption was such systems may have drivers that could prefer a per-cpu
QoS request. per-cpu PM QoS is not an ARM specific change, though this
benefits quite a bit in a power constrained environment. An octa core
system with a low power or a high performance distinction amongst the
cores, clearly benefits from this feature, when used with an IRQ
balancer. 

It didnt seem right me that PM QoS gets notified from InifiniPath and NIC
drivers, while others have PM QoS registering directly.

As far my usecases are concerned in an ARM chipset, I did not see any
conflict with the existing users of the smp affinity notifications. But,
I assume the systems may want to use per-cpu PM QoS for some of their
IRQs on a later date. Hence the change to a notification mechanism.

>The more I think about what you are not telling the more I get the
>feeling that this whole PM QoS thing you try to do is a complete
>design disaster.
>
>From your changelog:
>
> "PM QoS and other idle frameworks can do a better job of addressing
>  power and performance requirements for a cpu, knowing the IRQs that
>  are affine to that cpu."
>
>I agree with that. PM might make better decisions if it knows which
>activities are targeted at a particular cpu.
>
>So someone figured that out and you got tasked to implement it. So you
>looked for a way to get to this information and you found the existing
>notifier and decided to (ab)use it for your purpose. But you did not
>spend a split second to figure out why this is wrong and even useless.
>
>At the point where PM QoS or whatever decides to use that information,
>it does not know at all what the current affinity mask is. So in your
>patch 4/4, which I ignored so far you do:
>
>+     	   	 struct irq_desc *desc = irq_to_desc(req->irq);
>+ 		 struct cpumask *mask = desc->irq_data.affinity;
>
>WTF?
>
Sorry, bad access on my part. I will fix it.

>You did not even have the courtesy to use the proper functions for
>this. And of course you access all of this unlocked. So how is that
>supposed to work with a concurrent affinity update? Not at
>all.
>
>Fiddling with irq_desc is a NONO: Consult your preferred search
>machine or just the kernel changelogs about the consequences.
>
>Looking at the other patches you really seem to have missed out on
>some other NONOs:
>
> struct pm_qos_request {
>+	enum pm_qos_req_type type;
>+	struct cpumask cpus_affine;
>+	/* Internal structure members */
> 	struct plist_node node;
> 	int pm_qos_class;
> 	struct delayed_work work; /* for pm_qos_update_request_timeout */
>@@ -80,6 +90,7 @@ enum pm_qos_type {
> struct pm_qos_constraints {
> 	struct plist_head list;
> 	s32 target_value; /* Do not change to 64 bit */
>+	s32 target_per_cpu[NR_CPUS];
> 	s32 default_value;
> 	s32 no_constraint_value;
> 	enum pm_qos_type type;
>
>Go figure how well "struct cpumask" and a s32 array sizeof NR_CPUS
>bodes on kernels compiled for NR_CPUS=8192 or larger.
>
Sorry, I did not imagine the possiblity of these failures.
I figured that I would not use up a per-cpu variable, since the access
to the values may be from one cpu where the code is being processed, but
seems like there are quite a few merits, that I overlooked.

>Just get it: You are hacking core code, not some random ARM SoC
>driver.
>
>That aside, your design or the lack thereof sucks. While your hackery
>might work somehow for the PM QoS problem at hand - ignoring the hard
>to debug race window of the concurrent affinity update - it's
>completely useless for other PM related decisions despite your claims
>that its a benefit for all of that.
Yes, there is a race condition in PM QoS, where the notification between
the time, the request is added and the notification is registered.
AFAIK, the behavior in the irq/manage.c seems to be held the same. If
there is a concurrent change in notification, the schedule_work would
not reschedule and possibly not send a notification, which happens today
as well. I will revisit the change and look, if I can make it better.

>
>How would a general "keep track of the targets of all interrupts in
>the system" mechanism make use of this? 
Sorry, I do not understand your question.
PM QoS is only interested in the IRQs specified in the QoS request. If
there are no requests that need to be associated with an IRQ, then PM
QoS will not register for an affinity change notification.

>Not at all. Simply because it
>has no idea which interrupts are requested at any given time. Sure
>your answer will be "add another notifier" to solve that problem. But
>that's the completely wrong answer because notifiers are a PITA and
>should go away. Here is Linus' POV of notifiers and I really agree
>with it:
>
> "Notifiers are a disgrace, and almost all of them are a major design
>  mistake. They all have locking problems, the introduce internal
>  arbitrary API's that are hard to fix later (because you have random
>  people who decided to hook into them, which is the whole *point* of
>  those notifier chains)."
>
>And so I really fight that notifier chain tooth and nails until someone
>provides me a proper argument WHY it cant be solved with proper
>explicit and understandable mechanisms.
>
FWIW, I do not like notifications either. I feel it adds unpredictablity
to the system. If I didnt think, that NUMA and other systems might
benefit from per-cpu PM QoS feature, I would not have added this
notification mechanism and instead registered directly. I admit to
having not much of a knowledge about NUMA, but atleast PM QoS seems to
be architecture agnostic and all systems could use this generic kernel
feature.

Also, the notification added is restricted to PM QoS framework. The
smp-affinity notifications are not relayed to the driver making the PM
QoS request.

The way it is with per-cpu PM QoS is that the framework registers for
notification, *only* if there is a QoS request tagged with an IRQ. When
a request does not specify an IRQ, the QoS framework does not register
for any IRQ affinity notifications. Also, its possible that multiple
drivers may have a QoS request and could tag the same IRQ, in which
case, the notfication list helps. But surely, I can use other methods in
PM QoS to do the same thing, without the need for a notification list in
the IRQ framework. But that still doesnt solve for other systems that
may have NUMA or NIC drivers using per-cpu PM QoS.

When notified PM QoS updates its request list and amends the request
affinity and recalculates the target value. In this patch, I currently
update requests when the irq notifications are sent out.

>Back to your PM QoS trainwreck:
>
>I'm really impressed that you did not notice at all how inaccurate and
>therefor useless the affinity information which you are consulting is.
>
>Did it ever hit even the outer hemisphere of your brain, that
>irqdata.affinity is the affinity which was requested by the caller of
>an affinity setter function and not the effective affinity?
>
Yes and the effective affinity may not match up to irqdata.affinity,
but the only reliable information we have is irqdata.affinity. We are
trying to do the best with the information available. It still is better
than having no information about affinity.

>I bet it did not, simply because you did not pay any attention and
>bother to analyze that proper. Otherwise you would have tried to hack
>around this as well in some disgusting way.
>
>Really great savings, if the affinity is left at default to all cpus,
>but the effective affinity is a single cpu due to irq routing
>restrictions in the hardware.
>
>Surely you tested against single CPU affinities and achieved great
>results, but that's not a generic and useful solution.

Here is what needs to happen, before the power savings can be
beneficial.
	- Drivers specifying PM QoS request update their drivers to
	  track the request against an IRQ
	- SoCs that have a way of constraining the IRQ affinity to a
	  subset of cpus. Or may use an IRQ balancer or may be design
	  limited to have IRQs affine to a set of cpus.
	- Menu and other cpuidle governors consider per-cpu PM QoS in
	  determining the idle state of the cpu.

We may not see any benefit without all these in play on a given system.
Like you say, with single CPU affinity and with these changes, we do see
quite a bit of power savings from the performance cores.

Obviously, I seem to be doing it wrong with this change. I am ready to
amend and start over, if there are better ways to get affinity
information.

Thanks,
Lina


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Kevin Hilman Sept. 25, 2014, 8:35 p.m. UTC | #6
Hi Thomas,

Thomas Gleixner <tglx@linutronix.de> writes:

[...]

> All I can see from your postings so far is:
>
>     1) You want to use the notification for PM QoS
>
>     2) You run into conflicts with the existing notifiers
>
>     3) You want to solve that conflict
>
> What you are not telling is:
>
>     1) In which way is PM QoS going to use that notifier
>
>     2) How does that conflict with the existing users. And we only
>        have two of them:
>
>        - InfiniPath 7322 chip driver
>
>        - cpu_rmap driver, which is used by
>        	- solarflare NIC driver
> 		- mellanox mlx4 driver
>	 	- cisco enic driver
>
> 	So how is that conflicting with your ARM centric PM QoS work?
>
> 	Please provide proper debug info about such a conflict, if it
> 	exists at all.

Ignoring any new users of the affinity notifiers for now, aren't the
existing users you list above conflicting with each other already?

Maybe I'm missing something, or maybe we're just lucky and nobody uses
them together, but irq_set_affinity_notifier() only allows a single
notifier to be registered at any given time.  So if you had a system
with the inifinipath driver and a cpu_rmap user, only the one who's
lucky enough to be the last to call irq_set_affinity_notifier() will
ever be notified.

So I think the main question for Lina is: assuming we need more than one
co-existing driver/subsystem to get IRQ affinity notifiers (and we
already seem to have them, but are lucky they don't currently co-exist),
how should the IRQ core be changed to support that?  Is the list
approach proposed in $SUBJECT path reasonable?

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Sept. 26, 2014, 9:29 a.m. UTC | #7
On Thu, 25 Sep 2014, Kevin Hilman wrote:
> Maybe I'm missing something, or maybe we're just lucky and nobody uses
> them together, but irq_set_affinity_notifier() only allows a single
> notifier to be registered at any given time.  So if you had a system

A single notifier per irq .....
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Oct. 8, 2014, 3:03 p.m. UTC | #8
On Thu, 25 Sep 2014, Lina Iyer wrote:
> > How would a general "keep track of the targets of all interrupts in
> > the system" mechanism make use of this? 
> Sorry, I do not understand your question.
> PM QoS is only interested in the IRQs specified in the QoS request. If
> there are no requests that need to be associated with an IRQ, then PM
> QoS will not register for an affinity change notification.

Right, and I really hate the whole per irq notifier. It's a rats nest
of life time issues and other problems.
 
It also does not tell you whether an irq is disabled, reenabled or
removed, which will change the qos constraints as well unless you
plaster all drivers with updates to qos for those cases.

So what about adding a qos field to irq_data itself, have a function
to update it and let the irq core keep track of the per cpu irq
relevant qos constraints and provide an evaluation function or a
notifier for the PM/idle code?

That's going to need some serious thought as well, but it should avoid
most of the nasty notifier and lifetime issue which the per irq
notifiers provide.

Thoughts?

	tglx





--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lina Iyer Oct. 10, 2014, 3:11 p.m. UTC | #9
On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote:
>On Thu, 25 Sep 2014, Lina Iyer wrote:
>> > How would a general "keep track of the targets of all interrupts in
>> > the system" mechanism make use of this?
>> Sorry, I do not understand your question.
>> PM QoS is only interested in the IRQs specified in the QoS request. If
>> there are no requests that need to be associated with an IRQ, then PM
>> QoS will not register for an affinity change notification.
>
>Right, and I really hate the whole per irq notifier. It's a rats nest
>of life time issues and other problems.
>
>It also does not tell you whether an irq is disabled, reenabled or
>removed, which will change the qos constraints as well unless you
>plaster all drivers with updates to qos for those cases.
>
>So what about adding a qos field to irq_data itself, have a function
>to update it and let the irq core keep track of the per cpu irq
>relevant qos constraints and provide an evaluation function or a
>notifier for the PM/idle code?
If that isnt intrusive in the IRQ core, then we can make it work for PM
QoS. The issue that I am concerned is that, it might result in back and
forth between IRQ and PM QoS frameworks. If that doesnt happen, then we
are good with this approach.
>
>That's going to need some serious thought as well, but it should avoid
>most of the nasty notifier and lifetime issue which the per irq
>notifiers provide.
Sure. I will look into this.
>
>Thoughts?

Thank you.

Lina
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner Oct. 17, 2014, 7:29 a.m. UTC | #10
On Fri, 10 Oct 2014, Lina Iyer wrote:
> On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote:
> > On Thu, 25 Sep 2014, Lina Iyer wrote:
> > > > How would a general "keep track of the targets of all interrupts in
> > > > the system" mechanism make use of this?
> > > Sorry, I do not understand your question.
> > > PM QoS is only interested in the IRQs specified in the QoS request. If
> > > there are no requests that need to be associated with an IRQ, then PM
> > > QoS will not register for an affinity change notification.
> > 
> > Right, and I really hate the whole per irq notifier. It's a rats nest
> > of life time issues and other problems.
> > 
> > It also does not tell you whether an irq is disabled, reenabled or
> > removed, which will change the qos constraints as well unless you
> > plaster all drivers with updates to qos for those cases.
> > 
> > So what about adding a qos field to irq_data itself, have a function
> > to update it and let the irq core keep track of the per cpu irq
> > relevant qos constraints and provide an evaluation function or a
> > notifier for the PM/idle code?
> 
> If that isnt intrusive in the IRQ core, then we can make it work for PM
> QoS. The issue that I am concerned is that, it might result in back and
> forth between IRQ and PM QoS frameworks. If that doesnt happen, then we
> are good with this approach.

I can't tell that upfront, but I think it's worth to explore it.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lina Iyer Nov. 18, 2014, 6:22 a.m. UTC | #11
Hi Thomas,

On Fri, Oct 17 2014 at 01:29 -0600, Thomas Gleixner wrote:
>On Fri, 10 Oct 2014, Lina Iyer wrote:
>> On Wed, Oct 08 2014 at 09:03 -0600, Thomas Gleixner wrote:
>> > On Thu, 25 Sep 2014, Lina Iyer wrote:
>> > > > How would a general "keep track of the targets of all interrupts in
>> > > > the system" mechanism make use of this?
>> > > Sorry, I do not understand your question.
>> > > PM QoS is only interested in the IRQs specified in the QoS request. If
>> > > there are no requests that need to be associated with an IRQ, then PM
>> > > QoS will not register for an affinity change notification.
>> >
>> > Right, and I really hate the whole per irq notifier. It's a rats nest
>> > of life time issues and other problems.
>> >
>> > It also does not tell you whether an irq is disabled, reenabled or
>> > removed, which will change the qos constraints as well unless you
>> > plaster all drivers with updates to qos for those cases.
>> >
>> > So what about adding a qos field to irq_data itself, have a function
>> > to update it and let the irq core keep track of the per cpu irq
>> > relevant qos constraints and provide an evaluation function or a
>> > notifier for the PM/idle code?
>>
>> If that isnt intrusive in the IRQ core, then we can make it work for PM
>> QoS. The issue that I am concerned is that, it might result in back and
>> forth between IRQ and PM QoS frameworks. If that doesnt happen, then we
>> are good with this approach.
>
>I can't tell that upfront, but I think it's worth to explore it.
>
I was able to review the options and I attempted a few methods, but
off-loading the QoS onto the IRQ framework, made it quite complex to
manage it. QoS values for each of the four constraints and the
constraints could be one of 3 types - min, max or sum, makes it a whole
lot of mess handling it in IRQ code.

I was able to get QoS to be notified of IRQ affinity changes without
using notifiers, but, I still am yet to find a way to modify QoS
requests on enable/disable of IRQ.

I will send the RFC of the new patchset, would be interested in taking a
look and let me know what you think.

Thanks,
Lina
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/hw/qib/qib_iba7322.c b/drivers/infiniband/hw/qib/qib_iba7322.c
index a7eb325..62cb77d 100644
--- a/drivers/infiniband/hw/qib/qib_iba7322.c
+++ b/drivers/infiniband/hw/qib/qib_iba7322.c
@@ -3345,9 +3345,7 @@  static void reset_dca_notifier(struct qib_devdata *dd, struct qib_msix_entry *m)
 		"Disabling notifier on HCA %d irq %d\n",
 		dd->unit,
 		m->msix.vector);
-	irq_set_affinity_notifier(
-		m->msix.vector,
-		NULL);
+	irq_release_affinity_notifier(m->notifier);
 	m->notifier = NULL;
 }
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 698ad05..c1e227c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -203,7 +203,7 @@  static inline int check_wakeup_irqs(void) { return 0; }
  * struct irq_affinity_notify - context for notification of IRQ affinity changes
  * @irq:		Interrupt to which notification applies
  * @kref:		Reference count, for internal use
- * @work:		Work item, for internal use
+ * @list:		Add to the notifier list, for internal use
  * @notify:		Function to be called on change.  This will be
  *			called in process context.
  * @release:		Function to be called on release.  This will be
@@ -214,7 +214,7 @@  static inline int check_wakeup_irqs(void) { return 0; }
 struct irq_affinity_notify {
 	unsigned int irq;
 	struct kref kref;
-	struct work_struct work;
+	struct list_head list;
 	void (*notify)(struct irq_affinity_notify *, const cpumask_t *mask);
 	void (*release)(struct kref *ref);
 };
@@ -265,6 +265,8 @@  extern int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m);
 extern int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify);
 
+extern int
+irq_release_affinity_notifier(struct irq_affinity_notify *notify);
 #else /* CONFIG_SMP */
 
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
@@ -295,6 +297,12 @@  irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	return 0;
 }
+
+static inline int
+irq_release_affinity_notifier(struct irq_affinity_notify *notify)
+{
+	return 0;
+}
 #endif /* CONFIG_SMP */
 
 /*
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 472c021..db3509e 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -31,7 +31,8 @@  struct irq_desc;
  * @threads_handled_last: comparator field for deferred spurious detection of theraded handlers
  * @lock:		locking for SMP
  * @affinity_hint:	hint to user space for preferred irq affinity
- * @affinity_notify:	context for notification of affinity changes
+ * @affinity_notify:	list of notification clients for affinity changes
+ * @affinity_work:	Work queue for handling affinity change notifications
  * @pending_mask:	pending rebalanced interrupts
  * @threads_oneshot:	bitfield to handle shared oneshot threads
  * @threads_active:	number of irqaction threads currently running
@@ -60,7 +61,8 @@  struct irq_desc {
 	struct cpumask		*percpu_enabled;
 #ifdef CONFIG_SMP
 	const struct cpumask	*affinity_hint;
-	struct irq_affinity_notify *affinity_notify;
+	struct list_head	affinity_notify;
+	struct work_struct	affinity_work;
 #ifdef CONFIG_GENERIC_PENDING_IRQ
 	cpumask_var_t		pending_mask;
 #endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1487a12..c95e1f3 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -91,6 +91,7 @@  static void desc_set_defaults(unsigned int irq, struct irq_desc *desc, int node,
 	for_each_possible_cpu(cpu)
 		*per_cpu_ptr(desc->kstat_irqs, cpu) = 0;
 	desc_smp_init(desc, node);
+	INIT_LIST_HEAD(&desc->affinity_notify);
 }
 
 int nr_irqs = NR_IRQS;
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3dc6a61..b6ff79c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -209,10 +209,9 @@  int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
 		irq_copy_pending(desc, mask);
 	}
 
-	if (desc->affinity_notify) {
-		kref_get(&desc->affinity_notify->kref);
-		schedule_work(&desc->affinity_notify->work);
-	}
+	if (!list_empty(&desc->affinity_notify))
+		schedule_work(&desc->affinity_work);
+
 	irqd_set(data, IRQD_AFFINITY_SET);
 
 	return ret;
@@ -248,14 +247,14 @@  EXPORT_SYMBOL_GPL(irq_set_affinity_hint);
 
 static void irq_affinity_notify(struct work_struct *work)
 {
-	struct irq_affinity_notify *notify =
-		container_of(work, struct irq_affinity_notify, work);
-	struct irq_desc *desc = irq_to_desc(notify->irq);
+	struct irq_desc *desc =
+			container_of(work, struct irq_desc, affinity_work);
 	cpumask_var_t cpumask;
 	unsigned long flags;
+	struct irq_affinity_notify *notify;
 
 	if (!desc || !alloc_cpumask_var(&cpumask, GFP_KERNEL))
-		goto out;
+		return;
 
 	raw_spin_lock_irqsave(&desc->lock, flags);
 	if (irq_move_pending(&desc->irq_data))
@@ -264,11 +263,20 @@  static void irq_affinity_notify(struct work_struct *work)
 		cpumask_copy(cpumask, desc->irq_data.affinity);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
-	notify->notify(notify, cpumask);
+	list_for_each_entry(notify, &desc->affinity_notify, list) {
+		/**
+		 * Check and get the kref only if the kref has not been
+		 * released by now. Its possible that the reference count
+		 * is already 0, we dont want to notify those if they are
+		 * already released.
+		 */
+		if (!kref_get_unless_zero(&notify->kref))
+			continue;
+		notify->notify(notify, cpumask);
+		kref_put(&notify->kref, notify->release);
+	}
 
 	free_cpumask_var(cpumask);
-out:
-	kref_put(&notify->kref, notify->release);
 }
 
 /**
@@ -286,34 +294,50 @@  int
 irq_set_affinity_notifier(unsigned int irq, struct irq_affinity_notify *notify)
 {
 	struct irq_desc *desc = irq_to_desc(irq);
-	struct irq_affinity_notify *old_notify;
 	unsigned long flags;
 
-	/* The release function is promised process context */
-	might_sleep();
-
 	if (!desc)
 		return -EINVAL;
 
-	/* Complete initialisation of *notify */
-	if (notify) {
-		notify->irq = irq;
-		kref_init(&notify->kref);
-		INIT_WORK(&notify->work, irq_affinity_notify);
+	if (!notify) {
+		WARN("%s called with NULL notifier - use irq_release_affinity_notifier function instead.\n",
+				__func__);
+		return -EINVAL;
 	}
 
+	notify->irq = irq;
+	kref_init(&notify->kref);
+	INIT_LIST_HEAD(&notify->list);
 	raw_spin_lock_irqsave(&desc->lock, flags);
-	old_notify = desc->affinity_notify;
-	desc->affinity_notify = notify;
+	list_add(&notify->list, &desc->affinity_notify);
 	raw_spin_unlock_irqrestore(&desc->lock, flags);
 
-	if (old_notify)
-		kref_put(&old_notify->kref, old_notify->release);
-
 	return 0;
 }
 EXPORT_SYMBOL_GPL(irq_set_affinity_notifier);
 
+/**
+ *	irq_release_affinity_notifier - Remove us from notifications
+ *	@notify: Context for notification
+ */
+int irq_release_affinity_notifier(struct irq_affinity_notify *notify)
+{
+	struct irq_desc *desc;
+	unsigned long flags;
+
+	if (!notify)
+		return -EINVAL;
+
+	desc = irq_to_desc(notify->irq);
+	raw_spin_lock_irqsave(&desc->lock, flags);
+	list_del(&notify->list);
+	raw_spin_unlock_irqrestore(&desc->lock, flags);
+	kref_put(&notify->kref, notify->release);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(irq_release_affinity_notifier);
+
 #ifndef CONFIG_AUTO_IRQ_AFFINITY
 /*
  * Generic version of the affinity autoselector.
@@ -348,6 +372,8 @@  setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask)
 		if (cpumask_intersects(mask, nodemask))
 			cpumask_and(mask, mask, nodemask);
 	}
+	INIT_LIST_HEAD(&desc->affinity_notify);
+	INIT_WORK(&desc->affinity_work, irq_affinity_notify);
 	irq_do_set_affinity(&desc->irq_data, mask, false);
 	return 0;
 }
@@ -1413,14 +1439,15 @@  EXPORT_SYMBOL_GPL(remove_irq);
 void free_irq(unsigned int irq, void *dev_id)
 {
 	struct irq_desc *desc = irq_to_desc(irq);
+	struct irq_affinity_notify *notify;
 
 	if (!desc || WARN_ON(irq_settings_is_per_cpu_devid(desc)))
 		return;
 
-#ifdef CONFIG_SMP
-	if (WARN_ON(desc->affinity_notify))
-		desc->affinity_notify = NULL;
-#endif
+	WARN_ON(!list_empty(&desc->affinity_notify));
+
+	list_for_each_entry(notify, &desc->affinity_notify, list)
+		kref_put(&notify->kref, notify->release);
 
 	chip_bus_lock(desc);
 	kfree(__free_irq(irq, dev_id));
diff --git a/lib/cpu_rmap.c b/lib/cpu_rmap.c
index 4f134d8..0c8da50 100644
--- a/lib/cpu_rmap.c
+++ b/lib/cpu_rmap.c
@@ -235,7 +235,7 @@  void free_irq_cpu_rmap(struct cpu_rmap *rmap)
 
 	for (index = 0; index < rmap->used; index++) {
 		glue = rmap->obj[index];
-		irq_set_affinity_notifier(glue->notify.irq, NULL);
+		irq_release_affinity_notifier(&glue->notify);
 	}
 
 	cpu_rmap_put(rmap);