diff mbox series

[RESEND,v2,1/3] sched/tp: Add new tracepoint to track uclamp set from user-space

Message ID 20230522145702.2419654-2-lukasz.luba@arm.com
State New
Headers show
Series Add basic tracing for uclamp and schedutil | expand

Commit Message

Lukasz Luba May 22, 2023, 2:57 p.m. UTC
The user-space can set uclamp value for a given task. It impacts task
placement decisions made by the scheduler. This is very useful information
and helps to understand the system behavior or track improvements in
middleware and applications which start using uclamp mechanisms and report
better performance in tests.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 include/trace/events/sched.h | 4 ++++
 kernel/sched/core.c          | 5 +++++
 2 files changed, 9 insertions(+)

Comments

Masami Hiramatsu (Google) June 21, 2023, 3:25 a.m. UTC | #1
On Wed, 31 May 2023 19:26:29 +0100
Qais Yousef <qyousef@layalina.io> wrote:

> On 05/22/23 15:57, Lukasz Luba wrote:
> > The user-space can set uclamp value for a given task. It impacts task
> > placement decisions made by the scheduler. This is very useful information
> > and helps to understand the system behavior or track improvements in
> > middleware and applications which start using uclamp mechanisms and report
> > better performance in tests.
> 
> Do you mind adding a generic one instead please? And explain why we can't just
> attach to the syscall via kprobes? I think you want to bypass the permission
> checks, so maybe a generic tracepoint after that might be justifiable?

Could you tell me more about this point? I would like to know what kind of
permission checks can be bypassed with tracepoints.

> Then anyone can use it to track how userspace has changed any attributes for
> a task, not just uclamp.

I guess Uclamp is not controlled by syscall but from kernel internal
sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
tracepoint, something like trace_sched_set_scheduer(task, attr).

Thank you,

> 
> 
> Thanks
> 
> --
> Qais Yousef
> 
> > 
> > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> > ---
> >  include/trace/events/sched.h | 4 ++++
> >  kernel/sched/core.c          | 5 +++++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index fbb99a61f714..dbfb30809f15 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -735,6 +735,10 @@ DECLARE_TRACE(sched_update_nr_running_tp,
> >  	TP_PROTO(struct rq *rq, int change),
> >  	TP_ARGS(rq, change));
> >  
> > +DECLARE_TRACE(uclamp_update_tsk_tp,
> > +	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
> > +	TP_ARGS(tsk, uclamp_id, value));
> > +
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 944c3ae39861..7b9b800ebb6c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -114,6 +114,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
> >  
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >  
> > @@ -1956,12 +1957,16 @@ static void __setscheduler_uclamp(struct task_struct *p,
> >  	    attr->sched_util_min != -1) {
> >  		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
> >  			      attr->sched_util_min, true);
> > +		trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
> > +					   attr->sched_util_min);
> >  	}
> >  
> >  	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
> >  	    attr->sched_util_max != -1) {
> >  		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> >  			      attr->sched_util_max, true);
> > +		trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
> > +					   attr->sched_util_max);
> >  	}
> >  }
> >  
> > -- 
> > 2.25.1
> >
Qais Yousef June 30, 2023, 11:49 a.m. UTC | #2
On 06/21/23 12:25, Masami Hiramatsu wrote:
> On Wed, 31 May 2023 19:26:29 +0100
> Qais Yousef <qyousef@layalina.io> wrote:
> 
> > On 05/22/23 15:57, Lukasz Luba wrote:
> > > The user-space can set uclamp value for a given task. It impacts task
> > > placement decisions made by the scheduler. This is very useful information
> > > and helps to understand the system behavior or track improvements in
> > > middleware and applications which start using uclamp mechanisms and report
> > > better performance in tests.
> > 
> > Do you mind adding a generic one instead please? And explain why we can't just
> > attach to the syscall via kprobes? I think you want to bypass the permission
> > checks, so maybe a generic tracepoint after that might be justifiable?
> 
> Could you tell me more about this point? I would like to know what kind of
> permission checks can be bypassed with tracepoints.

Sorry bad usage of English from my end.

The syscall can fail if the caller doesn't have permission to change the
attribute (some of them are protected with CAP_NICE) or if the boundary check
fails. The desire here is to emit a tracepoint() when the user successfully
changes an attribute of a task.

Lukasz would like to have this tracepoint to help debug and analyse workloads.
We are not really bypassing anything. So to rephrase, emit the tracepointn if
the syscall is successfully changing an attribute.

> 
> > Then anyone can use it to track how userspace has changed any attributes for
> > a task, not just uclamp.
> 
> I guess Uclamp is not controlled by syscall but from kernel internal
> sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
> tracepoint, something like trace_sched_set_scheduer(task, attr).

Yes. Which is something worries me and I had a series in the past to hide it.
The uclamp range is abstracted and has no meaning in general and should be set
specifically to each system. e.g: 512 means half the system performance level,
but if the system is over powered this could be too fast, and if it's
underpowered it could be too slow. It must be set by userspace; though not sure
if kernel threads need to manage their performance level how this can be
achieved.


Thanks!

--
Qais Yousef
Lukasz Luba July 4, 2023, 7:49 a.m. UTC | #3
Hi Masami, Qais,

On 6/30/23 12:49, Qais Yousef wrote:
> On 06/21/23 12:25, Masami Hiramatsu wrote:
>> On Wed, 31 May 2023 19:26:29 +0100
>> Qais Yousef <qyousef@layalina.io> wrote:
>>
>>> On 05/22/23 15:57, Lukasz Luba wrote:
>>>> The user-space can set uclamp value for a given task. It impacts task
>>>> placement decisions made by the scheduler. This is very useful information
>>>> and helps to understand the system behavior or track improvements in
>>>> middleware and applications which start using uclamp mechanisms and report
>>>> better performance in tests.
>>>
>>> Do you mind adding a generic one instead please? And explain why we can't just
>>> attach to the syscall via kprobes? I think you want to bypass the permission
>>> checks, so maybe a generic tracepoint after that might be justifiable?
>>
>> Could you tell me more about this point? I would like to know what kind of
>> permission checks can be bypassed with tracepoints.
> 
> Sorry bad usage of English from my end.
> 
> The syscall can fail if the caller doesn't have permission to change the
> attribute (some of them are protected with CAP_NICE) or if the boundary check
> fails. The desire here is to emit a tracepoint() when the user successfully
> changes an attribute of a task.
> 
> Lukasz would like to have this tracepoint to help debug and analyse workloads.
> We are not really bypassing anything. So to rephrase, emit the tracepointn if
> the syscall is successfully changing an attribute.

Sorry, but no. As I said, I don't want to add more dependencies in our
tooling and kernel configuration.

> 
>>
>>> Then anyone can use it to track how userspace has changed any attributes for
>>> a task, not just uclamp.

Is this a real-life use case?
Is there a user-space SW that changes dynamically those attributes in a
way which affects task scheduler decisions that we have hard time to
understand it?

This syscall is quite old and I haven't heard that there is a need to
know what and how often it changes for apps.

On the other hand (the real life).
We started facing issues since some smart middle-ware very often
(a few hundred times in a few minutes) changes the uclamp for apps.
That daemon works autonomously and tries to figure out best values,
To understand those decisions we need some tricky offline processing
from trace. Since the uclamp affects a lot of mechanisms, we need to
know exactly the time and value when it is set.

If you don't point me to the SW which changes the other attributes
that often that you need to record them and post process, then
I would keep the current approach.


>>
>> I guess Uclamp is not controlled by syscall but from kernel internal
>> sched_setattr/setscheduler() too. Anyway I agree that it can be more generic
>> tracepoint, something like trace_sched_set_scheduer(task, attr).
> 
> Yes. Which is something worries me and I had a series in the past to hide it.
> The uclamp range is abstracted and has no meaning in general and should be set
> specifically to each system. e.g: 512 means half the system performance level,
> but if the system is over powered this could be too fast, and if it's
> underpowered it could be too slow. It must be set by userspace; though not sure
> if kernel threads need to manage their performance level how this can be
> achieved.

In mainline kernel I don't see any place where uclamp is set for kernel
threads. It's not use the use case and I hope it won't be anytime soon.

Regards,
Lukasz
Qais Yousef July 4, 2023, 2:02 p.m. UTC | #4
On 07/04/23 08:49, Lukasz Luba wrote:
> Hi Masami, Qais,
> 
> On 6/30/23 12:49, Qais Yousef wrote:
> > On 06/21/23 12:25, Masami Hiramatsu wrote:
> > > On Wed, 31 May 2023 19:26:29 +0100
> > > Qais Yousef <qyousef@layalina.io> wrote:
> > > 
> > > > On 05/22/23 15:57, Lukasz Luba wrote:
> > > > > The user-space can set uclamp value for a given task. It impacts task
> > > > > placement decisions made by the scheduler. This is very useful information
> > > > > and helps to understand the system behavior or track improvements in
> > > > > middleware and applications which start using uclamp mechanisms and report
> > > > > better performance in tests.
> > > > 
> > > > Do you mind adding a generic one instead please? And explain why we can't just
> > > > attach to the syscall via kprobes? I think you want to bypass the permission
> > > > checks, so maybe a generic tracepoint after that might be justifiable?
> > > 
> > > Could you tell me more about this point? I would like to know what kind of
> > > permission checks can be bypassed with tracepoints.
> > 
> > Sorry bad usage of English from my end.
> > 
> > The syscall can fail if the caller doesn't have permission to change the
> > attribute (some of them are protected with CAP_NICE) or if the boundary check
> > fails. The desire here is to emit a tracepoint() when the user successfully
> > changes an attribute of a task.
> > 
> > Lukasz would like to have this tracepoint to help debug and analyse workloads.
> > We are not really bypassing anything. So to rephrase, emit the tracepointn if
> > the syscall is successfully changing an attribute.
> 
> Sorry, but no. As I said, I don't want to add more dependencies in our
> tooling and kernel configuration.

Fair enough. But as I said before a dedicate uclamp only tracepoint doesn't
make sense to me. If maintainers are convinced, then be it. My point of view is
that We want generic tracepoints that scale to other use cases and it makes
sense to go this way to accommodate all potential future users, not just you.


Cheers

--
Qais Yousef
Lukasz Luba July 19, 2023, 1:18 p.m. UTC | #5
On 7/6/23 12:14, Peter Zijlstra wrote:
> On Wed, May 31, 2023 at 07:26:29PM +0100, Qais Yousef wrote:
>> On 05/22/23 15:57, Lukasz Luba wrote:
>>> The user-space can set uclamp value for a given task. It impacts task
>>> placement decisions made by the scheduler. This is very useful information
>>> and helps to understand the system behavior or track improvements in
>>> middleware and applications which start using uclamp mechanisms and report
>>> better performance in tests.
>>
>> Do you mind adding a generic one instead please? And explain why we can't just
>> attach to the syscall via kprobes? I think you want to bypass the permission
>> checks, so maybe a generic tracepoint after that might be justifiable?
>> Then anyone can use it to track how userspace has changed any attributes for
>> a task, not just uclamp.
> 
> Yeah, so I'm leaning towards the same, if you want to put a tracepoint
> in __sched_setscheduler(), just trace the whole attr and leave it at
> that:
> 
> 	trace_update_sched_attr_tp(p, attr);
> 
> or somesuch.
> 

OK, fair enough, I'll do that. Thanks Peter!
(I'm sorry for the delay, I was on vacation)

Lukasz
diff mbox series

Patch

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index fbb99a61f714..dbfb30809f15 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -735,6 +735,10 @@  DECLARE_TRACE(sched_update_nr_running_tp,
 	TP_PROTO(struct rq *rq, int change),
 	TP_ARGS(rq, change));
 
+DECLARE_TRACE(uclamp_update_tsk_tp,
+	TP_PROTO(struct task_struct *tsk, int uclamp_id,  unsigned int value),
+	TP_ARGS(tsk, uclamp_id, value));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 944c3ae39861..7b9b800ebb6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -114,6 +114,7 @@  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(uclamp_update_tsk_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -1956,12 +1957,16 @@  static void __setscheduler_uclamp(struct task_struct *p,
 	    attr->sched_util_min != -1) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
 			      attr->sched_util_min, true);
+		trace_uclamp_update_tsk_tp(p, UCLAMP_MIN,
+					   attr->sched_util_min);
 	}
 
 	if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
 	    attr->sched_util_max != -1) {
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
 			      attr->sched_util_max, true);
+		trace_uclamp_update_tsk_tp(p, UCLAMP_MAX,
+					   attr->sched_util_max);
 	}
 }