[RFC] sched_clock: Track monotonic raw clock

Message ID 1410522334.16936.42.camel@hornet
State New
Headers show

Commit Message

Pawel Moll Sept. 12, 2014, 11:45 a.m.
On Tue, 2014-07-22 at 20:39 +0100, Richard Cochran wrote:
> So the only reasonable way, IMHO, is to simply provide the needed
> information in the traces, and then add some user space code to find
> the relationship between the perf clock and the system clock. The
> great advantage of this approach is that you don't have to create a
> perfect servo from day one. The time stamps can be re-analyzed at any
> time after the fact.

Ok, so I had yet another (already lost count ;-) look into this...

I came up with a conclusion that the simplest way it can be done is to
have a periodically repeated events marked with both "kernel timestamp"
and "user timestamp". And a scheduler switch is such a periodic event.
Moreover, it seems quite natural: a user process is scheduled in (take a
snapshot of sched clock and monotonic raw at this moment, let's call
them t_s_in and t_m_in), the process "generates" the "user timestamp" by
clock_gettime(CLOCK_MONOTONIC_RAW, t_m_x), the process is scheduled out
(again, take sched and monotonic timestamps then, t_s_out and t_m_out).

The, after extracting all the t_?_? from the trace stream, a translation
(we've know the monotonic raw value t_m_x, we're looking for its sched
clock equivalent t_s_x) should simply get down to:

t_s_x = t_s_in + ((t_m_x - t_m_1) * (t_s_out - t_s_in)) / (t_m_out - t_m_in)

I quickly made an experiment, by adding a customised variant of
sched_switch trace event (note: I'm *not* proposing it to be merged)

8<-------------
From ad772fca06da4774e6dd3a116ee2424a0dd3d80f Mon Sep 17 00:00:00 2001
From: Pawel Moll <pawel.moll@arm.com>
Date: Fri, 22 Aug 2014 15:23:13 +0100
Subject: [PATCH] trace: ched_switch event variant with raw monotonic timestamp

This patch adds a variant of sched_switch trace event,
adding a field for a monotonic raw timestamp.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/trace/events/sched.h | 48 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/core.c          |  1 +
 2 files changed, 49 insertions(+)

Patch hide | download patch | download mbox

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 0a68d5a..1429958 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -7,6 +7,7 @@ 
 #include <linux/sched.h>
 #include <linux/tracepoint.h>
 #include <linux/binfmts.h>
+#include <linux/time.h>
 
 /*
  * Tracepoint for calling kthread_stop, performed to end a kthread:
@@ -150,6 +151,53 @@  TRACE_EVENT(sched_switch,
 );
 
 /*
+ * Tracepoint for task switches, including raw monotonic raw clock
+ * timestamp, to perform user- to kernelspace time correlation
+ */
+TRACE_EVENT(sched_switch_raw_monotonic,
+
+	TP_PROTO(struct task_struct *prev,
+		 struct task_struct *next),
+
+	TP_ARGS(prev, next),
+
+	TP_STRUCT__entry(
+		__field(	u64,	raw_monotonic			)
+		__array(	char,	prev_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	prev_pid			)
+		__field(	int,	prev_prio			)
+		__field(	long,	prev_state			)
+		__array(	char,	next_comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	next_pid			)
+		__field(	int,	next_prio			)
+	),
+
+	TP_fast_assign(
+		struct timespec now;
+		getrawmonotonic(&now);
+		__entry->raw_monotonic = timespec_to_ns(&now);
+		memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
+		__entry->prev_pid	= prev->pid;
+		__entry->prev_prio	= prev->prio;
+		__entry->prev_state	= __trace_sched_switch_state(prev);
+		memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
+		__entry->next_pid	= next->pid;
+		__entry->next_prio	= next->prio;
+	),
+
+	TP_printk("raw_monotonic=%Lu prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+		__entry->raw_monotonic, __entry->prev_comm,
+		__entry->prev_pid, __entry->prev_prio,
+		__entry->prev_state & (TASK_STATE_MAX-1) ?
+		  __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
+				{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
+				{ 16, "Z" }, { 32, "X" }, { 64, "x" },
+				{ 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+		__entry->prev_state & TASK_STATE_MAX ? "+" : "",
+		__entry->next_comm, __entry->next_pid, __entry->next_prio)
+);
+
+/*
  * Tracepoint for a task being migrated:
  */
 TRACE_EVENT(sched_migrate_task,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec1a286..0cc7a92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2179,6 +2179,7 @@  prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
 	trace_sched_switch(prev, next);
+	trace_sched_switch_raw_monotonic(prev, next);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
 	fire_sched_out_preempt_notifiers(prev, next);