From patchwork Sat Apr 5 06:26:48 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Viresh Kumar X-Patchwork-Id: 27872 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-qa0-f72.google.com (mail-qa0-f72.google.com [209.85.216.72]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id EA73720369 for ; Sat, 5 Apr 2014 06:27:35 +0000 (UTC) Received: by mail-qa0-f72.google.com with SMTP id hw13sf5759911qab.11 for ; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:mime-version:in-reply-to:references :date:message-id:subject:from:to:cc:sender:precedence:list-id :x-original-sender:x-original-authentication-results:mailing-list :list-post:list-help:list-archive:list-unsubscribe:content-type; bh=rPsPnWY05Tk3jfJsZOYGQ/VXEgvR/SWD71yV2lGHSBw=; b=UTzwvFv8CwiScKzNRroQoteOLm0fPJAacjWcHB50ZxfsheyjOILr8a+7zWwQoSJduN 9T4eyiZ1M0geifm1/QoRLhtcmIa9c3QHLCWXOdIFdKUg5fIWb5qFGiawOzcvkrid6r3W J0PafV+cZo7ZRUUvNf/0PISi6z1Xpkiayxt9o6a9XLYUx/D6d+4W73v+23h0L6tpbZox J+rAnRF26tERfe+O6XX+acWiGKRY3Z/oazD+6F/9sYXo0TlGbJ0hJEwp2EdJO7LdyTAn 2PmhZgxlYQ7gmuW52UAiandKEE7r9AOPKVSDjixSkOr6Hg8TZYAUkY8UNEzYgg+qmdgS XzPA== X-Gm-Message-State: ALoCoQnXbCGfivkxKllMd4vSX0WRy2NP5kYBev51xwJOnG4hw3F3P9WpMl8nK74AerERl+VRaKw3 X-Received: by 10.236.142.101 with SMTP id h65mr8504634yhj.1.1396679255431; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.96.132 with SMTP id k4ls1258476qge.42.gmail; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) X-Received: by 10.52.142.10 with SMTP id rs10mr14050728vdb.3.1396679255351; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) Received: from mail-ve0-f171.google.com (mail-ve0-f171.google.com [209.85.128.171]) by mx.google.com with ESMTPS id sc7si2041101vdc.85.2014.04.04.23.27.35 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 04 Apr 2014 23:27:35 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.128.171 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.128.171; Received: by mail-ve0-f171.google.com with SMTP id jy13so2150558veb.16 for ; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) X-Received: by 10.52.173.165 with SMTP id bl5mr13955951vdc.13.1396679255266; Fri, 04 Apr 2014 23:27:35 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.220.12.8 with SMTP id v8csp6728vcv; Fri, 4 Apr 2014 23:27:34 -0700 (PDT) X-Received: by 10.66.150.228 with SMTP id ul4mr8778888pab.16.1396679231043; Fri, 04 Apr 2014 23:27:11 -0700 (PDT) Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a8si5511417pbs.70.2014.04.04.23.27.10; Fri, 04 Apr 2014 23:27:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753672AbaDEG0v (ORCPT + 27 others); Sat, 5 Apr 2014 02:26:51 -0400 Received: from mail-oa0-f44.google.com ([209.85.219.44]:38063 "EHLO mail-oa0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753353AbaDEG0t (ORCPT ); Sat, 5 Apr 2014 02:26:49 -0400 Received: by mail-oa0-f44.google.com with SMTP id n16so4575763oag.3 for ; Fri, 04 Apr 2014 23:26:49 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.173.233 with SMTP id bn9mr23496646oec.9.1396679209149; Fri, 04 Apr 2014 23:26:49 -0700 (PDT) Received: by 10.182.28.168 with HTTP; Fri, 4 Apr 2014 23:26:48 -0700 (PDT) In-Reply-To: <533F86D4.10007@intel.com> References: <20140404031928.GC11828@localhost> <533E635F.9050803@intel.com> <533F86D4.10007@intel.com> Date: Sat, 5 Apr 2014 11:56:48 +0530 Message-ID: Subject: Re: WARNING: CPU: 0 PID: 1935 at kernel/timer.c:1621 migrate_timer_list() From: Viresh Kumar To: Jet Chen Cc: Thomas Gleixner , Fengguang Wu , Linux Kernel Mailing List Sender: linux-kernel-owner@vger.kernel.org Precedence: list List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: viresh.kumar@linaro.org X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.128.171 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , On 5 April 2014 10:00, Jet Chen wrote: > vmlinuz from our build system doesn't have debug information. It is hard to > use objdump to identify which routine is timer->function. I see... > But after several times trials, I get below dmesg messages. > It is clear to see address of "timer->function" is 0xffffffff810d7010. > In calling stack, " [] ? > clocksource_watchdog_kthread+0x40/0x40 ". So I guess timer->function is > clocksource_watchdog_kthread. Hmm.. not exactly this function as this isn't timer->function for any timer. But I think I have found the right function with this hint: clocksource_watchdog() Can you please try to test the attached patch, which must fix it. Untested. I will then post it with your Tested-by :) --- viresh >From abd38155f8293923de5953cc063f9e2d7ecb3f04 Mon Sep 17 00:00:00 2001 Message-Id: From: Viresh Kumar Date: Sat, 5 Apr 2014 11:43:25 +0530 Subject: [PATCH] clocksource: register cpu notifier to remove timer from dying CPU clocksource core is using add_timer_on() to run clocksource_watchdog() on all CPUs one by one. But when a core is brought down, clocksource core doesn't remove this timer from the dying CPU. And in this case timer core gives this (Gives this only with unmerged code, anyway in the current code as well timer core is migrating a pinned timer to other CPUs, which is also wrong: http://www.gossamer-threads.com/lists/linux/kernel/1898117) migrate_timer_list: can't migrate pinned timer: ffffffff81f06a60, timer->function: ffffffff810d7010,deactivating it Modules linked in: CPU: 0 PID: 1932 Comm: 01-cpu-hotplug Not tainted 3.14.0-rc1-00088-gab3c4fd #4 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000009 ffff88001d407c38 ffffffff817237bd ffff88001d407c80 ffff88001d407c70 ffffffff8106a1dd 0000000000000010 ffffffff81f06a60 ffff88001e04d040 ffffffff81e3d4c0 ffff88001e04d030 ffff88001d407cd0 Call Trace: [] dump_stack+0x4d/0x66 [] warn_slowpath_common+0x7d/0xa0 [] warn_slowpath_fmt+0x4c/0x50 [] ? __internal_add_timer+0x113/0x130 [] ? clocksource_watchdog_kthread+0x40/0x40 [] migrate_timer_list+0xdb/0xf0 [] timer_cpu_notify+0xfc/0x1f0 [] notifier_call_chain+0x4c/0x70 [] __raw_notifier_call_chain+0xe/0x10 [] cpu_notify+0x23/0x50 [] cpu_notify_nofail+0xe/0x20 [] _cpu_down+0x1ad/0x2e0 [] cpu_down+0x34/0x50 [] cpu_subsys_offline+0x14/0x20 [] device_offline+0x95/0xc0 [] online_store+0x40/0x90 [] dev_attr_store+0x18/0x30 [] sysfs_kf_write+0x3d/0x50 This patch tries to fix this by registering cpu notifiers from clocksource core, only when we start clocksource-watchdog. And if on the CPU_DEAD notification it is found that dying CPU was the CPU on which this timer is queued on, then it is removed from that CPU and queued to next CPU. Reported-by: Jet Chen Reported-by: Fengguang Wu Signed-off-by: Viresh Kumar --- kernel/time/clocksource.c | 64 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 53 insertions(+), 11 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index ba3e502..9e96853 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -23,16 +23,21 @@ * o Allow clocksource drivers to be unregistered */ +#include #include #include #include #include +#include #include /* for spin_unlock_irq() using preempt_count() m68k */ #include #include #include "tick-internal.h" +/* Tracks next CPU to queue watchdog timer on */ +static int timer_cpu; + void timecounter_init(struct timecounter *tc, const struct cyclecounter *cc, u64 start_tstamp) @@ -246,12 +251,25 @@ void clocksource_mark_unstable(struct clocksource *cs) spin_unlock_irqrestore(&watchdog_lock, flags); } +void queue_timer_on_next_cpu(void) +{ + /* + * Cycle through CPUs to check if the CPUs stay synchronized to each + * other. + */ + timer_cpu = cpumask_next(timer_cpu, cpu_online_mask); + if (timer_cpu >= nr_cpu_ids) + timer_cpu = cpumask_first(cpu_online_mask); + watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL; + add_timer_on(&watchdog_timer, timer_cpu); +} + static void clocksource_watchdog(unsigned long data) { struct clocksource *cs; cycle_t csnow, wdnow; int64_t wd_nsec, cs_nsec; - int next_cpu, reset_pending; + int reset_pending; spin_lock(&watchdog_lock); if (!watchdog_running) @@ -336,27 +354,50 @@ static void clocksource_watchdog(unsigned long data) if (reset_pending) atomic_dec(&watchdog_reset_pending); - /* - * Cycle through CPUs to check if the CPUs stay synchronized - * to each other. - */ - next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask); - if (next_cpu >= nr_cpu_ids) - next_cpu = cpumask_first(cpu_online_mask); - watchdog_timer.expires += WATCHDOG_INTERVAL; - add_timer_on(&watchdog_timer, next_cpu); + queue_timer_on_next_cpu(); out: spin_unlock(&watchdog_lock); } +static int clocksource_cpu_notify(struct notifier_block *self, + unsigned long action, void *hcpu) +{ + long cpu = (long)hcpu; + + spin_lock(&watchdog_lock); + if (!watchdog_running) + goto notify_out; + + switch (action) { + case CPU_DEAD: + case CPU_DEAD_FROZEN: + if (cpu != timer_cpu) + break; + del_timer(&watchdog_timer); + queue_timer_on_next_cpu(); + break; + } + +notify_out: + spin_unlock(&watchdog_lock); + return NOTIFY_OK; +} + +static struct notifier_block clocksource_nb = { + .notifier_call = clocksource_cpu_notify, + .priority = 1, +}; + static inline void clocksource_start_watchdog(void) { if (watchdog_running || !watchdog || list_empty(&watchdog_list)) return; + timer_cpu = cpumask_first(cpu_online_mask); + register_cpu_notifier(&clocksource_nb); init_timer(&watchdog_timer); watchdog_timer.function = clocksource_watchdog; watchdog_timer.expires = jiffies + WATCHDOG_INTERVAL; - add_timer_on(&watchdog_timer, cpumask_first(cpu_online_mask)); + add_timer_on(&watchdog_timer, timer_cpu); watchdog_running = 1; } @@ -365,6 +406,7 @@ static inline void clocksource_stop_watchdog(void) if (!watchdog_running || (watchdog && !list_empty(&watchdog_list))) return; del_timer(&watchdog_timer); + unregister_cpu_notifier(&clocksource_nb); watchdog_running = 0; } -- 1.7.12.rc2.18.g61b472e