From patchwork Thu Jan 7 07:53:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 358910 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10597C4332E for ; Thu, 7 Jan 2021 07:54:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C96B623125 for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726805AbhAGHyQ (ORCPT ); Thu, 7 Jan 2021 02:54:16 -0500 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:49973 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727418AbhAGHyG (ORCPT ); Thu, 7 Jan 2021 02:54:06 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R161e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e04426; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=20; SR=0; TI=SMTPD_---0UKysbUW_1610005996; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKysbUW_1610005996) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:16 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, Christian Brauner , Linus Torvalds , Jann Horn , Oleg Nesterov , Arnd Bergmann , "Eric W. Biederman" , Kees Cook , Thomas Gleixner , David Howells , "Michael Kerrisk (man-pages)" , Andy Lutomirsky , Andrew Morton , Aleksa Sarai , Al Viro , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 1/7] clone: add CLONE_PIDFD Date: Thu, 7 Jan 2021 15:53:08 +0800 Message-Id: <20210107075314.62683-2-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Christian Brauner [ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ] This patchset makes it possible to retrieve pid file descriptors at process creation time by introducing the new flag CLONE_PIDFD to the clone() system call. Linus originally suggested to implement this as a new flag to clone() instead of making it a separate system call. As spotted by Linus, there is exactly one bit for clone() left. CLONE_PIDFD creates file descriptors based on the anonymous inode implementation in the kernel that will also be used to implement the new mount api. They serve as a simple opaque handle on pids. Logically, this makes it possible to interpret a pidfd differently, narrowing or widening the scope of various operations (e.g. signal sending). Thus, a pidfd cannot just refer to a tgid, but also a tid, or in theory - given appropriate flag arguments in relevant syscalls - a process group or session. A pidfd does not represent a privilege. This does not imply it cannot ever be that way but for now this is not the case. A pidfd comes with additional information in fdinfo if the kernel supports procfs. The fdinfo file contains the pid of the process in the callers pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d". As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the parent_tidptr argument of clone. This has the advantage that we can give back the associated pid and the pidfd at the same time. To remove worries about missing metadata access this patchset comes with a sample program that illustrates how a combination of CLONE_PIDFD, and pidfd_send_signal() can be used to gain race-free access to process metadata through /proc/. The sample program can easily be translated into a helper that would be suitable for inclusion in libc so that users don't have to worry about writing it themselves. Suggested-by: Linus Torvalds Signed-off-by: Christian Brauner Co-developed-by: Jann Horn Signed-off-by: Jann Horn Reviewed-by: Oleg Nesterov Cc: Arnd Bergmann Cc: "Eric W. Biederman" Cc: Kees Cook Cc: Thomas Gleixner Cc: David Howells Cc: "Michael Kerrisk (man-pages)" Cc: Andy Lutomirsky Cc: Andrew Morton Cc: Aleksa Sarai Cc: Linus Torvalds Cc: Al Viro Cc: # 4.19.x (clone: fix up cherry-pick conflicts for b3e583825266) Signed-off-by: Wen Yang --- include/linux/pid.h | 2 + include/uapi/linux/sched.h | 1 + kernel/fork.c | 107 +++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 106 insertions(+), 4 deletions(-) diff --git a/include/linux/pid.h b/include/linux/pid.h index 14a9a39..29c0a99 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -66,6 +66,8 @@ struct pid extern struct pid init_struct_pid; +extern const struct file_operations pidfd_fops; + static inline struct pid *get_pid(struct pid *pid) { if (pid) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 22627f8..ed4ee17 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -10,6 +10,7 @@ #define CLONE_FS 0x00000200 /* set if fs info shared between processes */ #define CLONE_FILES 0x00000400 /* set if open files shared between processes */ #define CLONE_SIGHAND 0x00000800 /* set if signal handlers and blocked signals shared */ +#define CLONE_PIDFD 0x00001000 /* set if a pidfd should be placed in parent */ #define CLONE_PTRACE 0x00002000 /* set if we want to let tracing continue on the child too */ #define CLONE_VFORK 0x00004000 /* set if the parent wants the child to wake it up on mm_release */ #define CLONE_PARENT 0x00008000 /* set if we want to have the same parent as the cloner */ diff --git a/kernel/fork.c b/kernel/fork.c index f2c92c1..e419891 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -11,6 +11,7 @@ * management can be a bitch. See 'mm/memory.c': 'copy_page_range()' */ +#include #include #include #include @@ -21,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -1666,6 +1668,58 @@ static void copy_oom_score_adj(u64 clone_flags, struct task_struct *tsk) mutex_unlock(&oom_adj_mutex); } +static int pidfd_release(struct inode *inode, struct file *file) +{ + struct pid *pid = file->private_data; + + file->private_data = NULL; + put_pid(pid); + return 0; +} + +#ifdef CONFIG_PROC_FS +static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct pid_namespace *ns = proc_pid_ns(file_inode(m->file)); + struct pid *pid = f->private_data; + + seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns)); + seq_putc(m, '\n'); +} +#endif + +const struct file_operations pidfd_fops = { + .release = pidfd_release, +#ifdef CONFIG_PROC_FS + .show_fdinfo = pidfd_show_fdinfo, +#endif +}; + +/** + * pidfd_create() - Create a new pid file descriptor. + * + * @pid: struct pid that the pidfd will reference + * + * This creates a new pid file descriptor with the O_CLOEXEC flag set. + * + * Note, that this function can only be called after the fd table has + * been unshared to avoid leaking the pidfd to the new process. + * + * Return: On success, a cloexec pidfd is returned. + * On error, a negative errno number will be returned. + */ +static int pidfd_create(struct pid *pid) +{ + int fd; + + fd = anon_inode_getfd("[pidfd]", &pidfd_fops, get_pid(pid), + O_RDWR | O_CLOEXEC); + if (fd < 0) + put_pid(pid); + + return fd; +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -1678,13 +1732,14 @@ static __latent_entropy struct task_struct *copy_process( unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, + int __user *parent_tidptr, int __user *child_tidptr, struct pid *pid, int trace, unsigned long tls, int node) { - int retval; + int pidfd = -1, retval; struct task_struct *p; struct multiprocess_signals delayed; @@ -1734,6 +1789,31 @@ static __latent_entropy struct task_struct *copy_process( return ERR_PTR(-EINVAL); } + if (clone_flags & CLONE_PIDFD) { + int reserved; + + /* + * - CLONE_PARENT_SETTID is useless for pidfds and also + * parent_tidptr is used to return pidfds. + * - CLONE_DETACHED is blocked so that we can potentially + * reuse it later for CLONE_PIDFD. + * - CLONE_THREAD is blocked until someone really needs it. + */ + if (clone_flags & + (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD)) + return ERR_PTR(-EINVAL); + + /* + * Verify that parent_tidptr is sane so we can potentially + * reuse it later. + */ + if (get_user(reserved, parent_tidptr)) + return ERR_PTR(-EFAULT); + + if (reserved != 0) + return ERR_PTR(-EINVAL); + } + /* * Force any signals received before this point to be delivered * before the fork happens. Collect up signals sent to multiple @@ -1934,6 +2014,22 @@ static __latent_entropy struct task_struct *copy_process( } } + /* + * This has to happen after we've potentially unshared the file + * descriptor table (so that the pidfd doesn't leak into the child + * if the fd table isn't shared). + */ + if (clone_flags & CLONE_PIDFD) { + retval = pidfd_create(pid); + if (retval < 0) + goto bad_fork_free_pid; + + pidfd = retval; + retval = put_user(pidfd, parent_tidptr); + if (retval) + goto bad_fork_put_pidfd; + } + #ifdef CONFIG_BLOCK p->plug = NULL; #endif @@ -1989,7 +2085,7 @@ static __latent_entropy struct task_struct *copy_process( */ retval = cgroup_can_fork(p); if (retval) - goto bad_fork_free_pid; + goto bad_fork_put_pidfd; /* * From this point on we must avoid any synchronous user-space @@ -2111,6 +2207,9 @@ static __latent_entropy struct task_struct *copy_process( spin_unlock(¤t->sighand->siglock); write_unlock_irq(&tasklist_lock); cgroup_cancel_fork(p); +bad_fork_put_pidfd: + if (clone_flags & CLONE_PIDFD) + ksys_close(pidfd); bad_fork_free_pid: cgroup_threadgroup_change_end(current); if (pid != &init_struct_pid) @@ -2178,7 +2277,7 @@ static inline void init_idle_pids(struct task_struct *idle) struct task_struct *fork_idle(int cpu) { struct task_struct *task; - task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0, 0, + task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0, cpu_to_node(cpu)); if (!IS_ERR(task)) { init_idle_pids(task); @@ -2225,7 +2324,7 @@ long _do_fork(unsigned long clone_flags, trace = 0; } - p = copy_process(clone_flags, stack_start, stack_size, + p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr, child_tidptr, NULL, trace, tls, NUMA_NO_NODE); add_latent_entropy(); From patchwork Thu Jan 7 07:53:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 358912 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1AE3EC433E6 for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D520023125 for ; Thu, 7 Jan 2021 07:54:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727459AbhAGHyD (ORCPT ); Thu, 7 Jan 2021 02:54:03 -0500 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:52721 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727415AbhAGHyD (ORCPT ); Thu, 7 Jan 2021 02:54:03 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R261e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e04394; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=20; SR=0; TI=SMTPD_---0UKzE3SA_1610005997; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKzE3SA_1610005997) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:17 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Joel Fernandes (Google)" , Andy Lutomirski , Steven Rostedt , Daniel Colascione , Jann Horn , Tim Murray , Jonathan Kowalski , Linus Torvalds , Al Viro , Kees Cook , David Howells , Oleg Nesterov , kernel-team@android.com, Christian Brauner , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 2/7] pidfd: add polling support Date: Thu, 7 Jan 2021 15:53:09 +0800 Message-Id: <20210107075314.62683-3-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Joel Fernandes (Google)" [ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ] This patch adds polling support to pidfd. Android low memory killer (LMK) needs to know when a process dies once it is sent the kill signal. It does so by checking for the existence of /proc/pid which is both racy and slow. For example, if a PID is reused between when LMK sends a kill signal and checks for existence of the PID, since the wrong PID is now possibly checked for existence. Using the polling support, LMK will be able to get notified when a process exists in race-free and fast way, and allows the LMK to do other things (such as by polling on other fds) while awaiting the process being killed to die. For notification to polling processes, we follow the same existing mechanism in the kernel used when the parent of the task group is to be notified of a child's death (do_notify_parent). This is precisely when the tasks waiting on a poll of pidfd are also awakened in this patch. We have decided to include the waitqueue in struct pid for the following reasons: 1. The wait queue has to survive for the lifetime of the poll. Including it in task_struct would not be option in this case because the task can be reaped and destroyed before the poll returns. 2. By including the struct pid for the waitqueue means that during de_thread(), the new thread group leader automatically gets the new waitqueue/pid even though its task_struct is different. Appropriate test cases are added in the second patch to provide coverage of all the cases the patch is handling. Cc: Andy Lutomirski Cc: Steven Rostedt Cc: Daniel Colascione Cc: Jann Horn Cc: Tim Murray Cc: Jonathan Kowalski Cc: Linus Torvalds Cc: Al Viro Cc: Kees Cook Cc: David Howells Cc: Oleg Nesterov Cc: kernel-team@android.com Reviewed-by: Oleg Nesterov Co-developed-by: Daniel Colascione Signed-off-by: Daniel Colascione Signed-off-by: Joel Fernandes (Google) Signed-off-by: Christian Brauner Cc: # 4.19.x Signed-off-by: Wen Yang --- include/linux/pid.h | 3 +++ kernel/fork.c | 26 ++++++++++++++++++++++++++ kernel/pid.c | 2 ++ kernel/signal.c | 11 +++++++++++ 4 files changed, 42 insertions(+) diff --git a/include/linux/pid.h b/include/linux/pid.h index 29c0a99..a82d2f7 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -3,6 +3,7 @@ #define _LINUX_PID_H #include +#include enum pid_type { @@ -60,6 +61,8 @@ struct pid unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; + /* wait queue for pidfd notifications */ + wait_queue_head_t wait_pidfd; struct rcu_head rcu; struct upid numbers[1]; }; diff --git a/kernel/fork.c b/kernel/fork.c index e419891..33dc746 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1688,8 +1688,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) } #endif +/* + * Poll support for process exit notification. + */ +static unsigned int pidfd_poll(struct file *file, struct poll_table_struct *pts) +{ + struct task_struct *task; + struct pid *pid = file->private_data; + int poll_flags = 0; + + poll_wait(file, &pid->wait_pidfd, pts); + + rcu_read_lock(); + task = pid_task(pid, PIDTYPE_PID); + /* + * Inform pollers only when the whole thread group exits. + * If the thread group leader exits before all other threads in the + * group, then poll(2) should block, similar to the wait(2) family. + */ + if (!task || (task->exit_state && thread_group_empty(task))) + poll_flags = POLLIN | POLLRDNORM; + rcu_read_unlock(); + + return poll_flags; +} + const struct file_operations pidfd_fops = { .release = pidfd_release, + .poll = pidfd_poll, #ifdef CONFIG_PROC_FS .show_fdinfo = pidfd_show_fdinfo, #endif diff --git a/kernel/pid.c b/kernel/pid.c index b88fe5e..3ba6fcb 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns) for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); + init_waitqueue_head(&pid->wait_pidfd); + upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock); if (!(ns->pid_allocated & PIDNS_ADDING)) diff --git a/kernel/signal.c b/kernel/signal.c index a02a25a..22a04795 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1810,6 +1810,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type) return ret; } +static void do_notify_pidfd(struct task_struct *task) +{ + struct pid *pid; + + pid = task_pid(task); + wake_up_all(&pid->wait_pidfd); +} + /* * Let a parent know about the death of a child. * For a stopped/continued status change, use do_notify_parent_cldstop instead. @@ -1833,6 +1841,9 @@ bool do_notify_parent(struct task_struct *tsk, int sig) BUG_ON(!tsk->ptrace && (tsk->group_leader != tsk || !thread_group_empty(tsk))); + /* Wake up all pidfd waiters */ + do_notify_pidfd(tsk); + if (sig != SIGCHLD) { /* * This is only possible if parent == real_parent. From patchwork Thu Jan 7 07:53:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 359734 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0065EC433E9 for ; Thu, 7 Jan 2021 07:55:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C51B623125 for ; Thu, 7 Jan 2021 07:55:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727442AbhAGHyD (ORCPT ); Thu, 7 Jan 2021 02:54:03 -0500 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:43661 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726505AbhAGHyC (ORCPT ); Thu, 7 Jan 2021 02:54:02 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R281e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e04423; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=7; SR=0; TI=SMTPD_---0UKzE3SW_1610005998; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKzE3SW_1610005998) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:18 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Eric W. Biederman" , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 3/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Date: Thu, 7 Jan 2021 15:53:10 +0800 Message-Id: <20210107075314.62683-4-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Eric W. Biederman" [ Upstream commit 0afa5ca82212247456f9de1468b595a111fee633 ] I about to need and use the same functionality for pid based inodes and there is no point in adding a second field when this field is already here and serving the same purporse. Just give the field a generic name so it is clear that it is no longer sysctl specific. Also for good measure initialize sibling_inodes when proc_inode is initialized. Signed-off-by: Eric W. Biederman Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/inode.c | 1 + fs/proc/internal.h | 2 +- fs/proc/proc_sysctl.c | 8 ++++---- 3 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 31bf3bb..e5334ed 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -70,6 +70,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb) ei->pde = NULL; ei->sysctl = NULL; ei->sysctl_entry = NULL; + INIT_HLIST_NODE(&ei->sibling_inodes); ei->ns_ops = NULL; inode = &ei->vfs_inode; return inode; diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 95b1419..d922c01 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -91,7 +91,7 @@ struct proc_inode { struct proc_dir_entry *pde; struct ctl_table_header *sysctl; struct ctl_table *sysctl_entry; - struct hlist_node sysctl_inodes; + struct hlist_node sibling_inodes; const struct proc_ns_operations *ns_ops; struct inode vfs_inode; } __randomize_layout; diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index c95f32b..0f578f6 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -274,9 +274,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header *head) node = hlist_first_rcu(&head->inodes); if (!node) break; - ei = hlist_entry(node, struct proc_inode, sysctl_inodes); + ei = hlist_entry(node, struct proc_inode, sibling_inodes); spin_lock(&sysctl_lock); - hlist_del_init_rcu(&ei->sysctl_inodes); + hlist_del_init_rcu(&ei->sibling_inodes); spin_unlock(&sysctl_lock); inode = &ei->vfs_inode; @@ -478,7 +478,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb, } ei->sysctl = head; ei->sysctl_entry = table; - hlist_add_head_rcu(&ei->sysctl_inodes, &head->inodes); + hlist_add_head_rcu(&ei->sibling_inodes, &head->inodes); head->count++; spin_unlock(&sysctl_lock); @@ -509,7 +509,7 @@ static struct inode *proc_sys_make_inode(struct super_block *sb, void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head) { spin_lock(&sysctl_lock); - hlist_del_init_rcu(&PROC_I(inode)->sysctl_inodes); + hlist_del_init_rcu(&PROC_I(inode)->sibling_inodes); if (!--head->count) kfree_rcu(head, rcu); spin_unlock(&sysctl_lock); From patchwork Thu Jan 7 07:53:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 359737 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 707ECC433E9 for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2AADA23123 for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727468AbhAGHyF (ORCPT ); Thu, 7 Jan 2021 02:54:05 -0500 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:54133 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727416AbhAGHyE (ORCPT ); Thu, 7 Jan 2021 02:54:04 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R121e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e01424; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=7; SR=0; TI=SMTPD_---0UKzE3St_1610005999; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKzE3St_1610005999) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:19 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Eric W. Biederman" , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 4/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Date: Thu, 7 Jan 2021 15:53:11 +0800 Message-Id: <20210107075314.62683-5-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Eric W. Biederman" [ Upstream commit 26dbc60f385ff9cff475ea2a3bad02e80fd6fa43 ] This prepares the way for allowing the pid part of proc to use this dcache pruning code as well. Signed-off-by: Eric W. Biederman Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/inode.c | 38 ++++++++++++++++++++++++++++++++++++++ fs/proc/internal.h | 1 + fs/proc/proc_sysctl.c | 35 +---------------------------------- 3 files changed, 40 insertions(+), 34 deletions(-) diff --git a/fs/proc/inode.c b/fs/proc/inode.c index e5334ed..fffc7e4 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -112,6 +112,44 @@ void __init proc_init_kmemcache(void) BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE); } +void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock) +{ + struct inode *inode; + struct proc_inode *ei; + struct hlist_node *node; + struct super_block *sb; + + rcu_read_lock(); + for (;;) { + node = hlist_first_rcu(inodes); + if (!node) + break; + ei = hlist_entry(node, struct proc_inode, sibling_inodes); + spin_lock(lock); + hlist_del_init_rcu(&ei->sibling_inodes); + spin_unlock(lock); + + inode = &ei->vfs_inode; + sb = inode->i_sb; + if (!atomic_inc_not_zero(&sb->s_active)) + continue; + inode = igrab(inode); + rcu_read_unlock(); + if (unlikely(!inode)) { + deactivate_super(sb); + rcu_read_lock(); + continue; + } + + d_prune_aliases(inode); + iput(inode); + deactivate_super(sb); + + rcu_read_lock(); + } + rcu_read_unlock(); +} + static int proc_show_options(struct seq_file *seq, struct dentry *root) { struct super_block *sb = root->d_sb; diff --git a/fs/proc/internal.h b/fs/proc/internal.h index d922c01..6cae472 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -210,6 +210,7 @@ struct pde_opener { extern const struct inode_operations proc_pid_link_inode_operations; void proc_init_kmemcache(void); +void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock); void set_proc_pid_nlink(void); extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *); extern int proc_fill_super(struct super_block *, void *data, int flags); diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 0f578f6..57b16bf 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -264,40 +264,7 @@ static void unuse_table(struct ctl_table_header *p) static void proc_sys_prune_dcache(struct ctl_table_header *head) { - struct inode *inode; - struct proc_inode *ei; - struct hlist_node *node; - struct super_block *sb; - - rcu_read_lock(); - for (;;) { - node = hlist_first_rcu(&head->inodes); - if (!node) - break; - ei = hlist_entry(node, struct proc_inode, sibling_inodes); - spin_lock(&sysctl_lock); - hlist_del_init_rcu(&ei->sibling_inodes); - spin_unlock(&sysctl_lock); - - inode = &ei->vfs_inode; - sb = inode->i_sb; - if (!atomic_inc_not_zero(&sb->s_active)) - continue; - inode = igrab(inode); - rcu_read_unlock(); - if (unlikely(!inode)) { - deactivate_super(sb); - rcu_read_lock(); - continue; - } - - d_prune_aliases(inode); - iput(inode); - deactivate_super(sb); - - rcu_read_lock(); - } - rcu_read_unlock(); + proc_prune_siblings_dcache(&head->inodes, &sysctl_lock); } /* called under sysctl_lock, will reacquire if has to wait */ From patchwork Thu Jan 7 07:53:12 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 359735 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB4BDC433E0 for ; Thu, 7 Jan 2021 07:54:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A809323125 for ; Thu, 7 Jan 2021 07:54:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727418AbhAGHye (ORCPT ); Thu, 7 Jan 2021 02:54:34 -0500 Received: from out30-54.freemail.mail.aliyun.com ([115.124.30.54]:48027 "EHLO out30-54.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727396AbhAGHyE (ORCPT ); Thu, 7 Jan 2021 02:54:04 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R161e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e04420; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=7; SR=0; TI=SMTPD_---0UKzNwIV_1610006000; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKzNwIV_1610006000) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:20 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Eric W. Biederman" , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about Date: Thu, 7 Jan 2021 15:53:12 +0800 Message-Id: <20210107075314.62683-6-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Eric W. Biederman" [ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ] This just keeps everything tidier, and allows for using flags like SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse. I don't see reuse without reinitializing happening with the proc_inode but I had a false alarm while reworking flushing of proc dentries and indoes when a process dies that caused me to tidy this up. The code is a little easier to follow and reason about this way so I figured the changes might as well be kept. Signed-off-by: "Eric W. Biederman" Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/inode.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/proc/inode.c b/fs/proc/inode.c index fffc7e4..45b4344 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -34,21 +34,27 @@ static void proc_evict_inode(struct inode *inode) { struct proc_dir_entry *de; struct ctl_table_header *head; + struct proc_inode *ei = PROC_I(inode); truncate_inode_pages_final(&inode->i_data); clear_inode(inode); /* Stop tracking associated processes */ - put_pid(PROC_I(inode)->pid); + if (ei->pid) { + put_pid(ei->pid); + ei->pid = NULL; + } /* Let go of any associated proc directory entry */ - de = PDE(inode); - if (de) + de = ei->pde; + if (de) { pde_put(de); + ei->pde = NULL; + } - head = PROC_I(inode)->sysctl; + head = ei->sysctl; if (head) { - RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL); + RCU_INIT_POINTER(ei->sysctl, NULL); proc_sys_evict_inode(inode, head); } } From patchwork Thu Jan 7 07:53:13 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 359736 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E4DBC43331 for ; Thu, 7 Jan 2021 07:54:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F360023123 for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726545AbhAGHyQ (ORCPT ); Thu, 7 Jan 2021 02:54:16 -0500 Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:43633 "EHLO out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727441AbhAGHyF (ORCPT ); Thu, 7 Jan 2021 02:54:05 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R221e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=alimailimapcm10staff010182156082; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=7; SR=0; TI=SMTPD_---0UKysbVc_1610006001; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKysbVc_1610006001) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:21 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Eric W. Biederman" , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 6/7] proc: Use d_invalidate in proc_prune_siblings_dcache Date: Thu, 7 Jan 2021 15:53:13 +0800 Message-Id: <20210107075314.62683-7-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Eric W. Biederman" [ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ] The function d_prune_aliases has the problem that it will only prune aliases thare are completely unused. It will not remove aliases for the dcache or even think of removing mounts from the dcache. For that behavior d_invalidate is needed. To use d_invalidate replace d_prune_aliases with d_find_alias followed by d_invalidate and dput. For completeness the directory and the non-directory cases are separated because in theory (although not in currently in practice for proc) directories can only ever have a single dentry while non-directories can have hardlinks and thus multiple dentries. As part of this separation use d_find_any_alias for directories to spare d_find_alias the extra work of doing that. Plus the differences between d_find_any_alias and d_find_alias makes it clear why the directory and non-directory code and not share code. To make it clear these routines now invalidate dentries rename proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename proc_sys_prune_dcache proc_sys_invalidate_dcache. V2: Split the directory and non-directory cases. To make this code robust to future changes in proc. Signed-off-by: "Eric W. Biederman" Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/inode.c | 16 ++++++++++++++-- fs/proc/internal.h | 2 +- fs/proc/proc_sysctl.c | 8 ++++---- 3 files changed, 19 insertions(+), 7 deletions(-) diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 45b4344..fad579e 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -118,7 +118,7 @@ void __init proc_init_kmemcache(void) BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE); } -void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock) +void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock) { struct inode *inode; struct proc_inode *ei; @@ -147,7 +147,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock) continue; } - d_prune_aliases(inode); + if (S_ISDIR(inode->i_mode)) { + struct dentry *dir = d_find_any_alias(inode); + if (dir) { + d_invalidate(dir); + dput(dir); + } + } else { + struct dentry *dentry; + while ((dentry = d_find_alias(inode))) { + d_invalidate(dentry); + dput(dentry); + } + } iput(inode); deactivate_super(sb); diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 6cae472..1db693b 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -210,7 +210,7 @@ struct pde_opener { extern const struct inode_operations proc_pid_link_inode_operations; void proc_init_kmemcache(void); -void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock); +void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock); void set_proc_pid_nlink(void); extern struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *); extern int proc_fill_super(struct super_block *, void *data, int flags); diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c index 57b16bf..f8f1f8a 100644 --- a/fs/proc/proc_sysctl.c +++ b/fs/proc/proc_sysctl.c @@ -262,9 +262,9 @@ static void unuse_table(struct ctl_table_header *p) complete(p->unregistering); } -static void proc_sys_prune_dcache(struct ctl_table_header *head) +static void proc_sys_invalidate_dcache(struct ctl_table_header *head) { - proc_prune_siblings_dcache(&head->inodes, &sysctl_lock); + proc_invalidate_siblings_dcache(&head->inodes, &sysctl_lock); } /* called under sysctl_lock, will reacquire if has to wait */ @@ -286,10 +286,10 @@ static void start_unregistering(struct ctl_table_header *p) spin_unlock(&sysctl_lock); } /* - * Prune dentries for unregistered sysctls: namespaced sysctls + * Invalidate dentries for unregistered sysctls: namespaced sysctls * can have duplicate names and contaminate dcache very badly. */ - proc_sys_prune_dcache(p); + proc_sys_invalidate_dcache(p); /* * do not remove from the list until nobody holds it; walking the * list in do_sysctl() relies on that. From patchwork Thu Jan 7 07:53:14 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Wen Yang X-Patchwork-Id: 358911 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACA89C4332D for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7A9932312A for ; Thu, 7 Jan 2021 07:54:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727416AbhAGHyH (ORCPT ); Thu, 7 Jan 2021 02:54:07 -0500 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:45610 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727463AbhAGHyH (ORCPT ); Thu, 7 Jan 2021 02:54:07 -0500 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R531e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=e01e04400; MF=wenyang@linux.alibaba.com; NM=1; PH=DS; RN=7; SR=0; TI=SMTPD_---0UKzNwIl_1610006002; Received: from localhost(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0UKzNwIl_1610006002) by smtp.aliyun-inc.com(127.0.0.1); Thu, 07 Jan 2021 15:53:22 +0800 From: Wen Yang To: Greg Kroah-Hartman , Sasha Levin Cc: Xunlei Pang , linux-kernel@vger.kernel.org, "Eric W. Biederman" , stable@vger.kernel.org, Wen Yang Subject: [PATCH 4.19 7/7] proc: Use a list of inodes to flush from proc Date: Thu, 7 Jan 2021 15:53:14 +0800 Message-Id: <20210107075314.62683-8-wenyang@linux.alibaba.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210107075314.62683-1-wenyang@linux.alibaba.com> References: <20210107075314.62683-1-wenyang@linux.alibaba.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Eric W. Biederman" [ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ] Rework the flushing of proc to use a list of directory inodes that need to be flushed. The list is kept on struct pid not on struct task_struct, as there is a fixed connection between proc inodes and pids but at least for the case of de_thread the pid of a task_struct changes. This removes the dependency on proc_mnt which allows for different mounts of proc having different mount options even in the same pid namespace and this allows for the removal of proc_mnt which will trivially the first mount of proc to honor it's mount options. This flushing remains an optimization. The functions pid_delete_dentry and pid_revalidate ensure that ordinary dcache management will not attempt to use dentries past the point their respective task has died. When unused the shrinker will eventually be able to remove these dentries. There is a case in de_thread where proc_flush_pid can be called early for a given pid. Which winds up being safe (if suboptimal) as this is just an optiimization. Only pid directories are put on the list as the other per pid files are children of those directories and d_invalidate on the directory will get them as well. So that the pid can be used during flushing it's reference count is taken in release_task and dropped in proc_flush_pid. Further the call of proc_flush_pid is moved after the tasklist_lock is released in release_task so that it is certain that the pid has already been unhashed when flushing it taking place. This removes a small race where a dentry could recreated. As struct pid is supposed to be small and I need a per pid lock I reuse the only lock that currently exists in struct pid the the wait_pidfd.lock. The net result is that this adds all of this functionality with just a little extra list management overhead and a single extra pointer in struct pid. v2: Initialize pid->inodes. I somehow failed to get that initialization into the initial version of the patch. A boot failure was reported by "kernel test robot ", and failure to initialize that pid->inodes matches all of the reported symptoms. Signed-off-by: Eric W. Biederman Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces") Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush entries from multiple proc trees") Cc: # 4.19.x: b3e583825266: clone: add CLONE_PIDFD Cc: # 4.19.x: b53b0b9d9a61: pidfd: add polling support Cc: # 4.19.x: 0afa5ca82212: proc: Rename in proc_inode rename sysctl_inodes sibling_inodes Cc: # 4.19.x: 26dbc60f385f: proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache Cc: # 4.19.x: 71448011ea2a: proc: Clear the pieces of proc_inode that proc_evict_inode cares about Cc: # 4.19.x: f90f3cafe8d5: Use d_invalidate in proc_prune_siblings_dcache Cc: # 4.19.x Signed-off-by: Wen Yang --- fs/proc/base.c | 111 ++++++++++++++++-------------------------------- fs/proc/inode.c | 2 +- fs/proc/internal.h | 1 + include/linux/pid.h | 1 + include/linux/proc_fs.h | 4 +- kernel/exit.c | 4 +- kernel/pid.c | 1 + 7 files changed, 45 insertions(+), 79 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 5e705fa..ea74c7c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1741,11 +1741,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode, *rgid = gid; } +void proc_pid_evict_inode(struct proc_inode *ei) +{ + struct pid *pid = ei->pid; + + if (S_ISDIR(ei->vfs_inode.i_mode)) { + spin_lock(&pid->wait_pidfd.lock); + hlist_del_init_rcu(&ei->sibling_inodes); + spin_unlock(&pid->wait_pidfd.lock); + } + + put_pid(pid); +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; + struct pid *pid; /* We need a new inode */ @@ -1763,10 +1777,18 @@ struct inode *proc_pid_make_inode(struct super_block * sb, /* * grab the reference to task. */ - ei->pid = get_task_pid(task, PIDTYPE_PID); - if (!ei->pid) + pid = get_task_pid(task, PIDTYPE_PID); + if (!pid) goto out_unlock; + /* Let the pid remember us for quick removal */ + ei->pid = pid; + if (S_ISDIR(mode)) { + spin_lock(&pid->wait_pidfd.lock); + hlist_add_head_rcu(&ei->sibling_inodes, &pid->inodes); + spin_unlock(&pid->wait_pidfd.lock); + } + task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); security_task_to_inode(task, inode); @@ -3067,90 +3089,29 @@ static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *de .permission = proc_pid_permission, }; -static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid) -{ - struct dentry *dentry, *leader, *dir; - char buf[10 + 1]; - struct qstr name; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", pid); - /* no ->d_hash() rejects on procfs */ - dentry = d_hash_and_lookup(mnt->mnt_root, &name); - if (dentry) { - d_invalidate(dentry); - dput(dentry); - } - - if (pid == tgid) - return; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", tgid); - leader = d_hash_and_lookup(mnt->mnt_root, &name); - if (!leader) - goto out; - - name.name = "task"; - name.len = strlen(name.name); - dir = d_hash_and_lookup(leader, &name); - if (!dir) - goto out_put_leader; - - name.name = buf; - name.len = snprintf(buf, sizeof(buf), "%u", pid); - dentry = d_hash_and_lookup(dir, &name); - if (dentry) { - d_invalidate(dentry); - dput(dentry); - } - - dput(dir); -out_put_leader: - dput(leader); -out: - return; -} - /** - * proc_flush_task - Remove dcache entries for @task from the /proc dcache. - * @task: task that should be flushed. + * proc_flush_pid - Remove dcache entries for @pid from the /proc dcache. + * @pid: pid that should be flushed. * - * When flushing dentries from proc, one needs to flush them from global - * proc (proc_mnt) and from all the namespaces' procs this task was seen - * in. This call is supposed to do all of this job. - * - * Looks in the dcache for - * /proc/@pid - * /proc/@tgid/task/@pid - * if either directory is present flushes it and all of it'ts children - * from the dcache. + * This function walks a list of inodes (that belong to any proc + * filesystem) that are attached to the pid and flushes them from + * the dentry cache. * * It is safe and reasonable to cache /proc entries for a task until * that task exits. After that they just clog up the dcache with * useless entries, possibly causing useful dcache entries to be - * flushed instead. This routine is proved to flush those useless - * dcache entries at process exit time. + * flushed instead. This routine is provided to flush those useless + * dcache entries when a process is reaped. * * NOTE: This routine is just an optimization so it does not guarantee - * that no dcache entries will exist at process exit time it - * just makes it very unlikely that any will persist. + * that no dcache entries will exist after a process is reaped + * it just makes it very unlikely that any will persist. */ -void proc_flush_task(struct task_struct *task) +void proc_flush_pid(struct pid *pid) { - int i; - struct pid *pid, *tgid; - struct upid *upid; - - pid = task_pid(task); - tgid = task_tgid(task); - - for (i = 0; i <= pid->level; i++) { - upid = &pid->numbers[i]; - proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, - tgid->numbers[i].nr); - } + proc_invalidate_siblings_dcache(&pid->inodes, &pid->wait_pidfd.lock); + put_pid(pid); } static struct dentry *proc_pid_instantiate(struct dentry * dentry, diff --git a/fs/proc/inode.c b/fs/proc/inode.c index fad579e..83c640a 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -41,7 +41,7 @@ static void proc_evict_inode(struct inode *inode) /* Stop tracking associated processes */ if (ei->pid) { - put_pid(ei->pid); + proc_pid_evict_inode(ei); ei->pid = NULL; } diff --git a/fs/proc/internal.h b/fs/proc/internal.h index 1db693b..be55ee2 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -158,6 +158,7 @@ extern int proc_pid_statm(struct seq_file *, struct pid_namespace *, extern const struct dentry_operations pid_dentry_operations; extern int pid_getattr(const struct path *, struct kstat *, u32, unsigned int); extern int proc_setattr(struct dentry *, struct iattr *); +extern void proc_pid_evict_inode(struct proc_inode *); extern struct inode *proc_pid_make_inode(struct super_block *, struct task_struct *, umode_t); extern void pid_update_inode(struct task_struct *, struct inode *); extern int pid_delete_dentry(const struct dentry *); diff --git a/include/linux/pid.h b/include/linux/pid.h index a82d2f7..08efbe1 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -61,6 +61,7 @@ struct pid unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; + struct hlist_head inodes; /* wait queue for pidfd notifications */ wait_queue_head_t wait_pidfd; struct rcu_head rcu; diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index d0e1f15..94bf352 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -17,7 +17,7 @@ typedef int (*proc_write_t)(struct file *, char *, size_t); extern void proc_root_init(void); -extern void proc_flush_task(struct task_struct *); +extern void proc_flush_pid(struct pid *); extern struct proc_dir_entry *proc_symlink(const char *, struct proc_dir_entry *, const char *); @@ -80,7 +80,7 @@ static inline void proc_root_init(void) { } -static inline void proc_flush_task(struct task_struct *task) +static inline void proc_flush_pid(struct pid *pid) { } diff --git a/kernel/exit.c b/kernel/exit.c index 65133eb..55996ba 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -185,6 +185,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp) void release_task(struct task_struct *p) { struct task_struct *leader; + struct pid *thread_pid; int zap_leader; repeat: /* don't need to get the RCU readlock here - the process is dead and @@ -193,11 +194,11 @@ void release_task(struct task_struct *p) atomic_dec(&__task_cred(p)->user->processes); rcu_read_unlock(); - proc_flush_task(p); cgroup_release(p); write_lock_irq(&tasklist_lock); ptrace_release_task(p); + thread_pid = get_pid(p->thread_pid); __exit_signal(p); /* @@ -220,6 +221,7 @@ void release_task(struct task_struct *p) } write_unlock_irq(&tasklist_lock); + proc_flush_pid(thread_pid); release_thread(p); call_rcu(&p->rcu, delayed_put_task_struct); diff --git a/kernel/pid.c b/kernel/pid.c index 3ba6fcb..65df035 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -215,6 +215,7 @@ struct pid *alloc_pid(struct pid_namespace *ns) INIT_HLIST_HEAD(&pid->tasks[type]); init_waitqueue_head(&pid->wait_pidfd); + INIT_HLIST_HEAD(&pid->inodes); upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock);