From patchwork Fri May 8 23:53:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 226056 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, LOTS_OF_MONEY, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7EE6C47255 for ; Fri, 8 May 2020 23:53:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A7A7B2063A for ; Fri, 8 May 2020 23:53:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588982002; bh=Q+zg/S+1Gx4065xyCGOd2vJrhliJjaxxiPkQceLl/fM=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=bDY8gZaXV2HXrNumYbTYRFSZnx2sLf2MkIeGRSLdH78Ar803Vhj3j1dqE/80gB6i5 Kdq0PiWmu0u9LQsKi0ROPgNipZGgaiLCW6Yd4H1255pY2DP5BuuQR6mkcWJTJ2ms12 nvPIxmdrE/kGH2zkfAksOj+hfORNjCBahoO0BbSQ= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728379AbgEHXxV (ORCPT ); Fri, 8 May 2020 19:53:21 -0400 Received: from mail.kernel.org ([198.145.29.99]:55836 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727878AbgEHXxV (ORCPT ); Fri, 8 May 2020 19:53:21 -0400 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 3FB262063A; Fri, 8 May 2020 23:53:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588982000; bh=Q+zg/S+1Gx4065xyCGOd2vJrhliJjaxxiPkQceLl/fM=; h=Date:From:To:Subject:In-Reply-To:From; b=W2CClFWnASUjWS3FjpdC+13aANTdykKUpo1+WB7WWdgN4ditEx5p5vZsiXCufifz+ YUneyH0/haAWWK4a5VVmTNlaNkc0tXVkMD+yQUVUrMqv9sDI+Pd7ERs3/1lEUwxtIt 1nQLNWv/NVP/skh719Bp4srlvKkwvwPXBF9AYHeI= Date: Fri, 08 May 2020 16:53:19 -0700 From: Andrew Morton To: akpm@linux-foundation.org, dan.j.williams@intel.com, dave.jiang@intel.com, david@redhat.com, mm-commits@vger.kernel.org, pasha.tatashin@soleen.com, stable@vger.kernel.org, vishal.l.verma@intel.com Subject: + device-dax-dont-leak-kernel-memory-to-user-space-after-unloading-kmem.patch added to -mm tree Message-ID: <20200508235319.0NmZ0DltL%akpm@linux-foundation.org> In-Reply-To: <20200507183509.c5ef146c5aaeb118a25a39a8@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org The patch titled Subject: device-dax: don't leak kernel memory to user space after unloading kmem has been added to the -mm tree. Its filename is device-dax-dont-leak-kernel-memory-to-user-space-after-unloading-kmem.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/device-dax-dont-leak-kernel-memory-to-user-space-after-unloading-kmem.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/device-dax-dont-leak-kernel-memory-to-user-space-after-unloading-kmem.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: David Hildenbrand Subject: device-dax: don't leak kernel memory to user space after unloading kmem Assume we have kmem configured and loaded: [root@localhost ~]# cat /proc/iomem ... 140000000-33fffffff : Persistent Memory$ 140000000-1481fffff : namespace0.0 150000000-33fffffff : dax0.0 150000000-33fffffff : System RAM Assume we try to unload kmem. This force-unloading will work, even if memory cannot get removed from the system. [root@localhost ~]# rmmod kmem [ 86.380228] removing memory fails, because memory [0x0000000150000000-0x0000000157ffffff] is onlined ... [ 86.431225] kmem dax0.0: DAX region [mem 0x150000000-0x33fffffff] cannot be hotremoved until the next reboot Now, we can reconfigure the namespace: [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 --mode=devdax [ 131.409351] nd_pmem namespace0.0: could not reserve region [mem 0x140000000-0x33fffffff]dax [ 131.410147] nd_pmem: probe of namespace0.0 failed with error -16namespace0.0 --mode=devdax ... This fails as expected due to the busy memory resource, and the memory cannot be used. However, the dax0.0 device is removed, and along its name. The name of the memory resource now points at freed memory (name of the device). [root@localhost ~]# cat /proc/iomem ... 140000000-33fffffff : Persistent Memory 140000000-1481fffff : namespace0.0 150000000-33fffffff : �_�^7_��/_��wR��WQ���^��� ... 150000000-33fffffff : System RAM We have to make sure to duplicate the string. While at it, remove the superfluous setting of the name and fixup a stale comment. Link: http://lkml.kernel.org/r/20200508084217.9160-2-david@redhat.com Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used like normal RAM") Signed-off-by: David Hildenbrand Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Pavel Tatashin Cc: Andrew Morton Cc: [5.3] Signed-off-by: Andrew Morton --- drivers/dax/kmem.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) --- a/drivers/dax/kmem.c~device-dax-dont-leak-kernel-memory-to-user-space-after-unloading-kmem +++ a/drivers/dax/kmem.c @@ -22,6 +22,7 @@ int dev_dax_kmem_probe(struct device *de resource_size_t kmem_size; resource_size_t kmem_end; struct resource *new_res; + const char *new_res_name; int numa_node; int rc; @@ -48,11 +49,16 @@ int dev_dax_kmem_probe(struct device *de kmem_size &= ~(memory_block_size_bytes() - 1); kmem_end = kmem_start + kmem_size; - /* Region is permanently reserved. Hot-remove not yet implemented. */ - new_res = request_mem_region(kmem_start, kmem_size, dev_name(dev)); + new_res_name = kstrdup(dev_name(dev), GFP_KERNEL); + if (!new_res_name) + return -ENOMEM; + + /* Region is permanently reserved if hotremove fails. */ + new_res = request_mem_region(kmem_start, kmem_size, new_res_name); if (!new_res) { dev_warn(dev, "could not reserve region [%pa-%pa]\n", &kmem_start, &kmem_end); + kfree(new_res_name); return -EBUSY; } @@ -63,12 +69,12 @@ int dev_dax_kmem_probe(struct device *de * unknown to us that will break add_memory() below. */ new_res->flags = IORESOURCE_SYSTEM_RAM; - new_res->name = dev_name(dev); rc = add_memory(numa_node, new_res->start, resource_size(new_res)); if (rc) { release_resource(new_res); kfree(new_res); + kfree(new_res_name); return rc; } dev_dax->dax_kmem_res = new_res; @@ -83,6 +89,7 @@ static int dev_dax_kmem_remove(struct de struct resource *res = dev_dax->dax_kmem_res; resource_size_t kmem_start = res->start; resource_size_t kmem_size = resource_size(res); + const char *res_name = res->name; int rc; /* @@ -102,6 +109,7 @@ static int dev_dax_kmem_remove(struct de /* Release and free dax resources */ release_resource(res); kfree(res); + kfree(res_name); dev_dax->dax_kmem_res = NULL; return 0; From patchwork Fri May 8 01:35:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 226305 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 718CEC47247 for ; Fri, 8 May 2020 01:35:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 475B620A8B for ; Fri, 8 May 2020 01:35:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588901749; bh=p66t++QEcmXGWyvKZsR/6dx7XY7ecH/xwXoI7luDbYo=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=aBc2AEnvQ74wQUQMFXZ9vGerhgqZRoz5v2HRlfH5aig2BVqTfB5s+cSzn5/+2HF46 Wu8EIPRsYOkz0eJXk3/2D01W36L9YG4AcpnfdoTekQQb+Cf3GJDVfEKW3cE7mogUdI NzKSelNVVEaGy4S6uF6ugxYzOZWqqyY7q3vjbXd0= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726661AbgEHBft (ORCPT ); Thu, 7 May 2020 21:35:49 -0400 Received: from mail.kernel.org ([198.145.29.99]:58292 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726514AbgEHBfs (ORCPT ); Thu, 7 May 2020 21:35:48 -0400 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id BF0E6208E4; Fri, 8 May 2020 01:35:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588901747; bh=p66t++QEcmXGWyvKZsR/6dx7XY7ecH/xwXoI7luDbYo=; h=Date:From:To:Subject:In-Reply-To:From; b=jkGXfNUs6QL0Wwc7prBk5L1BxOvCtHoT86Un4Li1crmdr6UsqU+3oROBDqQlwJjM7 QaOJ0NuSkmUlabvQK70dfy37qnpcgEKz7WXZSGHrF2Nx3vFQriyLyzr3W0cooJK/20 X8+IvsS4onPm5UBpCki0BJM12nrjOcPwM4Wcu42U= Date: Thu, 07 May 2020 18:35:46 -0700 From: Andrew Morton To: akpm@linux-foundation.org, alexander.duyck@gmail.com, bhe@redhat.com, daniel.m.jordan@oracle.com, david@redhat.com, ktkhai@virtuozzo.com, linux-mm@kvack.org, mhocko@kernel.org, mhocko@suse.com, mm-commits@vger.kernel.org, osalvador@suse.de, pankaj.gupta.linux@gmail.com, pasha.tatashin@soleen.com, shile.zhang@linux.alibaba.com, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 03/15] mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() Message-ID: <20200508013546.5KqzoUC9B%akpm@linux-foundation.org> In-Reply-To: <20200507183509.c5ef146c5aaeb118a25a39a8@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: David Hildenbrand Subject: mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() Without CONFIG_PREEMPT, it can happen that we get soft lockups detected, e.g., while booting up. [ 105.608900] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1] [ 105.608933] Modules linked in: [ 105.608933] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4 [ 105.608933] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014 [ 105.608933] RIP: 0010:__pageblock_pfn_to_page+0x134/0x1c0 [ 105.608933] Code: 85 c0 74 71 4a 8b 04 d0 48 85 c0 74 68 48 01 c1 74 63 f6 01 04 74 5e 48 c1 e7 06 4c 8b 05 cc 991 [ 105.608933] RSP: 0000:ffffb6d94000fe60 EFLAGS: 00010286 ORIG_RAX: ffffffffffffff13 [ 105.608933] RAX: fffff81953250000 RBX: 000000000a4c9600 RCX: ffff8fe9ff7c1990 [ 105.608933] RDX: ffff8fe9ff7dab80 RSI: 000000000a4c95ff RDI: 0000000293250000 [ 105.608933] RBP: ffff8fe9ff7dab80 R08: fffff816c0000000 R09: 0000000000000008 [ 105.608933] R10: 0000000000000014 R11: 0000000000000014 R12: 0000000000000000 [ 105.608933] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 105.608933] FS: 0000000000000000(0000) GS:ffff8fe1ff400000(0000) knlGS:0000000000000000 [ 105.608933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 105.608933] CR2: 000000000f613000 CR3: 00000088cf20a000 CR4: 00000000000006f0 [ 105.608933] Call Trace: [ 105.608933] set_zone_contiguous+0x56/0x70 [ 105.608933] page_alloc_init_late+0x166/0x176 [ 105.608933] kernel_init_freeable+0xfa/0x255 [ 105.608933] ? rest_init+0xaa/0xaa [ 105.608933] kernel_init+0xa/0x106 [ 105.608933] ret_from_fork+0x35/0x40 The issue becomes visible when having a lot of memory (e.g., 4TB) assigned to a single NUMA node - a system that can easily be created using QEMU. Inside VMs on a hypervisor with quite some memory overcommit, this is fairly easy to trigger. Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Pavel Tatashin Reviewed-by: Pankaj Gupta Reviewed-by: Baoquan He Reviewed-by: Shile Zhang Acked-by: Michal Hocko Cc: Kirill Tkhai Cc: Shile Zhang Cc: Pavel Tatashin Cc: Daniel Jordan Cc: Michal Hocko Cc: Alexander Duyck Cc: Baoquan He Cc: Oscar Salvador Cc: Signed-off-by: Andrew Morton --- mm/page_alloc.c | 1 + 1 file changed, 1 insertion(+) --- a/mm/page_alloc.c~mm-page_alloc-fix-watchdog-soft-lockups-during-set_zone_contiguous +++ a/mm/page_alloc.c @@ -1607,6 +1607,7 @@ void set_zone_contiguous(struct zone *zo if (!__pageblock_pfn_to_page(block_start_pfn, block_end_pfn, zone)) return; + cond_resched(); } /* We confirm that there is no hole */ From patchwork Fri May 8 01:36:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 226304 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F049FC54E49 for ; Fri, 8 May 2020 01:36:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C970420A8B for ; Fri, 8 May 2020 01:36:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588901778; bh=uUjZswnjm9/AbyYApb/czSVRZ24ia+tN6f3TYg6/x7Y=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=wicH+N+GH8QXSAyp9yuGAPKVCZ5sTLAZhxa7yjq/5Qtd8Z4bCuRHBUNW3YXulHysJ 1qrZy1K7h3SdvsxZYcCbWishLbDSnvKgJwIkWvQaZFLSUIjmPfumJVYSMlzjMjnlMz Qn19r/zRiweEHNQjCH/MIec17m7/1hrerFMFh0vI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726891AbgEHBgS (ORCPT ); Thu, 7 May 2020 21:36:18 -0400 Received: from mail.kernel.org ([198.145.29.99]:58854 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726612AbgEHBgS (ORCPT ); Thu, 7 May 2020 21:36:18 -0400 Received: from localhost.localdomain (c-73-231-172-41.hsd1.ca.comcast.net [73.231.172.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 5C92A208DB; Fri, 8 May 2020 01:36:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1588901777; bh=uUjZswnjm9/AbyYApb/czSVRZ24ia+tN6f3TYg6/x7Y=; h=Date:From:To:Subject:In-Reply-To:From; b=NuNUPiTa43wgJ07JNaiRbgrvOWr89TppyNtGP9WNBHfPDKfw3McRX5PNEwH3/eqCg NZfQaPkxh9x/VfL5Mczg6/G6xTaju/1GZSpv4RFqoC/Ka7p6MrW9JW+zOnzzRCI8Me 3z6udh1fC48FHUZ6OCMm/W+mleFr/BgsxA5zWT74= Date: Thu, 07 May 2020 18:36:16 -0700 From: Andrew Morton To: akpm@linux-foundation.org, jbaron@akamai.com, khazhy@google.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, r@hev.cc, rpenyaev@suse.de, stable@vger.kernel.org, torvalds@linux-foundation.org, viro@zeniv.linux.org.uk Subject: [patch 12/15] epoll: atomically remove wait entry on wake up Message-ID: <20200508013616.UgiLheqE9%akpm@linux-foundation.org> In-Reply-To: <20200507183509.c5ef146c5aaeb118a25a39a8@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Roman Penyaev Subject: epoll: atomically remove wait entry on wake up This patch does two things: 1. fixes lost wakeup introduced by: 339ddb53d373 ("fs/epoll: remove unnecessary wakeups of nested epoll") 2. improves performance for events delivery. The description of the problem is the following: if N (>1) threads are waiting on ep->wq for new events and M (>1) events come, it is quite likely that >1 wakeups hit the same wait queue entry, because there is quite a big window between __add_wait_queue_exclusive() and the following __remove_wait_queue() calls in ep_poll() function. This can lead to lost wakeups, because thread, which was woken up, can handle not all the events in ->rdllist. (in better words the problem is described here: https://lkml.org/lkml/2019/10/7/905) The idea of the current patch is to use init_wait() instead of init_waitqueue_entry(). Internally init_wait() sets autoremove_wake_function as a callback, which removes the wait entry atomically (under the wq locks) from the list, thus the next coming wakeup hits the next wait entry in the wait queue, thus preventing lost wakeups. Problem is very well reproduced by the epoll60 test case [1]. Wait entry removal on wakeup has also performance benefits, because there is no need to take a ep->lock and remove wait entry from the queue after the successful wakeup. Here is the timing output of the epoll60 test case: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): real 0m6.970s user 0m49.786s sys 0m0.113s After this patch: real 0m5.220s user 0m36.879s sys 0m0.019s The other testcase is the stress-epoll [2], where one thread consumes all the events and other threads produce many events: With explicit wakeup from ep_scan_ready_list() (the state of the code prior 339ddb53d373): threads events/ms run-time ms 8 5427 1474 16 6163 2596 32 6824 4689 64 7060 9064 128 6991 18309 After this patch: threads events/ms run-time ms 8 5598 1429 16 7073 2262 32 7502 4265 64 7640 8376 128 7634 16767 (number of "events/ms" represents event bandwidth, thus higher is better; number of "run-time ms" represents overall time spent doing the benchmark, thus lower is better) [1] tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c [2] https://github.com/rouming/test-tools/blob/master/stress-epoll.c Link: http://lkml.kernel.org/r/20200430130326.1368509-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev Reviewed-by: Jason Baron Cc: Khazhismel Kumykov Cc: Alexander Viro Cc: Heiher Cc: Signed-off-by: Andrew Morton --- fs/eventpoll.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-) --- a/fs/eventpoll.c~epoll-atomically-remove-wait-entry-on-wake-up +++ a/fs/eventpoll.c @@ -1822,7 +1822,6 @@ static int ep_poll(struct eventpoll *ep, { int res = 0, eavail, timed_out = 0; u64 slack = 0; - bool waiter = false; wait_queue_entry_t wait; ktime_t expires, *to = NULL; @@ -1867,21 +1866,23 @@ fetch_events: */ ep_reset_busy_poll_napi_id(ep); - /* - * We don't have any available event to return to the caller. We need - * to sleep here, and we will be woken by ep_poll_callback() when events - * become available. - */ - if (!waiter) { - waiter = true; - init_waitqueue_entry(&wait, current); - + do { + /* + * Internally init_wait() uses autoremove_wake_function(), + * thus wait entry is removed from the wait queue on each + * wakeup. Why it is important? In case of several waiters + * each new wakeup will hit the next waiter, giving it the + * chance to harvest new event. Otherwise wakeup can be + * lost. This is also good performance-wise, because on + * normal wakeup path no need to call __remove_wait_queue() + * explicitly, thus ep->lock is not taken, which halts the + * event delivery. + */ + init_wait(&wait); write_lock_irq(&ep->lock); __add_wait_queue_exclusive(&ep->wq, &wait); write_unlock_irq(&ep->lock); - } - for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state @@ -1911,10 +1912,20 @@ fetch_events: timed_out = 1; break; } - } + + /* We were woken up, thus go and try to harvest some events */ + eavail = 1; + + } while (0); __set_current_state(TASK_RUNNING); + if (!list_empty_careful(&wait.entry)) { + write_lock_irq(&ep->lock); + __remove_wait_queue(&ep->wq, &wait); + write_unlock_irq(&ep->lock); + } + send_events: /* * Try to transfer events to user space. In case we get 0 events and @@ -1925,12 +1936,6 @@ send_events: !(res = ep_send_events(ep, events, maxevents)) && !timed_out) goto fetch_events; - if (waiter) { - write_lock_irq(&ep->lock); - __remove_wait_queue(&ep->wq, &wait); - write_unlock_irq(&ep->lock); - } - return res; }