From patchwork Wed Oct 19 18:46:08 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Vagin X-Patchwork-Id: 78327 Delivered-To: patch@linaro.org Received: by 10.140.97.247 with SMTP id m110csp397503qge; Wed, 19 Oct 2016 11:47:00 -0700 (PDT) X-Received: by 10.98.29.131 with SMTP id d125mr13779628pfd.111.1476902820868; Wed, 19 Oct 2016 11:47:00 -0700 (PDT) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id fl4si29693252pab.94.2016.10.19.11.47.00; Wed, 19 Oct 2016 11:47:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@gmail.com; spf=pass (google.com: best guess record for domain of netdev-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=netdev-owner@vger.kernel.org; dmarc=fail (p=NONE dis=NONE) header.from=openvz.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935717AbcJSSq6 (ORCPT + 4 others); Wed, 19 Oct 2016 14:46:58 -0400 Received: from mail-it0-f67.google.com ([209.85.214.67]:34975 "EHLO mail-it0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933236AbcJSSqK (ORCPT ); Wed, 19 Oct 2016 14:46:10 -0400 Received: by mail-it0-f67.google.com with SMTP id 139so2132717itm.2 for ; Wed, 19 Oct 2016 11:46:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=lilNwDDyaM3qTgt9C3DKZmbLHV+jHhXlD+NJIwoSKpM=; b=d5hS8H4AQlz4pBwnv9u3Zdh3eYV97OhLZNHizk26YHMcNHvkF4crJx23v3cZL4NTMq HCdLqg0QsbLdO2Vz4do3Hp9KR0ZwaKUnNK7YyiZ492N5MVeZjQpoawhhRq03VjjkFSt3 15JL0TSj6ixXiy3j/oYuAnDeFO8LsEHJCYV7PmUA8laaYhWgCssNagKushWKO3dTSnjk 7ogYw/CYg1ZvwS9Jg3TQGo1wEBYyWYS8dxe4LGDtVgLS9mBBFWXyAhHJDWvZfYLIqgCb FLHWW345NRrrSTFA8IqF3ycVX2tfdNzob3//ne6x0BlGubq0eUpwEvuWDqI9vWbhy8Xh tNmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=lilNwDDyaM3qTgt9C3DKZmbLHV+jHhXlD+NJIwoSKpM=; b=QyXsMGWIh1GxaiwYP7F/2uWwlJs9n+0q9szF2kIk1bSnN7II3Q2nJ0bHz7jqvFiO6O AObpY5116x64T4dAVEbZibEvJ44/j3SRXUMXWIxsAUVgX/PtftA+a6J40fM9EEOLEKnL SocL6QFtoNxWIQCLdGc7/Gz7O7CNSiLWtBVgkPWJzE2F41PDVH+kacM/JLRjn87j9WhF zLev84NH52ikjh+Vh0JHT2GFxPS2zSKKJAFPfIe09cnMyezC9GbQhsa9/FvgZe2RWdlI +fclDS51egSNru4dAmEZ2KoWdEfHeoH5PWAi+4Cx/2pEgJOxdIXyc5IsH5GwjJtx8amb 9gYQ== X-Gm-Message-State: AA6/9Rme72E0zBnahPoDNlyLge7TYaql8C083LGR5CXUAgh3AnZyQ81nwQTPJM5Vu5ZaYNbvOzUbdmMTvmgLIw== X-Received: by 10.107.48.8 with SMTP id w8mr9995018iow.226.1476902769556; Wed, 19 Oct 2016 11:46:09 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.68.74 with HTTP; Wed, 19 Oct 2016 11:46:08 -0700 (PDT) In-Reply-To: <87eg3hy3fm.fsf@x220.int.ebiederm.org> References: <1476293579-28582-1-git-send-email-avagin@openvz.org> <871szk9rl9.fsf@x220.int.ebiederm.org> <20161013204405.GA19836@outlook.office365.com> <87k2db39zf.fsf@x220.int.ebiederm.org> <20161014212642.GA2005@outlook.office365.com> <87eg3hy3fm.fsf@x220.int.ebiederm.org> From: Andrey Vagin Date: Wed, 19 Oct 2016 11:46:08 -0700 X-Google-Sender-Auth: _5wI0wS1qIszR8fJGbDYilxIGfQ Message-ID: Subject: Re: [PATCH] net: limit a number of namespaces which can be cleaned up concurrently To: "Eric W. Biederman" , "David S. Miller" Cc: Andrei Vagin , netdev@vger.kernel.org, Linux Containers Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Sat, Oct 15, 2016 at 9:36 AM, Eric W. Biederman wrote: > Andrei Vagin writes: > >> On Thu, Oct 13, 2016 at 10:06:28PM -0500, Eric W. Biederman wrote: >>> Andrei Vagin writes: >>> >>> > On Thu, Oct 13, 2016 at 10:49:38AM -0500, Eric W. Biederman wrote: >>> >> Andrei Vagin writes: >>> >> >>> >> > From: Andrey Vagin >>> >> > >>> >> > The operation of destroying netns is heavy and it is executed under >>> >> > net_mutex. If many namespaces are destroyed concurrently, net_mutex can >>> >> > be locked for a long time. It is impossible to create a new netns during >>> >> > this period of time. >>> >> >>> >> This may be the right approach or at least the right approach to bound >>> >> net_mutex hold times but I have to take exception to calling network >>> >> namespace cleanup heavy. >>> >> >>> >> The only particularly time consuming operation I have ever found are calls to >>> >> synchronize_rcu/sycrhonize_sched/synchronize_net. >>> > >>> > I booted the kernel with maxcpus=1, in this case these functions work >>> > very fast and the problem is there any way. >>> > >>> > Accoding to perf, we spend a lot of time in kobject_uevent: >>> > >>> > - 99.96% 0.00% kworker/u4:1 [kernel.kallsyms] [k] unregister_netdevice_many >>> > - unregister_netdevice_many >>> > - 99.95% rollback_registered_many >>> > - 99.64% netdev_unregister_kobject >>> > - 33.43% netdev_queue_update_kobjects >>> > - 33.40% kobject_put >>> > - kobject_release >>> > + 33.37% kobject_uevent >>> > + 0.03% kobject_del >>> > + 0.03% sysfs_remove_group >>> > - 33.13% net_rx_queue_update_kobjects >>> > - kobject_put >>> > - kobject_release >>> > + 33.11% kobject_uevent >>> > + 0.01% kobject_del >>> > 0.00% rx_queue_release >>> > - 33.08% device_del >>> > + 32.75% kobject_uevent >>> > + 0.17% device_remove_attrs >>> > + 0.07% dpm_sysfs_remove >>> > + 0.04% device_remove_class_symlinks >>> > + 0.01% kobject_del >>> > + 0.01% device_pm_remove >>> > + 0.01% sysfs_remove_file_ns >>> > + 0.00% klist_del >>> > + 0.00% driver_deferred_probe_del >>> > 0.00% cleanup_glue_dir.isra.14.part.15 >>> > 0.00% to_acpi_device_node >>> > 0.00% sysfs_remove_group >>> > 0.00% klist_del >>> > 0.00% device_remove_attrs >>> > + 0.26% call_netdevice_notifiers_info >>> > + 0.04% rtmsg_ifinfo_build_skb >>> > + 0.01% rtmsg_ifinfo_send >>> > 0.00% dev_uc_flush >>> > 0.00% netif_reset_xps_queues_gt >>> > >>> > Someone can listen these uevents, so we can't stop sending them without >>> > breaking backward compatibility. We can try to optimize >>> > kobject_uevent... >>> >>> Oh that is a surprise. We can definitely skip genenerating uevents for >>> network namespaces that are exiting because by definition no one can see >>> those network namespaces. If a socket existed that could see those >>> uevents it would hold a reference to the network namespace and as such >>> the network namespace could not exit. >>> >>> That sounds like it is worth investigating a little more deeply. >>> >>> I am surprised that allocation and freeing is so heavy we are spending >>> lots of time doing that. On the other hand kobj_bcast_filter is very >>> dumb and very late so I expect something can be moved earlier and make >>> that code cheaper with the tiniest bit of work. >>> >> >> I'm sorry, I've collected this data for a kernel with debug options >> (DEBUG_SPINLOCK, PROVE_LOCKING, DEBUG_LIST, etc). If a kernel is >> compiled without debug options, kobject_uevent becomes less expensive, >> but still expensive. >> >> - 98.64% 0.00% kworker/u4:2 [kernel.kallsyms] [k] cleanup_net >> - cleanup_net >> - 98.54% ops_exit_list.isra.4 >> - 60.48% default_device_exit_batch >> - 60.40% unregister_netdevice_many >> - rollback_registered_many >> - 59.82% netdev_unregister_kobject >> - 20.10% device_del >> + 19.44% kobject_uevent >> + 0.40% device_remove_attrs >> + 0.17% dpm_sysfs_remove >> + 0.04% device_remove_class_symlinks >> + 0.04% kobject_del >> + 0.01% device_pm_remove >> + 0.01% sysfs_remove_file_ns >> - 19.89% netdev_queue_update_kobjects >> + 19.81% kobject_put >> + 0.07% sysfs_remove_group >> - 19.79% net_rx_queue_update_kobjects >> kobject_put >> - kobject_release >> + 19.77% kobject_uevent >> + 0.02% kobject_del >> 0.01% rx_queue_release >> + 0.02% kset_unregister >> 0.01% pm_runtime_set_memalloc_noio >> 0.01% bus_remove_device >> + 0.45% call_netdevice_notifiers_info >> + 0.07% rtmsg_ifinfo_build_skb >> + 0.04% rtmsg_ifinfo_send >> 0.01% kset_unregister >> + 0.07% rtnl_unlock >> + 19.27% rpcsec_gss_exit_net >> + 5.45% tcp_net_metrics_exit >> + 5.31% sunrpc_exit_net >> + 3.18% ip6addrlbl_net_exit >> >> >> So after removing kobject_uevent, cleanup_net becomes more than two times faster: >> >> 1000 namespaces are cleaned up for 2.8 seconds with uevents, and 1.2 senconds >> without uevents. I do this experiments with max_cpus=1 to exclude synchronize_rcu. >> >> As a summary we can skip generating uevents, but it doesn't solve the original >> problem. If we want to avoid the limit introduced in this patch, we have >> to reduce the time for destroing net namespace in dozen times, don't >> we? > > It definitely looks like optimizing kobject_uevent for this case is > worth while. > > I would not mind getting the raw cost of network namespace cleanups > below 2.8ms or with uevent cleanups 1.2ms. There is just a lot going on > for a lot of good reasons in the networking stack so that can be tricky. > > The larger issue is that there is a trade off between latency and > throughput in network namespace destruction. Consider the case of > vsftpd. Which creates a new network namespace for every connection. > Something like that can wind up with a huge backlog of network > namespaces to clean up while continually creating more. The system will > go OOM if we don't stop and cleanup what we have. > > And the batching is very very important for throughput. So the smallest > batch size we could really accept is a batch size that does not hurt > throughput when destroying network namespaces. Otherwise we will have a > growing backlog of network namespaces to cleanup and a system that > eventuallys stops being usable at all. In that context I think a long > hold time on net_mutex is preferable to a system that does not work at > all. > > Now I would love to make both the throughput and the latency better I > would be all in favor of that, but that requires some deep changes to > the network namespace initialization and cleanup. Unfortunately I > haven't stared at the problem enough to know what those changes would > need to be. But something where we would not need to serialize network > namespace cleanup between different network namespaces. And ideally > something we could implement incrementally as there is so much > networking code I don't expect we could verify and change verything > overnight. Eric, I get your point. All these sounds reasonable. And here is another idea about net_mutex. The longer I look at net_mutex the more it looks like that it can be replaced on a read-write lock. It protects per-namespace lists of operations, which are modified only when modules are loaded or unloaded. And the kernel reads these lists to create or destroy a new network namespace. Eric and David, what do you think about this idea? Do you have any ideas why it will not work. I don't know this code so well to not skip something obvious. The attached patch shows how it looks like. If it will works we will be able to create and destroy net namespaces concurrently. And even call cleanup_net() from a few threads if we have a big backlog. It is only one of steps which may be useful to fix this problem. > > That plus in practice the bottleneck has always been the synchronize_rcu > calls which tend to take at least a millisecond a piece. Being able > overlap those synchronize_rcu calls in the common case has reduced > the time to run the network stack cleanup code by very dramatic amounts. > > Right now I am very happy that the network namespace cleanup code is > working properly. When I started the network stack cleanup code to > cleanup network namespaces I found actual functional bugs. I will be > even happier if we can figure out how to make it all run fast. > > But ultimately we have the net_mutex and the rtnl_lock that serialize > things on the setup and cleanup paths and to allow creation to proceed > while cleanup is ongoing we need to find a way to avoid serialization by > either of those, and I have honestly drawn a blank. > > So right now my best suggestion for making things better is to find and > fix each little piece we can fix. Until the things are working as best > we can make them work. It is not sexy or glamorous or fast but it makes > things better and is the best that I can see to do. > > Eric > > >> Here is a perf report after skipping generating uevents: >> - 93.27% 0.00% kworker/u4:1 [kernel.kallsyms] [k] cleanup_net >> - cleanup_net >> - 92.97% ops_exit_list.isra.4 >> - 35.14% rpcsec_gss_exit_net >> - gss_svc_shutdown_net >> - 17.40% rsc_cache_destroy_net >> + 8.64% cache_unregister_net >> + 8.52% cache_purge >> + 0.22% cache_destroy_net >> + 9.00% cache_unregister_net >> + 8.49% cache_purge >> + 0.15% destroy_use_gss_proxy_proc_entry >> + 0.10% cache_destroy_net >> - 14.35% tcp_net_metrics_exit >> - 7.32% tcp_metrics_flush_all >> + 4.86% _raw_spin_unlock_bh >> 0.59% __local_bh_enable_ip >> 6.12% _raw_spin_lock_bh >> 0.90% _raw_spin_unlock_bh >> - 13.08% sunrpc_exit_net >> - 6.91% ip_map_cache_destroy >> + 3.90% cache_unregister_net >> + 2.86% cache_purge >> + 0.15% cache_destroy_net >> + 5.95% unix_gid_cache_destroy >> + 0.12% rpc_pipefs_exit_net >> + 0.10% rpc_proc_exit >> - 7.35% ip6addrlbl_net_exit >> + call_rcu_sched >> + 3.34% xfrm_net_exit >> + 1.22% ipv6_frags_exit_net >> + 1.17% ipv4_frags_exit_net >> + 0.78% fib_net_exit >> + 0.76% inet6_net_exit >> + 0.76% devinet_exit_net >> + 0.68% addrconf_exit_net >> + 0.63% igmp6_net_exit >> + 0.59% ipv4_mib_exit_net >> + 0.59% uevent_net_exit >> >>> Eric > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers >From 6fb255bed1c82a24931595502a8e393200c7dd52 Mon Sep 17 00:00:00 2001 From: Andrei Vagin Date: Sun, 16 Oct 2016 17:18:43 -0700 Subject: [PATCH] [RFC] net: convert net_mutex into a read-write lock It protects per-namespace lists of operations, which are modified only when modules are loaded or unloaded. And the kernel reads these lists to create or destroy a new network namespace. Signed-off-by: Andrei Vagin --- include/linux/rtnetlink.h | 2 +- net/core/net_namespace.c | 32 ++++++++++++++++---------------- net/core/rtnetlink.c | 4 ++-- 3 files changed, 19 insertions(+), 19 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 57e5484..1a780a3 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -30,7 +30,7 @@ extern int rtnl_trylock(void); extern int rtnl_is_locked(void); extern wait_queue_head_t netdev_unregistering_wq; -extern struct mutex net_mutex; +extern struct rw_semaphore net_mutex; #ifdef CONFIG_PROVE_LOCKING extern bool lockdep_rtnl_is_held(void); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 989434f..81dafce 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -27,7 +27,7 @@ static LIST_HEAD(pernet_list); static struct list_head *first_device = &pernet_list; -DEFINE_MUTEX(net_mutex); +DECLARE_RWSEM(net_mutex); LIST_HEAD(net_namespace_list); EXPORT_SYMBOL_GPL(net_namespace_list); @@ -59,7 +59,7 @@ static int net_assign_generic(struct net *net, int id, void *data) { struct net_generic *ng, *old_ng; - BUG_ON(!mutex_is_locked(&net_mutex)); + BUG_ON(!rwsem_is_locked(&net_mutex)); BUG_ON(id == 0); old_ng = rcu_dereference_protected(net->gen, @@ -379,7 +379,7 @@ struct net *copy_net_ns(unsigned long flags, get_user_ns(user_ns); - mutex_lock(&net_mutex); + down_read(&net_mutex); net->ucounts = ucounts; rv = setup_net(net, user_ns); if (rv == 0) { @@ -387,7 +387,7 @@ struct net *copy_net_ns(unsigned long flags, list_add_tail_rcu(&net->list, &net_namespace_list); rtnl_unlock(); } - mutex_unlock(&net_mutex); + up_read(&net_mutex); if (rv < 0) { dec_net_namespaces(ucounts); put_user_ns(user_ns); @@ -412,7 +412,7 @@ static void cleanup_net(struct work_struct *work) list_replace_init(&cleanup_list, &net_kill_list); spin_unlock_irq(&cleanup_list_lock); - mutex_lock(&net_mutex); + down_read(&net_mutex); /* Don't let anyone else find us. */ rtnl_lock(); @@ -452,7 +452,7 @@ static void cleanup_net(struct work_struct *work) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list); - mutex_unlock(&net_mutex); + up_read(&net_mutex); /* Ensure there are no outstanding rcu callbacks using this * network namespace. @@ -763,7 +763,7 @@ static int __init net_ns_init(void) rcu_assign_pointer(init_net.gen, ng); - mutex_lock(&net_mutex); + down_read(&net_mutex); if (setup_net(&init_net, &init_user_ns)) panic("Could not setup the initial network namespace"); @@ -773,7 +773,7 @@ static int __init net_ns_init(void) list_add_tail_rcu(&init_net.list, &net_namespace_list); rtnl_unlock(); - mutex_unlock(&net_mutex); + up_read(&net_mutex); register_pernet_subsys(&net_ns_ops); @@ -912,9 +912,9 @@ static void unregister_pernet_operations(struct pernet_operations *ops) int register_pernet_subsys(struct pernet_operations *ops) { int error; - mutex_lock(&net_mutex); + down_write(&net_mutex); error = register_pernet_operations(first_device, ops); - mutex_unlock(&net_mutex); + up_write(&net_mutex); return error; } EXPORT_SYMBOL_GPL(register_pernet_subsys); @@ -930,9 +930,9 @@ EXPORT_SYMBOL_GPL(register_pernet_subsys); */ void unregister_pernet_subsys(struct pernet_operations *ops) { - mutex_lock(&net_mutex); + down_write(&net_mutex); unregister_pernet_operations(ops); - mutex_unlock(&net_mutex); + up_write(&net_mutex); } EXPORT_SYMBOL_GPL(unregister_pernet_subsys); @@ -958,11 +958,11 @@ EXPORT_SYMBOL_GPL(unregister_pernet_subsys); int register_pernet_device(struct pernet_operations *ops) { int error; - mutex_lock(&net_mutex); + down_write(&net_mutex); error = register_pernet_operations(&pernet_list, ops); if (!error && (first_device == &pernet_list)) first_device = &ops->list; - mutex_unlock(&net_mutex); + up_write(&net_mutex); return error; } EXPORT_SYMBOL_GPL(register_pernet_device); @@ -978,11 +978,11 @@ EXPORT_SYMBOL_GPL(register_pernet_device); */ void unregister_pernet_device(struct pernet_operations *ops) { - mutex_lock(&net_mutex); + down_write(&net_mutex); if (&ops->list == first_device) first_device = first_device->next; unregister_pernet_operations(ops); - mutex_unlock(&net_mutex); + up_write(&net_mutex); } EXPORT_SYMBOL_GPL(unregister_pernet_device); diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index b06d2f4..7533419 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -418,11 +418,11 @@ static void rtnl_lock_unregistering_all(void) void rtnl_link_unregister(struct rtnl_link_ops *ops) { /* Close the race with cleanup_net() */ - mutex_lock(&net_mutex); + down_write(&net_mutex); rtnl_lock_unregistering_all(); __rtnl_link_unregister(ops); rtnl_unlock(); - mutex_unlock(&net_mutex); + up_write(&net_mutex); } EXPORT_SYMBOL_GPL(rtnl_link_unregister); -- 2.7.4