Message ID | cover.1631660727.git.leonro@nvidia.com |
---|---|
Headers | show |
Series | Optional counter statistics support | expand |
On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote: > diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c > index 3f6b98a87566..67519730b1ac 100644 > +++ b/drivers/infiniband/core/nldev.c > @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg, > if (!table_attr) > return -EMSGSIZE; > @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct( > if (!stats) > return NULL; > > + stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters), > + sizeof(long), GFP_KERNEL); > + if (!stats->is_disabled) > + goto err; > + Please de-inline this function and make a rdma_free_hw_stats_struct() call to pair with it. The hw_stats_data kfree should be in there. If you do this as a precursor patch this patch will be much smaller. Also, the stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64), GFP_KERNEL); Should be using array_size Jason
On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote: > > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable) > +{ > + struct rdma_hw_stats *stats; > + int ret; > + > + if (!dev->ops.modify_hw_stat) > + return -EOPNOTSUPP; > + > + stats = ib_get_hw_stats_port(dev, port); > + if (!stats) > + return -EINVAL; > + > + mutex_lock(&stats->lock); > + ret = dev->ops.modify_hw_stat(dev, port, index, enable); > + if (!ret) > + enable ? clear_bit(index, stats->is_disabled) : > + set_bit(index, stats->is_disabled); This is not a kernel coding style write out the if, use success oriented flow Also, shouldn't this logic protect the driver from being called on non-optional counters? > for (i = 0; i < data->stats->num_counters; i++) { > - attr = &data->attrs[i]; > + if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL) > + continue; > + attr = &data->attrs[pos]; > sysfs_attr_init(&attr->attr.attr); > attr->attr.attr.name = data->stats->descs[i].name; > attr->attr.attr.mode = 0444; > attr->attr.show = hw_stat_device_show; > attr->show = show_hw_stats; > - data->group.attrs[i] = &attr->attr.attr; > + data->group.attrs[pos] = &attr->attr.attr; > + pos++; > } This isn't OK, the hw_stat_device_show() computes the stat index like this: return stat_attr->show(ibdev, ibdev->hw_stats_data->stats, stat_attr - ibdev->hw_stats_data->attrs, 0, buf); Which assumes the stats are packed contiguously. This only works because mlx5 is always putting the optional stats at the end. > /** > * struct rdma_stat_desc > * @name - The name of the counter > - * > + * @flags - Flags of the counter; For example, IB_STAT_FLAG_OPTIONAL > */ The previous patch shouldn't have had the extra blank line then? > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, > + bool is_add); index should be unsigned int The bool is called 'is_add' here but the implementation is 'enable' ? Jason
On Wed, Sep 15, 2021 at 02:07:26AM +0300, Leon Romanovsky wrote: > - return -EINVAL; > + need_enable = false; > + disabled = test_bit(i, stats->is_disabled); > + nla_for_each_nested(entry_attr, > + tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS], rem) { > + index = nla_get_u32(entry_attr); > + if (index >= stats->num_counters) > + return -EINVAL; > + if (i == index) { > + need_enable = true; > + break; > + } > + } > > - port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]); > - if (!rdma_is_port_valid(device, port)) { > - ret = -EINVAL; > - goto err; > + if (disabled && need_enable) > + ret = rdma_counter_modify(device, port, i, true); > + else if (!disabled && !need_enable) > + ret = rdma_counter_modify(device, port, i, false); This disabled check looks racy, I would do the no-change optimization inside rdma_counter_modify() Also, this is a O(N^2) algorithm, why not do it in one pass with a small memory allocation for the target state bitmap? Jason
On 9/28/2021 12:53 AM, Jason Gunthorpe wrote: > On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote: > >> diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c >> index 3f6b98a87566..67519730b1ac 100644 >> +++ b/drivers/infiniband/core/nldev.c >> @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg, >> if (!table_attr) >> return -EMSGSIZE; >> @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct( >> if (!stats) >> return NULL; >> >> + stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters), >> + sizeof(long), GFP_KERNEL); >> + if (!stats->is_disabled) >> + goto err; >> + > > Please de-inline this function and make a rdma_free_hw_stats_struct() > call to pair with it. The hw_stats_data kfree should be in there. If > you do this as a precursor patch this patch will be much smaller. > > Also, the > > stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64), > GFP_KERNEL); > > Should be using array_size Maybe use struct_size: stats = kzalloc(struct_size(stats, value, num_counters), GFP_KERNEL);
On 9/28/2021 1:03 AM, Jason Gunthorpe wrote: > On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote: >> >> +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable) >> +{ >> + struct rdma_hw_stats *stats; >> + int ret; >> + >> + if (!dev->ops.modify_hw_stat) >> + return -EOPNOTSUPP; >> + >> + stats = ib_get_hw_stats_port(dev, port); >> + if (!stats) >> + return -EINVAL; >> + >> + mutex_lock(&stats->lock); >> + ret = dev->ops.modify_hw_stat(dev, port, index, enable); >> + if (!ret) >> + enable ? clear_bit(index, stats->is_disabled) : >> + set_bit(index, stats->is_disabled); > > This is not a kernel coding style write out the if, use success > oriented flow > > Also, shouldn't this logic protect the driver from being called on > non-optional counters? We leave it to driver, driver would return failure if modify is not supported. Is it good? >> for (i = 0; i < data->stats->num_counters; i++) { >> - attr = &data->attrs[i]; >> + if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL) >> + continue; >> + attr = &data->attrs[pos]; >> sysfs_attr_init(&attr->attr.attr); >> attr->attr.attr.name = data->stats->descs[i].name; >> attr->attr.attr.mode = 0444; >> attr->attr.show = hw_stat_device_show; >> attr->show = show_hw_stats; >> - data->group.attrs[i] = &attr->attr.attr; >> + data->group.attrs[pos] = &attr->attr.attr; >> + pos++; >> } > > This isn't OK, the hw_stat_device_show() computes the stat index like > this: > > return stat_attr->show(ibdev, ibdev->hw_stats_data->stats, > stat_attr - ibdev->hw_stats_data->attrs, 0, buf); > > Which assumes the stats are packed contiguously. This only works > because mlx5 is always putting the optional stats at the end. Yes you are right, thanks. Maybe we can add an "index" field in struct hw_stats_device/port_attribute, then set it in setup and use it in show.
On Tue, Sep 28, 2021 at 05:03:24PM +0800, Mark Zhang wrote: > On 9/28/2021 1:03 AM, Jason Gunthorpe wrote: > > On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote: > > > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable) > > > +{ > > > + struct rdma_hw_stats *stats; > > > + int ret; > > > + > > > + if (!dev->ops.modify_hw_stat) > > > + return -EOPNOTSUPP; > > > + > > > + stats = ib_get_hw_stats_port(dev, port); > > > + if (!stats) > > > + return -EINVAL; > > > + > > > + mutex_lock(&stats->lock); > > > + ret = dev->ops.modify_hw_stat(dev, port, index, enable); > > > + if (!ret) > > > + enable ? clear_bit(index, stats->is_disabled) : > > > + set_bit(index, stats->is_disabled); > > > > This is not a kernel coding style write out the if, use success > > oriented flow > > > > Also, shouldn't this logic protect the driver from being called on > > non-optional counters? > > We leave it to driver, driver would return failure if modify is not > supported. Is it good? I think the core code should do it > > > for (i = 0; i < data->stats->num_counters; i++) { > > > - attr = &data->attrs[i]; > > > + if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL) > > > + continue; > > > + attr = &data->attrs[pos]; > > > sysfs_attr_init(&attr->attr.attr); > > > attr->attr.attr.name = data->stats->descs[i].name; > > > attr->attr.attr.mode = 0444; > > > attr->attr.show = hw_stat_device_show; > > > attr->show = show_hw_stats; > > > - data->group.attrs[i] = &attr->attr.attr; > > > + data->group.attrs[pos] = &attr->attr.attr; > > > + pos++; > > > } > > > > This isn't OK, the hw_stat_device_show() computes the stat index like > > this: > > > > return stat_attr->show(ibdev, ibdev->hw_stats_data->stats, > > stat_attr - ibdev->hw_stats_data->attrs, 0, buf); > > > > Which assumes the stats are packed contiguously. This only works > > because mlx5 is always putting the optional stats at the end. > > Yes you are right, thanks. Maybe we can add an "index" field in struct > hw_stats_device/port_attribute, then set it in setup and use it in show. You could just add a WARN_ON check that optional stats are at the end I suppose Jason
On Tue, Sep 28, 2021 at 11:28:39AM +0800, Mark Zhang wrote: > On 9/28/2021 12:53 AM, Jason Gunthorpe wrote: > > On Wed, Sep 15, 2021 at 02:07:23AM +0300, Leon Romanovsky wrote: > > > > > diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c > > > index 3f6b98a87566..67519730b1ac 100644 > > > +++ b/drivers/infiniband/core/nldev.c > > > @@ -968,15 +968,21 @@ static int fill_stat_counter_hwcounters(struct sk_buff *msg, > > > if (!table_attr) > > > return -EMSGSIZE; > > > @@ -601,11 +604,20 @@ static inline struct rdma_hw_stats *rdma_alloc_hw_stats_struct( > > > if (!stats) > > > return NULL; > > > + stats->is_disabled = kcalloc(BITS_TO_LONGS(num_counters), > > > + sizeof(long), GFP_KERNEL); > > > + if (!stats->is_disabled) > > > + goto err; > > > + > > > > Please de-inline this function and make a rdma_free_hw_stats_struct() > > call to pair with it. The hw_stats_data kfree should be in there. If > > you do this as a precursor patch this patch will be much smaller. > > > > Also, the > > > > stats = kzalloc(sizeof(*stats) + num_counters * sizeof(u64), > > GFP_KERNEL); > > > > Should be using array_size > Maybe use struct_size: > stats = kzalloc(struct_size(stats, value, num_counters), GFP_KERNEL); Right Jason
On Tue, Sep 28, 2021 at 08:51:35AM -0300, Jason Gunthorpe wrote: > On Tue, Sep 28, 2021 at 05:03:24PM +0800, Mark Zhang wrote: > > On 9/28/2021 1:03 AM, Jason Gunthorpe wrote: > > > On Wed, Sep 15, 2021 at 02:07:24AM +0300, Leon Romanovsky wrote: > > > > +int rdma_counter_modify(struct ib_device *dev, u32 port, int index, bool enable) > > > > +{ > > > > + struct rdma_hw_stats *stats; > > > > + int ret; > > > > + > > > > + if (!dev->ops.modify_hw_stat) > > > > + return -EOPNOTSUPP; > > > > + > > > > + stats = ib_get_hw_stats_port(dev, port); > > > > + if (!stats) > > > > + return -EINVAL; > > > > + > > > > + mutex_lock(&stats->lock); > > > > + ret = dev->ops.modify_hw_stat(dev, port, index, enable); > > > > + if (!ret) > > > > + enable ? clear_bit(index, stats->is_disabled) : > > > > + set_bit(index, stats->is_disabled); > > > > > > This is not a kernel coding style write out the if, use success > > > oriented flow > > > > > > Also, shouldn't this logic protect the driver from being called on > > > non-optional counters? > > > > We leave it to driver, driver would return failure if modify is not > > supported. Is it good? > > I think the core code should do it > > > > > for (i = 0; i < data->stats->num_counters; i++) { > > > > - attr = &data->attrs[i]; > > > > + if (data->stats->descs[i].flags & IB_STAT_FLAG_OPTIONAL) > > > > + continue; > > > > + attr = &data->attrs[pos]; > > > > sysfs_attr_init(&attr->attr.attr); > > > > attr->attr.attr.name = data->stats->descs[i].name; > > > > attr->attr.attr.mode = 0444; > > > > attr->attr.show = hw_stat_device_show; > > > > attr->show = show_hw_stats; > > > > - data->group.attrs[i] = &attr->attr.attr; > > > > + data->group.attrs[pos] = &attr->attr.attr; > > > > + pos++; > > > > } > > > > > > This isn't OK, the hw_stat_device_show() computes the stat index like > > > this: > > > > > > return stat_attr->show(ibdev, ibdev->hw_stats_data->stats, > > > stat_attr - ibdev->hw_stats_data->attrs, 0, buf); > > > > > > Which assumes the stats are packed contiguously. This only works > > > because mlx5 is always putting the optional stats at the end. > > > > Yes you are right, thanks. Maybe we can add an "index" field in struct > > hw_stats_device/port_attribute, then set it in setup and use it in show. > > You could just add a WARN_ON check that optional stats are at the end > I suppose Everyone adds their counters to the end, last example is bnxt_re 9a381f7e5aa2 ("RDMA/bnxt_re: Add extended statistics counters") Thanks > > Jason
On Mon, Sep 27, 2021 at 02:20:06PM -0300, Jason Gunthorpe wrote: > On Wed, Sep 15, 2021 at 02:07:26AM +0300, Leon Romanovsky wrote: > > - return -EINVAL; > > + need_enable = false; > > + disabled = test_bit(i, stats->is_disabled); > > + nla_for_each_nested(entry_attr, > > + tb[RDMA_NLDEV_ATTR_STAT_HWCOUNTERS], rem) { > > + index = nla_get_u32(entry_attr); > > + if (index >= stats->num_counters) > > + return -EINVAL; > > + if (i == index) { > > + need_enable = true; > > + break; > > + } > > + } > > > > - port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]); > > - if (!rdma_is_port_valid(device, port)) { > > - ret = -EINVAL; > > - goto err; > > + if (disabled && need_enable) > > + ret = rdma_counter_modify(device, port, i, true); > > + else if (!disabled && !need_enable) > > + ret = rdma_counter_modify(device, port, i, false); > > This disabled check looks racy, I would do the no-change optimization inside > rdma_counter_modify() > > Also, this is a O(N^2) algorithm, why not do it in one pass with a > small memory allocation for the target state bitmap? We don't have many counters. Is this optimization really worth it? Thanks > > Jason
From: Leon Romanovsky <leonro@nvidia.com> Change Log: v1: * Add a descriptor structure to replace name in struct rdma_hw_stats; * Add a bitmap in struct rdma_hw_stats to indicate the enable/disable status of all counters; * Add a "flag" field in counter descriptor and define IB_STAT_FLAG_OPTIONAL flag; * add/remove_op_stat() are replaced by modify_op_stat(); * Use "set/unset" in command line and send full opcounters list through netlink, and send opcounter indexes instead of names; * Patches are re-ordered. v0: https://lore.kernel.org/all/20210818112428.209111-1-markzhang@nvidia.com ---------------------------------------------------------------------- Hi, This series from Neta and Aharon provides an extension to the rdma statistics tool that allows to set optional counters dynamically, using netlink. The idea of having optional counters is to provide to the users the ability to get statistics of counters that hurts performance. Once an optional counter was added, its statistics will be presented along with all the counters, using the show command. Binding objects to the optional counters is currently not supported, neither in auto mode nor in manual mode. To get the list of optional counters that are supported on this device, use "rdma statistic mode supported". To see which counters are currently enabled, use "rdma statistic mode". Examples: $ rdma statistic mode supported link rocep8s0f0/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts link rocep8s0f1/1 supported optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts,cc_tx_cnp_pkts $ sudo rdma statistic set link rocep8s0f0/1 optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts $ rdma statistic mode link rocep8s0f0/1 link rocep8s0f0/1 optional-counters cc_rx_ce_pkts,cc_rx_cnp_pkts $ rdma statistic show link rocep8s0f0/1 link rocep8s0f0/1 rx_write_requests 0 rx_read_requests 0 rx_atomic_requests 0 out_of_buffer 0 out_of_sequence 0 duplicate_request 0 rnr_nak_retry_err 0 packet_seq_err 0 implied_nak_seq_err 0 local_ack_timeout_err 0 resp_local_length_error 0 resp_cqe_error 0 req_cqe_error 0 req_remote_invalid_request 0 req_remote_access_errors 0 resp_remote_access_errors 0 resp_cqe_flush_error 0 req_cqe_flush_error 0 roce_adp_retrans 0 roce_adp_retrans_to 0 roce_slow_restart 0 roce_slow_restart_cnps 0 roce_slow_restart_trans 0 rp_cnp_ignored 0 rp_cnp_handled 0 np_ecn_marked_roce_packets 0 np_cnp_sent 0 rx_icrc_encapsulated 0 cc_rx_ce_pkts 0 cc_rx_cnp_pkts 0 $ sudo rdma statistic set link rocep8s0f0/1 optional-counters cc_rx_ce_pkts $ rdma statistic mode link rocep8s0f0/1 link rocep8s0f0/1 optional-counters cc_rx_ce_pkts Thanks Aharon Landau (11): net/mlx5: Add ifc bits to support optional counters net/mlx5: Add priorities for counters in RDMA namespaces RDMA/counter: Add a descriptor in struct rdma_hw_stats RDMA/counter: Add an is_disabled field in struct rdma_hw_stats RDMA/counter: Add optional counter support RDMA/nldev: Add support to get status of all counters RDMA/nldev: Allow optional-counter status configuration through RDMA netlink RDMA/mlx5: Support optional counters in hw_stats initialization RDMA/mlx5: Add steering support in optional flow counters RDMA/mlx5: Add modify_op_stat() support RDMA/mlx5: Add optional counter support in get_hw_stats callback drivers/infiniband/core/counters.c | 30 +- drivers/infiniband/core/device.c | 1 + drivers/infiniband/core/nldev.c | 289 +++++++++++++----- drivers/infiniband/core/sysfs.c | 41 ++- drivers/infiniband/hw/bnxt_re/hw_counters.c | 114 +++---- drivers/infiniband/hw/cxgb4/provider.c | 22 +- drivers/infiniband/hw/efa/efa_verbs.c | 19 +- drivers/infiniband/hw/hfi1/verbs.c | 43 +-- drivers/infiniband/hw/irdma/verbs.c | 98 +++--- drivers/infiniband/hw/mlx4/main.c | 37 +-- drivers/infiniband/hw/mlx4/mlx4_ib.h | 2 +- drivers/infiniband/hw/mlx5/counters.c | 280 ++++++++++++++--- drivers/infiniband/hw/mlx5/fs.c | 187 ++++++++++++ drivers/infiniband/hw/mlx5/mlx5_ib.h | 28 +- drivers/infiniband/sw/rxe/rxe_hw_counters.c | 42 +-- .../net/ethernet/mellanox/mlx5/core/fs_core.c | 54 +++- include/linux/mlx5/device.h | 2 + include/linux/mlx5/fs.h | 2 + include/linux/mlx5/mlx5_ifc.h | 22 +- include/rdma/ib_hdrs.h | 1 + include/rdma/ib_verbs.h | 48 ++- include/rdma/rdma_counter.h | 2 + include/uapi/rdma/rdma_netlink.h | 3 + 23 files changed, 1041 insertions(+), 326 deletions(-)