mbox series

[wpan-next,v2,00/27] IEEE 802.15.4 scan support

Message ID 20220112173312.764660-1-miquel.raynal@bootlin.com
Headers show
Series IEEE 802.15.4 scan support | expand

Message

Miquel Raynal Jan. 12, 2022, 5:32 p.m. UTC
Hello,

Here is a series attempting to bring support for scans in the
IEEE 802.15.4 stack. A second series follows in order to align the
tooling with these changes, bringing support for a number of new
features such as:

* Sending (or stopping) beacons. Intervals ranging from 0 to 14 are
  valid for passively sending beacons at regular intervals. An interval
  of 15 would request the core to answer to received BEACON_REQ.
  # iwpan dev wpan0 beacons send interval 2 # send BEACON at a fixed rate
  # iwpan dev wpan0 beacons send interval 15 # answer BEACON_REQ only
  # iwpan dev wpan0 beacons stop # apply to both cases

* Scanning all the channels or only a subset:
  # iwpan dev wpan1 scan type passive duration 3 # will not trigger BEACON_REQ
  # iwpan dev wpan1 scan type active duration 3 # will trigger BEACON_REQ

* If a beacon is received during a scan, the internal PAN list is
  updated and can be dumped, flushed and configured with:
  # iwpan dev wpan1 pans dump
  PAN 0xffff (on wpan1)
      coordinator 0x2efefdd4cdbf9330
      page 0
      channel 13
      superframe spec. 0xcf22
      LQI 0
      seen 7156ms ago
  # iwpan dev wpan1 pans flush
  # iwpan dev wpan1 set max_pan_entries 100
  # iwpan dev wpan1 set pans_expiration 3600

* It is also possible to monitor the events with:
  # iwpan event

* As well as triggering a non blocking scan:
  # iwpan dev wpan1 scan trigger type passive duration 3
  # iwpan dev wpan1 scan done
  # iwpan dev wpan1 scan abort

The PAN list gets automatically updated by dropping the expired PANs
each time the user requests access to the list.

Internally, both requests (scan/beacons) are handled periodically by
delayed workqueues when relevant.

So far the only technical point that is missing in this series is the
possibility to grab a reference over the module driving the net device
in order to prevent module unloading during a scan or when the beacons
work is ongoing.

Finally, this series is a deep reshuffle of David Girault's original
work, hence the fact that he is almost systematically credited, either
by being the only author when I created the patches based on his changes
with almost no modification, or with a Co-developped-by tag whenever the
final code base is significantly different than his first proposal while
still being greatly inspired from it.

Cheers,
Miquèl

Changes in v2:
* Create two new netlink commands to set the maximum number of PANs that
  can be listed as well as their expiration time (in seconds).
* Added a patch to the series to avoid ignoring bad frames in hwsim as
  requested by Alexander.
* Changed the symbol duration type to receive nanoseconds instead of
  microseconds.
* Dropped most of the hwsim patches and reworked how drivers advertise
  their channels in order to be capable of deriving the symbol durations
  automatically.
* The scanning boolean gets turned into an atomic.
* The ca8210 driver does not support scanning, implement the driver
  hooks to reflect the situation.
* Reworked a bit the content of each patch to ease the introduction of
  active scans. 
* Added active scan support.

David Girault (5):
  net: ieee802154: Move IEEE 802.15.4 Kconfig main entry
  net: mac802154: Include the softMAC stack inside the IEEE 802.15.4
    menu
  net: ieee802154: Move the address structure earlier
  net: ieee802154: Add a kernel doc header to the ieee802154_addr
    structure
  net: ieee802154: Trace the registration of new PANs

Miquel Raynal (22):
  net: mac802154: Split the set channel hook implementation
  net: mac802154: Ensure proper channel selection at probe time
  net: ieee802154: Improve the way supported channels are declared
  net: ieee802154: Give more details to the core about the channel
    configurations
  net: mac802154: Convert the symbol duration into nanoseconds
  net: mac802154: Set the symbol duration automatically
  net: ieee802154: hwsim: Ensure frame checksum are valid
  net: ieee802154: Drop symbol duration settings when the core does it
    already
  net: ieee802154: Return meaningful error codes from the netlink
    helpers
  net: ieee802154: Add support for internal PAN management
  net: ieee802154: Define a beacon frame header
  net: ieee802154: Define frame types
  net: ieee802154: Add support for scanning requests
  net: mac802154: Handle scan requests
  net: ieee802154: Full PAN management
  net: ieee802154: Add support for beacon requests
  net: mac802154: Handle beacons requests
  net: mac802154: Add support for active scans
  net: mac802154: Add support for processing beacon requests
  net: mac802154: Inform device drivers about scans
  net: mac802154: Inform device drivers about beacon operations
  net: ieee802154: ca8210: Refuse most of the scan operations

 drivers/net/ieee802154/adf7242.c         |   3 +-
 drivers/net/ieee802154/at86rf230.c       |  34 +-
 drivers/net/ieee802154/atusb.c           |  34 +-
 drivers/net/ieee802154/ca8210.c          |  32 +-
 drivers/net/ieee802154/cc2520.c          |   3 +-
 drivers/net/ieee802154/fakelb.c          |  43 +-
 drivers/net/ieee802154/mac802154_hwsim.c |  78 +++-
 drivers/net/ieee802154/mcr20a.c          |   8 +-
 drivers/net/ieee802154/mrf24j40.c        |   3 +-
 include/linux/ieee802154.h               |   7 +
 include/net/cfg802154.h                  | 167 ++++++-
 include/net/ieee802154_netdev.h          |  85 ++++
 include/net/mac802154.h                  |  40 ++
 include/net/nl802154.h                   |  99 ++++
 net/Kconfig                              |   3 +-
 net/ieee802154/Kconfig                   |   1 +
 net/ieee802154/Makefile                  |   2 +-
 net/ieee802154/core.c                    |   2 +
 net/ieee802154/core.h                    |  31 ++
 net/ieee802154/header_ops.c              |  67 +++
 net/ieee802154/nl-phy.c                  |   8 +-
 net/ieee802154/nl802154.c                | 548 ++++++++++++++++++++++-
 net/ieee802154/nl802154.h                |   4 +
 net/ieee802154/pan.c                     | 234 ++++++++++
 net/ieee802154/rdev-ops.h                |  52 +++
 net/ieee802154/trace.h                   |  86 ++++
 net/mac802154/Makefile                   |   2 +-
 net/mac802154/cfg.c                      |  94 +++-
 net/mac802154/driver-ops.h               |  58 +++
 net/mac802154/ieee802154_i.h             |  52 +++
 net/mac802154/main.c                     | 113 ++++-
 net/mac802154/rx.c                       |  34 +-
 net/mac802154/scan.c                     | 446 ++++++++++++++++++
 net/mac802154/trace.h                    |  49 ++
 net/mac802154/tx.c                       |   3 +
 net/mac802154/util.c                     |  26 ++
 36 files changed, 2438 insertions(+), 113 deletions(-)
 create mode 100644 net/ieee802154/pan.c
 create mode 100644 net/mac802154/scan.c

Comments

Alexander Aring Jan. 12, 2022, 10:25 p.m. UTC | #1
Hi,

On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Now that we have access to all the basic information to know which
> symbol duration should be applied, let's set the symbol duration
> automatically. The two locations that must call for the symbol duration
> to be set are:
> - when manually requesting a channel change though the netlink interface
> - at PHY creation, ieee802154_alloc_hw() already calls
>   ieee802154_change_channel() which will now update the symbol duration
>   accordingly.
>
> If an information is missing, the symbol duration is not touched, a
> debug message is eventually printed. This keeps the compatibility with
> the unconverted drivers for which it was too complicated for me to find
> their precise information. If they initially provided a symbol duration,
> it would be kept. If they don't, the symbol duration value is left
> untouched.
>
> Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> ---
>  include/net/cfg802154.h |  2 +
>  net/mac802154/cfg.c     |  1 +
>  net/mac802154/main.c    | 93 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 96 insertions(+)
>
> diff --git a/include/net/cfg802154.h b/include/net/cfg802154.h
> index 286709a9dd0b..52eefc4b5b4d 100644
> --- a/include/net/cfg802154.h
> +++ b/include/net/cfg802154.h
> @@ -455,4 +455,6 @@ static inline const char *wpan_phy_name(struct wpan_phy *phy)
>         return dev_name(&phy->dev);
>  }
>
> +void ieee802154_set_symbol_duration(struct wpan_phy *phy);
> +
>  #endif /* __NET_CFG802154_H */
> diff --git a/net/mac802154/cfg.c b/net/mac802154/cfg.c
> index 6969f1330ccd..ba57da07c08e 100644
> --- a/net/mac802154/cfg.c
> +++ b/net/mac802154/cfg.c
> @@ -113,6 +113,7 @@ int ieee802154_change_channel(struct wpan_phy *wpan_phy, u8 page, u8 channel)
>         if (!ret) {
>                 wpan_phy->current_page = page;
>                 wpan_phy->current_channel = channel;
> +               ieee802154_set_symbol_duration(wpan_phy);
>         }
>
>         return ret;

We also need to do it in ieee802154_register_hw()?

> diff --git a/net/mac802154/main.c b/net/mac802154/main.c
> index 77a4943f345f..88826c5aa4ba 100644
> --- a/net/mac802154/main.c
> +++ b/net/mac802154/main.c
> @@ -113,6 +113,99 @@ ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops)
>  }
>  EXPORT_SYMBOL(ieee802154_alloc_hw);
>
> +void ieee802154_set_symbol_duration(struct wpan_phy *phy)
> +{
> +       struct phy_page *page = &phy->supported.page[phy->current_page];
> +       struct phy_channels *chan;
> +       unsigned int chunk;
> +       u32 duration = 0;
> +
> +       for (chunk = 0; chunk < page->nchunks; chunk++) {
> +               if (page->chunk[chunk].channels & phy->current_channel)
> +                       break;
> +       }
> +
> +       if (chunk == page->nchunks)
> +               goto set_duration;
> +
> +       chan = &page->chunk[chunk];
> +       switch (chan->protocol) {
> +       case IEEE802154_BPSK_PHY:
> +               switch (chan->band) {
> +               case IEEE802154_868_MHZ_BAND:
> +                       /* 868 MHz BPSK 802.15.4-2003: 20 ksym/s */
> +                       duration = 50 * 1000;

* NSEC_PER_USEC?

- Alex
Alexander Aring Jan. 12, 2022, 10:30 p.m. UTC | #2
Hi,

On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> As it is currently designed, the set_channel() cfg802154 hook
> implemented in the softMAC is doing a couple of checks before actually
> performing the channel change. However, as we enhance the support for
> automatically setting the symbol duration during channel changes, it
> will also be needed to ensure that the corresponding channel as properly
> be selected at probe time. In order to verify this, we will need to

no, we don't set channels at probe time. We set the
current_page/channel whatever the default is according to the hardware
datasheet. I think this channel should be dropped and all drivers set
the defaults before registering hw as what we do at e.g. at86rf230,
see [0].

- Alex

[0] https://elixir.bootlin.com/linux/v5.16/source/drivers/net/ieee802154/at86rf230.c#L1553
Miquel Raynal Jan. 13, 2022, 9:32 a.m. UTC | #3
Hi Alexander,

alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:30:35 -0500:

> Hi,
> 
> On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > As it is currently designed, the set_channel() cfg802154 hook
> > implemented in the softMAC is doing a couple of checks before actually
> > performing the channel change. However, as we enhance the support for
> > automatically setting the symbol duration during channel changes, it
> > will also be needed to ensure that the corresponding channel as properly
> > be selected at probe time. In order to verify this, we will need to  
> 
> no, we don't set channels at probe time. We set the
> current_page/channel whatever the default is according to the hardware
> datasheet. I think this channel should be dropped and all drivers set
> the defaults before registering hw as what we do at e.g. at86rf230,
> see [0].

Is there a reason for refusing to call ->set_channel() at probe time?

Anyway, I'll put the symbol duration setting in the registration helper
and I will fix hwsim aside.

> 
> - Alex
> 
> [0] https://elixir.bootlin.com/linux/v5.16/source/drivers/net/ieee802154/at86rf230.c#L1553


Thanks,
Miquèl
Miquel Raynal Jan. 13, 2022, 9:52 a.m. UTC | #4
Hi Alexander,

alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:25:01 -0500:

> Hi,
> 
> On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Now that we have access to all the basic information to know which
> > symbol duration should be applied, let's set the symbol duration
> > automatically. The two locations that must call for the symbol duration
> > to be set are:
> > - when manually requesting a channel change though the netlink interface
> > - at PHY creation, ieee802154_alloc_hw() already calls
> >   ieee802154_change_channel() which will now update the symbol duration
> >   accordingly.
> >
> > If an information is missing, the symbol duration is not touched, a
> > debug message is eventually printed. This keeps the compatibility with
> > the unconverted drivers for which it was too complicated for me to find
> > their precise information. If they initially provided a symbol duration,
> > it would be kept. If they don't, the symbol duration value is left
> > untouched.
> >
> > Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
> > ---
> >  include/net/cfg802154.h |  2 +
> >  net/mac802154/cfg.c     |  1 +
> >  net/mac802154/main.c    | 93 +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 96 insertions(+)
> >
> > diff --git a/include/net/cfg802154.h b/include/net/cfg802154.h
> > index 286709a9dd0b..52eefc4b5b4d 100644
> > --- a/include/net/cfg802154.h
> > +++ b/include/net/cfg802154.h
> > @@ -455,4 +455,6 @@ static inline const char *wpan_phy_name(struct wpan_phy *phy)
> >         return dev_name(&phy->dev);
> >  }
> >
> > +void ieee802154_set_symbol_duration(struct wpan_phy *phy);
> > +
> >  #endif /* __NET_CFG802154_H */
> > diff --git a/net/mac802154/cfg.c b/net/mac802154/cfg.c
> > index 6969f1330ccd..ba57da07c08e 100644
> > --- a/net/mac802154/cfg.c
> > +++ b/net/mac802154/cfg.c
> > @@ -113,6 +113,7 @@ int ieee802154_change_channel(struct wpan_phy *wpan_phy, u8 page, u8 channel)
> >         if (!ret) {
> >                 wpan_phy->current_page = page;
> >                 wpan_phy->current_channel = channel;
> > +               ieee802154_set_symbol_duration(wpan_phy);
> >         }
> >
> >         return ret;  
> 
> We also need to do it in ieee802154_register_hw()?

As you probably saw, my idea was to call for a channel change during
the registration but you nacked that possibility so I'll indeed have
to set the symbol duration manually there.

> > diff --git a/net/mac802154/main.c b/net/mac802154/main.c
> > index 77a4943f345f..88826c5aa4ba 100644
> > --- a/net/mac802154/main.c
> > +++ b/net/mac802154/main.c
> > @@ -113,6 +113,99 @@ ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops)
> >  }
> >  EXPORT_SYMBOL(ieee802154_alloc_hw);
> >
> > +void ieee802154_set_symbol_duration(struct wpan_phy *phy)
> > +{
> > +       struct phy_page *page = &phy->supported.page[phy->current_page];
> > +       struct phy_channels *chan;
> > +       unsigned int chunk;
> > +       u32 duration = 0;
> > +
> > +       for (chunk = 0; chunk < page->nchunks; chunk++) {
> > +               if (page->chunk[chunk].channels & phy->current_channel)

.channels still being a bitfield, David allegedly reported that the
above line should use "& BIT(phy->current_channel)".

> > +                       break;
> > +       }
> > +
> > +       if (chunk == page->nchunks)
> > +               goto set_duration;
> > +
> > +       chan = &page->chunk[chunk];
> > +       switch (chan->protocol) {
> > +       case IEEE802154_BPSK_PHY:
> > +               switch (chan->band) {
> > +               case IEEE802154_868_MHZ_BAND:
> > +                       /* 868 MHz BPSK 802.15.4-2003: 20 ksym/s */
> > +                       duration = 50 * 1000;  
> 
> * NSEC_PER_USEC?

Oh right, I grepped for USEC_TO_NSEC but the macro was named the other
way around, thanks.

Thanks,
Miquèl
Miquel Raynal Jan. 13, 2022, 11:12 a.m. UTC | #5
Hi Alexander,

alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:53:57 -0500:

> Hi,
> 
> On Wed, 12 Jan 2022 at 17:30, Alexander Aring <alex.aring@gmail.com> wrote:
> >
> > Hi,
> >
> > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:  
> > >
> > > As it is currently designed, the set_channel() cfg802154 hook
> > > implemented in the softMAC is doing a couple of checks before actually
> > > performing the channel change. However, as we enhance the support for
> > > automatically setting the symbol duration during channel changes, it
> > > will also be needed to ensure that the corresponding channel as properly
> > > be selected at probe time. In order to verify this, we will need to  
> >
> > no, we don't set channels at probe time. We set the
> > current_page/channel whatever the default is according to the hardware
> > datasheet. I think this channel should be dropped and all drivers set  
> 
> s/channel/patch/

I've dropped the patch and put an additional call to
_set_symbol_duration() in the hw registration routine as discussed
initially.

Thanks,
Miquèl
Miquel Raynal Jan. 13, 2022, 5:07 p.m. UTC | #6
Hi Alexander,

alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:

> Hi,
> 
> On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> ...
> > +       return 0;
> > +}
> > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > index c829e4a75325..40656728c624 100644
> > --- a/net/mac802154/tx.c
> > +++ b/net/mac802154/tx.c
> > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> >         struct net_device *dev = skb->dev;
> >         int ret;
> >
> > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > +               return NETDEV_TX_BUSY;
> > +  
> 
> Please look into the functions "ieee802154_wake_queue()" and
> "ieee802154_stop_queue()" which prevent this function from being
> called. Call stop before starting scanning and wake after scanning is
> done or stopped.

Mmmh all this is already done, isn't it?
- mac802154_trigger_scan_locked() stops the queue before setting the
  promiscuous mode
- mac802154_end_of_scan() wakes the queue after resetting the
  promiscuous mode to its original state

Should I drop the check which stands for an extra precaution?


But overall I think I don't understand well this part. What is
a bit foggy to me is why the (async) tx implementation does:

*Core*                           *Driver*

stop_queue()
drv_async_xmit() -------
                        \------> do something
                         ------- calls ieee802154_xmit_complete()
wakeup_queue() <--------/

So we actually disable the queue for transmitting. Why??

> Also there exists a race which exists in your way and also the one
> mentioned above. There can still be some transmissions going on... We
> need to wait until "all possible" tx completions are done... to be
> sure there are really no transmissions going on. However we need to be
> sure that a wake cannot be done if a tx completion is done, we need to
> avoid it when the scan operation is ongoing as a workaround for this
> race.
> 
> This race exists and should be fixed in future work?

Yep, this is true, do you have any pointers? Because I looked at the
code and for now it appears quite unpractical to add some kind of
flushing mechanism on that net queue. I believe we cannot use the netif
interface for that so we would have to implement our own mechanism in
the ieee802154 core.

Thanks,
Miquèl
Alexander Aring Jan. 13, 2022, 11:27 p.m. UTC | #7
Hi,

On Thu, 13 Jan 2022 at 04:32, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Hi Alexander,
>
> alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:30:35 -0500:
>
> > Hi,
> >
> > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > >
> > > As it is currently designed, the set_channel() cfg802154 hook
> > > implemented in the softMAC is doing a couple of checks before actually
> > > performing the channel change. However, as we enhance the support for
> > > automatically setting the symbol duration during channel changes, it
> > > will also be needed to ensure that the corresponding channel as properly
> > > be selected at probe time. In order to verify this, we will need to
> >
> > no, we don't set channels at probe time. We set the
> > current_page/channel whatever the default is according to the hardware
> > datasheet. I think this channel should be dropped and all drivers set
> > the defaults before registering hw as what we do at e.g. at86rf230,
> > see [0].
>
> Is there a reason for refusing to call ->set_channel() at probe time?
>

because the current drivers work the way to not set the channel/page
during probe time. Also the 802.15.4 spec states that this default
value is hardware specific and this goes back whatever the vendor
decides. Also you drop the check that if the same channel is already
set don't set it which works like a shadow register for registers.
Is there a reason why to set a channel during probe time? Are you
setting the value which is the default one again? If the driver has a
random default value it should choose some and stick to one, the
others do whatever the datasheet has after resetting the hardware.

I really don't see the sense here why every driver should introduce on
driver level a set channel call. At probing time the transceiver
registers are already in a state which we should reflect.

> Anyway, I'll put the symbol duration setting in the registration helper
> and I will fix hwsim aside.
>

ok, thanks.

- Alex
Alexander Aring Jan. 13, 2022, 11:36 p.m. UTC | #8
Hi,

On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
...
> +               case IEEE802154_4030KHZ_MEAN_PRF:
> +                       duration = 3974;
> +                       break;
> +               case IEEE802154_62890KHZ_MEAN_PRF:
> +                       duration = 1018;
> +                       break;
> +               default:
> +                       break;
> +               }
> +               break;
> +       default:
> +               break;
> +       }
> +
> +set_duration:
> +       if (!duration)
> +               pr_debug("Unknown PHY symbol duration, the driver should be fixed\n");

Why should the driver be fixed? It's more this table which needs to be fixed?

> +       else
> +               phy->symbol_duration = duration;

Can you also set the lifs/sifs period after the duration is known?

- Alex
Alexander Aring Jan. 14, 2022, 12:01 a.m. UTC | #9
Hi,

On Thu, 13 Jan 2022 at 12:07, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Hi Alexander,
>
> alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:
>
> > Hi,
> >
> > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > ...
> > > +       return 0;
> > > +}
> > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > index c829e4a75325..40656728c624 100644
> > > --- a/net/mac802154/tx.c
> > > +++ b/net/mac802154/tx.c
> > > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > >         struct net_device *dev = skb->dev;
> > >         int ret;
> > >
> > > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > > +               return NETDEV_TX_BUSY;
> > > +
> >
> > Please look into the functions "ieee802154_wake_queue()" and
> > "ieee802154_stop_queue()" which prevent this function from being
> > called. Call stop before starting scanning and wake after scanning is
> > done or stopped.
>
> Mmmh all this is already done, isn't it?
> - mac802154_trigger_scan_locked() stops the queue before setting the
>   promiscuous mode
> - mac802154_end_of_scan() wakes the queue after resetting the
>   promiscuous mode to its original state
>
> Should I drop the check which stands for an extra precaution?
>

no, I think then it should be a WARN_ON() more without any return
(hopefully it will survive). This case should never happen otherwise
we have a bug that we wake the queue when we "took control about
transmissions" only.
Change the name, I think it will be in future not only scan related.
Maybe "mac802154_queue_stopped()". Everything which is queued from
socket/upperlayer(6lowpan) goes this way.

>
> But overall I think I don't understand well this part. What is
> a bit foggy to me is why the (async) tx implementation does:
>
> *Core*                           *Driver*
>
> stop_queue()
> drv_async_xmit() -------
>                         \------> do something
>                          ------- calls ieee802154_xmit_complete()
> wakeup_queue() <--------/
>
> So we actually disable the queue for transmitting. Why??
>

Because all transceivers have either _one_ transmit framebuffer or one
framebuffer for transmit and receive one time. We need to report to
stop giving us more skb's while we are busy with one to transmit.
This all will/must be changed in future if there is hardware outside
which is more powerful and the driver needs to control the flow here.

That ieee802154_xmit_complete() calls wakeup_queue need to be
forbidden when we are in "synchronous transmit mode"/the queue is
stopped. The synchronous transmit mode is not for any hotpath, it's
for MLME and I think we also need a per phy lock to avoid multiple
synchronous transmissions at one time. Please note that I don't think
here only about scan operation, also for other possible MLME-ops.

> > Also there exists a race which exists in your way and also the one
> > mentioned above. There can still be some transmissions going on... We
> > need to wait until "all possible" tx completions are done... to be
> > sure there are really no transmissions going on. However we need to be
> > sure that a wake cannot be done if a tx completion is done, we need to
> > avoid it when the scan operation is ongoing as a workaround for this
> > race.
> >
> > This race exists and should be fixed in future work?
>
> Yep, this is true, do you have any pointers? Because I looked at the
> code and for now it appears quite unpractical to add some kind of
> flushing mechanism on that net queue. I believe we cannot use the netif
> interface for that so we would have to implement our own mechanism in
> the ieee802154 core.

yes, we need some kind of "wait_for_completion()" and "complete()". We
are currently lucky that we allow only one skb to be transmitted at
one time. I think it is okay to put that on a per phy basis...

- Alex
Miquel Raynal Jan. 14, 2022, 10:18 a.m. UTC | #10
Hi Alexander,

alex.aring@gmail.com wrote on Thu, 13 Jan 2022 18:36:15 -0500:

> Hi,
> 
> On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> ...
> > +               case IEEE802154_4030KHZ_MEAN_PRF:
> > +                       duration = 3974;
> > +                       break;
> > +               case IEEE802154_62890KHZ_MEAN_PRF:
> > +                       duration = 1018;
> > +                       break;
> > +               default:
> > +                       break;
> > +               }
> > +               break;
> > +       default:
> > +               break;
> > +       }
> > +
> > +set_duration:
> > +       if (!duration)
> > +               pr_debug("Unknown PHY symbol duration, the driver should be fixed\n");  
> 
> Why should the driver be fixed? It's more this table which needs to be fixed?

Right.

> 
> > +       else
> > +               phy->symbol_duration = duration;  
> 
> Can you also set the lifs/sifs period after the duration is known?

Done.

Thanks,
Miquèl
Miquel Raynal Jan. 14, 2022, 6:44 p.m. UTC | #11
Hi Alexander,

alex.aring@gmail.com wrote on Thu, 13 Jan 2022 19:01:56 -0500:

> Hi,
> 
> On Thu, 13 Jan 2022 at 12:07, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Hi Alexander,
> >
> > alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:
> >  
> > > Hi,
> > >
> > > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > > ...  
> > > > +       return 0;
> > > > +}
> > > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > > index c829e4a75325..40656728c624 100644
> > > > --- a/net/mac802154/tx.c
> > > > +++ b/net/mac802154/tx.c
> > > > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > > >         struct net_device *dev = skb->dev;
> > > >         int ret;
> > > >
> > > > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > > > +               return NETDEV_TX_BUSY;
> > > > +  
> > >
> > > Please look into the functions "ieee802154_wake_queue()" and
> > > "ieee802154_stop_queue()" which prevent this function from being
> > > called. Call stop before starting scanning and wake after scanning is
> > > done or stopped.  
> >
> > Mmmh all this is already done, isn't it?
> > - mac802154_trigger_scan_locked() stops the queue before setting the
> >   promiscuous mode
> > - mac802154_end_of_scan() wakes the queue after resetting the
> >   promiscuous mode to its original state
> >
> > Should I drop the check which stands for an extra precaution?
> >  
> 
> no, I think then it should be a WARN_ON() more without any return
> (hopefully it will survive). This case should never happen otherwise
> we have a bug that we wake the queue when we "took control about
> transmissions" only.
> Change the name, I think it will be in future not only scan related.
> Maybe "mac802154_queue_stopped()". Everything which is queued from
> socket/upperlayer(6lowpan) goes this way.

Got it.

I've changed the name of the helper, and used an atomic variable there
to follow the count. 

> > But overall I think I don't understand well this part. What is
> > a bit foggy to me is why the (async) tx implementation does:
> >
> > *Core*                           *Driver*
> >
> > stop_queue()
> > drv_async_xmit() -------
> >                         \------> do something
> >                          ------- calls ieee802154_xmit_complete()
> > wakeup_queue() <--------/
> >
> > So we actually disable the queue for transmitting. Why??
> >  
> 
> Because all transceivers have either _one_ transmit framebuffer or one
> framebuffer for transmit and receive one time. We need to report to
> stop giving us more skb's while we are busy with one to transmit.
> This all will/must be changed in future if there is hardware outside
> which is more powerful and the driver needs to control the flow here.
> 
> That ieee802154_xmit_complete() calls wakeup_queue need to be
> forbidden when we are in "synchronous transmit mode"/the queue is
> stopped. The synchronous transmit mode is not for any hotpath, it's
> for MLME and I think we also need a per phy lock to avoid multiple
> synchronous transmissions at one time. Please note that I don't think
> here only about scan operation, also for other possible MLME-ops.
> 

First, thank you very much for all your guidance and reviews, I think I
have a much clearer understanding now.

I've tried to follow your advices, creating:
- a way of tracking ongoing transmissions
- a synchronous API for MLME transfers

I've decided to use the wait_queue + atomic combo which looks nice.
Everything seems to work, I just need a bit of time to clean and rework
a bit the series before sending a v3.

Thanks,
Miquèl
Alexander Aring Jan. 16, 2022, 10:44 p.m. UTC | #12
Hi,

On Fri, 14 Jan 2022 at 13:44, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> Hi Alexander,
>
> alex.aring@gmail.com wrote on Thu, 13 Jan 2022 19:01:56 -0500:
>
> > Hi,
> >
> > On Thu, 13 Jan 2022 at 12:07, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > >
> > > Hi Alexander,
> > >
> > > alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:
> > >
> > > > Hi,
> > > >
> > > > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > > > ...
> > > > > +       return 0;
> > > > > +}
> > > > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > > > index c829e4a75325..40656728c624 100644
> > > > > --- a/net/mac802154/tx.c
> > > > > +++ b/net/mac802154/tx.c
> > > > > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > > > >         struct net_device *dev = skb->dev;
> > > > >         int ret;
> > > > >
> > > > > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > > > > +               return NETDEV_TX_BUSY;
> > > > > +
> > > >
> > > > Please look into the functions "ieee802154_wake_queue()" and
> > > > "ieee802154_stop_queue()" which prevent this function from being
> > > > called. Call stop before starting scanning and wake after scanning is
> > > > done or stopped.
> > >
> > > Mmmh all this is already done, isn't it?
> > > - mac802154_trigger_scan_locked() stops the queue before setting the
> > >   promiscuous mode
> > > - mac802154_end_of_scan() wakes the queue after resetting the
> > >   promiscuous mode to its original state
> > >
> > > Should I drop the check which stands for an extra precaution?
> > >
> >
> > no, I think then it should be a WARN_ON() more without any return
> > (hopefully it will survive). This case should never happen otherwise
> > we have a bug that we wake the queue when we "took control about
> > transmissions" only.
> > Change the name, I think it will be in future not only scan related.
> > Maybe "mac802154_queue_stopped()". Everything which is queued from
> > socket/upperlayer(6lowpan) goes this way.
>
> Got it.
>
> I've changed the name of the helper, and used an atomic variable there
> to follow the count.
>
> > > But overall I think I don't understand well this part. What is
> > > a bit foggy to me is why the (async) tx implementation does:
> > >
> > > *Core*                           *Driver*
> > >
> > > stop_queue()
> > > drv_async_xmit() -------
> > >                         \------> do something
> > >                          ------- calls ieee802154_xmit_complete()
> > > wakeup_queue() <--------/
> > >
> > > So we actually disable the queue for transmitting. Why??
> > >
> >
> > Because all transceivers have either _one_ transmit framebuffer or one
> > framebuffer for transmit and receive one time. We need to report to
> > stop giving us more skb's while we are busy with one to transmit.
> > This all will/must be changed in future if there is hardware outside
> > which is more powerful and the driver needs to control the flow here.
> >
> > That ieee802154_xmit_complete() calls wakeup_queue need to be
> > forbidden when we are in "synchronous transmit mode"/the queue is
> > stopped. The synchronous transmit mode is not for any hotpath, it's
> > for MLME and I think we also need a per phy lock to avoid multiple
> > synchronous transmissions at one time. Please note that I don't think
> > here only about scan operation, also for other possible MLME-ops.
> >
>
> First, thank you very much for all your guidance and reviews, I think I
> have a much clearer understanding now.
>
> I've tried to follow your advices, creating:
> - a way of tracking ongoing transmissions
> - a synchronous API for MLME transfers
>

Please note that I think we cannot use netif_stop_queue() from context
outside of netif xmit() callback. It's because the atomic counter
itself is racy in xmit(), we need to be sure xmit() can't occur while
stopping the queue. I think maybe "netif_tx_disable()" is the right
call to stop from another context, because it holds the tx_lock, which
I believe is held while xmit().
Where the wake queue call should be fine to call..., maybe we can
remove some EXPORT_SYMBOL() then?

I saw that some drivers call "ieee802154_wake_queue()" in error cases,
may we introduce a new helper "?ieee802154_xmit_error?" for error
cases so you can also catch error cases in your sync tx. See `grep -r
"ieee802154_wake_queue" drivers/net/ieee802154`, if we have more
information we might add more meaning into the error cases (e.g.
proper errno).

> I've decided to use the wait_queue + atomic combo which looks nice.
> Everything seems to work, I just need a bit of time to clean and rework
> a bit the series before sending a v3.
>

Okay, sounds good to implement both requirements.

- Alex
Alexander Aring Jan. 16, 2022, 10:50 p.m. UTC | #13
Hi,

On Sun, 16 Jan 2022 at 17:44, Alexander Aring <alex.aring@gmail.com> wrote:
>
> Hi,
>
> On Fri, 14 Jan 2022 at 13:44, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Hi Alexander,
> >
> > alex.aring@gmail.com wrote on Thu, 13 Jan 2022 19:01:56 -0500:
> >
> > > Hi,
> > >
> > > On Thu, 13 Jan 2022 at 12:07, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > > >
> > > > Hi Alexander,
> > > >
> > > > alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:
> > > >
> > > > > Hi,
> > > > >
> > > > > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > > > > ...
> > > > > > +       return 0;
> > > > > > +}
> > > > > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > > > > index c829e4a75325..40656728c624 100644
> > > > > > --- a/net/mac802154/tx.c
> > > > > > +++ b/net/mac802154/tx.c
> > > > > > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > > > > >         struct net_device *dev = skb->dev;
> > > > > >         int ret;
> > > > > >
> > > > > > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > > > > > +               return NETDEV_TX_BUSY;
> > > > > > +
> > > > >
> > > > > Please look into the functions "ieee802154_wake_queue()" and
> > > > > "ieee802154_stop_queue()" which prevent this function from being
> > > > > called. Call stop before starting scanning and wake after scanning is
> > > > > done or stopped.
> > > >
> > > > Mmmh all this is already done, isn't it?
> > > > - mac802154_trigger_scan_locked() stops the queue before setting the
> > > >   promiscuous mode
> > > > - mac802154_end_of_scan() wakes the queue after resetting the
> > > >   promiscuous mode to its original state
> > > >
> > > > Should I drop the check which stands for an extra precaution?
> > > >
> > >
> > > no, I think then it should be a WARN_ON() more without any return
> > > (hopefully it will survive). This case should never happen otherwise
> > > we have a bug that we wake the queue when we "took control about
> > > transmissions" only.
> > > Change the name, I think it will be in future not only scan related.
> > > Maybe "mac802154_queue_stopped()". Everything which is queued from
> > > socket/upperlayer(6lowpan) goes this way.
> >
> > Got it.
> >
> > I've changed the name of the helper, and used an atomic variable there
> > to follow the count.
> >
> > > > But overall I think I don't understand well this part. What is
> > > > a bit foggy to me is why the (async) tx implementation does:
> > > >
> > > > *Core*                           *Driver*
> > > >
> > > > stop_queue()
> > > > drv_async_xmit() -------
> > > >                         \------> do something
> > > >                          ------- calls ieee802154_xmit_complete()
> > > > wakeup_queue() <--------/
> > > >
> > > > So we actually disable the queue for transmitting. Why??
> > > >
> > >
> > > Because all transceivers have either _one_ transmit framebuffer or one
> > > framebuffer for transmit and receive one time. We need to report to
> > > stop giving us more skb's while we are busy with one to transmit.
> > > This all will/must be changed in future if there is hardware outside
> > > which is more powerful and the driver needs to control the flow here.
> > >
> > > That ieee802154_xmit_complete() calls wakeup_queue need to be
> > > forbidden when we are in "synchronous transmit mode"/the queue is
> > > stopped. The synchronous transmit mode is not for any hotpath, it's
> > > for MLME and I think we also need a per phy lock to avoid multiple
> > > synchronous transmissions at one time. Please note that I don't think
> > > here only about scan operation, also for other possible MLME-ops.
> > >
> >
> > First, thank you very much for all your guidance and reviews, I think I
> > have a much clearer understanding now.
> >
> > I've tried to follow your advices, creating:
> > - a way of tracking ongoing transmissions
> > - a synchronous API for MLME transfers
> >
>
> Please note that I think we cannot use netif_stop_queue() from context
> outside of netif xmit() callback. It's because the atomic counter
> itself is racy in xmit(), we need to be sure xmit() can't occur while
> stopping the queue. I think maybe "netif_tx_disable()" is the right
> call to stop from another context, because it holds the tx_lock, which
> I believe is held while xmit().
> Where the wake queue call should be fine to call..., maybe we can
> remove some EXPORT_SYMBOL() then?
>

I am sorry, that comment should go below.

> I saw that some drivers call "ieee802154_wake_queue()" in error cases,
> may we introduce a new helper "?ieee802154_xmit_error?" for error
> cases so you can also catch error cases in your sync tx. See `grep -r
> "ieee802154_wake_queue" drivers/net/ieee802154`, if we have more
> information we might add more meaning into the error cases (e.g.
> proper errno).

maybe we can remove some EXPORT_SYMBOL() then?

- Alex
Miquel Raynal Jan. 17, 2022, 9 a.m. UTC | #14
Hi Alexander,

alex.aring@gmail.com wrote on Sun, 16 Jan 2022 17:44:18 -0500:

> Hi,
> 
> On Fri, 14 Jan 2022 at 13:44, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > Hi Alexander,
> >
> > alex.aring@gmail.com wrote on Thu, 13 Jan 2022 19:01:56 -0500:
> >  
> > > Hi,
> > >
> > > On Thu, 13 Jan 2022 at 12:07, Miquel Raynal <miquel.raynal@bootlin.com> wrote:  
> > > >
> > > > Hi Alexander,
> > > >
> > > > alex.aring@gmail.com wrote on Wed, 12 Jan 2022 17:44:02 -0500:
> > > >  
> > > > > Hi,
> > > > >
> > > > > On Wed, 12 Jan 2022 at 12:33, Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> > > > > ...  
> > > > > > +       return 0;
> > > > > > +}
> > > > > > diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
> > > > > > index c829e4a75325..40656728c624 100644
> > > > > > --- a/net/mac802154/tx.c
> > > > > > +++ b/net/mac802154/tx.c
> > > > > > @@ -54,6 +54,9 @@ ieee802154_tx(struct ieee802154_local *local, struct sk_buff *skb)
> > > > > >         struct net_device *dev = skb->dev;
> > > > > >         int ret;
> > > > > >
> > > > > > +       if (unlikely(mac802154_scan_is_ongoing(local)))
> > > > > > +               return NETDEV_TX_BUSY;
> > > > > > +  
> > > > >
> > > > > Please look into the functions "ieee802154_wake_queue()" and
> > > > > "ieee802154_stop_queue()" which prevent this function from being
> > > > > called. Call stop before starting scanning and wake after scanning is
> > > > > done or stopped.  
> > > >
> > > > Mmmh all this is already done, isn't it?
> > > > - mac802154_trigger_scan_locked() stops the queue before setting the
> > > >   promiscuous mode
> > > > - mac802154_end_of_scan() wakes the queue after resetting the
> > > >   promiscuous mode to its original state
> > > >
> > > > Should I drop the check which stands for an extra precaution?
> > > >  
> > >
> > > no, I think then it should be a WARN_ON() more without any return
> > > (hopefully it will survive). This case should never happen otherwise
> > > we have a bug that we wake the queue when we "took control about
> > > transmissions" only.
> > > Change the name, I think it will be in future not only scan related.
> > > Maybe "mac802154_queue_stopped()". Everything which is queued from
> > > socket/upperlayer(6lowpan) goes this way.  
> >
> > Got it.
> >
> > I've changed the name of the helper, and used an atomic variable there
> > to follow the count.
> >  
> > > > But overall I think I don't understand well this part. What is
> > > > a bit foggy to me is why the (async) tx implementation does:
> > > >
> > > > *Core*                           *Driver*
> > > >
> > > > stop_queue()
> > > > drv_async_xmit() -------
> > > >                         \------> do something
> > > >                          ------- calls ieee802154_xmit_complete()
> > > > wakeup_queue() <--------/
> > > >
> > > > So we actually disable the queue for transmitting. Why??
> > > >  
> > >
> > > Because all transceivers have either _one_ transmit framebuffer or one
> > > framebuffer for transmit and receive one time. We need to report to
> > > stop giving us more skb's while we are busy with one to transmit.
> > > This all will/must be changed in future if there is hardware outside
> > > which is more powerful and the driver needs to control the flow here.
> > >
> > > That ieee802154_xmit_complete() calls wakeup_queue need to be
> > > forbidden when we are in "synchronous transmit mode"/the queue is
> > > stopped. The synchronous transmit mode is not for any hotpath, it's
> > > for MLME and I think we also need a per phy lock to avoid multiple
> > > synchronous transmissions at one time. Please note that I don't think
> > > here only about scan operation, also for other possible MLME-ops.
> > >  
> >
> > First, thank you very much for all your guidance and reviews, I think I
> > have a much clearer understanding now.
> >
> > I've tried to follow your advices, creating:
> > - a way of tracking ongoing transmissions
> > - a synchronous API for MLME transfers
> >  
> 
> Please note that I think we cannot use netif_stop_queue() from context
> outside of netif xmit() callback. It's because the atomic counter
> itself is racy in xmit(), we need to be sure xmit() can't occur while
> stopping the queue.

In my current implementation I don't see this as a real problem because
for me, there is no real difference between:

- a transfer is started
- we call stop_queue()
* right here a transfer is ongoing *

and 

- we call stop_queue()
- the counter is racy hence a last transfer is started
* right here a transfer is ongoing *

because stopping the queue and "flushing" it are two different things.
In the code I don't only rely on the queue being stopped but if I don't
want any more transfer to happen after that, so I also sync the queue
thanks to the new helpers introduced.

Please check v3 (which is coming very soon) and tell me what you think.
Maybe I missed something.

> I think maybe "netif_tx_disable()" is the right
> call to stop from another context, because it holds the tx_lock, which
> I believe is held while xmit().
> Where the wake queue call should be fine to call..., maybe we can
> remove some EXPORT_SYMBOL() then?
> 
> I saw that some drivers call "ieee802154_wake_queue()" in error cases,
> may we introduce a new helper "?ieee802154_xmit_error?" for error
> cases so you can also catch error cases in your sync tx. See `grep -r
> "ieee802154_wake_queue" drivers/net/ieee802154`, if we have more
> information we might add more meaning into the error cases (e.g.
> proper errno).

Most of the time the calling functions are void functions. In fact they
all simply hardcode the xmit_done helper and even worse, sometimes they
simply leak the skb. I've handled that already by updating all these
callers to be sure the only way out is to call xmit_done, which helps a
lot tracking transfers.

Also, you are right, we can certainly drop a couple of EXPORT_SYMBOLS
:-)

> > I've decided to use the wait_queue + atomic combo which looks nice.
> > Everything seems to work, I just need a bit of time to clean and rework
> > a bit the series before sending a v3.
> >  
> 
> Okay, sounds good to implement both requirements.
> 
> - Alex

Thanks,
Miquèl