Message ID | 20210809102152.719961-1-idosch@idosch.org |
---|---|
Headers | show |
Series | ethtool: Add ability to control transceiver modules | expand |
On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote: > From: Ido Schimmel <idosch@nvidia.com> > > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver > modules parameters and retrieve their status. Hi Ido I've not read all the patchset yet, but i like the general direction. > The first parameter to control is the low power mode of the module. It > is only relevant for paged memory modules, as flat memory modules always > operate in low power mode. > > When a paged memory module is in low power mode, its power consumption > is reduced to the minimum, the management interface towards the host is > available and the data path is deactivated. > > User space can choose to put modules that are not currently in use in > low power mode and transition them to high power mode before putting the > associated ports administratively up. > > Transitioning into low power mode means loss of carrier, so error is > returned when the netdev is administratively up. However, i don't get this use case. With copper PHYs, putting the link administratively down results in a call into phylib and into the driver to down the link. This effectively puts the PHY into a low power mode. The management interface, as defined by C22 and C45 remain available, but the data path is disabled. For a 1G PHY, this can save a few watts. For SFPs managed by phylink and the kernal SFP driver, the exact same happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus still works, but the data path is disabled. So i would expect a driver using firmware, not Linux code to manage SFPs, to just do this on link down. Why do we need user space involved? Andrew
On Mon, Aug 09, 2021 at 04:28:32PM +0200, Andrew Lunn wrote: > On Mon, Aug 09, 2021 at 01:21:45PM +0300, Ido Schimmel wrote: > > From: Ido Schimmel <idosch@nvidia.com> > > > > Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and > > 'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver > > modules parameters and retrieve their status. > > Hi Ido > > I've not read all the patchset yet, but i like the general direction. > > > The first parameter to control is the low power mode of the module. It > > is only relevant for paged memory modules, as flat memory modules always > > operate in low power mode. > > > > When a paged memory module is in low power mode, its power consumption > > is reduced to the minimum, the management interface towards the host is > > available and the data path is deactivated. > > > > User space can choose to put modules that are not currently in use in > > low power mode and transition them to high power mode before putting the > > associated ports administratively up. > > > > Transitioning into low power mode means loss of carrier, so error is > > returned when the netdev is administratively up. > > However, i don't get this use case. With copper PHYs, putting the link > administratively down results in a call into phylib and into the > driver to down the link. This effectively puts the PHY into a low > power mode. The management interface, as defined by C22 and C45 remain > available, but the data path is disabled. For a 1G PHY, this can save > a few watts. > > For SFPs managed by phylink and the kernal SFP driver, the exact same > happens. The TX_ENABLE pin of the SFP is set to false. The I2C bus > still works, but the data path is disabled. > > So i would expect a driver using firmware, not Linux code to manage > SFPs, to just do this on link down. Why do we need user space > involved? The transition from low power to high power can take a few seconds with QSFP/QSFP-DD and it's likely to only get longer with future / more complex modules. Therefore, to reduce link-up time, the firmware automatically transitions modules to high power mode. There is obviously a trade-off here between power consumption and link-up time. My understanding is that Mellanox is not the only vendor favoring shorter link-up times as users have the ability to control the low power mode of the modules in other implementations. Regarding "why do we need user space involved?", by default, it does not need to be involved (the system works without this API), but if it wants to reduce the power consumption by setting unused modules to low power mode, then it will need to use this API.
On Mon, Aug 09, 2021 at 09:13:47PM +0200, Andrew Lunn wrote: > On Mon, Aug 09, 2021 at 01:21:46PM +0300, Ido Schimmel wrote: > > From: Ido Schimmel <idosch@nvidia.com> > > > > Add a new ethtool message, 'ETHTOOL_MSG_MODULE_RESET_ACT', which allows > > user space to request a reset of transceiver modules. A successful reset > > results in a notification being emitted to user space in the form of a > > 'ETHTOOL_MSG_MODULE_RESET_NTF' message. > > > > Reset can be performed by either asserting the relevant hardware signal > > ("Reset" in CMIS / "ResetL" in SFF-8636) or by writing to the relevant > > reset bit in the module's EEPROM (page 00h, byte 26, bit 3 in CMIS / > > page 00h, byte 93, bit 7 in SFF-8636). > > > > Reset is useful in order to allow a module to transition out of a fault > > state. From section 6.3.2.12 in CMIS 5.0: "Except for a power cycle, the > > only exit path from the ModuleFault state is to perform a module reset > > by taking an action that causes the ResetS transition signal to become > > TRUE (see Table 6-11)". > > > > To avoid changes to the operational state of the device, reset can only > > be performed when the device is administratively down. > > > > Example usage: > > > > # ethtool --reset-module swp11 > > netlink error: Cannot reset module when port is administratively up > > netlink error: Invalid argument > > > > # ip link set dev swp11 down > > > > # ethtool --reset-module swp11 > > > > Monitor notifications: > > > > $ ethtool --monitor > > listening... > > > > Module reset done for swp11 > > Again, i'm wondering, why is user space doing the reset? Can you think > of any other piece of hardware where Linux relies on user space > performing a reset before the kernel can properly use it? > > How long does a reset take? Table 10-1 says the reset pulse must be > 10uS and table 10-2 says the reset should not take longer than > 2000ms. Takes about 1.5ms to get an ACK on the reset request and another few seconds to ensure module is in a valid operational state (will remove RTNL in next version). > So maybe reset it on ifup if it is in a bad state? We can have multiple ports (split) using the same module and in CMIS each data path is controlled by a different state machine. Given the complexity of these modules and possible faults, it is possible to imagine a situation in which a few ports are fine and the rest are unable to obtain a carrier. Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't affect other split ports (e.g., swp1s1). With the dedicated reset command we have the ability to enforce all the required restrictions from the start instead of changing the behavior of existing commands. > I assume the driver/firmware is monitoring the SFP and if it goes into > a state which requires a reset it indicates carrier down? Wasn't there > some patches which added link down reasons? It would make sense to add > enum ethtool_link_ext_substate_sfp_fault? You can then use ethtool to > see what state the module is in, and a down/ip should reset it? I will look into extending the interface with more reasons and parse the CMIS ModuleFaultCause (see table 8-15) in ethtool(8).
> The transition from low power to high power can take a few seconds with > QSFP/QSFP-DD and it's likely to only get longer with future / more > complex modules. Therefore, to reduce link-up time, the firmware > automatically transitions modules to high power mode. > > There is obviously a trade-off here between power consumption and > link-up time. My understanding is that Mellanox is not the only vendor > favoring shorter link-up times as users have the ability to control the > low power mode of the modules in other implementations. > > Regarding "why do we need user space involved?", by default, it does not > need to be involved (the system works without this API), but if it wants > to reduce the power consumption by setting unused modules to low power > mode, then it will need to use this API. O.K. Thanks for the better explanation. Some of this should go into the commit message. I suggest it gets a different name and semantics, to avoid confusion. I think we should consider this the default power mode for when the link is administratively down, rather than direct control over the modules power mode. The driver should transition the module to this setting on link down, be it high power or low power. That saves a lot of complexity, since i assume you currently need a udev script or something which sets it to low power mode on link down, where as you can avoid this be configuring the default and let the driver do it. I also wonder if a hierarchy is needed? You can set the default for the switch, and then override is per module? I _guess_ most users will decide at a switch level they want to save power and pay the penalty over longer link up times. But then we have the question, is it an ethtool option, or a devlink parameter? Andrew
On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote: > > Again, i'm wondering, why is user space doing the reset? Can you think > > of any other piece of hardware where Linux relies on user space > > performing a reset before the kernel can properly use it? > > > > How long does a reset take? Table 10-1 says the reset pulse must be > > 10uS and table 10-2 says the reset should not take longer than > > 2000ms. > > Takes about 1.5ms to get an ACK on the reset request and another few > seconds to ensure module is in a valid operational state (will remove > RTNL in next version). Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink locking was much complicated by the unclear locking rules. Can we keep ethtool simple and put this functionality in a different API or make the reset async? > > So maybe reset it on ifup if it is in a bad state? > > We can have multiple ports (split) using the same module and in CMIS > each data path is controlled by a different state machine. Given the > complexity of these modules and possible faults, it is possible to > imagine a situation in which a few ports are fine and the rest are > unable to obtain a carrier. > > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't > affect other split ports (e.g., swp1s1). With the dedicated reset > command we have the ability to enforce all the required restrictions > from the start instead of changing the behavior of existing commands. Sounds similar to what ethtool --reset was trying to address (upper 16 bits). Could we reuse that? Whether its a SFP or other part of the port being reset may not be entirely important to the user, so perhaps it's not a bad idea to abstract that away and stick to "reset levels"?
On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > The transition from low power to high power can take a few seconds with > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > complex modules. Therefore, to reduce link-up time, the firmware > > automatically transitions modules to high power mode. > > > > There is obviously a trade-off here between power consumption and > > link-up time. My understanding is that Mellanox is not the only vendor > > favoring shorter link-up times as users have the ability to control the > > low power mode of the modules in other implementations. > > > > Regarding "why do we need user space involved?", by default, it does not > > need to be involved (the system works without this API), but if it wants > > to reduce the power consumption by setting unused modules to low power > > mode, then it will need to use this API. > > O.K. Thanks for the better explanation. Some of this should go into > the commit message. > > I suggest it gets a different name and semantics, to avoid > confusion. I think we should consider this the default power mode for > when the link is administratively down, rather than direct control > over the modules power mode. The driver should transition the module > to this setting on link down, be it high power or low power. That > saves a lot of complexity, since i assume you currently need a udev > script or something which sets it to low power mode on link down, > where as you can avoid this be configuring the default and let the > driver do it. Good point. And actually NICs have similar knobs, exposed via ethtool priv flags today. Intel NICs for example. Maybe we should create a "really power the port down policy" API? Jake do you know what the use cases for Intel are? Are they SFP, MAC, or NC-SI related? > I also wonder if a hierarchy is needed? You can set the default for > the switch, and then override is per module? I _guess_ most users will > decide at a switch level they want to save power and pay the penalty > over longer link up times. But then we have the question, is it an > ethtool option, or a devlink parameter?
On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote: > > > Again, i'm wondering, why is user space doing the reset? Can you think > > > of any other piece of hardware where Linux relies on user space > > > performing a reset before the kernel can properly use it? > > > > > > How long does a reset take? Table 10-1 says the reset pulse must be > > > 10uS and table 10-2 says the reset should not take longer than > > > 2000ms. > > > > Takes about 1.5ms to get an ACK on the reset request and another few > > seconds to ensure module is in a valid operational state (will remove > > RTNL in next version). > > Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink > locking was much complicated by the unclear locking rules. Can we keep > ethtool simple and put this functionality in a different API or make > the reset async? I thought there are already RTNL-lock-less ethtool ops, but maybe I imagined it. Given that we also want to support firmware update on modules and that user space might want to update several modules simultaneously, do you have a suggestion on how to handle it from locking perspective? The ethtool netlink backend has parallel ops, but RTNL is a problem. Firmware flashing is currently synchronous in both ethtool and devlink, but the latter does not hold RTNL. > > > > So maybe reset it on ifup if it is in a bad state? > > > > We can have multiple ports (split) using the same module and in CMIS > > each data path is controlled by a different state machine. Given the > > complexity of these modules and possible faults, it is possible to > > imagine a situation in which a few ports are fine and the rest are > > unable to obtain a carrier. > > > > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't > > affect other split ports (e.g., swp1s1). With the dedicated reset > > command we have the ability to enforce all the required restrictions > > from the start instead of changing the behavior of existing commands. > > Sounds similar to what ethtool --reset was trying to address (upper > 16 bits). Could we reuse that? Whether its a SFP or other part of the > port being reset may not be entirely important to the user, so perhaps > it's not a bad idea to abstract that away and stick to "reset levels"? Wasn't aware of this API. Looks like it is only implemented by a few drivers, but man page says "phy Transceiver/PHY", so I think we can reuse it. What do you mean by "reset levels"? The split between shared / dedicated? Just to make sure I understand, you suggest the following semantics? # ethtool --reset swp1s0 phy Error since module is used by several ports (split) # ethtool --reset swp1s0 phy-shared OK # ethtool --reset swp1 phy OK (no split) # ethtool --reset swp1 phy-shared OK
> I thought there are already RTNL-lock-less ethtool ops, but maybe I > imagined it. Given that we also want to support firmware update on > modules and that user space might want to update several modules > simultaneously, do you have a suggestion on how to handle it from > locking perspective? I had a similar problem for Ethernet cable testing. It takes a few seconds to perform a cable test, and you don't want to hold the RTNL for that, if you can help it. I made changes to the PHY state machine, so it has an additional state, indicating a cable test is in operation. The ethtool netlink op simply starts the cable test and then returns. Once the cable test is complete, it async reports the results to user space. So for module upgrade, you probably need to add a per module state machine. Changes to the state need to hold RTNL, but you can release it between state changes. Andrew
On Tue, 10 Aug 2021 21:15:45 +0300 Ido Schimmel wrote: > On Tue, Aug 10, 2021 at 06:54:23AM -0700, Jakub Kicinski wrote: > > On Tue, 10 Aug 2021 16:05:07 +0300 Ido Schimmel wrote: > > > Takes about 1.5ms to get an ACK on the reset request and another few > > > seconds to ensure module is in a valid operational state (will remove > > > RTNL in next version). > > > > Hm. RTNL-lock-less ethtool ops are a little concerning. The devlink > > locking was much complicated by the unclear locking rules. Can we keep > > ethtool simple and put this functionality in a different API or make > > the reset async? > > I thought there are already RTNL-lock-less ethtool ops, but maybe I > imagined it. Given that we also want to support firmware update on > modules and that user space might want to update several modules > simultaneously, do you have a suggestion on how to handle it from > locking perspective? Hm, flashing is harder than reset. We can't unbind the driver while it's in progress. I don't have any ready solution in mind, but I'd like to make sure the locking is clear and hard to get wrong. Maybe we could have a mix of ops, one called for "preparing" the flashing called under rtnl and another for "commit" with "unlocked" in the name. Drivers which don't want to deal with dropping rtnl lock can just do everything in the first stage? Perhaps Andrew has better ideas, I'm just spit-balling. Presumably there are already locks at play, locks we would have to take in the case where Linux manages the PHY. Maybe they dictate an architecture? > The ethtool netlink backend has parallel ops, but > RTNL is a problem. Firmware flashing is currently synchronous in both > ethtool and devlink, but the latter does not hold RTNL. Yeah, drivers dropping rtnl_lock mid way thru the ethtool flashing op was one of my main motivations for moving it into devlink. > > > We can have multiple ports (split) using the same module and in CMIS > > > each data path is controlled by a different state machine. Given the > > > complexity of these modules and possible faults, it is possible to > > > imagine a situation in which a few ports are fine and the rest are > > > unable to obtain a carrier. > > > > > > Resetting the module on ifup of swp1s0 is not intuitive and it shouldn't > > > affect other split ports (e.g., swp1s1). With the dedicated reset > > > command we have the ability to enforce all the required restrictions > > > from the start instead of changing the behavior of existing commands. > > > > Sounds similar to what ethtool --reset was trying to address (upper > > 16 bits). Could we reuse that? Whether its a SFP or other part of the > > port being reset may not be entirely important to the user, so perhaps > > it's not a bad idea to abstract that away and stick to "reset levels"? > > Wasn't aware of this API. Looks like it is only implemented by a few > drivers, but man page says "phy Transceiver/PHY", so I think we can > reuse it. > > What do you mean by "reset levels"? The split between shared / > dedicated? Indeed. > Just to make sure I understand, you suggest the following semantics? > > # ethtool --reset swp1s0 phy > Error since module is used by several ports (split) > > # ethtool --reset swp1s0 phy-shared > OK > > # ethtool --reset swp1 phy > OK (no split) > > # ethtool --reset swp1 phy-shared > OK Exactly.
> Hm, flashing is harder than reset. We can't unbind the driver while > it's in progress. I don't have any ready solution in mind, but I'd > like to make sure the locking is clear and hard to get wrong. Maybe > we could have a mix of ops, one called for "preparing" the flashing > called under rtnl and another for "commit" with "unlocked" in the name. > Drivers which don't want to deal with dropping rtnl lock can just do > everything in the first stage? Perhaps Andrew has better ideas, I'm > just spit-balling. Presumably there are already locks at play, locks > we would have to take in the case where Linux manages the PHY. Maybe > they dictate an architecture? I don't think the way linux manages PHYs dictates the architecture. PHY cable test requires that the link is administratively up, so the PHY state machine is in play. It transitions into a testing state when cable test is started, and when the test is finished, it resets the PHY to put it back into running state. If you down the interface while the cable test is running, it aborts the cable test, and then downs the PHY. Flashing firmware is a bit different. You need to ensure the interface is down. And i guess that gets interesting with split modules. You really should not abort an upgrade because the user wants to up the interface. So -EBUSY to open() seems like the best option, based on the state of the SFP state machine. I suspect you are going to need a kernel thread to do the real work. So your "prepare" netlink op would pass the name of the firmware file. Some basic validation would be performed, that all the needed interfaces are down etc, and then the netlink OP would return. The thread then uses request_firmware() to get access to the firmware, and program it. Once complete, or on error, it can async notify user space that it is sorry, your module is toast, or firmware upgrade was successful. This is just throwing out ideas... Andrew
On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > The transition from low power to high power can take a few seconds with > > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > > complex modules. Therefore, to reduce link-up time, the firmware > > > automatically transitions modules to high power mode. > > > > > > There is obviously a trade-off here between power consumption and > > > link-up time. My understanding is that Mellanox is not the only vendor > > > favoring shorter link-up times as users have the ability to control the > > > low power mode of the modules in other implementations. > > > > > > Regarding "why do we need user space involved?", by default, it does not > > > need to be involved (the system works without this API), but if it wants > > > to reduce the power consumption by setting unused modules to low power > > > mode, then it will need to use this API. > > > > O.K. Thanks for the better explanation. Some of this should go into > > the commit message. > > > > I suggest it gets a different name and semantics, to avoid > > confusion. I think we should consider this the default power mode for > > when the link is administratively down, rather than direct control > > over the modules power mode. The driver should transition the module > > to this setting on link down, be it high power or low power. That > > saves a lot of complexity, since i assume you currently need a udev > > script or something which sets it to low power mode on link down, > > where as you can avoid this be configuring the default and let the > > driver do it. > > Good point. And actually NICs have similar knobs, exposed via ethtool > priv flags today. Intel NICs for example. Maybe we should create a > "really power the port down policy" API? See below about Intel. I'm not sure it's the same thing... I'm against adding a vague "really power the port down policy" API. The API proposed in the patch is well-defined, its implementation is documented in standards, its implications are clear and we offer APIs that give user space full observability into its operation. A vague API means that it is going to be abused and user space will get different results over different implementations. After reading the *commit messages* about the private flags, I'm not sure what the flags really do, what is their true motivation, implications or how do I get observability into their operation. I'm not too hopeful about the user documentation. Also, like I mentioned in the cover letter, given the complexity of these modules and as they become more common, it is likely that we will need to extend the API to control more parameters and expose more diagnostic information. I would really like to keep it clean and contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over different APIs. > > Jake do you know what the use cases for Intel are? Are they SFP, MAC, > or NC-SI related? I went through all the Intel drivers that implement these operations and I believe you are talking about these commits: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 There isn't too much information about the motivation, but maybe it has something to do with multi-host controllers where you want to prevent one host from taking the physical link down for all the other hosts sharing it? I remember such issues with mlx5. > > > I also wonder if a hierarchy is needed? You can set the default for > > the switch, and then override is per module? I _guess_ most users will > > decide at a switch level they want to save power and pay the penalty > > over longer link up times. But then we have the question, is it an > > ethtool option, or a devlink parameter?
On Tue, Aug 10, 2021 at 09:28:08PM +0200, Andrew Lunn wrote: > > Hm, flashing is harder than reset. We can't unbind the driver while > > it's in progress. I don't have any ready solution in mind, but I'd > > like to make sure the locking is clear and hard to get wrong. Maybe > > we could have a mix of ops, one called for "preparing" the flashing > > called under rtnl and another for "commit" with "unlocked" in the name. > > Drivers which don't want to deal with dropping rtnl lock can just do > > everything in the first stage? Perhaps Andrew has better ideas, I'm > > just spit-balling. Presumably there are already locks at play, locks > > we would have to take in the case where Linux manages the PHY. Maybe > > they dictate an architecture? > > I don't think the way linux manages PHYs dictates the > architecture. PHY cable test requires that the link is > administratively up, so the PHY state machine is in play. It > transitions into a testing state when cable test is started, and when > the test is finished, it resets the PHY to put it back into running > state. If you down the interface while the cable test is running, it > aborts the cable test, and then downs the PHY. > > Flashing firmware is a bit different. You need to ensure the interface > is down. And i guess that gets interesting with split modules. You > really should not abort an upgrade because the user wants to up the > interface. So -EBUSY to open() seems like the best option, based on > the state of the SFP state machine. > > I suspect you are going to need a kernel thread to do the real > work. So your "prepare" netlink op would pass the name of the firmware > file. Some basic validation would be performed, that all the needed > interfaces are down etc, and then the netlink OP would return. The > thread then uses request_firmware() to get access to the firmware, and > program it. Once complete, or on error, it can async notify user space > that it is sorry, your module is toast, or firmware upgrade was > successful. > > This is just throwing out ideas... Thanks Andrew and Jakub. I will look into these suggestions more closely when I start working on modules firmware update.
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: Tuesday, August 10, 2021 7:00 AM > To: Andrew Lunn <andrew@lunn.ch>; Keller, Jacob E <jacob.e.keller@intel.com> > Cc: Ido Schimmel <idosch@idosch.org>; netdev@vger.kernel.org; > davem@davemloft.net; mkubecek@suse.cz; pali@kernel.org; > vadimp@nvidia.com; mlxsw@nvidia.com; Ido Schimmel <idosch@nvidia.com> > Subject: Re: [RFC PATCH net-next 1/8] ethtool: Add ability to control transceiver > modules' low power mode > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > The transition from low power to high power can take a few seconds with > > > QSFP/QSFP-DD and it's likely to only get longer with future / more > > > complex modules. Therefore, to reduce link-up time, the firmware > > > automatically transitions modules to high power mode. > > > > > > There is obviously a trade-off here between power consumption and > > > link-up time. My understanding is that Mellanox is not the only vendor > > > favoring shorter link-up times as users have the ability to control the > > > low power mode of the modules in other implementations. > > > > > > Regarding "why do we need user space involved?", by default, it does not > > > need to be involved (the system works without this API), but if it wants > > > to reduce the power consumption by setting unused modules to low power > > > mode, then it will need to use this API. > > > > O.K. Thanks for the better explanation. Some of this should go into > > the commit message. > > > > I suggest it gets a different name and semantics, to avoid > > confusion. I think we should consider this the default power mode for > > when the link is administratively down, rather than direct control > > over the modules power mode. The driver should transition the module > > to this setting on link down, be it high power or low power. That > > saves a lot of complexity, since i assume you currently need a udev > > script or something which sets it to low power mode on link down, > > where as you can avoid this be configuring the default and let the > > driver do it. > > Good point. And actually NICs have similar knobs, exposed via ethtool > priv flags today. Intel NICs for example. Maybe we should create a > "really power the port down policy" API? > > Jake do you know what the use cases for Intel are? Are they SFP, MAC, > or NC-SI related? Offhand I don't know. I think we have some requirements documents I can look up. I'll try to get back to you soon if I can find any further information. (Yes, I wish the commit messages gave stronger motivation too...) Thanks, Jake > > > I also wonder if a hierarchy is needed? You can set the default for > > the switch, and then override is per module? I _guess_ most users will > > decide at a switch level they want to save power and pay the penalty > > over longer link up times. But then we have the question, is it an > > ethtool option, or a devlink parameter?
On 8/10/2021 1:46 PM, Ido Schimmel wrote: > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: >> On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: >>>> The transition from low power to high power can take a few seconds with >>>> QSFP/QSFP-DD and it's likely to only get longer with future / more >>>> complex modules. Therefore, to reduce link-up time, the firmware >>>> automatically transitions modules to high power mode. >>>> >>>> There is obviously a trade-off here between power consumption and >>>> link-up time. My understanding is that Mellanox is not the only vendor >>>> favoring shorter link-up times as users have the ability to control the >>>> low power mode of the modules in other implementations. >>>> >>>> Regarding "why do we need user space involved?", by default, it does not >>>> need to be involved (the system works without this API), but if it wants >>>> to reduce the power consumption by setting unused modules to low power >>>> mode, then it will need to use this API. >>> >>> O.K. Thanks for the better explanation. Some of this should go into >>> the commit message. >>> >>> I suggest it gets a different name and semantics, to avoid >>> confusion. I think we should consider this the default power mode for >>> when the link is administratively down, rather than direct control >>> over the modules power mode. The driver should transition the module >>> to this setting on link down, be it high power or low power. That >>> saves a lot of complexity, since i assume you currently need a udev >>> script or something which sets it to low power mode on link down, >>> where as you can avoid this be configuring the default and let the >>> driver do it. >> >> Good point. And actually NICs have similar knobs, exposed via ethtool >> priv flags today. Intel NICs for example. Maybe we should create a >> "really power the port down policy" API? > > See below about Intel. I'm not sure it's the same thing... > > I'm against adding a vague "really power the port down policy" API. The > API proposed in the patch is well-defined, its implementation is > documented in standards, its implications are clear and we offer APIs > that give user space full observability into its operation. > > A vague API means that it is going to be abused and user space will get > different results over different implementations. After reading the > *commit messages* about the private flags, I'm not sure what the flags > really do, what is their true motivation, implications or how do I get > observability into their operation. I'm not too hopeful about the user > documentation. > > Also, like I mentioned in the cover letter, given the complexity of > these modules and as they become more common, it is likely that we will > need to extend the API to control more parameters and expose more > diagnostic information. I would really like to keep it clean and > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > different APIs. > >> >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >> or NC-SI related? > > I went through all the Intel drivers that implement these operations and > I believe you are talking about these commits: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > There isn't too much information about the motivation, but maybe it has > something to do with multi-host controllers where you want to prevent > one host from taking the physical link down for all the other hosts > sharing it? I remember such issues with mlx5. > Ok, I found some more information here. The primary motivation of the changes in the i40e and ice drivers is from customer requests asking to have the link go down when the port is administratively disabled. This is because if the link is down then the switch on the other side will see the port not having link and will stop trying to send traffic to it. As far as I can tell, the reason its a flag is because some users wanted the behavior the other way. I'm not sure it's really related to the behavior here. For what it's worth, I'm in favor of containing things like this into ethtool as well. Thanks, Jake
On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote: > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > O.K. Thanks for the better explanation. Some of this should go into > > > the commit message. > > > > > > I suggest it gets a different name and semantics, to avoid > > > confusion. I think we should consider this the default power mode for > > > when the link is administratively down, rather than direct control > > > over the modules power mode. The driver should transition the module > > > to this setting on link down, be it high power or low power. That > > > saves a lot of complexity, since i assume you currently need a udev > > > script or something which sets it to low power mode on link down, > > > where as you can avoid this be configuring the default and let the > > > driver do it. > > > > Good point. And actually NICs have similar knobs, exposed via ethtool > > priv flags today. Intel NICs for example. Maybe we should create a > > "really power the port down policy" API? > > See below about Intel. I'm not sure it's the same thing... > > I'm against adding a vague "really power the port down policy" API. The > API proposed in the patch is well-defined, its implementation is > documented in standards, its implications are clear and we offer APIs > that give user space full observability into its operation. > > A vague API means that it is going to be abused and user space will get > different results over different implementations. After reading the > *commit messages* about the private flags, I'm not sure what the flags > really do, what is their true motivation, implications or how do I get > observability into their operation. I'm not too hopeful about the user > documentation. > > Also, like I mentioned in the cover letter, given the complexity of > these modules and as they become more common, it is likely that we will > need to extend the API to control more parameters and expose more > diagnostic information. I would really like to keep it clean and > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > different APIs. The patch is well defined but it doesn't provide user with the answer to the question "why is the SFP still up if I asked it to be down?" It's good to match specs closely but Linux may need to reconcile multiple policies. IIUC if Intel decides to keep the SFP up for "other" reasons the situation will look like this: $ ethtool --show-module eth0 Module parameters for eth0: low-power true # ethtool -m eth0 Module State : 0x03 (ModuleReady) LowPwrAllowRequestHW : Off LowPwrRequestSW : Off IOW the low-power mode is a way for user to express preference to shut down/keep up the SFP, but it's not necessarily going to be the only "policy" that matters. If other policies (e.g. NC-SI) express preference to keep the interface up it will stay up, right? The LowPwrRequestSW is not directly controlled by low-power && IF_UP. What I had in mind was something along the lines of a bitmap of reasons which are allowed to keep the interface up, and for your use case the reason would be something like SFP_ALWAYS_ON, but other reasons (well defined) may also keep the ifc up. That's just to explain what I meant, if it's going to be clear to everyone that low-power != LowPwrRequestSW I'm fine either way.
On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, > >> or NC-SI related? > > > > I went through all the Intel drivers that implement these operations and > > I believe you are talking about these commits: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > > > There isn't too much information about the motivation, but maybe it has > > something to do with multi-host controllers where you want to prevent > > one host from taking the physical link down for all the other hosts > > sharing it? I remember such issues with mlx5. > > > > Ok, I found some more information here. The primary motivation of the > changes in the i40e and ice drivers is from customer requests asking to > have the link go down when the port is administratively disabled. This > is because if the link is down then the switch on the other side will > see the port not having link and will stop trying to send traffic to it. > > As far as I can tell, the reason its a flag is because some users wanted > the behavior the other way. > > I'm not sure it's really related to the behavior here. > > For what it's worth, I'm in favor of containing things like this into > ethtool as well. I think the question was the inverse - why not always shut down the port if the interface is brought down?
On 8/10/2021 3:06 PM, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>> or NC-SI related? >>> >>> I went through all the Intel drivers that implement these operations and >>> I believe you are talking about these commits: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>> >>> There isn't too much information about the motivation, but maybe it has >>> something to do with multi-host controllers where you want to prevent >>> one host from taking the physical link down for all the other hosts >>> sharing it? I remember such issues with mlx5. >>> >> >> Ok, I found some more information here. The primary motivation of the >> changes in the i40e and ice drivers is from customer requests asking to >> have the link go down when the port is administratively disabled. This >> is because if the link is down then the switch on the other side will >> see the port not having link and will stop trying to send traffic to it. >> >> As far as I can tell, the reason its a flag is because some users wanted >> the behavior the other way. >> >> I'm not sure it's really related to the behavior here. >> >> For what it's worth, I'm in favor of containing things like this into >> ethtool as well. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? > That... is a better question yes. Unfortunately so far I haven't found any argument for not doing this. Only a bit about many requests to have this behavior. It might just be inertia to maintain current behavior by default...
On 8/10/2021 3:06 PM, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>> or NC-SI related? >>> >>> I went through all the Intel drivers that implement these operations and >>> I believe you are talking about these commits: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>> >>> There isn't too much information about the motivation, but maybe it has >>> something to do with multi-host controllers where you want to prevent >>> one host from taking the physical link down for all the other hosts >>> sharing it? I remember such issues with mlx5. >>> >> >> Ok, I found some more information here. The primary motivation of the >> changes in the i40e and ice drivers is from customer requests asking to >> have the link go down when the port is administratively disabled. This >> is because if the link is down then the switch on the other side will >> see the port not having link and will stop trying to send traffic to it. >> >> As far as I can tell, the reason its a flag is because some users wanted >> the behavior the other way. >> >> I'm not sure it's really related to the behavior here. >> >> For what it's worth, I'm in favor of containing things like this into >> ethtool as well. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? > So far the best I've found after digging is that forcing total shutdown causes manageability and VF traffic to stop. Other customers want this traffic to continue even when the PF port is brought down. Thanks, Jake
On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: > > >> Jake do you know what the use cases for Intel are? Are they SFP, MAC, > > >> or NC-SI related? > > > > > > I went through all the Intel drivers that implement these operations and > > > I believe you are talking about these commits: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 > > > > > > There isn't too much information about the motivation, but maybe it has > > > something to do with multi-host controllers where you want to prevent > > > one host from taking the physical link down for all the other hosts > > > sharing it? I remember such issues with mlx5. > > > > > > > Ok, I found some more information here. The primary motivation of the > > changes in the i40e and ice drivers is from customer requests asking to > > have the link go down when the port is administratively disabled. This > > is because if the link is down then the switch on the other side will > > see the port not having link and will stop trying to send traffic to it. > > > > As far as I can tell, the reason its a flag is because some users wanted > > the behavior the other way. > > > > I'm not sure it's really related to the behavior here. > > I think the question was the inverse - why not always shut down the > port if the interface is brought down? Humm. Something does not seem right here. I would assume that when you administratively configure the link down, the SERDES in the MAC would stop sending anything. So the module has nothing to send. The link peer SERDES then looses sync, and reports that upwards as carrier lost. Does the i40e and ice leave its SERDES running when the link is configured down? Or is the switch FUBAR and does not consider SERDES loss of sync as carrier down? Ido's use case does seem to be different. The link is down. Do we want to leave the module active, probably sending a bit stream of all 0, maybe noise, or do we want to power the module down, so it does not send anything at all. Andrew
> The patch is well defined but it doesn't provide user with the answer > to the question "why is the SFP still up if I asked it to be down?" Can the user even see that the SFP is still up? Does ip link show give: 3: eth42: <BROADCAST,MULTICAST,LOWER_UP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 i.e. LOWER_UP despite DOWN? Roopa added: rtnetlink: add support for protodown reason Maybe we need the opposite as well? Andrew
On 8/10/2021 3:31 PM, Andrew Lunn wrote: > On Tue, Aug 10, 2021 at 03:06:36PM -0700, Jakub Kicinski wrote: >> On Tue, 10 Aug 2021 22:00:51 +0000 Keller, Jacob E wrote: >>>>> Jake do you know what the use cases for Intel are? Are they SFP, MAC, >>>>> or NC-SI related? >>>> >>>> I went through all the Intel drivers that implement these operations and >>>> I believe you are talking about these commits: >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3880bd159d431d06b687b0b5ab22e24e6ef0070 >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5ec9e2ce41ac198de2ee18e0e529b7ebbc67408 >>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ab4ab73fc1ec6dec548fa36c5e383ef5faa7b4c1 >>>> >>>> There isn't too much information about the motivation, but maybe it has >>>> something to do with multi-host controllers where you want to prevent >>>> one host from taking the physical link down for all the other hosts >>>> sharing it? I remember such issues with mlx5. >>>> >>> >>> Ok, I found some more information here. The primary motivation of the >>> changes in the i40e and ice drivers is from customer requests asking to >>> have the link go down when the port is administratively disabled. This >>> is because if the link is down then the switch on the other side will >>> see the port not having link and will stop trying to send traffic to it. >>> >>> As far as I can tell, the reason its a flag is because some users wanted >>> the behavior the other way. >>> >>> I'm not sure it's really related to the behavior here. >> >> I think the question was the inverse - why not always shut down the >> port if the interface is brought down? > > Humm. Something does not seem right here. I would assume that when you > administratively configure the link down, the SERDES in the MAC would > stop sending anything. So the module has nothing to send. The link > peer SERDES then looses sync, and reports that upwards as carrier > lost. > Right.... > Does the i40e and ice leave its SERDES running when the link is > configured down? Or is the switch FUBAR and does not consider SERDES > loss of sync as carrier down? > It's not clear to me. I tried to test with the driver, and it looks like upstream doesn't yet have the link-down-on-close merged into net-next, so I grabbed our out-of-tree driver. Interestingly, both ip link show and ethtool do not report link as up when the device is closed (ip link set enp175f0s0 down)..... So... whatever difference link-down-on-close makes we're definitely not reporting things up. I don't have a setup to confirm anything else right now unfortunately, but I suspect something is wrong with the implementation of link-down-on-close (at the very least it seems like we should still be reporting LOWER_UP.... no?) Unless there's some other weirdness like with QSFP or other multi-port-single-cable setups? I even tried adding some VFs and I see that regardless of whether link-down-on-close is set, I can see link up on the VF.... Hmmmmm. > Ido's use case does seem to be different. The link is down. Do we want > to leave the module active, probably sending a bit stream of all 0, > maybe noise, or do we want to power the module down, so it does not > send anything at all. > > Andrew > Right, I think these two cases are very different.
On Tue, Aug 10, 2021 at 03:05:44PM -0700, Jakub Kicinski wrote: > On Tue, 10 Aug 2021 23:46:28 +0300 Ido Schimmel wrote: > > On Tue, Aug 10, 2021 at 06:59:54AM -0700, Jakub Kicinski wrote: > > > On Tue, 10 Aug 2021 15:52:20 +0200 Andrew Lunn wrote: > > > > O.K. Thanks for the better explanation. Some of this should go into > > > > the commit message. > > > > > > > > I suggest it gets a different name and semantics, to avoid > > > > confusion. I think we should consider this the default power mode for > > > > when the link is administratively down, rather than direct control > > > > over the modules power mode. The driver should transition the module > > > > to this setting on link down, be it high power or low power. That > > > > saves a lot of complexity, since i assume you currently need a udev > > > > script or something which sets it to low power mode on link down, > > > > where as you can avoid this be configuring the default and let the > > > > driver do it. > > > > > > Good point. And actually NICs have similar knobs, exposed via ethtool > > > priv flags today. Intel NICs for example. Maybe we should create a > > > "really power the port down policy" API? > > > > See below about Intel. I'm not sure it's the same thing... > > > > I'm against adding a vague "really power the port down policy" API. The > > API proposed in the patch is well-defined, its implementation is > > documented in standards, its implications are clear and we offer APIs > > that give user space full observability into its operation. > > > > A vague API means that it is going to be abused and user space will get > > different results over different implementations. After reading the > > *commit messages* about the private flags, I'm not sure what the flags > > really do, what is their true motivation, implications or how do I get > > observability into their operation. I'm not too hopeful about the user > > documentation. > > > > Also, like I mentioned in the cover letter, given the complexity of > > these modules and as they become more common, it is likely that we will > > need to extend the API to control more parameters and expose more > > diagnostic information. I would really like to keep it clean and > > contained in 'ETHTOOL_MSG_MODULE_*' messages and not spread it over > > different APIs. > > The patch is well defined but it doesn't provide user with the answer > to the question "why is the SFP still up if I asked it to be down?" But you didn't ask the module to be "down", you asked the MAC. See more below. > It's good to match specs closely but Linux may need to reconcile > multiple policies. > > IIUC if Intel decides to keep the SFP up for "other" reasons the > situation will look like this: Intel did not decide to keep the module "up", it decided to keep both the MAC and the module up. If one of them was down, the peer would notice it, but it didn't (according to Jake's reply). This is a very problematic behavior as you are telling your peer that everything is fine, but in practice you are dropping all of its packets. I hit the exact same issue with mlx5 a few years ago and when I investigated the reason for this behavior I was referred to multi-host systems where you don't want one host to take down the shared link for all the rest. See: https://www.mellanox.com/sites/default/files/doc-2020/sb-externally-connected-multi-host.pdf > > $ ethtool --show-module eth0 > Module parameters for eth0: > low-power true > > # ethtool -m eth0 > Module State : 0x03 (ModuleReady) > LowPwrAllowRequestHW : Off > LowPwrRequestSW : Off This output is wrong. In Intel's case "ip link show" will report the link as down (according to Jake's reply) despite the MAC being up. On the other hand, the output of "ethtool --show-module eth0" will show that the module is not in low power mode and it will be correct. > > > IOW the low-power mode is a way for user to express preference to > shut down/keep up the SFP, Yes, it controls the module, not the MAC. If you want to get a carrier, both the module and the MAC need to be operational. See following example where swp13 and swp14 are connected to each other: $ ip link show dev swp13 127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff # ip link set dev swp13 down # ethtool --set-module swp13 low-power on $ ethtool --show-module swp13 Module parameters for swp13: low-power true # ip link set dev swp13 up $ ip link show dev swp13 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff # ip link set dev swp13 down # ethtool --set-module swp13 low-power off $ ethtool --show-module swp13 Module parameters for swp13: low-power false # ip link set dev swp13 up $ ip link show dev swp13 127: swp13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff $ ip link show dev swp14 128: swp14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > but it's not necessarily going to be the only "policy" that matters. > If other policies (e.g. NC-SI) express preference to keep the > interface up it will stay up, right? > > The LowPwrRequestSW is not directly controlled by low-power && IF_UP. > > What I had in mind was something along the lines of a bitmap of reasons > which are allowed to keep the interface up, and for your use case > the reason would be something like SFP_ALWAYS_ON, but other reasons > (well defined) may also keep the ifc up. > > That's just to explain what I meant, if it's going to be clear to > everyone that low-power != LowPwrRequestSW I'm fine either way. I think we mixed two different use cases here. The first is a way to make sure the link is fully operational (both MAC and module). In this case, contrary to the expected behavior, "ip link set dev eth0 down" will not take the MAC down and the peer will not lose its carrier. This is most likely motivated by special hardware designs or exotic use cases like I mentioned above. The use case for this patch is completely different. Today, when you do "ip link set dev eth0 up" you expect to gain a carrier which means that both the MAC and the module are operational. The latter can be made operational following the user request (e.g., SFP driver) or as soon as it was plugged-in, by the device's firmware. When you do "ip link set dev eth0 down" you expect the reverse to happen and this is what happens today. In implementations where the module is always operational, it stays in high power mode and continues to get warm and consume unnecessary amount of power. Some users might not be OK with that and would like more control, which is exactly what this patch is doing. This patch does not change existing behavior, the API has clear semantics and implications and user space has full observability into its operation. If in the future someone sees the need to add 'protoup', it can be done: # ip link set dev eth0 up # MAC and module are operational # ip link set dev eth0 protoup on # ip link set dev eth0 down # returns an error / ignore # ethtool --set-module eth0 low-power on # returns an error because of admin up You can obviously engineer situations that do not make any sense. Like this: # ethtool --set-module eth0 low-power on # ip link set dev eth0 up # ip link set dev eth0 protoup on The MAC is operational and you can't take it down, but you will never get a carrier because you explicitly asked the module to stay in low power mode.
On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > # ethtool --set-module swp13 low-power on > > $ ethtool --show-module swp13 > Module parameters for swp13: > low-power true > > # ip link set dev swp13 up > > $ ip link show dev swp13 > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > $ ip link show dev swp14 > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff Oh, so if we set low-power true the carrier will never show up? I thought Andrew suggested the setting is only taken into account when netdev is down. That made so much sense to me I assumed we'll just go with that. I must have misunderstood.
On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > > # ethtool --set-module swp13 low-power on > > > > $ ethtool --show-module swp13 > > Module parameters for swp13: > > low-power true > > > > # ip link set dev swp13 up > > > > $ ip link show dev swp13 > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > > > $ ip link show dev swp14 > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > > Oh, so if we set low-power true the carrier will never show up? > I thought Andrew suggested the setting is only taken into account > when netdev is down. Yes, that was my intention. If this low power mode also applies when the interface is admin up, it sounds like a foot gun. ip link show gives you no idea why the carrier is down, and people will assume the cable or peer is broken. We at least need a new flag, LOWER_DISABLED or similar to give the poor user some chance to figure out what is going on. To me, this setting should only apply when the link is admin down. Andrew
On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote: > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > > On Wed, 11 Aug 2021 14:33:06 +0300 Ido Schimmel wrote: > > > # ethtool --set-module swp13 low-power on > > > > > > $ ethtool --show-module swp13 > > > Module parameters for swp13: > > > low-power true > > > > > > # ip link set dev swp13 up > > > > > > $ ip link show dev swp13 > > > 127: swp13: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > > link/ether 1c:34:da:18:55:49 brd ff:ff:ff:ff:ff:ff > > > > > > $ ip link show dev swp14 > > > 128: swp14: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 > > > link/ether 1c:34:da:18:55:4d brd ff:ff:ff:ff:ff:ff > > > > Oh, so if we set low-power true the carrier will never show up? > > I thought Andrew suggested the setting is only taken into account > > when netdev is down. > > Yes, that was my intention. If this low power mode also applies when > the interface is admin up, it sounds like a foot gun. ip link show > gives you no idea why the carrier is down, and people will assume the > cable or peer is broken. We at least need a new flag, LOWER_DISABLED > or similar to give the poor user some chance to figure out what is > going on. The canonical way to report such errors is via LINKSTATE_GET and I will add an extended sub-state describing the problem. > > To me, this setting should only apply when the link is admin down. I don't want to bake such an assumption into the kernel, but I have a suggestion that resolves the issue. We currently have a single attribute that encodes the desired state on SET messages and the operational state on GET_REPLY messages (ETHTOOL_A_MODULE_LOW_POWER_ENABLED): $ ethtool --show-module swp11 Module parameters for swp11: low-power true It is not defined very well when a module is not connected despite being a very interesting use case. We really need to have two attributes. The first one describing the power mode policy and the second one describing the operational power mode which is only reported when a module is plugged in. For the policy we can have these values: 1. low: Always transition the module to low power mode 2. high: Always transition the module to high power mode 3. high-on-up: Transition the module to high power mode when a port using it is administratively up. Otherwise, low A different policy for up/down seems like an overkill for me. See example usage below. No module connected: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high Like I mentioned before, this is the default on Mellanox systems so this new attribute allows user space to query the default policy. Change to a different policy: # ethtool --set-module swp11 power-mode-policy high-on-up $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up After a module was connected: $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up power-mode low # ip link set dev swp11 up $ ethtool --show-module swp11 Module parameters for swp11: power-mode-policy high-on-up low-power high # ip link set dev swp11 down # ethtool --set-module swp11 power-mode-policy low # ip link set dev swp11 up $ ethtool swp11 ... Link detected: no (Cable issue, Module is in low power mode) I'm quite happy with the above. Might change a few things as I implement it, but you get the gist. WDYT?
On Wed, 11 Aug 2021 22:37:53 +0300 Ido Schimmel wrote: > On Wed, Aug 11, 2021 at 04:36:13PM +0200, Andrew Lunn wrote: > > On Wed, Aug 11, 2021 at 06:03:43AM -0700, Jakub Kicinski wrote: > > > Oh, so if we set low-power true the carrier will never show up? > > > I thought Andrew suggested the setting is only taken into account > > > when netdev is down. > > > > Yes, that was my intention. If this low power mode also applies when > > the interface is admin up, it sounds like a foot gun. ip link show > > gives you no idea why the carrier is down, and people will assume the > > cable or peer is broken. We at least need a new flag, LOWER_DISABLED > > or similar to give the poor user some chance to figure out what is > > going on. > > The canonical way to report such errors is via LINKSTATE_GET and I will > add an extended sub-state describing the problem. > > > To me, this setting should only apply when the link is admin down. > > I don't want to bake such an assumption into the kernel, but I have a > suggestion that resolves the issue. > > We currently have a single attribute that encodes the desired state on > SET messages and the operational state on GET_REPLY messages > (ETHTOOL_A_MODULE_LOW_POWER_ENABLED): > > $ ethtool --show-module swp11 > Module parameters for swp11: > low-power true > > It is not defined very well when a module is not connected despite being > a very interesting use case. We really need to have two attributes. The > first one describing the power mode policy and the second one describing > the operational power mode which is only reported when a module is > plugged in. > > For the policy we can have these values: > > 1. low: Always transition the module to low power mode > 2. high: Always transition the module to high power mode > 3. high-on-up: Transition the module to high power mode when a port > using it is administratively up. Otherwise, low > > A different policy for up/down seems like an overkill for me. > > See example usage below. > > No module connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high > > Like I mentioned before, this is the default on Mellanox systems so this > new attribute allows user space to query the default policy. > > Change to a different policy: > > # ethtool --set-module swp11 power-mode-policy high-on-up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > > After a module was connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low > > # ip link set dev swp11 up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > low-power high > > # ip link set dev swp11 down > > # ethtool --set-module swp11 power-mode-policy low > > # ip link set dev swp11 up > > $ ethtool swp11 > ... > Link detected: no (Cable issue, Module is in low power mode) > > I'm quite happy with the above. Might change a few things as I implement > it, but you get the gist. WDYT? Isn't the "low-power" attr just duplicating the relevant bits from -m?
> For the policy we can have these values: > > 1. low: Always transition the module to low power mode > 2. high: Always transition the module to high power mode > 3. high-on-up: Transition the module to high power mode when a port > using it is administratively up. Otherwise, low > > A different policy for up/down seems like an overkill for me. O.K. The current kernel SFP driver is going to default to high-on-up, which is what kernel driven copper PHYs also do. > After a module was connected: > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low > > # ip link set dev swp11 up > > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > low-power high > > # ip link set dev swp11 down You missed a show here. I expect it to be: > $ ethtool --show-module swp11 > Module parameters for swp11: > power-mode-policy high-on-up > power-mode low since it is now down. I suppose we should consider the bigger picture. Is this feature limited to just SFP modules, or does it make sense to any other sort of network technology? CAN, bluetooth, 5G, IPoAC? Andrew
> Isn't the "low-power" attr just duplicating the relevant bits from -m?
Do all SFPs report it in the dump? I'm thinking GPON, 1G modules with
a TX_ENABLE pin? INF-8074 does not specify a bit in the 'EEPROM' data
to indicate the status. So you need to know the state of the GPIO
driving the TX_ENABLE pin.
Andrew
On Wed, Aug 11, 2021 at 01:30:06PM -0700, Jakub Kicinski wrote:
> Isn't the "low-power" attr just duplicating the relevant bits from -m?
If low power is forced by hardware, then it depends on the assertion /
de-assertion of the hardware signal which is obviously not part of the
EEPROM dump.
I knew this would come up, so I mentioned it in the commit message:
"The low power mode can be queried from the kernel. In case
LowPwrAllowRequestHW was on, the kernel would need to take into account
the state of the LowPwrRequestHW signal, which is not visible to user
space."
From: Ido Schimmel <idosch@nvidia.com> This patchset extends the ethtool netlink API to allow user space to control transceiver modules. Two specific APIs are added, but the plan is to extend the interface with more APIs in the future (see "Future plans"). This submission is a complete rework of a previous submission [1] that tried to achieve the same goal by allowing user space to write to the EEPROMs of these modules. It was rejected as it could have enabled user space binary blob drivers. However, the main issue is that by directly writing to some pages of these EEPROMs, we are interfering with the entity that is controlling the modules (kernel / device firmware). In addition, some functionality cannot be implemented solely by writing to the EEPROM, as it requires the assertion / de-assertion of hardware signals (e.g., "ResetL" pin in SFF-8636). Motivation ========== The kernel can currently dump the contents of module EEPROMs to user space via the ethtool legacy ioctl API or the new netlink API. These dumps can then be parsed by ethtool(8) according to the specification that defines the memory map of the EEPROM. For example, SFF-8636 [2] for QSFP and CMIS [3] for QSFP-DD. In addition to read-only elements, these specifications also define writeable elements that can be used to control the behavior of the module. For example, controlling whether the module is put in low or high power mode to limit its power consumption. The CMIS specification even defines a message exchange mechanism (CDB, Command Data Block) on top of the module's memory map. This allows the host to send various commands to the module. For example, to update its firmware. Implementation ============== The ethtool netlink API is extended with two new messages, 'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that allow user space to set and get transceiver module parameters. Specifically, the 'ETHTOOL_A_MODULE_LOW_POWER_ENABLED' attribute allows user space to control the low power mode of the module in order to limit its power consumption. See detailed description in patch #1. In addition, patch #2 adds the 'ETHTOOL_MSG_MODULE_RESET_ACT' message (with a corresponding notification) that allows user space to reset the module in order to get out of fault state. The user API is designed to be generic enough so that it could be used for modules with different memory maps (e.g., SFF-8636, CMIS). The only implementation of the device driver API in this series is for a MAC driver (mlxsw) where the module is controlled by the device's firmware, but it is designed to be generic enough so that it could also be used by implementations where the module is controlled by the kernel. Testing and introspection ========================= See detailed description in patches #1 and #2. Patchset overview ================= Patch #1 adds the initial infrastructure in ethtool along with the ability to control transceiver modules' low power mode. Patch #2 adds the ability to reset transceiver modules. Patches #3-#6 add required device registers in mlxsw. Patch #7 implements in mlxsw the ethtool operations added in patch #1. Patch #8 implements in mlxsw the ethtool operation added in patch #2. Future plans ============ * Extend 'ETHTOOL_MSG_MODULE_SET' to control Tx output among other attributes. * Add new ethtool message(s) to update firmware on transceiver modules. * Extend ethtool(8) to parse more diagnostic information from CMIS modules. No kernel changes required. [1] https://lore.kernel.org/netdev/20210623075925.2610908-1-idosch@idosch.org/ [2] https://members.snia.org/document/dl/26418 [3] http://www.qsfp-dd.com/wp-content/uploads/2021/05/CMIS5p0.pdf Ido Schimmel (8): ethtool: Add ability to control transceiver modules' low power mode ethtool: Add ability to reset transceiver modules mlxsw: reg: Add fields to PMAOS register mlxsw: Make PMAOS pack function more generic mlxsw: reg: Add Port Module Memory Map Properties register mlxsw: reg: Add Management Cable IO and Notifications register mlxsw: Add ability to control transceiver modules' low power mode mlxsw: Add ability to reset transceiver modules Documentation/networking/ethtool-netlink.rst | 83 +++++- .../net/ethernet/mellanox/mlxsw/core_env.c | 232 ++++++++++++++- .../net/ethernet/mellanox/mlxsw/core_env.h | 11 + drivers/net/ethernet/mellanox/mlxsw/minimal.c | 34 +++ drivers/net/ethernet/mellanox/mlxsw/reg.h | 140 +++++++++- .../mellanox/mlxsw/spectrum_ethtool.c | 68 +++++ include/linux/ethtool.h | 12 + include/uapi/linux/ethtool_netlink.h | 18 ++ net/ethtool/Makefile | 2 +- net/ethtool/module.c | 264 ++++++++++++++++++ net/ethtool/netlink.c | 26 ++ net/ethtool/netlink.h | 6 + 12 files changed, 887 insertions(+), 9 deletions(-) create mode 100644 net/ethtool/module.c