Message ID | 20250204143124.1252912-1-avri.altman@wdc.com |
---|---|
State | New |
Headers | show |
Series | [v2] scsi: ufs: critical health condition | expand |
On 2/4/25 6:31 AM, Avri Altman wrote: > +static ssize_t critical_health_show(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + struct ufs_hba *hba = dev_get_drvdata(dev); > + > + return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health); > +} Hi Avri, My understanding is that the UFS 4.1 standard supports reporting a critical health event repeatedly while this patch does not allow users to figure out whether a critical event has been reported once or repeatedly. Has it been considered to report the number of times a critical health event has been reported by a UFS device instead of only one bit of information? Thanks, Bart.
> > > On 2/4/25 6:31 AM, Avri Altman wrote: > > > +static ssize_t critical_health_show(struct device *dev, > > > + struct device_attribute *attr, > > > +char > > > +*buf) { > > > + struct ufs_hba *hba = dev_get_drvdata(dev); > > > + > > > + return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health); > > > + } > > > > Hi Avri, > > > > My understanding is that the UFS 4.1 standard supports reporting a > > critical health event repeatedly while this patch does not allow users > > to figure out whether a critical event has been reported once or > > repeatedly. Has it been considered to report the number of times a > > critical health event has been reported by a UFS device instead of only one > bit of information? > Done. After some further internal discussions: The set conditions are vendor specific; The device may set it as many times it wants depending on its criticality. The spec does not define that nor what the host should do. So there is this concern that some vendors will report multiple times, while other wont. Hence, reading critical_health = 1 might be misleading. What do you think? Thanks, Avri > > Thanks, > Avri > > > > > Thanks, > > > > Bart.
On 2/6/25 12:54 AM, Avri Altman wrote: > After some further internal discussions: The set conditions are > vendor specific; The device may set it as many times it wants > depending on its criticality. The spec does not define that nor what > the host should do. So there is this concern that some vendors will > report multiple times, while other wont. Hence, reading > critical_health = 1 might be misleading. What do you think? How about emitting a uevent if a critical health condition has been reported by a UFS device? See also sdev_evt_send(). Thanks, Bart.
On 2/6/25 9:47 AM, Avri Altman wrote: >> On 2/6/25 12:54 AM, Avri Altman wrote: >>> After some further internal discussions: The set conditions are vendor >>> specific; The device may set it as many times it wants depending on >>> its criticality. The spec does not define that nor what the host >>> should do. So there is this concern that some vendors will report >>> multiple times, while other wont. Hence, reading critical_health = 1 >>> might be misleading. What do you think? > > Still not sure if you want this to be a counter? Since the event can be reported multiple times, a counter sounds better to me than a boolean. >> How about emitting a uevent if a critical health condition has been reported by a >> UFS device? See also sdev_evt_send(). > Thanks for pointing this out. > A ufs event in enum scsi_device_event seems misplaced - looks like it was invented for unit attention codes. > How about calling kobject_uevent() or kobject_uevent_env() directly from the driver? Please note that emitting a uevent is not the only possible approach for informing user space code. Emitting a uevent is recommended if the code that processes an event can be implemented as a shell script. Could it be more likely that critical health events will be processed by C or C++ code rather than a shell script? If so, how about making the sysfs attribute that reports the number of critical health events pollable? In C and C++ code, polling a sysfs attribute requires less code than listening for uevents. Thanks, Bart.
> On 2/6/25 9:47 AM, Avri Altman wrote: > >> On 2/6/25 12:54 AM, Avri Altman wrote: > >>> After some further internal discussions: The set conditions are > >>> vendor specific; The device may set it as many times it wants > >>> depending on its criticality. The spec does not define that nor what > >>> the host should do. So there is this concern that some vendors will > >>> report multiple times, while other wont. Hence, reading > >>> critical_health = 1 might be misleading. What do you think? > > > > Still not sure if you want this to be a counter? > > Since the event can be reported multiple times, a counter sounds better to me > than a boolean. Done. > >> How about emitting a uevent if a critical health condition has been > >> reported by a UFS device? See also sdev_evt_send(). > > Thanks for pointing this out. > > A ufs event in enum scsi_device_event seems misplaced - looks like it was > invented for unit attention codes. > > How about calling kobject_uevent() or kobject_uevent_env() directly from the > driver? > > Please note that emitting a uevent is not the only possible approach for informing > user space code. Emitting a uevent is recommended if the code that processes an > event can be implemented as a shell script. Could it be more likely that critical > health events will be processed by C or C++ code rather than a shell script? If so, > how about making the sysfs attribute that reports the number of critical health > events pollable? In C and C++ code, polling a sysfs attribute requires less code > than listening for uevents. Done. Thanks, Avri > > Thanks, > > Bart.
diff --git a/Documentation/ABI/testing/sysfs-driver-ufs b/Documentation/ABI/testing/sysfs-driver-ufs index 5fa6655aee84..a74ea70bcee4 100644 --- a/Documentation/ABI/testing/sysfs-driver-ufs +++ b/Documentation/ABI/testing/sysfs-driver-ufs @@ -1559,3 +1559,16 @@ Description: Symbol - HCMID. This file shows the UFSHCD manufacturer id. The Manufacturer ID is defined by JEDEC in JEDEC-JEP106. The file is read only. + +What: /sys/bus/platform/drivers/ufshcd/*/critical_health +What: /sys/bus/platform/devices/*.ufs/critical_health +Date: February 2025 +Contact: Avri Altman <avri.altman@wdc.com> +Description: indicate whether a critical health condition has been detected + (1) or not (0). further insight into the specific issue can be + gained by reading one of: bPreEOLInfo, bDeviceLifeTimeEstA, + bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and + bRPMBLifeTimeEst. + + The file is read only. + diff --git a/drivers/ufs/core/ufs-sysfs.c b/drivers/ufs/core/ufs-sysfs.c index 3438269a5440..84de3e558a55 100644 --- a/drivers/ufs/core/ufs-sysfs.c +++ b/drivers/ufs/core/ufs-sysfs.c @@ -458,6 +458,14 @@ static ssize_t pm_qos_enable_store(struct device *dev, return count; } +static ssize_t critical_health_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ufs_hba *hba = dev_get_drvdata(dev); + + return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health); +} + static DEVICE_ATTR_RW(rpm_lvl); static DEVICE_ATTR_RO(rpm_target_dev_state); static DEVICE_ATTR_RO(rpm_target_link_state); @@ -470,6 +478,7 @@ static DEVICE_ATTR_RW(enable_wb_buf_flush); static DEVICE_ATTR_RW(wb_flush_threshold); static DEVICE_ATTR_RW(rtc_update_ms); static DEVICE_ATTR_RW(pm_qos_enable); +static DEVICE_ATTR_RO(critical_health); static struct attribute *ufs_sysfs_ufshcd_attrs[] = { &dev_attr_rpm_lvl.attr, @@ -484,6 +493,7 @@ static struct attribute *ufs_sysfs_ufshcd_attrs[] = { &dev_attr_wb_flush_threshold.attr, &dev_attr_rtc_update_ms.attr, &dev_attr_pm_qos_enable.attr, + &dev_attr_critical_health.attr, NULL }; diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index cd404ade48dc..950878c0bb01 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6216,6 +6216,9 @@ static void ufshcd_exception_event_handler(struct work_struct *work) if (status & hba->ee_drv_mask & MASK_EE_URGENT_TEMP) ufshcd_temp_exception_event_handler(hba, status); + if (status & hba->ee_drv_mask & MASK_EE_HEALTH_CRITICAL) + hba->dev_info.critical_health = true; + ufs_debugfs_exception_event(hba, status); } @@ -8308,6 +8311,9 @@ static int ufs_get_device_desc(struct ufs_hba *hba) ufshcd_temp_notif_probe(hba, desc_buf); + if (dev_info->wspecversion >= 0x410) + ufshcd_enable_ee(hba, MASK_EE_HEALTH_CRITICAL); + ufs_init_rtc(hba, desc_buf); /* diff --git a/include/ufs/ufs.h b/include/ufs/ufs.h index 89672ad8c3bb..d25083098f66 100644 --- a/include/ufs/ufs.h +++ b/include/ufs/ufs.h @@ -419,6 +419,7 @@ enum { MASK_EE_TOO_LOW_TEMP = BIT(4), MASK_EE_WRITEBOOSTER_EVENT = BIT(5), MASK_EE_PERFORMANCE_THROTTLING = BIT(6), + MASK_EE_HEALTH_CRITICAL = BIT(9), }; #define MASK_EE_URGENT_TEMP (MASK_EE_TOO_HIGH_TEMP | MASK_EE_TOO_LOW_TEMP) @@ -589,6 +590,9 @@ struct ufs_dev_info { u32 rtc_update_period; u8 rtt_cap; /* bDeviceRTTCap */ + + /* HEALTH_CRITICAL exception reported */ + bool critical_health; }; /*
Martin hi, The UFS4.1 standard, released on January 8, 2025, is adding several new features. Among them is a new exception event: HEALTH_CRITICAL, which notifies the host of a device's critical health condition. This notification implies that the device is approaching the end of its lifetime based on the amount of performed program/erase cycles. Once an EOL (End-of-Life) exception event is received, we toggle a `dev_info` designated member, which is exposed via a `sysfs` entry. This new `sysfs` entry, will indicate whether a critical health condition has been detected (1) or not (0). To handle this new `sysfs` entry, `udev` rules should be configured to monitor changes in the `critical_health` attribute. When the attribute value changes to 1, indicating a critical health condition, `udev` can trigger a user-space script or application to notify the user or take appropriate actions. The host can gain further insight into the specific issue by reading one of the following attributes: bPreEOLInfo, bDeviceLifeTimeEstA, bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and bRPMBLifeTimeEst. All those are available for reading via the driver's sysfs entries or through an applicable utility. It is up to user-space to read these attributes if needed. Please consider this for the next merge window. Thanks, Avri Signed-off-by: Avri Altman <avri.altman@wdc.com> --- Changes in v2: - withdraw from using hw-monitor subsystem (Guenter) --- Documentation/ABI/testing/sysfs-driver-ufs | 13 +++++++++++++ drivers/ufs/core/ufs-sysfs.c | 10 ++++++++++ drivers/ufs/core/ufshcd.c | 6 ++++++ include/ufs/ufs.h | 4 ++++ 4 files changed, 33 insertions(+)