diff mbox series

[v2] scsi: ufs: critical health condition

Message ID 20250204143124.1252912-1-avri.altman@wdc.com
State New
Headers show
Series [v2] scsi: ufs: critical health condition | expand

Commit Message

Avri Altman Feb. 4, 2025, 2:31 p.m. UTC
Martin hi,

The UFS4.1 standard, released on January 8, 2025, is adding several new
features. Among them is a new exception event: HEALTH_CRITICAL, which
notifies the host of a device's critical health condition. This
notification implies that the device is approaching the end of its
lifetime based on the amount of performed program/erase cycles.

Once an EOL (End-of-Life) exception event is received, we toggle a
`dev_info` designated member, which is exposed via a `sysfs` entry.
This new `sysfs` entry, will indicate whether a critical health
condition has been detected (1) or not (0).

To handle this new `sysfs` entry, `udev` rules should be configured to
monitor changes in the `critical_health` attribute. When the attribute
value changes to 1, indicating a critical health condition, `udev` can
trigger a user-space script or application to notify the user or take
appropriate actions.

The host can gain further insight into the specific issue by reading one
of the following attributes: bPreEOLInfo, bDeviceLifeTimeEstA,
bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and
bRPMBLifeTimeEst. All those are available for reading via the driver's
sysfs entries or through an applicable utility. It is up to user-space
to read these attributes if needed.

Please consider this for the next merge window.

Thanks,
Avri

Signed-off-by: Avri Altman <avri.altman@wdc.com>

---
Changes in v2:
 - withdraw from using hw-monitor subsystem (Guenter)
---
 Documentation/ABI/testing/sysfs-driver-ufs | 13 +++++++++++++
 drivers/ufs/core/ufs-sysfs.c               | 10 ++++++++++
 drivers/ufs/core/ufshcd.c                  |  6 ++++++
 include/ufs/ufs.h                          |  4 ++++
 4 files changed, 33 insertions(+)

Comments

Bart Van Assche Feb. 5, 2025, 6 p.m. UTC | #1
On 2/4/25 6:31 AM, Avri Altman wrote:
> +static ssize_t critical_health_show(struct device *dev,
> +				    struct device_attribute *attr, char *buf)
> +{
> +	struct ufs_hba *hba = dev_get_drvdata(dev);
> +
> +	return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health);
> +}

Hi Avri,

My understanding is that the UFS 4.1 standard supports reporting a
critical health event repeatedly while this patch does not allow users
to figure out whether a critical event has been reported once or
repeatedly. Has it been considered to report the number of times a
critical health event has been reported by a UFS device instead of only
one bit of information?

Thanks,

Bart.
Avri Altman Feb. 6, 2025, 8:54 a.m. UTC | #2
> 
> > On 2/4/25 6:31 AM, Avri Altman wrote:
> > > +static ssize_t critical_health_show(struct device *dev,
> > > +                                 struct device_attribute *attr,
> > > +char
> > > +*buf) {
> > > +     struct ufs_hba *hba = dev_get_drvdata(dev);
> > > +
> > > +     return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health);
> > > + }
> >
> > Hi Avri,
> >
> > My understanding is that the UFS 4.1 standard supports reporting a
> > critical health event repeatedly while this patch does not allow users
> > to figure out whether a critical event has been reported once or
> > repeatedly. Has it been considered to report the number of times a
> > critical health event has been reported by a UFS device instead of only one
> bit of information?
> Done.
After some further internal discussions:
The set conditions are vendor specific; The device may set it as many times it wants depending on its criticality.
The spec does not define that nor what the host should do.
So there is this concern that some vendors will report multiple times, while other wont.
Hence, reading critical_health = 1 might be misleading.
What do you think?

Thanks,
Avri


> 
> Thanks,
> Avri
> 
> >
> > Thanks,
> >
> > Bart.
Bart Van Assche Feb. 6, 2025, 3:54 p.m. UTC | #3
On 2/6/25 12:54 AM, Avri Altman wrote:
> After some further internal discussions: The set conditions are
> vendor specific; The device may set it as many times it wants
> depending on its criticality. The spec does not define that nor what
> the host should do. So there is this concern that some vendors will
> report multiple times, while other wont. Hence, reading
> critical_health = 1 might be misleading. What do you think?

How about emitting a uevent if a critical health condition has been
reported by a UFS device? See also sdev_evt_send().

Thanks,

Bart.
Bart Van Assche Feb. 6, 2025, 6:08 p.m. UTC | #4
On 2/6/25 9:47 AM, Avri Altman wrote:
>> On 2/6/25 12:54 AM, Avri Altman wrote:
>>> After some further internal discussions: The set conditions are vendor
>>> specific; The device may set it as many times it wants depending on
>>> its criticality. The spec does not define that nor what the host
>>> should do. So there is this concern that some vendors will report
>>> multiple times, while other wont. Hence, reading critical_health = 1
>>> might be misleading. What do you think?
 >
> Still not sure if you want this to be a counter?

Since the event can be reported multiple times, a counter sounds better
to me than a boolean.
>> How about emitting a uevent if a critical health condition has been reported by a
>> UFS device? See also sdev_evt_send().
> Thanks for pointing this out.
> A ufs event in enum scsi_device_event seems misplaced - looks like it was invented for unit attention codes.
> How about calling kobject_uevent() or  kobject_uevent_env() directly from the driver?

Please note that emitting a uevent is not the only possible approach for
informing user space code. Emitting a uevent is recommended if the code
that processes an event can be implemented as a shell script. Could it
be more likely that critical health events will be processed by C or C++
code rather than a shell script? If so, how about making the sysfs
attribute that reports the number of critical health events pollable? In
C and C++ code, polling a sysfs attribute requires less code than
listening for uevents.

Thanks,

Bart.
Avri Altman Feb. 6, 2025, 6:25 p.m. UTC | #5
> On 2/6/25 9:47 AM, Avri Altman wrote:
> >> On 2/6/25 12:54 AM, Avri Altman wrote:
> >>> After some further internal discussions: The set conditions are
> >>> vendor specific; The device may set it as many times it wants
> >>> depending on its criticality. The spec does not define that nor what
> >>> the host should do. So there is this concern that some vendors will
> >>> report multiple times, while other wont. Hence, reading
> >>> critical_health = 1 might be misleading. What do you think?
>  >
> > Still not sure if you want this to be a counter?
> 
> Since the event can be reported multiple times, a counter sounds better to me
> than a boolean.
Done.

> >> How about emitting a uevent if a critical health condition has been
> >> reported by a UFS device? See also sdev_evt_send().
> > Thanks for pointing this out.
> > A ufs event in enum scsi_device_event seems misplaced - looks like it was
> invented for unit attention codes.
> > How about calling kobject_uevent() or  kobject_uevent_env() directly from the
> driver?
> 
> Please note that emitting a uevent is not the only possible approach for informing
> user space code. Emitting a uevent is recommended if the code that processes an
> event can be implemented as a shell script. Could it be more likely that critical
> health events will be processed by C or C++ code rather than a shell script? If so,
> how about making the sysfs attribute that reports the number of critical health
> events pollable? In C and C++ code, polling a sysfs attribute requires less code
> than listening for uevents.
Done.

Thanks,
Avri

> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-driver-ufs b/Documentation/ABI/testing/sysfs-driver-ufs
index 5fa6655aee84..a74ea70bcee4 100644
--- a/Documentation/ABI/testing/sysfs-driver-ufs
+++ b/Documentation/ABI/testing/sysfs-driver-ufs
@@ -1559,3 +1559,16 @@  Description:
 		Symbol - HCMID. This file shows the UFSHCD manufacturer id.
 		The Manufacturer ID is defined by JEDEC in JEDEC-JEP106.
 		The file is read only.
+
+What:		/sys/bus/platform/drivers/ufshcd/*/critical_health
+What:		/sys/bus/platform/devices/*.ufs/critical_health
+Date:		February 2025
+Contact:	Avri Altman <avri.altman@wdc.com>
+Description:	indicate whether a critical health condition has been detected
+		(1) or not (0). further insight into the specific issue can be
+		gained by reading one of: bPreEOLInfo, bDeviceLifeTimeEstA,
+		bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and
+		bRPMBLifeTimeEst.
+
+		The file is read only.
+
diff --git a/drivers/ufs/core/ufs-sysfs.c b/drivers/ufs/core/ufs-sysfs.c
index 3438269a5440..84de3e558a55 100644
--- a/drivers/ufs/core/ufs-sysfs.c
+++ b/drivers/ufs/core/ufs-sysfs.c
@@ -458,6 +458,14 @@  static ssize_t pm_qos_enable_store(struct device *dev,
 	return count;
 }
 
+static ssize_t critical_health_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct ufs_hba *hba = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%d\n", hba->dev_info.critical_health);
+}
+
 static DEVICE_ATTR_RW(rpm_lvl);
 static DEVICE_ATTR_RO(rpm_target_dev_state);
 static DEVICE_ATTR_RO(rpm_target_link_state);
@@ -470,6 +478,7 @@  static DEVICE_ATTR_RW(enable_wb_buf_flush);
 static DEVICE_ATTR_RW(wb_flush_threshold);
 static DEVICE_ATTR_RW(rtc_update_ms);
 static DEVICE_ATTR_RW(pm_qos_enable);
+static DEVICE_ATTR_RO(critical_health);
 
 static struct attribute *ufs_sysfs_ufshcd_attrs[] = {
 	&dev_attr_rpm_lvl.attr,
@@ -484,6 +493,7 @@  static struct attribute *ufs_sysfs_ufshcd_attrs[] = {
 	&dev_attr_wb_flush_threshold.attr,
 	&dev_attr_rtc_update_ms.attr,
 	&dev_attr_pm_qos_enable.attr,
+	&dev_attr_critical_health.attr,
 	NULL
 };
 
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index cd404ade48dc..950878c0bb01 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -6216,6 +6216,9 @@  static void ufshcd_exception_event_handler(struct work_struct *work)
 	if (status & hba->ee_drv_mask & MASK_EE_URGENT_TEMP)
 		ufshcd_temp_exception_event_handler(hba, status);
 
+	if (status & hba->ee_drv_mask & MASK_EE_HEALTH_CRITICAL)
+		hba->dev_info.critical_health = true;
+
 	ufs_debugfs_exception_event(hba, status);
 }
 
@@ -8308,6 +8311,9 @@  static int ufs_get_device_desc(struct ufs_hba *hba)
 
 	ufshcd_temp_notif_probe(hba, desc_buf);
 
+	if (dev_info->wspecversion >= 0x410)
+		ufshcd_enable_ee(hba, MASK_EE_HEALTH_CRITICAL);
+
 	ufs_init_rtc(hba, desc_buf);
 
 	/*
diff --git a/include/ufs/ufs.h b/include/ufs/ufs.h
index 89672ad8c3bb..d25083098f66 100644
--- a/include/ufs/ufs.h
+++ b/include/ufs/ufs.h
@@ -419,6 +419,7 @@  enum {
 	MASK_EE_TOO_LOW_TEMP		= BIT(4),
 	MASK_EE_WRITEBOOSTER_EVENT	= BIT(5),
 	MASK_EE_PERFORMANCE_THROTTLING	= BIT(6),
+	MASK_EE_HEALTH_CRITICAL		= BIT(9),
 };
 #define MASK_EE_URGENT_TEMP (MASK_EE_TOO_HIGH_TEMP | MASK_EE_TOO_LOW_TEMP)
 
@@ -589,6 +590,9 @@  struct ufs_dev_info {
 	u32 rtc_update_period;
 
 	u8 rtt_cap; /* bDeviceRTTCap */
+
+	/* HEALTH_CRITICAL exception reported */
+	bool critical_health;
 };
 
 /*