diff mbox series

[v6,10/10] nvme: Atomic write support

Message ID 20240326133813.3224593-11-john.g.garry@oracle.com
State New
Headers show
Series block atomic writes | expand

Commit Message

John Garry March 26, 2024, 1:38 p.m. UTC
From: Alan Adamson <alan.adamson@oracle.com>

Add support to set block layer request_queue atomic write limits. The
limits will be derived from either the namespace or controller atomic
parameters.

NVMe atomic-related parameters are grouped into "normal" and "power-fail"
(or PF) class of parameter. For atomic write support, only PF parameters
are of interest. The "normal" parameters are concerned with racing reads
and writes (which also applies to PF). See NVM Command Set Specification
Revision 1.0d section 2.1.4 for reference.

Whether to use per namespace or controller atomic parameters is decided by
NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
Structure, NVM Command Set.

NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
are provided for a write which straddles this per-lba space boundary. The
block layer merging policy is such that no merges may occur in which the
resultant request would straddle such a boundary.

Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
atomic boundary rule. In addition, again unlike SCSI, there is no
dedicated atomic write command - a write which adheres to the atomic size
limit and boundary is implicitly atomic.

If NSFEAT bit 1 is set, the following parameters are of interest:
- NAWUPF (Namespace Atomic Write Unit Power Fail)
- NABSPF (Namespace Atomic Boundary Size Power Fail)
- NABO (Namespace Atomic Boundary Offset)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(NAWUPF)
- atomic_write_max_bytes = NAWUPF
- atomic_write_boundary = NABSPF

If in the unlikely scenario that NABO is non-zero, then atomic writes will
not be supported at all as dealing with this adds extra complexity. This
policy may change in future.

In all cases, atomic_write_unit_min is set to the logical block size.

If NSFEAT bit 1 is unset, the following parameter is of interest:
- AWUPF (Atomic Write Unit Power Fail)

and we set request_queue limits as follows:
- atomic_write_unit_max = rounddown_pow_of_two(AWUPF)
- atomic_write_max_bytes = AWUPF
- atomic_write_boundary = 0

A new function, nvme_valid_atomic_write(), is also called from submission
path to verify that a request has been submitted to the driver will
actually be executed atomically. As mentioned, there is no dedicated NVMe
atomic write command (which may error for a command which exceeds the
controller atomic write limits).

Note on NABSPF:
There seems to be some vagueness in the spec as to whether NABSPF applies
for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
and how it is affected by bit 1. However Figure 4 does tell to check Figure
97 for info about per-namespace parameters, which NABSPF is, so it is
implied. However currently nvme_update_disk_info() does check namespace
parameter NABO regardless of this bit.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
jpg: total rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 49 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

Comments

Luis Chamberlain April 11, 2024, 12:29 a.m. UTC | #1
On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
> From: Alan Adamson <alan.adamson@oracle.com>
> 
> Add support to set block layer request_queue atomic write limits. The
> limits will be derived from either the namespace or controller atomic
> parameters.
> 
> NVMe atomic-related parameters are grouped into "normal" and "power-fail"
> (or PF) class of parameter. For atomic write support, only PF parameters
> are of interest. The "normal" parameters are concerned with racing reads
> and writes (which also applies to PF). See NVM Command Set Specification
> Revision 1.0d section 2.1.4 for reference.
> 
> Whether to use per namespace or controller atomic parameters is decided by
> NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
> Structure, NVM Command Set.
> 
> NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
> are provided for a write which straddles this per-lba space boundary. The
> block layer merging policy is such that no merges may occur in which the
> resultant request would straddle such a boundary.
> 
> Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
> atomic boundary rule.

Larger IU drives a larger alignment *preference*, and it can be multiples
of the LBA format, it's called Namespace Preferred Write Granularity (NPWG)
and the NVMe driver already parses it. So say you have a 4k LBA format
but a 16k NPWG. I suspect this means we'd want atomics writes to align to 16k
but I can let Dan confirm.

> Note on NABSPF:
> There seems to be some vagueness in the spec as to whether NABSPF applies
> for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
> and how it is affected by bit 1. However Figure 4 does tell to check Figure
> 97 for info about per-namespace parameters, which NABSPF is, so it is
> implied. However currently nvme_update_disk_info() does check namespace
> parameter NABO regardless of this bit.

Yeah that its quirky.

Also today we set the physical block size to min(npwg, atomic) and that
means for a today's average 4k IU drive if they get 16k atomic the
physical block size would still be 4k. As the physical block size in
practice can also lift the sector size filesystems used it would seem
odd only a larger npwg could lift it. So we may want to revisit this
eventually, specially if we have an API to do atomics properly across the
block layer.

  Luis
John Garry April 11, 2024, 8:59 a.m. UTC | #2
On 11/04/2024 01:29, Luis Chamberlain wrote:
> On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
>> From: Alan Adamson <alan.adamson@oracle.com>
>>
>> Add support to set block layer request_queue atomic write limits. The
>> limits will be derived from either the namespace or controller atomic
>> parameters.
>>
>> NVMe atomic-related parameters are grouped into "normal" and "power-fail"
>> (or PF) class of parameter. For atomic write support, only PF parameters
>> are of interest. The "normal" parameters are concerned with racing reads
>> and writes (which also applies to PF). See NVM Command Set Specification
>> Revision 1.0d section 2.1.4 for reference.
>>
>> Whether to use per namespace or controller atomic parameters is decided by
>> NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
>> Structure, NVM Command Set.
>>
>> NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
>> are provided for a write which straddles this per-lba space boundary. The
>> block layer merging policy is such that no merges may occur in which the
>> resultant request would straddle such a boundary.
>>
>> Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
>> atomic boundary rule.
> 
> Larger IU drives a larger alignment *preference*, and it can be multiples
> of the LBA format, it's called Namespace Preferred Write Granularity (NPWG)
> and the NVMe driver already parses it. So say you have a 4k LBA format
> but a 16k NPWG. I suspect this means we'd want atomics writes to align to 16k
> but I can let Dan confirm.

If we need to be aligned to NPWG, then the min atomic write unit would 
also need to be NPWG. Any NPWG relation to atomic writes is not defined 
in the spec, AFAICS.

We simply use the LBA data size as the min atomic unit in this patch.

> 
>> Note on NABSPF:
>> There seems to be some vagueness in the spec as to whether NABSPF applies
>> for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
>> and how it is affected by bit 1. However Figure 4 does tell to check Figure
>> 97 for info about per-namespace parameters, which NABSPF is, so it is
>> implied. However currently nvme_update_disk_info() does check namespace
>> parameter NABO regardless of this bit.
> 
> Yeah that its quirky.
> 
> Also today we set the physical block size to min(npwg, atomic) and that
> means for a today's average 4k IU drive if they get 16k atomic the
> physical block size would still be 4k. As the physical block size in
> practice can also lift the sector size filesystems used it would seem
> odd only a larger npwg could lift it.
It seems to me that if you want to provide atomic guarantees for this 
large "physical block size", then it needs to be based on (N)AWUPF and NPWG.

> So we may want to revisit this
> eventually, specially if we have an API to do atomics properly across the
> block layer.
>
Thanks,
John
Luis Chamberlain April 11, 2024, 4:22 p.m. UTC | #3
On Thu, Apr 11, 2024 at 09:59:57AM +0100, John Garry wrote:
> On 11/04/2024 01:29, Luis Chamberlain wrote:
> > On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
> > > From: Alan Adamson <alan.adamson@oracle.com>
> > > 
> > > Add support to set block layer request_queue atomic write limits. The
> > > limits will be derived from either the namespace or controller atomic
> > > parameters.
> > > 
> > > NVMe atomic-related parameters are grouped into "normal" and "power-fail"
> > > (or PF) class of parameter. For atomic write support, only PF parameters
> > > are of interest. The "normal" parameters are concerned with racing reads
> > > and writes (which also applies to PF). See NVM Command Set Specification
> > > Revision 1.0d section 2.1.4 for reference.
> > > 
> > > Whether to use per namespace or controller atomic parameters is decided by
> > > NSFEAT bit 1 - see Figure 97: Identify – Identify Namespace Data
> > > Structure, NVM Command Set.
> > > 
> > > NVMe namespaces may define an atomic boundary, whereby no atomic guarantees
> > > are provided for a write which straddles this per-lba space boundary. The
> > > block layer merging policy is such that no merges may occur in which the
> > > resultant request would straddle such a boundary.
> > > 
> > > Unlike SCSI, NVMe specifies no granularity or alignment rules, apart from
> > > atomic boundary rule.
> > 
> > Larger IU drives a larger alignment *preference*, and it can be multiples
> > of the LBA format, it's called Namespace Preferred Write Granularity (NPWG)
> > and the NVMe driver already parses it. So say you have a 4k LBA format
> > but a 16k NPWG. I suspect this means we'd want atomics writes to align to 16k
> > but I can let Dan confirm.
> 
> If we need to be aligned to NPWG, then the min atomic write unit would also
> need to be NPWG. Any NPWG relation to atomic writes is not defined in the
> spec, AFAICS.

NPWG is just a preference, not a requirement, so it is different than
logical block size. As far as I can tell we have no block topology
information to represent it. LBS will help users opt-in to align to
the NPWG, and a respective NAWUPF will ensure you can also atomically
write the respective sector size.

For atomics, NABSPF is what we want to use.

The above statement on the commit log just seems a bit misleading then.

> We simply use the LBA data size as the min atomic unit in this patch.

I thought NABSPF is used.

> > > Note on NABSPF:
> > > There seems to be some vagueness in the spec as to whether NABSPF applies
> > > for NSFEAT bit 1 being unset. Figure 97 does not explicitly mention NABSPF
> > > and how it is affected by bit 1. However Figure 4 does tell to check Figure
> > > 97 for info about per-namespace parameters, which NABSPF is, so it is
> > > implied. However currently nvme_update_disk_info() does check namespace
> > > parameter NABO regardless of this bit.
> > 
> > Yeah that its quirky.
> > 
> > Also today we set the physical block size to min(npwg, atomic) and that
> > means for a today's average 4k IU drive if they get 16k atomic the
> > physical block size would still be 4k. As the physical block size in
> > practice can also lift the sector size filesystems used it would seem
> > odd only a larger npwg could lift it.
> It seems to me that if you want to provide atomic guarantees for this large
> "physical block size", then it needs to be based on (N)AWUPF and NPWG.

For atomicity, I read it as needing to use NABSPF. Aligning to NPWG will just
help performance.

The NPWG comes from an internal mapping table constructed and kept on
DRAM on a drive in units of an IU size [0], and so not aligning to the
IU just causes having to work with entries in the able rather than just
one, and also incurs a read-modify-write. Contrary to the logical block
size, a write below NPWG but respecting the logical block size is allowed,
its just not optimal.

[0] https://kernelnewbies.org/KernelProjects/large-block-size#Indirection_Unit_size_increases

  Luis
Dan Helmick April 11, 2024, 11:32 p.m. UTC | #4
On Thu, April 11, 2024 10:23 AM, Luis Chaimberlain wrote:
> On Thu, Apr 11, 2024 at 09:59:57AM +0100, John Garry wrote:
> > On 11/04/2024 01:29, Luis Chamberlain wrote:
> > > On Tue, Mar 26, 2024 at 01:38:13PM +0000, John Garry wrote:
> > > > From: Alan Adamson <alan.adamson@oracle.com>
> > > >
> > > > Add support to set block layer request_queue atomic write limits.
> > > > The limits will be derived from either the namespace or controller
> > > > atomic parameters.
> > > >
> > > > NVMe atomic-related parameters are grouped into "normal" and "power-
> fail"
> > > > (or PF) class of parameter. For atomic write support, only PF
> > > > parameters are of interest. The "normal" parameters are concerned
> > > > with racing reads and writes (which also applies to PF). See NVM
> > > > Command Set Specification Revision 1.0d section 2.1.4 for reference.
> > > >
> > > > Whether to use per namespace or controller atomic parameters is
> > > > decided by NSFEAT bit 1 - see Figure 97: Identify – Identify
> > > > Namespace Data Structure, NVM Command Set.
> > > >
> > > > NVMe namespaces may define an atomic boundary, whereby no atomic
> > > > guarantees are provided for a write which straddles this per-lba
> > > > space boundary. The block layer merging policy is such that no
> > > > merges may occur in which the resultant request would straddle such a
> boundary.
> > > >
> > > > Unlike SCSI, NVMe specifies no granularity or alignment rules,
> > > > apart from atomic boundary rule.
> > >
> > > Larger IU drives a larger alignment *preference*, and it can be
> > > multiples of the LBA format, it's called Namespace Preferred Write
> > > Granularity (NPWG) and the NVMe driver already parses it. So say you
> > > have a 4k LBA format but a 16k NPWG. I suspect this means we'd want
> > > atomics writes to align to 16k but I can let Dan confirm.

Apologies for my delayed reply.  I confirm.  

FYI: I authored the first draft of the OPTPERF section and I tried also at a point to help clarify the atomics section.  After my 1st drafts on these sections, there was a fair amount of translations from English into Standards type language.  So, some drive specifics get removed to enable other medias and so-forth.  

NPWG is a preference.  It does not dictate the atomic behavior though.  So, don't go assuming you can rely on that behavior.  
 

> >
> > If we need to be aligned to NPWG, then the min atomic write unit would
> > also need to be NPWG. Any NPWG relation to atomic writes is not
> > defined in the spec, AFAICS.
> 
> NPWG is just a preference, not a requirement, so it is different than logical
> block size. As far as I can tell we have no block topology information to
> represent it. LBS will help users opt-in to align to the NPWG, and a respective
> NAWUPF will ensure you can also atomically write the respective sector size.
> 
> For atomics, NABSPF is what we want to use.
> 
> The above statement on the commit log just seems a bit misleading then.
> 
> > We simply use the LBA data size as the min atomic unit in this patch.
> 
> I thought NABSPF is used.

Yes, use NABSPF.  But most SSDs don't actually have boundaries.  This is more of a legacy SSD need.  

> 
> > > > Note on NABSPF:
> > > > There seems to be some vagueness in the spec as to whether NABSPF
> > > > applies for NSFEAT bit 1 being unset. Figure 97 does not
> > > > explicitly mention NABSPF and how it is affected by bit 1. However
> > > > Figure 4 does tell to check Figure
> > > > 97 for info about per-namespace parameters, which NABSPF is, so it
> > > > is implied. However currently nvme_update_disk_info() does check
> > > > namespace parameter NABO regardless of this bit.

NABO is a parameter that was carried forward, and it was already in the spec.  I didn't get an ability to impact that one with my changes.  

The story that was relayed to me says this parameter first existed in SATA and SCSI, and NVMe just pulled over an equivalent parameter even though the problem was resolved in the landscape NVMe SSDs ship into.  I was told that NABO was a parameter from Windows 95-ish days.  Something about the BIOS being written in 512B sectors with an ending that didn't align to 4KB.  But all the HDDs were trying to move over to 4KB for efficiencies of their ECCs.  So, there was this NABO parameter added to get the OS portion of the drive to be aligned nicely with the HDD's ECC.  

Anyways: add in the offset as queried from the drive even though it will most likely always be zero.

> > >
> > > Yeah that its quirky.
> > >
> > > Also today we set the physical block size to min(npwg, atomic) and
> > > that means for a today's average 4k IU drive if they get 16k atomic
> > > the physical block size would still be 4k. As the physical block
> > > size in practice can also lift the sector size filesystems used it
> > > would seem odd only a larger npwg could lift it.
> > It seems to me that if you want to provide atomic guarantees for this
> > large "physical block size", then it needs to be based on (N)AWUPF and NPWG.
> 
> For atomicity, I read it as needing to use NABSPF. Aligning to NPWG will just
> help performance.
> 
> The NPWG comes from an internal mapping table constructed and kept on
> DRAM on a drive in units of an IU size [0], and so not aligning to the IU just
> causes having to work with entries in the able rather than just one, and also
> incurs a read-modify-write. Contrary to the logical block size, a write below
> NPWG but respecting the logical block size is allowed, its just not optimal.
> 
> [0]
> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=eae43295-
> b57ffcf1-eae5b9da-000babda0201-21ccc05e04b9be40&q=1&e=a20d17e2-
> 9e56-47e4-b5e0-
> 05494254a286&u=https*3A*2F*2Fkernelnewbies.org*2FKernelProjects*2Flarg
> e-block-size*23Indirection_Unit_size_increases__;JSUlJSUl!!EwVzqGoTKBqv-
> 0DWAJBm!Wkzd2Bo5MWgYPXDhfJYso2nocO0kAtKHvtYD1NT6p4QIkC846TclDRJ
> pPd683WDGeJSTbRBLKq5Fy-V9mBa8$
> 
>   Luis
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 943d72bdd794..7d3247be5cb9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -943,6 +943,30 @@  static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
 	return BLK_STS_OK;
 }
 
+static bool nvme_valid_atomic_write(struct request *req)
+{
+	struct request_queue *q = req->q;
+	u32 boundary_bytes = queue_atomic_write_boundary_bytes(q);
+
+	if (blk_rq_bytes(req) > queue_atomic_write_unit_max_bytes(q))
+		return false;
+
+	if (boundary_bytes) {
+		u64 mask = boundary_bytes - 1, imask = ~mask;
+		u64 start = blk_rq_pos(req) << SECTOR_SHIFT;
+		u64 end = start + blk_rq_bytes(req) - 1;
+
+		/* If greater then must be crossing a boundary */
+		if (blk_rq_bytes(req) > boundary_bytes)
+			return false;
+
+		if ((start & imask) != (end & imask))
+			return false;
+	}
+
+	return true;
+}
+
 static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 		struct request *req, struct nvme_command *cmnd,
 		enum nvme_opcode op)
@@ -957,6 +981,12 @@  static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
+	/*
+	 * Ensure that nothing has been sent which cannot be executed
+	 * atomically.
+	 */
+	if (req->cmd_flags & REQ_ATOMIC && !nvme_valid_atomic_write(req))
+		return BLK_STS_INVAL;
 
 	cmnd->rw.opcode = op;
 	cmnd->rw.flags = 0;
@@ -1937,6 +1967,23 @@  static void nvme_configure_metadata(struct nvme_ctrl *ctrl,
 	}
 }
 
+
+static void nvme_update_atomic_write_disk_info(struct nvme_ns *ns,
+			struct nvme_id_ns *id, struct queue_limits *lim,
+			u32 bs, u32 atomic_bs)
+{
+	unsigned int boundary = 0;
+
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (le16_to_cpu(id->nabspf))
+			boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+	}
+	lim->atomic_write_hw_max = atomic_bs;
+	lim->atomic_write_hw_boundary = boundary;
+	lim->atomic_write_hw_unit_min = bs;
+	lim->atomic_write_hw_unit_max = rounddown_pow_of_two(atomic_bs);
+}
+
 static u32 nvme_max_drv_segments(struct nvme_ctrl *ctrl)
 {
 	return ctrl->max_hw_sectors / (NVME_CTRL_PAGE_SIZE >> SECTOR_SHIFT) + 1;
@@ -1983,6 +2030,8 @@  static bool nvme_update_disk_info(struct nvme_ns *ns, struct nvme_id_ns *id,
 			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
 		else
 			atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+
+		nvme_update_atomic_write_disk_info(ns, id, lim, bs, atomic_bs);
 	}
 
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {