diff mbox series

[v2,01/21] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section

Message ID 20190405135936.7266-2-will.deacon@arm.com
State Accepted
Commit 4614bbdee35706ed77c130d23f12e21479b670ff
Headers show
Series Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) | expand

Commit Message

Will Deacon April 5, 2019, 1:59 p.m. UTC
The "KERNEL I/O BARRIER EFFECTS" section of memory-barriers.txt is vague,
x86-centric, out-of-date, incomplete and demonstrably incorrect in places.
This is largely because I/O ordering is a horrible can of worms, but also
because the document has stagnated as our understanding has evolved.

Attempt to address some of that, by rewriting the section based on
recent(-ish) discussions with Arnd, BenH and others. Maybe one day we'll
find a way to formalise this stuff, but for now let's at least try to
make the English easier to understand.

Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>

---
 Documentation/memory-barriers.txt | 115 +++++++++++++++++++++++---------------
 1 file changed, 70 insertions(+), 45 deletions(-)

-- 
2.11.0

Comments

Ingo Molnar April 10, 2019, 10:58 a.m. UTC | #1
Mostly minor grammer fixes:

* Will Deacon <will.deacon@arm.com> wrote:

> + (*) readX(), writeX():

>  

> +     The readX() and writeX() MMIO accessors take a pointer to the peripheral

> +     being accessed as an __iomem * parameter. For pointers mapped with the

> +     default I/O attributes (e.g. those returned by ioremap()), then the

> +     ordering guarantees are as follows:


s/then the
 /the

> +     1. All readX() and writeX() accesses to the same peripheral are ordered

> +        with respect to each other. For example, this ensures that MMIO register

> +	writes by the CPU to a particular device will arrive in program order.


Vertical alignment whitespace damage: some indentations are done via 
spaces, one via tabs. Please standardize to tabs.

I'd also suggest:

s/For example, this ensures
 /For example this ensures


for the rest of the text too. The comma after the 'For example,' 
introductory phrase is grammatically correct but stylistically confusing, 
because in reality there's a *second* introductory phrase via "this 
ensures".

>  

> +     2. A writeX() by the CPU to the peripheral will first wait for the

> +        completion of all prior CPU writes to memory. For example, this ensures

> +        that writes by the CPU to an outbound DMA buffer allocated by

> +        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes

> +        to its MMIO control register to trigger the transfer.

>  

> +     3. A readX() by the CPU from the peripheral will complete before any

> +	subsequent CPU reads from memory can begin. For example, this ensures

> +	that reads by the CPU from an incoming DMA buffer allocated by

> +	dma_alloc_coherent() will not see stale data after reading from the DMA

> +	engine's MMIO status register to establish that the DMA transfer has

> +	completed.

>  

> +     4. A readX() by the CPU from the peripheral will complete before any

> +	subsequent delay() loop can begin execution. For example, this ensures

> +	that two MMIO register writes by the CPU to a peripheral will arrive at

> +	least 1us apart if the first write is immediately read back with readX()

> +	and udelay(1) is called prior to the second writeX().


This might be more readable via some short code sequence instead?

>  

> +     __iomem pointers obtained with non-default attributes (e.g. those returned

> +     by ioremap_wc()) are unlikely to provide many of these guarantees.


This part is a bit confusing I think, because it's so cryptic. "Unlikely" 
as in probabilistic? ;-) So I think we should at least give some scope of 
the exceptions and expected trouble, or at least direct people to those 
APIs to see what the semantics are?

>  

> + (*) readX_relaxed(), writeX_relaxed():

>  

> +     These are similar to readX() and writeX(), but provide weaker memory

> +     ordering guarantees. Specifically, they do not guarantee ordering with

> +     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)

> +     but they are still guaranteed to be ordered with respect to other accesses

> +     to the same peripheral when operating on __iomem pointers mapped with the

> +     default I/O attributes.

>  

> + (*) readsX(), writesX():

>  

> +     The readsX() and writesX() MMIO accessors are designed for accessing

> +     register-based, memory-mapped FIFOs residing on peripherals that are not

> +     capable of performing DMA. Consequently, they provide only the ordering

> +     guarantees of readX_relaxed() and writeX_relaxed(), as documented above.


So is there any difference between 'X_relaxed' and 'sX' variants? What is 
the 's' about?

Thanks,

	Ingo
Will Deacon April 10, 2019, 12:28 p.m. UTC | #2
Hi Ingo,

Thanks for taking a look (diff at the end).

On Wed, Apr 10, 2019 at 12:58:33PM +0200, Ingo Molnar wrote:
> * Will Deacon <will.deacon@arm.com> wrote:

> 

> > + (*) readX(), writeX():

> >  

> > +     The readX() and writeX() MMIO accessors take a pointer to the peripheral

> > +     being accessed as an __iomem * parameter. For pointers mapped with the

> > +     default I/O attributes (e.g. those returned by ioremap()), then the

> > +     ordering guarantees are as follows:

> 

> s/then the

>  /the


Fixed.

> > +     1. All readX() and writeX() accesses to the same peripheral are ordered

> > +        with respect to each other. For example, this ensures that MMIO register

> > +	writes by the CPU to a particular device will arrive in program order.

> 

> Vertical alignment whitespace damage: some indentations are done via 

> spaces, one via tabs. Please standardize to tabs.


Fixed.

> I'd also suggest:

> 

> s/For example, this ensures

>  /For example this ensures


I'll just drop the "For example, " prefix altogether.

> > +     4. A readX() by the CPU from the peripheral will complete before any

> > +	subsequent delay() loop can begin execution. For example, this ensures

> > +	that two MMIO register writes by the CPU to a peripheral will arrive at

> > +	least 1us apart if the first write is immediately read back with readX()

> > +	and udelay(1) is called prior to the second writeX().

> 

> This might be more readable via some short code sequence instead?


Fixed.

> > +     __iomem pointers obtained with non-default attributes (e.g. those returned

> > +     by ioremap_wc()) are unlikely to provide many of these guarantees.

> 

> This part is a bit confusing I think, because it's so cryptic. "Unlikely" 

> as in probabilistic? ;-) So I think we should at least give some scope of 

> the exceptions and expected trouble, or at least direct people to those 

> APIs to see what the semantics are?


Right, so I'm trying to tackle the common case of ioremap() in this patch.
There isn't an agreed portable semantics for ioremap_wc() yet, so I'm taking
a punt for now but it's something I'd like to get back to in the future.
I've tried to clean up the wording in the diff below to spell out that
this this is currently arch-specific.

> >  

> > + (*) readX_relaxed(), writeX_relaxed():

> >  

> > +     These are similar to readX() and writeX(), but provide weaker memory

> > +     ordering guarantees. Specifically, they do not guarantee ordering with

> > +     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)

> > +     but they are still guaranteed to be ordered with respect to other accesses

> > +     to the same peripheral when operating on __iomem pointers mapped with the

> > +     default I/O attributes.

> >  

> > + (*) readsX(), writesX():

> >  

> > +     The readsX() and writesX() MMIO accessors are designed for accessing

> > +     register-based, memory-mapped FIFOs residing on peripherals that are not

> > +     capable of performing DMA. Consequently, they provide only the ordering

> > +     guarantees of readX_relaxed() and writeX_relaxed(), as documented above.

> 

> So is there any difference between 'X_relaxed' and 'sX' variants? What is 

> the 's' about?


From the ordering perspective, there isn't a difference, but they are very
different operations in the I/O access API so I think it's worth calling
them out separately. The 's' stands for "string", and these accessors
repeatedly access the same address, hence the comment about "memory-mapped
FIFOs".

Anyway, please take a look at the diff below and let me know if you're
happy with it. The original commit is currently at the bottom of my mmiowb
tree, so I'll probably put this on top as an extra commit with your
Reported-by.

Cheers,

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 3522f0cc772f..d0e332161a40 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2517,80 +2517,88 @@ guarantees:
 
  (*) readX(), writeX():
 
-     The readX() and writeX() MMIO accessors take a pointer to the peripheral
-     being accessed as an __iomem * parameter. For pointers mapped with the
-     default I/O attributes (e.g. those returned by ioremap()), then the
-     ordering guarantees are as follows:
-
-     1. All readX() and writeX() accesses to the same peripheral are ordered
-        with respect to each other. For example, this ensures that MMIO register
-	writes by the CPU to a particular device will arrive in program order.
-
-     2. A writeX() by the CPU to the peripheral will first wait for the
-        completion of all prior CPU writes to memory. For example, this ensures
-        that writes by the CPU to an outbound DMA buffer allocated by
-        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
-        to its MMIO control register to trigger the transfer.
-
-     3. A readX() by the CPU from the peripheral will complete before any
-	subsequent CPU reads from memory can begin. For example, this ensures
-	that reads by the CPU from an incoming DMA buffer allocated by
-	dma_alloc_coherent() will not see stale data after reading from the DMA
-	engine's MMIO status register to establish that the DMA transfer has
-	completed.
-
-     4. A readX() by the CPU from the peripheral will complete before any
-	subsequent delay() loop can begin execution. For example, this ensures
-	that two MMIO register writes by the CPU to a peripheral will arrive at
-	least 1us apart if the first write is immediately read back with readX()
-	and udelay(1) is called prior to the second writeX().
-
-     __iomem pointers obtained with non-default attributes (e.g. those returned
-     by ioremap_wc()) are unlikely to provide many of these guarantees.
+	The readX() and writeX() MMIO accessors take a pointer to the
+	peripheral being accessed as an __iomem * parameter. For pointers
+	mapped with the default I/O attributes (e.g. those returned by
+	ioremap()), the ordering guarantees are as follows:
+
+	1. All readX() and writeX() accesses to the same peripheral are ordered
+	   with respect to each other. This ensures that MMIO register writes by
+	   the CPU to a particular device will arrive in program order.
+
+	2. A writeX() by the CPU to the peripheral will first wait for the
+	   completion of all prior CPU writes to memory. This ensures that
+	   writes by the CPU to an outbound DMA buffer allocated by
+	   dma_alloc_coherent() will be visible to a DMA engine when the CPU
+	   writes to its MMIO control register to trigger the transfer.
+
+	3. A readX() by the CPU from the peripheral will complete before any
+	   subsequent CPU reads from memory can begin. This ensures that reads
+	   by the CPU from an incoming DMA buffer allocated by
+	   dma_alloc_coherent() will not see stale data after reading from the
+	   DMA engine's MMIO status register to establish that the DMA transfer
+	   has completed.
+
+	4. A readX() by the CPU from the peripheral will complete before any
+	   subsequent delay() loop can begin execution. This ensures that two
+	   MMIO register writes by the CPU to a peripheral will arrive at least
+	   1us apart if the first write is immediately read back with readX()
+	   and udelay(1) is called prior to the second writeX():
+
+		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
+		readl(DEVICE_REGISTER_0);
+		udelay(1);
+		writel(42, DEVICE_REGISTER_1); // ... at least 1us before this.
+
+	The ordering properties of __iomem pointers obtained with non-default
+	attributes (e.g. those returned by ioremap_wc()) are specific to the
+	underlying architecture and therefore the guarantees listed above cannot
+	generally be relied upon for these types of mappings.
 
  (*) readX_relaxed(), writeX_relaxed():
 
-     These are similar to readX() and writeX(), but provide weaker memory
-     ordering guarantees. Specifically, they do not guarantee ordering with
-     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)
-     but they are still guaranteed to be ordered with respect to other accesses
-     to the same peripheral when operating on __iomem pointers mapped with the
-     default I/O attributes.
+	These are similar to readX() and writeX(), but provide weaker memory
+	ordering guarantees. Specifically, they do not guarantee ordering with
+	respect to normal memory accesses or delay() loops (i.e. bullets 2-4
+	above) but they are still guaranteed to be ordered with respect to other
+	accesses to the same peripheral when operating on __iomem pointers
+	mapped with the default I/O attributes.
 
  (*) readsX(), writesX():
 
-     The readsX() and writesX() MMIO accessors are designed for accessing
-     register-based, memory-mapped FIFOs residing on peripherals that are not
-     capable of performing DMA. Consequently, they provide only the ordering
-     guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
+	The readsX() and writesX() MMIO accessors are designed for accessing
+	register-based, memory-mapped FIFOs residing on peripherals that are not
+	capable of performing DMA. Consequently, they provide only the ordering
+	guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
 
  (*) inX(), outX():
 
-     The inX() and outX() accessors are intended to access legacy port-mapped
-     I/O peripherals, which may require special instructions on some
-     architectures (notably x86). The port number of the peripheral being
-     accessed is passed as an argument.
+	The inX() and outX() accessors are intended to access legacy port-mapped
+	I/O peripherals, which may require special instructions on some
+	architectures (notably x86). The port number of the peripheral being
+	accessed is passed as an argument.
 
-     Since many CPU architectures ultimately access these peripherals via an
-     internal virtual memory mapping, the portable ordering guarantees provided
-     by inX() and outX() are the same as those provided by readX() and writeX()
-     respectively when accessing a mapping with the default I/O attributes.
+	Since many CPU architectures ultimately access these peripherals via an
+	internal virtual memory mapping, the portable ordering guarantees
+	provided by inX() and outX() are the same as those provided by readX()
+	and writeX() respectively when accessing a mapping with the default I/O
+	attributes.
 
-     Device drivers may expect outX() to emit a non-posted write transaction
-     that waits for a completion response from the I/O peripheral before
-     returning. This is not guaranteed by all architectures and is therefore
-     not part of the portable ordering semantics.
+	Device drivers may expect outX() to emit a non-posted write transaction
+	that waits for a completion response from the I/O peripheral before
+	returning. This is not guaranteed by all architectures and is therefore
+	not part of the portable ordering semantics.
 
  (*) insX(), outsX():
 
-     As above, the insX() and outsX() accessors provide the same ordering
-     guarantees as readsX() and writesX() respectively when accessing a mapping
-     with the default I/O attributes.
+	As above, the insX() and outsX() accessors provide the same ordering
+	guarantees as readsX() and writesX() respectively when accessing a
+	mapping with the default I/O attributes.
 
- (*) ioreadX(), iowriteX()
+ (*) ioreadX(), iowriteX():
 
-     These will perform appropriately for the type of access they're actually
-     doing, be it inX()/outX() or readX()/writeX().
+	These will perform appropriately for the type of access they're actually
+	doing, be it inX()/outX() or readX()/writeX().
 
 All of these accessors assume that the underlying peripheral is little-endian,
 and will therefore perform byte-swapping operations on big-endian architectures.
Ingo Molnar April 11, 2019, 11 a.m. UTC | #3
* Will Deacon <will.deacon@arm.com> wrote:

> Anyway, please take a look at the diff below and let me know if you're

> happy with it. The original commit is currently at the bottom of my mmiowb

> tree, so I'll probably put this on top as an extra commit with your

> Reported-by.


Sure - and the changes look good to me:

  Acked-by: Ingo Molnar <mingo@kernel.org>


Thanks,

	Ingo
Benjamin Herrenschmidt April 11, 2019, 10:12 p.m. UTC | #4
On Fri, 2019-04-05 at 14:59 +0100, Will Deacon wrote:
> +     1. All readX() and writeX() accesses to the same peripheral are ordered

> +        with respect to each other. For example, this ensures that MMIO register

> +       writes by the CPU to a particular device will arrive in program order.


Minor nit... I would have said "All readX() and writeX() accesses _from
the same CPU_ to the same peripheral... and then s/the CPU/this CPU.

> -     Accesses to this space may be fully synchronous (as on i386), but

> -     intermediary bridges (such as the PCI host bridge) may not fully honour

> -     that.

> +     2. A writeX() by the CPU to the peripheral will first wait for the

> +        completion of all prior CPU writes to memory. For example, this ensures

> +        that writes by the CPU to an outbound DMA buffer allocated by

> +        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes

> +        to its MMIO control register to trigger the transfer.


Similarily "the CPU" -> "a CPU"
>  

> -     They are guaranteed to be fully ordered with respect to each other.

> +     3. A readX() by the CPU from the peripheral will complete before any

> +       subsequent CPU reads from memory can begin. For example, this ensures

> +       that reads by the CPU from an incoming DMA buffer allocated by

> +       dma_alloc_coherent() will not see stale data after reading from the DMA

> +       engine's MMIO status register to establish that the DMA transfer has

> +       completed.

>  

> -     They are not guaranteed to be fully ordered with respect to other types of

> -     memory and I/O operation.

> +     4. A readX() by the CPU from the peripheral will complete before any

> +       subsequent delay() loop can begin execution. For example, this ensures

> +       that two MMIO register writes by the CPU to a peripheral will arrive at

> +       least 1us apart if the first write is immediately read back with readX()

> +       and udelay(1) is called prior to the second writeX().

>  

> - (*) readX(), writeX():

> +     __iomem pointers obtained with non-default attributes (e.g. those returned

> +     by ioremap_wc()) are unlikely to provide many of these guarantees.


So we give up on defining _wc semantics ? :-) Fair enough, it's a
mess...

 .../...

> +All of these accessors assume that the underlying peripheral is little-endian,

> +and will therefore perform byte-swapping operations on big-endian architectures.


This is not true of readsX/writesX, those will perform native accesses and are
intrinsically endian neutral.

> +Composing I/O ordering barriers with SMP ordering barriers and LOCK/UNLOCK

> +operations is a dangerous sport which may require the use of mmiowb(). See the

> +subsection "Acquires vs I/O accesses" for more information.


Cheers,
Ben.
Linus Torvalds April 11, 2019, 10:34 p.m. UTC | #5
On Thu, Apr 11, 2019 at 3:13 PM Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>

> Minor nit... I would have said "All readX() and writeX() accesses _from

> the same CPU_ to the same peripheral... and then s/the CPU/this CPU.


Maybe talk about "same thread" rather than "same cpu", with the
understanding that scheduling/preemption has to include the
appropriate cross-CPU IO barrier?

                   Linus
Benjamin Herrenschmidt April 12, 2019, 2:07 a.m. UTC | #6
On Thu, 2019-04-11 at 15:34 -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2019 at 3:13 PM Benjamin Herrenschmidt

> <benh@kernel.crashing.org> wrote:

> > 

> > Minor nit... I would have said "All readX() and writeX() accesses

> > _from

> > the same CPU_ to the same peripheral... and then s/the CPU/this

> > CPU.

> 

> Maybe talk about "same thread" rather than "same cpu", with the

> understanding that scheduling/preemption has to include the

> appropriate cross-CPU IO barrier?


Works for me, but why not spell all this out in the document ? We know,
but others might not.

Cheers,
Ben.
Will Deacon April 12, 2019, 1:17 p.m. UTC | #7
On Fri, Apr 12, 2019 at 12:07:09PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-04-11 at 15:34 -0700, Linus Torvalds wrote:

> > On Thu, Apr 11, 2019 at 3:13 PM Benjamin Herrenschmidt

> > <benh@kernel.crashing.org> wrote:

> > > 

> > > Minor nit... I would have said "All readX() and writeX() accesses

> > > _from

> > > the same CPU_ to the same peripheral... and then s/the CPU/this

> > > CPU.

> > 

> > Maybe talk about "same thread" rather than "same cpu", with the

> > understanding that scheduling/preemption has to include the

> > appropriate cross-CPU IO barrier?

> 

> Works for me, but why not spell all this out in the document ? We know,

> but others might not.


Ok, how about the diff below on top of:

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=for-next/mmiowb

?

I do plan to investigate ioremap_wc() and friends in the future, but it's
been painful enough just dealing with the common case! I'll almost certainly
need your help with that too.

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1660dde75e14..8ce298e09d54 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2524,26 +2524,30 @@ guarantees:
 
 	1. All readX() and writeX() accesses to the same peripheral are ordered
 	   with respect to each other. This ensures that MMIO register writes by
-	   the CPU to a particular device will arrive in program order.
-
-	2. A writeX() by the CPU to the peripheral will first wait for the
-	   completion of all prior CPU writes to memory. This ensures that
-	   writes by the CPU to an outbound DMA buffer allocated by
-	   dma_alloc_coherent() will be visible to a DMA engine when the CPU
-	   writes to its MMIO control register to trigger the transfer.
-
-	3. A readX() by the CPU from the peripheral will complete before any
-	   subsequent CPU reads from memory can begin. This ensures that reads
-	   by the CPU from an incoming DMA buffer allocated by
-	   dma_alloc_coherent() will not see stale data after reading from the
-	   DMA engine's MMIO status register to establish that the DMA transfer
-	   has completed.
-
-	4. A readX() by the CPU from the peripheral will complete before any
-	   subsequent delay() loop can begin execution. This ensures that two
-	   MMIO register writes by the CPU to a peripheral will arrive at least
-	   1us apart if the first write is immediately read back with readX()
-	   and udelay(1) is called prior to the second writeX():
+	   the same CPU thread to a particular device will arrive in program
+	   order.
+
+	2. A writeX() by a CPU thread to the peripheral will first wait for the
+	   completion of all prior writes to memory either issued by the thread
+	   or issued while holding a spinlock that was subsequently taken by the
+	   thread. This ensures that writes by the CPU to an outbound DMA
+	   buffer allocated by dma_alloc_coherent() will be visible to a DMA
+	   engine when the CPU writes to its MMIO control register to trigger
+	   the transfer.
+
+	3. A readX() by a CPU thread from the peripheral will complete before
+	   any subsequent reads from memory by the same thread can begin. This
+	   ensures that reads by the CPU from an incoming DMA buffer allocated
+	   by dma_alloc_coherent() will not see stale data after reading from
+	   the DMA engine's MMIO status register to establish that the DMA
+	   transfer has completed.
+
+	4. A readX() by a CPU thread from the peripheral will complete before
+	   any subsequent delay() loop can begin execution on the same thread.
+	   This ensures that two MMIO register writes by the CPU to a peripheral
+	   will arrive at least 1us apart if the first write is immediately read
+	   back with readX() and udelay(1) is called prior to the second
+	   writeX():
 
 		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
 		readl(DEVICE_REGISTER_0);
@@ -2600,8 +2604,10 @@ guarantees:
 	These will perform appropriately for the type of access they're actually
 	doing, be it inX()/outX() or readX()/writeX().
 
-All of these accessors assume that the underlying peripheral is little-endian,
-and will therefore perform byte-swapping operations on big-endian architectures.
+With the exception of the string accessors (insX(), outsX(), readsX() and
+writesX()), all of the above assume that the underlying peripheral is
+little-endian and will therefore perform byte-swapping operations on big-endian
+architectures.
 
 
 ========================================
Benjamin Herrenschmidt April 15, 2019, 4:05 a.m. UTC | #8
On Fri, 2019-04-12 at 14:17 +0100, Will Deacon wrote:
> 

> +	   the same CPU thread to a particular device will arrive in program

> +	   order.

> +

> +	2. A writeX() by a CPU thread to the peripheral will first wait for the

> +	   completion of all prior writes to memory either issued by the thread

> +	   or issued while holding a spinlock that was subsequently taken by the

> +	   thread. This ensures that writes by the CPU to an outbound DMA

> +	   buffer allocated by dma_alloc_coherent() will be visible to a DMA

> +	   engine when the CPU writes to its MMIO control register to trigger

> +	   the transfer.


Not particularily trying to be annoying here but I find the above
rather hard to parse :) I know what you're getting at but I'm not sure
somebody who doesn't will understand.

One way would be to instead prefix the whole thing with a blurb along
the lines of:

	readX() and writeX() provide some ordering guarantees versus
        each other and other memory accesses that are described below. 
	Those guarantees apply to accesses performed either by the same
        logical thread of execution, or by different threads but while 
        holding the same lock (spinlock or mutex).

Then have as simpler description of each case. No ?

> +	3. A readX() by a CPU thread from the peripheral will complete before

> +	   any subsequent reads from memory by the same thread can begin. This

> +	   ensures that reads by the CPU from an incoming DMA buffer allocated

> +	   by dma_alloc_coherent() will not see stale data after reading from

> +	   the DMA engine's MMIO status register to establish that the DMA

> +	   transfer has completed.

> +

> +	4. A readX() by a CPU thread from the peripheral will complete before

> +	   any subsequent delay() loop can begin execution on the same thread.

> +	   This ensures that two MMIO register writes by the CPU to a peripheral

> +	   will arrive at least 1us apart if the first write is immediately read

> +	   back with readX() and udelay(1) is called prior to the second

> +	   writeX():

>  

>  		writel(42, DEVICE_REGISTER_0); // Arrives at the device...

>  		readl(DEVICE_REGISTER_0);

> @@ -2600,8 +2604,10 @@ guarantees:

>  	These will perform appropriately for the type of access they're actually

>  	doing, be it inX()/outX() or readX()/writeX().

>  

> -All of these accessors assume that the underlying peripheral is little-endian,

> -and will therefore perform byte-swapping operations on big-endian architectures.

> +With the exception of the string accessors (insX(), outsX(), readsX() and

> +writesX()), all of the above assume that the underlying peripheral is

> +little-endian and will therefore perform byte-swapping operations on big-endian

> +architectures.

>  

>  

>  ========================================
Will Deacon April 16, 2019, 9:13 a.m. UTC | #9
Hi Ben,

On Mon, Apr 15, 2019 at 02:05:30PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2019-04-12 at 14:17 +0100, Will Deacon wrote:

> > 

> > +	   the same CPU thread to a particular device will arrive in program

> > +	   order.

> > +

> > +	2. A writeX() by a CPU thread to the peripheral will first wait for the

> > +	   completion of all prior writes to memory either issued by the thread

> > +	   or issued while holding a spinlock that was subsequently taken by the

> > +	   thread. This ensures that writes by the CPU to an outbound DMA

> > +	   buffer allocated by dma_alloc_coherent() will be visible to a DMA

> > +	   engine when the CPU writes to its MMIO control register to trigger

> > +	   the transfer.

> 

> Not particularily trying to be annoying here but I find the above

> rather hard to parse :) I know what you're getting at but I'm not sure

> somebody who doesn't will understand.

> 

> One way would be to instead prefix the whole thing with a blurb along

> the lines of:

> 

> 	readX() and writeX() provide some ordering guarantees versus

>         each other and other memory accesses that are described below. 

> 	Those guarantees apply to accesses performed either by the same

>         logical thread of execution, or by different threads but while 

>         holding the same lock (spinlock or mutex).

> 

> Then have as simpler description of each case. No ?


Argh, I think we've ended up confusing two different things in our edits:

  1. Ordering of readX()/writeX() between threads
  2. Ordering of memory accesses in one thread vs readX()/writeX() in another

and these are very different beasts.

For (1), with my mmiowb() patches we can provide some guarantees for
writeX() in conjunction with spinlocks. I'm not convinced we can provide
these same guarantees for combinations involving readX(). For example:

	CPU 1:
	val1 = readl(dev_base + REG1);
	flag = 1;
	spin_unlock(&dev_lock);

	CPU 2:
	spin_lock(&dev_lock);
	if (flag == 1)
		val2 = readl(dev_base + REG2);

In the case that CPU 2 sees the updated flag, do we require that CPU 1's readl()
reads from the device first? I'm not sure that RISC-V's implementation ensures
that readl() is ordered with a subsequent spin_unlock().

For (2), we would need to make this part of LKMM if we wanted to capture
the precise semantics here (e.g. by using the 'prop' relation to figure out
which writes are ordered by a writel). This is a pretty significant piece of
work, so perhaps just referring informally to propagation would be better for
the English text.

Updated diff below.

Will

--->8

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1660dde75e14..bc4c6a76c53a 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2524,26 +2524,36 @@ guarantees:
 
 	1. All readX() and writeX() accesses to the same peripheral are ordered
 	   with respect to each other. This ensures that MMIO register writes by
-	   the CPU to a particular device will arrive in program order.
-
-	2. A writeX() by the CPU to the peripheral will first wait for the
-	   completion of all prior CPU writes to memory. This ensures that
-	   writes by the CPU to an outbound DMA buffer allocated by
-	   dma_alloc_coherent() will be visible to a DMA engine when the CPU
-	   writes to its MMIO control register to trigger the transfer.
-
-	3. A readX() by the CPU from the peripheral will complete before any
-	   subsequent CPU reads from memory can begin. This ensures that reads
-	   by the CPU from an incoming DMA buffer allocated by
-	   dma_alloc_coherent() will not see stale data after reading from the
-	   DMA engine's MMIO status register to establish that the DMA transfer
-	   has completed.
-
-	4. A readX() by the CPU from the peripheral will complete before any
-	   subsequent delay() loop can begin execution. This ensures that two
-	   MMIO register writes by the CPU to a peripheral will arrive at least
-	   1us apart if the first write is immediately read back with readX()
-	   and udelay(1) is called prior to the second writeX():
+	   the same CPU thread to a particular device will arrive in program
+	   order.
+
+	2. A writeX() issued by a CPU thread holding a spinlock is ordered
+	   before a writeX() to the same peripheral from another CPU thread
+	   issued after a later acquisition of the same spinlock. This ensures
+	   that MMIO register writes to a particular device issued while holding
+	   a spinlock will arrive in an order consistent with acquisitions of
+	   the lock.
+
+	3. A writeX() by a CPU thread to the peripheral will first wait for the
+	   completion of all prior writes to memory either issued by, or
+	   propagated to, the same thread. This ensures that writes by the CPU
+	   to an outbound DMA buffer allocated by dma_alloc_coherent() will be
+	   visible to a DMA engine when the CPU writes to its MMIO control
+	   register to trigger the transfer.
+
+	4. A readX() by a CPU thread from the peripheral will complete before
+	   any subsequent reads from memory by the same thread can begin. This
+	   ensures that reads by the CPU from an incoming DMA buffer allocated
+	   by dma_alloc_coherent() will not see stale data after reading from
+	   the DMA engine's MMIO status register to establish that the DMA
+	   transfer has completed.
+
+	5. A readX() by a CPU thread from the peripheral will complete before
+	   any subsequent delay() loop can begin execution on the same thread.
+	   This ensures that two MMIO register writes by the CPU to a peripheral
+	   will arrive at least 1us apart if the first write is immediately read
+	   back with readX() and udelay(1) is called prior to the second
+	   writeX():
 
 		writel(42, DEVICE_REGISTER_0); // Arrives at the device...
 		readl(DEVICE_REGISTER_0);
@@ -2559,10 +2569,11 @@ guarantees:
 
 	These are similar to readX() and writeX(), but provide weaker memory
 	ordering guarantees. Specifically, they do not guarantee ordering with
-	respect to normal memory accesses or delay() loops (i.e. bullets 2-4
-	above) but they are still guaranteed to be ordered with respect to other
-	accesses to the same peripheral when operating on __iomem pointers
-	mapped with the default I/O attributes.
+	respect to locking, normal memory accesses or delay() loops (i.e.
+	bullets 2-5 above) but they are still guaranteed to be ordered with
+	respect to other accesses from the same CPU thread to the same
+	peripheral when operating on __iomem pointers mapped with the default
+	I/O attributes.
 
  (*) readsX(), writesX():
 
@@ -2600,8 +2611,10 @@ guarantees:
 	These will perform appropriately for the type of access they're actually
 	doing, be it inX()/outX() or readX()/writeX().
 
-All of these accessors assume that the underlying peripheral is little-endian,
-and will therefore perform byte-swapping operations on big-endian architectures.
+With the exception of the string accessors (insX(), outsX(), readsX() and
+writesX()), all of the above assume that the underlying peripheral is
+little-endian and will therefore perform byte-swapping operations on big-endian
+architectures.
 
 
 ========================================
diff mbox series

Patch

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1c22b21ae922..5eb6f4c6a133 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2599,72 +2599,97 @@  likely, then interrupt-disabling locks should be used to guarantee ordering.
 KERNEL I/O BARRIER EFFECTS
 ==========================
 
-When accessing I/O memory, drivers should use the appropriate accessor
-functions:
+Interfacing with peripherals via I/O accesses is deeply architecture and device
+specific. Therefore, drivers which are inherently non-portable may rely on
+specific behaviours of their target systems in order to achieve synchronization
+in the most lightweight manner possible. For drivers intending to be portable
+between multiple architectures and bus implementations, the kernel offers a
+series of accessor functions that provide various degrees of ordering
+guarantees:
 
- (*) inX(), outX():
+ (*) readX(), writeX():
 
-     These are intended to talk to I/O space rather than memory space, but
-     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
-     do indeed have special I/O space access cycles and instructions, but many
-     CPUs don't have such a concept.
+     The readX() and writeX() MMIO accessors take a pointer to the peripheral
+     being accessed as an __iomem * parameter. For pointers mapped with the
+     default I/O attributes (e.g. those returned by ioremap()), then the
+     ordering guarantees are as follows:
 
-     The PCI bus, amongst others, defines an I/O space concept which - on such
-     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
-     space.  However, it may also be mapped as a virtual I/O space in the CPU's
-     memory map, particularly on those CPUs that don't support alternate I/O
-     spaces.
+     1. All readX() and writeX() accesses to the same peripheral are ordered
+        with respect to each other. For example, this ensures that MMIO register
+	writes by the CPU to a particular device will arrive in program order.
 
-     Accesses to this space may be fully synchronous (as on i386), but
-     intermediary bridges (such as the PCI host bridge) may not fully honour
-     that.
+     2. A writeX() by the CPU to the peripheral will first wait for the
+        completion of all prior CPU writes to memory. For example, this ensures
+        that writes by the CPU to an outbound DMA buffer allocated by
+        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
+        to its MMIO control register to trigger the transfer.
 
-     They are guaranteed to be fully ordered with respect to each other.
+     3. A readX() by the CPU from the peripheral will complete before any
+	subsequent CPU reads from memory can begin. For example, this ensures
+	that reads by the CPU from an incoming DMA buffer allocated by
+	dma_alloc_coherent() will not see stale data after reading from the DMA
+	engine's MMIO status register to establish that the DMA transfer has
+	completed.
 
-     They are not guaranteed to be fully ordered with respect to other types of
-     memory and I/O operation.
+     4. A readX() by the CPU from the peripheral will complete before any
+	subsequent delay() loop can begin execution. For example, this ensures
+	that two MMIO register writes by the CPU to a peripheral will arrive at
+	least 1us apart if the first write is immediately read back with readX()
+	and udelay(1) is called prior to the second writeX().
 
- (*) readX(), writeX():
+     __iomem pointers obtained with non-default attributes (e.g. those returned
+     by ioremap_wc()) are unlikely to provide many of these guarantees.
 
-     Whether these are guaranteed to be fully ordered and uncombined with
-     respect to each other on the issuing CPU depends on the characteristics
-     defined for the memory window through which they're accessing.  On later
-     i386 architecture machines, for example, this is controlled by way of the
-     MTRR registers.
+ (*) readX_relaxed(), writeX_relaxed():
 
-     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
-     provided they're not accessing a prefetchable device.
+     These are similar to readX() and writeX(), but provide weaker memory
+     ordering guarantees. Specifically, they do not guarantee ordering with
+     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)
+     but they are still guaranteed to be ordered with respect to other accesses
+     to the same peripheral when operating on __iomem pointers mapped with the
+     default I/O attributes.
 
-     However, intermediary hardware (such as a PCI bridge) may indulge in
-     deferral if it so wishes; to flush a store, a load from the same location
-     is preferred[*], but a load from the same device or from configuration
-     space should suffice for PCI.
+ (*) readsX(), writesX():
 
-     [*] NOTE! attempting to load from the same location as was written to may
-	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
-	 example.
+     The readsX() and writesX() MMIO accessors are designed for accessing
+     register-based, memory-mapped FIFOs residing on peripherals that are not
+     capable of performing DMA. Consequently, they provide only the ordering
+     guarantees of readX_relaxed() and writeX_relaxed(), as documented above.
 
-     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
-     force stores to be ordered.
+ (*) inX(), outX():
 
-     Please refer to the PCI specification for more information on interactions
-     between PCI transactions.
+     The inX() and outX() accessors are intended to access legacy port-mapped
+     I/O peripherals, which may require special instructions on some
+     architectures (notably x86). The port number of the peripheral being
+     accessed is passed as an argument.
 
- (*) readX_relaxed(), writeX_relaxed()
+     Since many CPU architectures ultimately access these peripherals via an
+     internal virtual memory mapping, the portable ordering guarantees provided
+     by inX() and outX() are the same as those provided by readX() and writeX()
+     respectively when accessing a mapping with the default I/O attributes.
 
-     These are similar to readX() and writeX(), but provide weaker memory
-     ordering guarantees.  Specifically, they do not guarantee ordering with
-     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
-     ordering with respect to LOCK or UNLOCK operations.  If the latter is
-     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
-     the same peripheral are guaranteed to be ordered with respect to each
-     other.
+     Device drivers may expect outX() to emit a non-posted write transaction
+     that waits for a completion response from the I/O peripheral before
+     returning. This is not guaranteed by all architectures and is therefore
+     not part of the portable ordering semantics.
+
+ (*) insX(), outsX():
+
+     As above, the insX() and outsX() accessors provide the same ordering
+     guarantees as readsX() and writesX() respectively when accessing a mapping
+     with the default I/O attributes.
 
  (*) ioreadX(), iowriteX()
 
      These will perform appropriately for the type of access they're actually
      doing, be it inX()/outX() or readX()/writeX().
 
+All of these accessors assume that the underlying peripheral is little-endian,
+and will therefore perform byte-swapping operations on big-endian architectures.
+
+Composing I/O ordering barriers with SMP ordering barriers and LOCK/UNLOCK
+operations is a dangerous sport which may require the use of mmiowb(). See the
+subsection "Acquires vs I/O accesses" for more information.
 
 ========================================
 ASSUMED MINIMUM EXECUTION ORDERING MODEL