diff mbox

ARM: mm: dma: Update coherent streaming apis with missing memory barrier

Message ID 1398103390-31968-1-git-send-email-santosh.shilimkar@ti.com
State New
Headers show

Commit Message

Santosh Shilimkar April 21, 2014, 6:03 p.m. UTC
ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
machines. While this is true, one still needs to ensure that no
outstanding writes are pending in CPU write buffers. To take care
of that, we at least need a memory barrier to commit those changes
to main memory.

Patch is trying to fix those cases. Without such a patch, you will
end up patching device drivers to avoid the synchronisation issues.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
---
 arch/arm/mm/dma-mapping.c |   57 ++++++++++++++++++++++++++++++---------------
 1 file changed, 38 insertions(+), 19 deletions(-)

Comments

Will Deacon April 22, 2014, 10:28 a.m. UTC | #1
Hi Santosh,

On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote:
> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
> machines. While this is true, one still needs to ensure that no
> outstanding writes are pending in CPU write buffers. To take care
> of that, we at least need a memory barrier to commit those changes
> to main memory.
> 
> Patch is trying to fix those cases. Without such a patch, you will
> end up patching device drivers to avoid the synchronisation issues.

Don't you only need these barriers if you're passing ownership of a CPU
buffer to a device? In that case, I would expect a subsequent writel to tell
the device about the new buffer, which includes the required __iowmb().
That's the reason for the relaxed accessors: to avoid this barrier when it's
not needed. Perhaps you're using the relaxed accessors where you actually
need the stronger ordering guarantees?

Will
Santosh Shilimkar April 22, 2014, 1:49 p.m. UTC | #2
Hi Will,

On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:
> Hi Santosh,
> 
> On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote:
>> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
>> machines. While this is true, one still needs to ensure that no
>> outstanding writes are pending in CPU write buffers. To take care
>> of that, we at least need a memory barrier to commit those changes
>> to main memory.
>>
>> Patch is trying to fix those cases. Without such a patch, you will
>> end up patching device drivers to avoid the synchronisation issues.
> 
> Don't you only need these barriers if you're passing ownership of a CPU
> buffer to a device? In that case, I would expect a subsequent writel to tell
> the device about the new buffer, which includes the required __iowmb().
> That's the reason for the relaxed accessors: to avoid this barrier when it's
> not needed. Perhaps you're using the relaxed accessors where you actually
> need the stronger ordering guarantees?
> 
I kind of guessed some one will bring up above point. Infact this is how
mostly people have been living with the issue on coherent machines. On
Keystone too, we did explicit barriers in respective drivers.

I have added these barriers only on CPU to device streaming APIs because on
other direction, the memory is already upto date from CPU's perspective.

But if you look at the actual problem, its really responsibility of
DMA streaming APIs which we are trying to push on to drivers. A device
driver should be independent of whether it is running on a coherent or
a non-coherent CPU.

Lets take a example....
MMC controller driver running on a non-coherent and coherent machine.
Driver has below code sequence which is generic.
1. Prepare SG list
2. Perform CMO using DMA streaming API
3. Start DMA transfer...

Step 3 expects that step 2 has done its job and buffer is
completely in the main memory. And thats what also happens
on non-coherent machines.

Now, on coherent machines, as you mentioned, we are saying drivers
should add a barrier because Step2 is just NOP which is not correct.
The Step3 itself which is just suppose to start DMA doesn't need
any barrier as such. This is the whole rationale behind the patch.

Regards,
Santosh
Arnd Bergmann April 22, 2014, 2:08 p.m. UTC | #3
On Tuesday 22 April 2014, Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:

> > Don't you only need these barriers if you're passing ownership of a CPU
> > buffer to a device? In that case, I would expect a subsequent writel to tell
> > the device about the new buffer, which includes the required __iowmb().
> > That's the reason for the relaxed accessors: to avoid this barrier when it's
> > not needed. Perhaps you're using the relaxed accessors where you actually
> > need the stronger ordering guarantees?
> > 
> I kind of guessed some one will bring up above point. Infact this is how
> mostly people have been living with the issue on coherent machines. On
> Keystone too, we did explicit barriers in respective drivers.

You should not actually need explicit barriers in the drivers. As Will
said, you already do a writel() operation, which is contains the
implicit wmb.

> I have added these barriers only on CPU to device streaming APIs because on
> other direction, the memory is already upto date from CPU's perspective.
> 
> But if you look at the actual problem, its really responsibility of
> DMA streaming APIs which we are trying to push on to drivers. A device
> driver should be independent of whether it is running on a coherent or
> a non-coherent CPU.
> 
> Lets take a example....
> MMC controller driver running on a non-coherent and coherent machine.
> Driver has below code sequence which is generic.
> 1. Prepare SG list
> 2. Perform CMO using DMA streaming API
> 3. Start DMA transfer...
> 	
> Step 3 expects that step 2 has done its job and buffer is
> completely in the main memory. And thats what also happens
> on non-coherent machines.
> 
> Now, on coherent machines, as you mentioned, we are saying drivers
> should add a barrier because Step2 is just NOP which is not correct.
> The Step3 itself which is just suppose to start DMA doesn't need
> any barrier as such. This is the whole rationale behind the patch.

That's not what the API is. The entire reason for having both writel()
and writel_relaxed() is that drivers rely on writel() to do the barrier.
Doing another barrier in the DMA operations would add unnecessary
overhead for every single driver.

It's not the nicest API ever, but that's what it is and has been, mostly
for compatibility with x86, where the 'mov' instruction performing the
store to MMIO registers implies that all writes to DMA memory are
visible to the device.

	Arnd
Santosh Shilimkar April 22, 2014, 2:36 p.m. UTC | #4
On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
> On Tuesday 22 April 2014, Santosh Shilimkar wrote:
>> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:
> 
>>> Don't you only need these barriers if you're passing ownership of a CPU
>>> buffer to a device? In that case, I would expect a subsequent writel to tell
>>> the device about the new buffer, which includes the required __iowmb().
>>> That's the reason for the relaxed accessors: to avoid this barrier when it's
>>> not needed. Perhaps you're using the relaxed accessors where you actually
>>> need the stronger ordering guarantees?
>>>
>> I kind of guessed some one will bring up above point. Infact this is how
>> mostly people have been living with the issue on coherent machines. On
>> Keystone too, we did explicit barriers in respective drivers.
> 
> You should not actually need explicit barriers in the drivers. As Will
> said, you already do a writel() operation, which is contains the
> implicit wmb.
> 
>> I have added these barriers only on CPU to device streaming APIs because on
>> other direction, the memory is already upto date from CPU's perspective.
>>
>> But if you look at the actual problem, its really responsibility of
>> DMA streaming APIs which we are trying to push on to drivers. A device
>> driver should be independent of whether it is running on a coherent or
>> a non-coherent CPU.
>>
>> Lets take a example....
>> MMC controller driver running on a non-coherent and coherent machine.
>> Driver has below code sequence which is generic.
>> 1. Prepare SG list
>> 2. Perform CMO using DMA streaming API
>> 3. Start DMA transfer...
>> 	
>> Step 3 expects that step 2 has done its job and buffer is
>> completely in the main memory. And thats what also happens
>> on non-coherent machines.
>>
>> Now, on coherent machines, as you mentioned, we are saying drivers
>> should add a barrier because Step2 is just NOP which is not correct.
>> The Step3 itself which is just suppose to start DMA doesn't need
>> any barrier as such. This is the whole rationale behind the patch.
> 
> That's not what the API is. The entire reason for having both writel()
> and writel_relaxed() is that drivers rely on writel() to do the barrier.
> Doing another barrier in the DMA operations would add unnecessary
> overhead for every single driver.
> 
> It's not the nicest API ever, but that's what it is and has been, mostly
> for compatibility with x86, where the 'mov' instruction performing the
> store to MMIO registers implies that all writes to DMA memory are
> visible to the device.
> 
This is not about writel() and  writel_relaxed(). The driver don't
need that barrier. For example if the actual start of the DMA
happens bit later, that doesn't matter for the driver.

DMA APIs already do barriers today for non-coherent case. We
are not talking anything new here. Sorry but I don't see the
connection here.

Regards,
Santosh
Catalin Marinas April 22, 2014, 3:07 p.m. UTC | #5
On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:
> > On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote:
> >> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
> >> machines. While this is true, one still needs to ensure that no
> >> outstanding writes are pending in CPU write buffers. To take care
> >> of that, we at least need a memory barrier to commit those changes
> >> to main memory.
> >>
> >> Patch is trying to fix those cases. Without such a patch, you will
> >> end up patching device drivers to avoid the synchronisation issues.
> > 
> > Don't you only need these barriers if you're passing ownership of a CPU
> > buffer to a device? In that case, I would expect a subsequent writel to tell
> > the device about the new buffer, which includes the required __iowmb().
> > That's the reason for the relaxed accessors: to avoid this barrier when it's
> > not needed. Perhaps you're using the relaxed accessors where you actually
> > need the stronger ordering guarantees?
> 
> I kind of guessed some one will bring up above point. Infact this is how
> mostly people have been living with the issue on coherent machines. On
> Keystone too, we did explicit barriers in respective drivers.
> 
> I have added these barriers only on CPU to device streaming APIs because on
> other direction, the memory is already upto date from CPU's perspective.
> 
> But if you look at the actual problem, its really responsibility of
> DMA streaming APIs which we are trying to push on to drivers. A device
> driver should be independent of whether it is running on a coherent or
> a non-coherent CPU.
> 
> Lets take a example....
> MMC controller driver running on a non-coherent and coherent machine.
> Driver has below code sequence which is generic.
> 1. Prepare SG list
> 2. Perform CMO using DMA streaming API
> 3. Start DMA transfer...

The key here is how you start the DMA transfer. So far we assumed it's
done via an I/O operation like writel() and it has the right barriers.
If we have other ways for starting this (like writing the dma_addr in
some other memory descriptor), should we use explicit memory barriers?
Santosh Shilimkar April 22, 2014, 3:18 p.m. UTC | #6
On Tuesday 22 April 2014 11:07 AM, Catalin Marinas wrote:
> On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote:
>> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:
>>> On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote:
>>>> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
>>>> machines. While this is true, one still needs to ensure that no
>>>> outstanding writes are pending in CPU write buffers. To take care
>>>> of that, we at least need a memory barrier to commit those changes
>>>> to main memory.
>>>>
>>>> Patch is trying to fix those cases. Without such a patch, you will
>>>> end up patching device drivers to avoid the synchronisation issues.
>>>
>>> Don't you only need these barriers if you're passing ownership of a CPU
>>> buffer to a device? In that case, I would expect a subsequent writel to tell
>>> the device about the new buffer, which includes the required __iowmb().
>>> That's the reason for the relaxed accessors: to avoid this barrier when it's
>>> not needed. Perhaps you're using the relaxed accessors where you actually
>>> need the stronger ordering guarantees?
>>
>> I kind of guessed some one will bring up above point. Infact this is how
>> mostly people have been living with the issue on coherent machines. On
>> Keystone too, we did explicit barriers in respective drivers.
>>
>> I have added these barriers only on CPU to device streaming APIs because on
>> other direction, the memory is already upto date from CPU's perspective.
>>
>> But if you look at the actual problem, its really responsibility of
>> DMA streaming APIs which we are trying to push on to drivers. A device
>> driver should be independent of whether it is running on a coherent or
>> a non-coherent CPU.
>>
>> Lets take a example....
>> MMC controller driver running on a non-coherent and coherent machine.
>> Driver has below code sequence which is generic.
>> 1. Prepare SG list
>> 2. Perform CMO using DMA streaming API
>> 3. Start DMA transfer...
> 
> The key here is how you start the DMA transfer. So far we assumed it's
> done via an I/O operation like writel() and it has the right barriers.
> If we have other ways for starting this (like writing the dma_addr in
> some other memory descriptor), should we use explicit memory barriers?
> 
My point is why we want to rely on the DMA kick mechanism barrier
for the operation which is suppose to be completed in step 2.

Regards,
Santosh
Catalin Marinas April 22, 2014, 3:30 p.m. UTC | #7
On Tue, Apr 22, 2014 at 04:18:08PM +0100, Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 11:07 AM, Catalin Marinas wrote:
> > On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote:
> >> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote:
> >>> On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote:
> >>>> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent
> >>>> machines. While this is true, one still needs to ensure that no
> >>>> outstanding writes are pending in CPU write buffers. To take care
> >>>> of that, we at least need a memory barrier to commit those changes
> >>>> to main memory.
> >>>>
> >>>> Patch is trying to fix those cases. Without such a patch, you will
> >>>> end up patching device drivers to avoid the synchronisation issues.
> >>>
> >>> Don't you only need these barriers if you're passing ownership of a CPU
> >>> buffer to a device? In that case, I would expect a subsequent writel to tell
> >>> the device about the new buffer, which includes the required __iowmb().
> >>> That's the reason for the relaxed accessors: to avoid this barrier when it's
> >>> not needed. Perhaps you're using the relaxed accessors where you actually
> >>> need the stronger ordering guarantees?
> >>
> >> I kind of guessed some one will bring up above point. Infact this is how
> >> mostly people have been living with the issue on coherent machines. On
> >> Keystone too, we did explicit barriers in respective drivers.
> >>
> >> I have added these barriers only on CPU to device streaming APIs because on
> >> other direction, the memory is already upto date from CPU's perspective.
> >>
> >> But if you look at the actual problem, its really responsibility of
> >> DMA streaming APIs which we are trying to push on to drivers. A device
> >> driver should be independent of whether it is running on a coherent or
> >> a non-coherent CPU.
> >>
> >> Lets take a example....
> >> MMC controller driver running on a non-coherent and coherent machine.
> >> Driver has below code sequence which is generic.
> >> 1. Prepare SG list
> >> 2. Perform CMO using DMA streaming API
> >> 3. Start DMA transfer...
> > 
> > The key here is how you start the DMA transfer. So far we assumed it's
> > done via an I/O operation like writel() and it has the right barriers.
> > If we have other ways for starting this (like writing the dma_addr in
> > some other memory descriptor), should we use explicit memory barriers?
> 
> My point is why we want to rely on the DMA kick mechanism barrier
> for the operation which is suppose to be completed in step 2.

Because 2 only works if you use streaming DMA. Since we cover this with
the DMA kick mechanism already for coherent DMA buffers, we just double
the barriers for the streaming case.
Arnd Bergmann April 22, 2014, 7:53 p.m. UTC | #8
On Tuesday 22 April 2014, Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
> > It's not the nicest API ever, but that's what it is and has been, mostly
> > for compatibility with x86, where the 'mov' instruction performing the
> > store to MMIO registers implies that all writes to DMA memory are
> > visible to the device.
> > 
> This is not about writel() and  writel_relaxed(). The driver don't
> need that barrier. For example if the actual start of the DMA
> happens bit later, that doesn't matter for the driver.
> 
> DMA APIs already do barriers today for non-coherent case. We
> are not talking anything new here. Sorry but I don't see the
> connection here.

I don't think they do, nor should they. Can you tell me where
you see a barrier in dma_sync_single_for_cpu() or 
arm_dma_sync_single_for_device()? For all I can tell, they
only deal with L1 and L2 cache maintainance in arm_dma_ops.

	Arnd
Santosh Shilimkar April 22, 2014, 7:58 p.m. UTC | #9
On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote:
> On Tuesday 22 April 2014, Santosh Shilimkar wrote:
>> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
>>> It's not the nicest API ever, but that's what it is and has been, mostly
>>> for compatibility with x86, where the 'mov' instruction performing the
>>> store to MMIO registers implies that all writes to DMA memory are
>>> visible to the device.
>>>
>> This is not about writel() and  writel_relaxed(). The driver don't
>> need that barrier. For example if the actual start of the DMA
>> happens bit later, that doesn't matter for the driver.
>>
>> DMA APIs already do barriers today for non-coherent case. We
>> are not talking anything new here. Sorry but I don't see the
>> connection here.
> 
> I don't think they do, nor should they. Can you tell me where
> you see a barrier in dma_sync_single_for_cpu() or 
> arm_dma_sync_single_for_device()? For all I can tell, they
> only deal with L1 and L2 cache maintainance in arm_dma_ops.
> 
The cache APIs used by dma_ops do have the necessary barriers
at end of the of the cache operations. Thats what I meant. So for
end user(Device driver), its transparent.

Regards,
Santosh
Arnd Bergmann April 22, 2014, 8:23 p.m. UTC | #10
On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote:
> > On Tuesday 22 April 2014, Santosh Shilimkar wrote:
> >> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
> >>> It's not the nicest API ever, but that's what it is and has been, mostly
> >>> for compatibility with x86, where the 'mov' instruction performing the
> >>> store to MMIO registers implies that all writes to DMA memory are
> >>> visible to the device.
> >>>
> >> This is not about writel() and  writel_relaxed(). The driver don't
> >> need that barrier. For example if the actual start of the DMA
> >> happens bit later, that doesn't matter for the driver.
> >>
> >> DMA APIs already do barriers today for non-coherent case. We
> >> are not talking anything new here. Sorry but I don't see the
> >> connection here.
> > 
> > I don't think they do, nor should they. Can you tell me where
> > you see a barrier in dma_sync_single_for_cpu() or 
> > arm_dma_sync_single_for_device()? For all I can tell, they
> > only deal with L1 and L2 cache maintainance in arm_dma_ops.
> > 
> The cache APIs used by dma_ops do have the necessary barriers
> at end of the of the cache operations. Thats what I meant. So for
> end user(Device driver), its transparent.

Ok, I see it now for the noncoherent operations, and I see
the same thing on PowerPC and MIPS, which also have both coherent
and noncoherent versions of their dma_map_ops.

However, I also see that neither of those does a wmb() for the
coherent version. I don't see why ARM should be different from
the others here, so if there is a reason to do a barrier there,
we should change all architectures. I still don't see a reason
why the barrier is needed though.

Can you be more specific in your driver example why you think
the barrier in the writel() is not sufficient?

	Arnd
Santosh Shilimkar April 22, 2014, 8:30 p.m. UTC | #11
On Tuesday 22 April 2014 04:23 PM, Arnd Bergmann wrote:
> On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote:
>> On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote:
>>> On Tuesday 22 April 2014, Santosh Shilimkar wrote:
>>>> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
>>>>> It's not the nicest API ever, but that's what it is and has been, mostly
>>>>> for compatibility with x86, where the 'mov' instruction performing the
>>>>> store to MMIO registers implies that all writes to DMA memory are
>>>>> visible to the device.
>>>>>
>>>> This is not about writel() and  writel_relaxed(). The driver don't
>>>> need that barrier. For example if the actual start of the DMA
>>>> happens bit later, that doesn't matter for the driver.
>>>>
>>>> DMA APIs already do barriers today for non-coherent case. We
>>>> are not talking anything new here. Sorry but I don't see the
>>>> connection here.
>>>
>>> I don't think they do, nor should they. Can you tell me where
>>> you see a barrier in dma_sync_single_for_cpu() or 
>>> arm_dma_sync_single_for_device()? For all I can tell, they
>>> only deal with L1 and L2 cache maintainance in arm_dma_ops.
>>>
>> The cache APIs used by dma_ops do have the necessary barriers
>> at end of the of the cache operations. Thats what I meant. So for
>> end user(Device driver), its transparent.
> 
> Ok, I see it now for the noncoherent operations, and I see
> the same thing on PowerPC and MIPS, which also have both coherent
> and noncoherent versions of their dma_map_ops.
> 
> However, I also see that neither of those does a wmb() for the
> coherent version. I don't see why ARM should be different from
> the others here, so if there is a reason to do a barrier there,
> we should change all architectures. I still don't see a reason
> why the barrier is needed though.
> 
Thats fair.

> Can you be more specific in your driver example why you think
> the barrier in the writel() is not sufficient?
> 
writel() or an explcit barrier in the driver will do the job. I was
just thinking that we are trying to work around the short comings
of streaming API by adding barriers in the driver. For example
on a non-coherent system, i don't need that barrier because
dma_ops does take care of that.

Anyway, I think you and Catalin convinced me enough to handle the
case at driver level so I step back on the patch. It make sense
to keep ARM consistent with other architectures.

Regards,
Santosh
Will Deacon April 23, 2014, 9:02 a.m. UTC | #12
On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
> On Tuesday 22 April 2014 04:23 PM, Arnd Bergmann wrote:
> > On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote:
> >> On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote:
> >>> On Tuesday 22 April 2014, Santosh Shilimkar wrote:
> >>>> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote:
> >>>>> It's not the nicest API ever, but that's what it is and has been, mostly
> >>>>> for compatibility with x86, where the 'mov' instruction performing the
> >>>>> store to MMIO registers implies that all writes to DMA memory are
> >>>>> visible to the device.
> >>>>>
> >>>> This is not about writel() and  writel_relaxed(). The driver don't
> >>>> need that barrier. For example if the actual start of the DMA
> >>>> happens bit later, that doesn't matter for the driver.
> >>>>
> >>>> DMA APIs already do barriers today for non-coherent case. We
> >>>> are not talking anything new here. Sorry but I don't see the
> >>>> connection here.
> >>>
> >>> I don't think they do, nor should they. Can you tell me where
> >>> you see a barrier in dma_sync_single_for_cpu() or 
> >>> arm_dma_sync_single_for_device()? For all I can tell, they
> >>> only deal with L1 and L2 cache maintainance in arm_dma_ops.
> >>>
> >> The cache APIs used by dma_ops do have the necessary barriers
> >> at end of the of the cache operations. Thats what I meant. So for
> >> end user(Device driver), its transparent.
> > 
> > Ok, I see it now for the noncoherent operations, and I see
> > the same thing on PowerPC and MIPS, which also have both coherent
> > and noncoherent versions of their dma_map_ops.
> > 
> > However, I also see that neither of those does a wmb() for the
> > coherent version. I don't see why ARM should be different from
> > the others here, so if there is a reason to do a barrier there,
> > we should change all architectures. I still don't see a reason
> > why the barrier is needed though.
> > 
> Thats fair.
> 
> > Can you be more specific in your driver example why you think
> > the barrier in the writel() is not sufficient?
> > 
> writel() or an explcit barrier in the driver will do the job. I was
> just thinking that we are trying to work around the short comings
> of streaming API by adding barriers in the driver. For example
> on a non-coherent system, i don't need that barrier because
> dma_ops does take care of that.

I wonder whether we can remove those barriers altogether then (from the DMA
cache operations). For the coherent case, the driver must provide the
barrier (probably via writel) so the non-coherent case shouldn't be any
different.

I need some more coffee and a serious look at the code, but we may be able
to use dmb instructions to order the cache maintenance and avoid a final
dsb for completion.

Will
Catalin Marinas April 23, 2014, 4:02 p.m. UTC | #13
On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote:
> On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
> > writel() or an explcit barrier in the driver will do the job. I was
> > just thinking that we are trying to work around the short comings
> > of streaming API by adding barriers in the driver. For example
> > on a non-coherent system, i don't need that barrier because
> > dma_ops does take care of that.
> 
> I wonder whether we can remove those barriers altogether then (from the DMA
> cache operations). For the coherent case, the driver must provide the
> barrier (probably via writel) so the non-coherent case shouldn't be any
> different.

For the DMA_TO_DEVICE case the effect should be the same as wmb()
implies dsb (and outer_sync() for write). But the reason we have
barriers in the DMA ops is slightly different - the completion of the
cache maintenance operation rather than ordering with any previous
writes to the DMA buffer.

In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt
for a finished DMA transfer and executes dma_unmap_single() prior to
accessing the page. However the CPU access after unmapping is done using
normal LDR/STR which do not imply any barrier. So we need to ensure the
completion of the cache invalidation in the dma operation.

In the I/O coherency case, I would say it is the responsibility of the
device/hardware to ensure that the data is visible to all observers
(CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
code, I think it covers such scenario from-device or bidirectional
scenarios.

Maybe Santosh still has a point ;) but I don't know what the right
barrier would be here. And I really *hate* per-SoC/snoop unit barriers
(I still hope a dsb would do the trick on newer/ARMv8 systems).

> I need some more coffee and a serious look at the code, but we may be able
> to use dmb instructions to order the cache maintenance and avoid a final
> dsb for completion.

Is the dmb enough (assuming no outer cache)? We need to ensure the
flushed cache lines reach the memory for device access.
Will Deacon April 23, 2014, 5:17 p.m. UTC | #14
On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote:
> > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
> > > writel() or an explcit barrier in the driver will do the job. I was
> > > just thinking that we are trying to work around the short comings
> > > of streaming API by adding barriers in the driver. For example
> > > on a non-coherent system, i don't need that barrier because
> > > dma_ops does take care of that.
> > 
> > I wonder whether we can remove those barriers altogether then (from the DMA
> > cache operations). For the coherent case, the driver must provide the
> > barrier (probably via writel) so the non-coherent case shouldn't be any
> > different.
> 
> For the DMA_TO_DEVICE case the effect should be the same as wmb()
> implies dsb (and outer_sync() for write). But the reason we have
> barriers in the DMA ops is slightly different - the completion of the
> cache maintenance operation rather than ordering with any previous
> writes to the DMA buffer.
> 
> In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt
> for a finished DMA transfer and executes dma_unmap_single() prior to
> accessing the page. However the CPU access after unmapping is done using
> normal LDR/STR which do not imply any barrier. So we need to ensure the
> completion of the cache invalidation in the dma operation.

I don't think we necessarily need completion, we just need ordering. That
is, the normal LDR/STR instructions must be observed after the cache
maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb
should be sufficient for that guarantee.

> In the I/O coherency case, I would say it is the responsibility of the
> device/hardware to ensure that the data is visible to all observers
> (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> code, I think it covers such scenario from-device or bidirectional
> scenarios.
> 
> Maybe Santosh still has a point ;) but I don't know what the right
> barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> (I still hope a dsb would do the trick on newer/ARMv8 systems).

If you have device interrupts which are asynchronous to memory coherency,
then you're in a world of pain. I can't think of a generic (architected)
solution to this problem, unfortunately -- it's going to be both device
and interconnect specific. Adding dsbs doesn't necessarily help at all.

> > I need some more coffee and a serious look at the code, but we may be able
> > to use dmb instructions to order the cache maintenance and avoid a final
> > dsb for completion.
> 
> Is the dmb enough (assuming no outer cache)? We need to ensure the
> flushed cache lines reach the memory for device access.

I would keep the dsb in writel for that (the same argument as we had for the
coherent case earlier in the thread). Only the DMA cache maintenance
operations would be relaxed to dmb.

Will
Russell King - ARM Linux April 23, 2014, 6:37 p.m. UTC | #15
On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > In the I/O coherency case, I would say it is the responsibility of the
> > device/hardware to ensure that the data is visible to all observers
> > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> > code, I think it covers such scenario from-device or bidirectional
> > scenarios.
> > 
> > Maybe Santosh still has a point ;) but I don't know what the right
> > barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> > (I still hope a dsb would do the trick on newer/ARMv8 systems).
> 
> If you have device interrupts which are asynchronous to memory coherency,
> then you're in a world of pain. I can't think of a generic (architected)
> solution to this problem, unfortunately -- it's going to be both device
> and interconnect specific. Adding dsbs doesn't necessarily help at all.

Think, network devices with NAPI handling.  There, we explicitly turn
off the device's interrupt, and switch to software polling for received
packets.

The memory for the packets has already been mapped, and we're unmapping
the buffer, and then reading from it (to locate the ether type, and/or
vlan headers) before passing it up the network stack.

So in this case, we need to ensure that the cache operations are ordered
before the subsequent loads read from the DMA'd data.  It's purely an
ordering thing, it's not a completion thing.

However, what must not happen is that the unmap must not be re-ordered
before reading the descriptor and deciding whether there's a packet
present to be unmapped.  That probabily imples that code _should_ be
doing this:

	status = desc->status;
	if (!(status & CPU_OWNS_THIS_DESCRIPTOR))
		no_packet;

	rmb();

	addr = desc->buf;
	len = desc->length;

	dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE);

	...receive skb...reading buffer...

and there's a number of ethernet drivers which do exactly that.  For
example, drivers/net/ethernet/intel/e1000e/netdev.c, e1000_clean_rx_irq()
and various other Intel networking drivers.
Arnd Bergmann April 23, 2014, 6:58 p.m. UTC | #16
On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote:
> On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > > In the I/O coherency case, I would say it is the responsibility of the
> > > device/hardware to ensure that the data is visible to all observers
> > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> > > code, I think it covers such scenario from-device or bidirectional
> > > scenarios.
> > > 
> > > Maybe Santosh still has a point  but I don't know what the right
> > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> > > (I still hope a dsb would do the trick on newer/ARMv8 systems).
> > 
> > If you have device interrupts which are asynchronous to memory coherency,
> > then you're in a world of pain. I can't think of a generic (architected)
> > solution to this problem, unfortunately -- it's going to be both device
> > and interconnect specific. Adding dsbs doesn't necessarily help at all.
> 
> Think, network devices with NAPI handling.  There, we explicitly turn
> off the device's interrupt, and switch to software polling for received
> packets.
>
> The memory for the packets has already been mapped, and we're unmapping
> the buffer, and then reading from it (to locate the ether type, and/or
> vlan headers) before passing it up the network stack.
> 
> So in this case, we need to ensure that the cache operations are ordered
> before the subsequent loads read from the DMA'd data.  It's purely an
> ordering thing, it's not a completion thing.

PCI guarantees this, but I have seen systems in the past (on PowerPC) that
would violate them on the internal interconnect: You could sometimes see the
completion DMA data in the descriptor ring before the actual user data
is there. We only ever observed it in combination with an IOMMU, when the
descriptor address had a valid IOTLB but the data address did not.

I would hope that the ARM SMMU gets this right, but there are also
a number of other IOMMU implementations.

The x-gene SATA driver apparently suffers from a related problem, and
they have to flush outstanding DMAs at the interconnect whenever they
get a completion interrupt.

Another problem is MSI processing. MSI was specifically invented to avoid
having to check an MMIO register for a DMA completion that as a side-effect
flushes pending DMAs from the same device. This breaks down if the MSI
packet gets turned into a level interrupt before it reaches the CPU's
coherency domain, which is likely the case on the dw-pcie controller that
comes with its own MSI block.

	Arnd
Russell King - ARM Linux April 23, 2014, 7:04 p.m. UTC | #17
On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote:
> On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote:
> > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > > > In the I/O coherency case, I would say it is the responsibility of the
> > > > device/hardware to ensure that the data is visible to all observers
> > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> > > > code, I think it covers such scenario from-device or bidirectional
> > > > scenarios.
> > > > 
> > > > Maybe Santosh still has a point  but I don't know what the right
> > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> > > > (I still hope a dsb would do the trick on newer/ARMv8 systems).
> > > 
> > > If you have device interrupts which are asynchronous to memory coherency,
> > > then you're in a world of pain. I can't think of a generic (architected)
> > > solution to this problem, unfortunately -- it's going to be both device
> > > and interconnect specific. Adding dsbs doesn't necessarily help at all.
> > 
> > Think, network devices with NAPI handling.  There, we explicitly turn
> > off the device's interrupt, and switch to software polling for received
> > packets.
> >
> > The memory for the packets has already been mapped, and we're unmapping
> > the buffer, and then reading from it (to locate the ether type, and/or
> > vlan headers) before passing it up the network stack.
> > 
> > So in this case, we need to ensure that the cache operations are ordered
> > before the subsequent loads read from the DMA'd data.  It's purely an
> > ordering thing, it's not a completion thing.
> 
> PCI guarantees this, but I have seen systems in the past (on PowerPC) that
> would violate them on the internal interconnect: You could sometimes see the
> completion DMA data in the descriptor ring before the actual user data
> is there. We only ever observed it in combination with an IOMMU, when the
> descriptor address had a valid IOTLB but the data address did not.

What is done on down-stream buses is of no concern to the behaviour of
the CPU, which is what's being discussed here (in terms of barriers.)
and the correct CPU ordering of various read/writes to memory and
devices vs the streaming cache operations.
Jason Gunthorpe April 23, 2014, 7:34 p.m. UTC | #18
On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote:

> PCI guarantees this, but I have seen systems in the past (on
> PowerPC) that would violate them on the internal interconnect: You
> could sometimes see the completion DMA data in the descriptor ring
> before the actual user data is there. We only ever observed it in
> combination with an IOMMU, when the descriptor address had a valid
> IOTLB but the data address did not.

Ordering in PCI-E gets a bit fuzzy when you talk about internal order
within the host bridge, but AFAIK, re-ordering non-relaxed PCI-E
writes is certainly a big no-no.. It breaks the entire
producer/consumer driver model..

> Another problem is MSI processing. MSI was specifically invented to avoid
> having to check an MMIO register for a DMA completion that as a side-effect
> flushes pending DMAs from the same device. This breaks down if the MSI
> packet gets turned into a level interrupt before it reaches the CPU's
> coherency domain, which is likely the case on the dw-pcie controller that
> comes with its own MSI block.

I recently implemented a PCI-E to AXI bridge HW that does MSI in an
AXI environment and it requires waiting for all AXI operations
associated with prior PCI-E packets to complete and be acked back to
the bridge before sending an MSI edge over to the GIC.

Unlike PCI, AXI provides a write completion ack back to the initiator,
which can only be sent by the completor once the transaction is
visible to all other initiators.

A bridge must similarly serialize other TLPs: eg a series of posted
writes with the relaxed ordering bit set can be pipelined into AXI,
but once a non-relaxed TLP is hit, the bridge must wait for all the
prior writes to be ack'd before forwarding the non-relaxed one.

Not doing this serialization would be the root cause of problem like
you described above in PPC - where the IOMMU path takes longer than
the non-IOMMU path so the non-relaxed completion write arrives too
soon.

IMHO, if someone builds a PCI-E bridge that doesn't do this, then it's
MSI support is completely broken and should not be used. Delivering a
MSI interrupt before data visibility completely violates requirements
in PCI-E for transaction ordering.

It is also important to note that even level interrupts require bridge
serialization. When a non-relaxed read-response TLP is returned the
bridge must wait for all AXI writes to be ack'd before forwarding the
read response. Otherwise writes could be buffered within the
interconnect and still not be visible to the CPU while the read
response 'passes' them.

Jason
Catalin Marinas April 24, 2014, 9:09 a.m. UTC | #19
On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote:
> > > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
> > > > writel() or an explcit barrier in the driver will do the job. I was
> > > > just thinking that we are trying to work around the short comings
> > > > of streaming API by adding barriers in the driver. For example
> > > > on a non-coherent system, i don't need that barrier because
> > > > dma_ops does take care of that.
> > > 
> > > I wonder whether we can remove those barriers altogether then (from the DMA
> > > cache operations). For the coherent case, the driver must provide the
> > > barrier (probably via writel) so the non-coherent case shouldn't be any
> > > different.
> > 
> > For the DMA_TO_DEVICE case the effect should be the same as wmb()
> > implies dsb (and outer_sync() for write). But the reason we have
> > barriers in the DMA ops is slightly different - the completion of the
> > cache maintenance operation rather than ordering with any previous
> > writes to the DMA buffer.
> > 
> > In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt
> > for a finished DMA transfer and executes dma_unmap_single() prior to
> > accessing the page. However the CPU access after unmapping is done using
> > normal LDR/STR which do not imply any barrier. So we need to ensure the
> > completion of the cache invalidation in the dma operation.
> 
> I don't think we necessarily need completion, we just need ordering. That
> is, the normal LDR/STR instructions must be observed after the cache
> maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb
> should be sufficient for that guarantee.

If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8)
claims that these are ordered relative to any explicit load/stores to
the same address. So in theory we don't even need a DMB for unmapping
with DMA_FROM_DEVICE. But in practice, we may have the outer cache,
hence a DSB is required before the outer_sync() (we could move it there
though).
Russell King - ARM Linux April 24, 2014, 9:16 a.m. UTC | #20
On Thu, Apr 24, 2014 at 10:09:27AM +0100, Catalin Marinas wrote:
> If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8)
> claims that these are ordered relative to any explicit load/stores to
> the same address. So in theory we don't even need a DMB for unmapping
> with DMA_FROM_DEVICE. But in practice, we may have the outer cache,
> hence a DSB is required before the outer_sync() (we could move it there
> though).

The general usecase for outer_sync() is: dsb(); outer_sync();  Why would
we want to change this to dsb(); dmb(); outer_sync(); (where the dmb is
in outer_sync itself?)

Seems more sensible for it to stay at the outer_sync() call site where
it's needed.
Catalin Marinas April 24, 2014, 9:54 a.m. UTC | #21
On Wed, Apr 23, 2014 at 07:37:42PM +0100, Russell King - ARM Linux wrote:
> On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > > In the I/O coherency case, I would say it is the responsibility of the
> > > device/hardware to ensure that the data is visible to all observers
> > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> > > code, I think it covers such scenario from-device or bidirectional
> > > scenarios.
> > > 
> > > Maybe Santosh still has a point ;) but I don't know what the right
> > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> > > (I still hope a dsb would do the trick on newer/ARMv8 systems).
> > 
> > If you have device interrupts which are asynchronous to memory coherency,
> > then you're in a world of pain. I can't think of a generic (architected)
> > solution to this problem, unfortunately -- it's going to be both device
> > and interconnect specific. Adding dsbs doesn't necessarily help at all.
> 
> Think, network devices with NAPI handling.  There, we explicitly turn
> off the device's interrupt, and switch to software polling for received
> packets.
> 
> The memory for the packets has already been mapped, and we're unmapping
> the buffer, and then reading from it (to locate the ether type, and/or
> vlan headers) before passing it up the network stack.
> 
> So in this case, we need to ensure that the cache operations are ordered
> before the subsequent loads read from the DMA'd data.  It's purely an
> ordering thing, it's not a completion thing.

Well, ordering of completed cache operations ;).

> However, what must not happen is that the unmap must not be re-ordered
> before reading the descriptor and deciding whether there's a packet
> present to be unmapped.  That probabily imples that code _should_ be
> doing this:
> 
> 	status = desc->status;
> 	if (!(status & CPU_OWNS_THIS_DESCRIPTOR))
> 		no_packet;
> 
> 	rmb();
> 
> 	addr = desc->buf;
> 	len = desc->length;
> 
> 	dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE);

Indeed.

> 	...receive skb...reading buffer...

The point Will and myself were trying to make is about ordering as
observed by the CPU (rather than ordering of CPU actions). Independently
of whether the DMA is coherent or not, the ordering between device write
to the buffer and status update (in-memory descriptor or IRQ) *must* be
ordered by the device and interconnects. The rmb() as per your example
above only solves the relative CPU loads and cache maintenance but they
don't have any effect on the transactions done by the device.

The mvebu SoC has exactly this problem. mvebu_hwcc_sync_io_barrier() is
something that should be handled by the hardware automatically,
especially in a system which claims cache coherency.
Catalin Marinas April 24, 2014, 10:13 a.m. UTC | #22
On Thu, Apr 24, 2014 at 10:16:24AM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 24, 2014 at 10:09:27AM +0100, Catalin Marinas wrote:
> > If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8)
> > claims that these are ordered relative to any explicit load/stores to
> > the same address. So in theory we don't even need a DMB for unmapping
> > with DMA_FROM_DEVICE. But in practice, we may have the outer cache,
> > hence a DSB is required before the outer_sync() (we could move it there
> > though).
> 
> The general usecase for outer_sync() is: dsb(); outer_sync();  Why would
> we want to change this to dsb(); dmb(); outer_sync(); (where the dmb is
> in outer_sync itself?)
> 
> Seems more sensible for it to stay at the outer_sync() call site where
> it's needed.

You are right, it gets worse for the wmb() case if we change
outer_sync(), I was thinking about cache maintenance.

An optimisation would be for functions like v7_dma_inv_range() to no
longer have the dsb but move it to the __dma_page_cpu_to_dev() before
the outer_*_range() ops. If we assume that a streaming DMA is started by
a writel() access which has a dsb already, in the absence of outer cache
we wouldn't need any dsb at all, hence something like a conditional
sync_for_outer() barrier (dsb if outer cache or no-op otherwise) in
__dma_page_cpu_to_dev().

In the __dma_page_dev_to_cpu() we wouldn't need any dsb at all for cache
maintenance since subsequent accesses to the same address are ordered by
the hardware (and outer cache maintenance is done before the inner
anyway).

(that's from an ARMv7 perspective, we need to check ordering on earlier
architectures)
Catalin Marinas April 24, 2014, 10:47 a.m. UTC | #23
On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote:
> On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote:
> > On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote:
> > > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote:
> > > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> > > > > In the I/O coherency case, I would say it is the responsibility of the
> > > > > device/hardware to ensure that the data is visible to all observers
> > > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu
> > > > > code, I think it covers such scenario from-device or bidirectional
> > > > > scenarios.
> > > > > 
> > > > > Maybe Santosh still has a point  but I don't know what the right
> > > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers
> > > > > (I still hope a dsb would do the trick on newer/ARMv8 systems).
> > > > 
> > > > If you have device interrupts which are asynchronous to memory coherency,
> > > > then you're in a world of pain. I can't think of a generic (architected)
> > > > solution to this problem, unfortunately -- it's going to be both device
> > > > and interconnect specific. Adding dsbs doesn't necessarily help at all.
> > > 
> > > Think, network devices with NAPI handling.  There, we explicitly turn
> > > off the device's interrupt, and switch to software polling for received
> > > packets.
> > >
> > > The memory for the packets has already been mapped, and we're unmapping
> > > the buffer, and then reading from it (to locate the ether type, and/or
> > > vlan headers) before passing it up the network stack.
> > > 
> > > So in this case, we need to ensure that the cache operations are ordered
> > > before the subsequent loads read from the DMA'd data.  It's purely an
> > > ordering thing, it's not a completion thing.
> > 
> > PCI guarantees this, but I have seen systems in the past (on PowerPC) that
> > would violate them on the internal interconnect: You could sometimes see the
> > completion DMA data in the descriptor ring before the actual user data
> > is there. We only ever observed it in combination with an IOMMU, when the
> > descriptor address had a valid IOTLB but the data address did not.
> 
> What is done on down-stream buses is of no concern to the behaviour of
> the CPU, which is what's being discussed here (in terms of barriers.)
> and the correct CPU ordering of various read/writes to memory and
> devices vs the streaming cache operations.

It is still of concern because for cases like NAPI an rmb() on the CPU
side is no longer enough. If you use a common network driver, written
correctly with rmb(), but you have some weird interconnect which doesn't
ensure ordering, you have to add interconnect specific barrier to the
rmb() (or hack the dma ops like mvebu). I consider such hardware broken.
Will Deacon April 24, 2014, 10:58 a.m. UTC | #24
Hi Arnd,

On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote:
> Another problem is MSI processing. MSI was specifically invented to avoid
> having to check an MMIO register for a DMA completion that as a side-effect
> flushes pending DMAs from the same device. This breaks down if the MSI
> packet gets turned into a level interrupt before it reaches the CPU's
> coherency domain, which is likely the case on the dw-pcie controller that
> comes with its own MSI block.

I'm not sure there's anything special about MSI which helps with this
problem. For GICv3, the MSI write will target the ITS (a slave device),
whereas the data produced is assumedly targetting main memory. That still
requires careful ordering by the producer, in the same way as if it was
signalling a legacy interrupt.

Will
Russell King - ARM Linux April 24, 2014, 11:13 a.m. UTC | #25
On Thu, Apr 24, 2014 at 10:54:23AM +0100, Catalin Marinas wrote:
> On Wed, Apr 23, 2014 at 07:37:42PM +0100, Russell King - ARM Linux wrote:
> > However, what must not happen is that the unmap must not be re-ordered
> > before reading the descriptor and deciding whether there's a packet
> > present to be unmapped.  That probabily imples that code _should_ be
> > doing this:
> > 
> > 	status = desc->status;
> > 	if (!(status & CPU_OWNS_THIS_DESCRIPTOR))
> > 		no_packet;
> > 
> > 	rmb();
> > 
> > 	addr = desc->buf;
> > 	len = desc->length;
> > 
> > 	dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE);
> 
> Indeed.
> 
> > 	...receive skb...reading buffer...
> 
> The point Will and myself were trying to make is about ordering as
> observed by the CPU (rather than ordering of CPU actions). Independently
> of whether the DMA is coherent or not, the ordering between device write
> to the buffer and status update (in-memory descriptor or IRQ) *must* be
> ordered by the device and interconnects. The rmb() as per your example
> above only solves the relative CPU loads and cache maintenance but they
> don't have any effect on the transactions done by the device.

If the hardware does not ensure proper ordering of the writes to the
buffer vs writes to the descriptor indicating that the packet has been
received, then the hardware is quite simply broken.

> The mvebu SoC has exactly this problem. mvebu_hwcc_sync_io_barrier() is
> something that should be handled by the hardware automatically,
> especially in a system which claims cache coherency.

So what if the descriptor also contains other data, such as a received
timestamp, or separate error bits that need checking before DMA unmap?
(There are implementations where this is true - for example the freescale
FEC driver, but that currently omits the rmb()...)

Stuffing this into the DMA API may not be the best place for it, but
the alternative is to stuff it into drivers, and that's just not
acceptable for any generic driver - plus we want to keep stuff as simple
as possible for driver authors who probably don't want to understand any
of these details.
Russell King - ARM Linux April 24, 2014, 11:15 a.m. UTC | #26
On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote:
> On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote:
> > What is done on down-stream buses is of no concern to the behaviour of
> > the CPU, which is what's being discussed here (in terms of barriers.)
> > and the correct CPU ordering of various read/writes to memory and
> > devices vs the streaming cache operations.
> 
> It is still of concern because for cases like NAPI an rmb() on the CPU
> side is no longer enough. If you use a common network driver, written
> correctly with rmb(), but you have some weird interconnect which doesn't
> ensure ordering, you have to add interconnect specific barrier to the
> rmb() (or hack the dma ops like mvebu). I consider such hardware broken.

Yes, the hardware /is/ broken, but if you want to get it working in a
way that's acceptable in upstream kernels, adding that barrier to rmb()
is probably the only acceptable solution - especially if you have other
stuff going in between the rmb() and the DMA unmap.
Will Deacon April 24, 2014, 11:21 a.m. UTC | #27
On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote:
> On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote:
> > On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote:
> > > What is done on down-stream buses is of no concern to the behaviour of
> > > the CPU, which is what's being discussed here (in terms of barriers.)
> > > and the correct CPU ordering of various read/writes to memory and
> > > devices vs the streaming cache operations.
> > 
> > It is still of concern because for cases like NAPI an rmb() on the CPU
> > side is no longer enough. If you use a common network driver, written
> > correctly with rmb(), but you have some weird interconnect which doesn't
> > ensure ordering, you have to add interconnect specific barrier to the
> > rmb() (or hack the dma ops like mvebu). I consider such hardware broken.
> 
> Yes, the hardware /is/ broken, but if you want to get it working in a
> way that's acceptable in upstream kernels, adding that barrier to rmb()
> is probably the only acceptable solution - especially if you have other
> stuff going in between the rmb() and the DMA unmap.

The problem then is that the additional barrier may well require
bus-specific knowledge and access to parts of bus driver code which we can't
use inside the rmb() macro. To solve this properly, the bus topology topic
once again rears its ugly head, as I think you'd need a callback from the
device driver to the bus on which it resides in order to provide the
appropriate barrier (which isn't something that can be done sanely for
the platform_bus).

Will
Arnd Bergmann April 24, 2014, 12:12 p.m. UTC | #28
On Thursday 24 April 2014 11:58:46 Will Deacon wrote:
> 
> On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote:
> > Another problem is MSI processing. MSI was specifically invented to avoid
> > having to check an MMIO register for a DMA completion that as a side-effect
> > flushes pending DMAs from the same device. This breaks down if the MSI
> > packet gets turned into a level interrupt before it reaches the CPU's
> > coherency domain, which is likely the case on the dw-pcie controller that
> > comes with its own MSI block.
> 
> I'm not sure there's anything special about MSI which helps with this
> problem. For GICv3, the MSI write will target the ITS (a slave device),
> whereas the data produced is assumedly targetting main memory. That still
> requires careful ordering by the producer, in the same way as if it was
> signalling a legacy interrupt.

With legacy interrupts a PCI bus master has no way to order the transactions:
It initiates the DMA to memory, but does not wait for the DMA to complete.
It raises the interrupt line, which causes the interrupt handler of the
driver to start. The driver then reads a status register from the bus master
device, and this read is ordered with respect to the DMA that may still
be in progress at the time. Any PCI driver that works with legacy interrupts
and DMA has to do this, and the PCI host controller has to ensure that these
ordering semantics are maintained on the upstream buses.

The difference with MSI is that the driver does not have to do an MMIO read
transaction, and that the host controller has to ensure ordering between
the (possibly weakly ordered) data DMA and the MSI transaction, rather than
between the DMA and the MMIO read. These two are not the same, and it's
totally possible for a broken implementation to get one of them right
but not the other.

	Arnd
Will Deacon April 24, 2014, 12:37 p.m. UTC | #29
On Thu, Apr 24, 2014 at 01:12:16PM +0100, Arnd Bergmann wrote:
> On Thursday 24 April 2014 11:58:46 Will Deacon wrote:
> > On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote:
> > > Another problem is MSI processing. MSI was specifically invented to avoid
> > > having to check an MMIO register for a DMA completion that as a side-effect
> > > flushes pending DMAs from the same device. This breaks down if the MSI
> > > packet gets turned into a level interrupt before it reaches the CPU's
> > > coherency domain, which is likely the case on the dw-pcie controller that
> > > comes with its own MSI block.
> > 
> > I'm not sure there's anything special about MSI which helps with this
> > problem. For GICv3, the MSI write will target the ITS (a slave device),
> > whereas the data produced is assumedly targetting main memory. That still
> > requires careful ordering by the producer, in the same way as if it was
> > signalling a legacy interrupt.
> 
> With legacy interrupts a PCI bus master has no way to order the transactions:
> It initiates the DMA to memory, but does not wait for the DMA to complete.
> It raises the interrupt line, which causes the interrupt handler of the
> driver to start. The driver then reads a status register from the bus master
> device, and this read is ordered with respect to the DMA that may still
> be in progress at the time. Any PCI driver that works with legacy interrupts
> and DMA has to do this, and the PCI host controller has to ensure that these
> ordering semantics are maintained on the upstream buses.
> 
> The difference with MSI is that the driver does not have to do an MMIO read
> transaction, and that the host controller has to ensure ordering between
> the (possibly weakly ordered) data DMA and the MSI transaction, rather than
> between the DMA and the MMIO read. These two are not the same, and it's
> totally possible for a broken implementation to get one of them right
> but not the other.

Ok, so the problem of enforcing ordering (from the CPU's perspective) moves
from the endpoint to the host controller. That's certainly an improvement,
but it doesn't seem unlikely for the host controller to screw that up :(

Will
Santosh Shilimkar April 24, 2014, 1:38 p.m. UTC | #30
On Thursday 24 April 2014 07:21 AM, Will Deacon wrote:
> On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote:
>> On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote:
>>> On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote:
>>>> What is done on down-stream buses is of no concern to the behaviour of
>>>> the CPU, which is what's being discussed here (in terms of barriers.)
>>>> and the correct CPU ordering of various read/writes to memory and
>>>> devices vs the streaming cache operations.
>>>
>>> It is still of concern because for cases like NAPI an rmb() on the CPU
>>> side is no longer enough. If you use a common network driver, written
>>> correctly with rmb(), but you have some weird interconnect which doesn't
>>> ensure ordering, you have to add interconnect specific barrier to the
>>> rmb() (or hack the dma ops like mvebu). I consider such hardware broken.
>>
>> Yes, the hardware /is/ broken, but if you want to get it working in a
>> way that's acceptable in upstream kernels, adding that barrier to rmb()
>> is probably the only acceptable solution - especially if you have other
>> stuff going in between the rmb() and the DMA unmap.
> 
> The problem then is that the additional barrier may well require
> bus-specific knowledge and access to parts of bus driver code which we can't
> use inside the rmb() macro. To solve this properly, the bus topology topic
> once again rears its ugly head, as I think you'd need a callback from the
> device driver to the bus on which it resides in order to provide the
> appropriate barrier (which isn't something that can be done sanely for
> the platform_bus).
> 
Not exactly against the bus notifier point but we can't afford to have such
notifier calls in hot paths. Especially gigabit network drivers per packet
processing paths where even 10 cycle cost makes huge impact on the throughput.

Interconnect barriers are really needed for completion. I think CPUs within at
least same clusters will be ordered with rmb(). But same is not true when you
have multiple clusters and then further down coherent interconnect comes into
picture where all other non-CPU coherent masters are participating.

If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect
most of the ARM coherent architectures are broken. If you take any typical SOC,
ARM CPUs are bolted with other coherent masters at AXI boundary or may be with
ACP interfaces. At this level rmb() isn't good enough and you at least
need a dsb() for completion.

So in my view unless and until you have features like DVM in hardware, dsb() is
needed to guarantee even the ordering within CPUs sitting across clusters.

Regards,
Santosh
Will Deacon April 24, 2014, 2:09 p.m. UTC | #31
On Thu, Apr 24, 2014 at 02:38:28PM +0100, Santosh Shilimkar wrote:
> On Thursday 24 April 2014 07:21 AM, Will Deacon wrote:
> > On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote:
> >> Yes, the hardware /is/ broken, but if you want to get it working in a
> >> way that's acceptable in upstream kernels, adding that barrier to rmb()
> >> is probably the only acceptable solution - especially if you have other
> >> stuff going in between the rmb() and the DMA unmap.
> > 
> > The problem then is that the additional barrier may well require
> > bus-specific knowledge and access to parts of bus driver code which we can't
> > use inside the rmb() macro. To solve this properly, the bus topology topic
> > once again rears its ugly head, as I think you'd need a callback from the
> > device driver to the bus on which it resides in order to provide the
> > appropriate barrier (which isn't something that can be done sanely for
> > the platform_bus).
> > 
> Not exactly against the bus notifier point but we can't afford to have such
> notifier calls in hot paths. Especially gigabit network drivers per packet
> processing paths where even 10 cycle cost makes huge impact on the throughput.

I don't think anybody is suggesting that you do this per-packet. This is a
per-DMA-transfer barrier, which is required anyway. The details of the
barrier are what varies, and are likely bus-specific.

> Interconnect barriers are really needed for completion. I think CPUs within at
> least same clusters will be ordered with rmb(). But same is not true when you
> have multiple clusters and then further down coherent interconnect comes into
> picture where all other non-CPU coherent masters are participating.

You're making a lot of rash generalisations here. The architected barrier
instructions as used by Linux will work perfectly well within the
inner-shareable domain. That means you don't need to worry about
multiple-clusters of CPUs.

However, you can't read *anything* into how a barrier instruction executed
on the CPU affects writes from another master; there is inherently a race
there which must be dealt with by either the external master or some
implementation-specific action by the CPU. This is the real problem.

> If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect
> most of the ARM coherent architectures are broken. If you take any typical SOC,
> ARM CPUs are bolted with other coherent masters at AXI boundary or may be with
> ACP interfaces. At this level rmb() isn't good enough and you at least
> need a dsb() for completion.

An rmb() expands to dsb, neither of which give you anything in this scenario
as described by the architecture.

> So in my view unless and until you have features like DVM in hardware, dsb() is
> needed to guarantee even the ordering within CPUs sitting across clusters.

Firstly, you can only have multiple clusters of CPUs running with a single
Linux image if hardware coherency is supported between them. In this case,
all the CPUs will live in the same inner-shareable domain and dmb ish is
sufficient to enforce ordering between them.

Secondly, a dsb executed by a CPU is irrelevant to ordering of accesses by
an external peripheral, regardless of whether that peripheral is cache
coherent. If you think about this as a producer/consumer problem, you need
ordering at *both* ends to make any guarantees.

Will
Santosh Shilimkar April 24, 2014, 2:44 p.m. UTC | #32
On Thursday 24 April 2014 10:09 AM, Will Deacon wrote:
> On Thu, Apr 24, 2014 at 02:38:28PM +0100, Santosh Shilimkar wrote:
>> On Thursday 24 April 2014 07:21 AM, Will Deacon wrote:
>>> On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote:
>>>> Yes, the hardware /is/ broken, but if you want to get it working in a
>>>> way that's acceptable in upstream kernels, adding that barrier to rmb()
>>>> is probably the only acceptable solution - especially if you have other
>>>> stuff going in between the rmb() and the DMA unmap.
>>>
>>> The problem then is that the additional barrier may well require
>>> bus-specific knowledge and access to parts of bus driver code which we can't
>>> use inside the rmb() macro. To solve this properly, the bus topology topic
>>> once again rears its ugly head, as I think you'd need a callback from the
>>> device driver to the bus on which it resides in order to provide the
>>> appropriate barrier (which isn't something that can be done sanely for
>>> the platform_bus).
>>>
>> Not exactly against the bus notifier point but we can't afford to have such
>> notifier calls in hot paths. Especially gigabit network drivers per packet
>> processing paths where even 10 cycle cost makes huge impact on the throughput.
> 
> I don't think anybody is suggesting that you do this per-packet. This is a
> per-DMA-transfer barrier, which is required anyway. The details of the
> barrier are what varies, and are likely bus-specific.
>
fair enough
 
>> Interconnect barriers are really needed for completion. I think CPUs within at
>> least same clusters will be ordered with rmb(). But same is not true when you
>> have multiple clusters and then further down coherent interconnect comes into
>> picture where all other non-CPU coherent masters are participating.
> 
> You're making a lot of rash generalisations here. The architected barrier
> instructions as used by Linux will work perfectly well within the
> inner-shareable domain. That means you don't need to worry about
> multiple-clusters of CPUs.
> 
> However, you can't read *anything* into how a barrier instruction executed
> on the CPU affects writes from another master; there is inherently a race
> there which must be dealt with by either the external master or some
> implementation-specific action by the CPU. This is the real problem.
> 
>> If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect
>> most of the ARM coherent architectures are broken. If you take any typical SOC,
>> ARM CPUs are bolted with other coherent masters at AXI boundary or may be with
>> ACP interfaces. At this level rmb() isn't good enough and you at least
>> need a dsb() for completion.
> 
> An rmb() expands to dsb, neither of which give you anything in this scenario
> as described by the architecture.
>
My bad... I don't for what reason I though rmb() just expands to dmb(). Just
ignore that point.
 
>> So in my view unless and until you have features like DVM in hardware, dsb() is
>> needed to guarantee even the ordering within CPUs sitting across clusters.
> 
> Firstly, you can only have multiple clusters of CPUs running with a single
> Linux image if hardware coherency is supported between them. In this case,
> all the CPUs will live in the same inner-shareable domain and dmb ish is
> sufficient to enforce ordering between them.
> 
Thanks for expanding and correcting me. Inner sharable domain if implemented
correctly should take care of it.

> Secondly, a dsb executed by a CPU is irrelevant to ordering of accesses by
> an external peripheral, regardless of whether that peripheral is cache
> coherent. If you think about this as a producer/consumer problem, you need
> ordering at *both* ends to make any guarantees.
> 
Agreed. Isn't this an assumption we are doing on coherent DMA streaming case ?
Ofcourse we are talking about io*mb() in that case which could be more than
dsb() if needed but its actually a dsb() for A15 class of devices.

May be I was not clear, but if we are saying that only ordering needs to
be guaranteed and not completion then we have all the expected behaviour.
And converting existing non-coherent dma_ops barriers to dmb() is
right. My concern is completion is important for external master
cases. So whats the expectation in those cases from
producer and consumer

DMA_FROM_DEVICE case ... DMA-> producer, CPU->consumer
1. DMA updates the main memory with correct descriptors and buffers.
2. CPU perfroms the dma_ops() to take over the buffer ownership. In coherent
DMA case this is NOP.
** At this point of time DMA has guaranteed the ordering as well completion.
3. CPU operates on the buffer/descriptor which is correct.

DMA_TO_DEVICE: CPU->producer and DMA->consumer
1. CPU fills a descriptor/buffer in memory for DMA to pick it up.
2. Performs necessary dma_op() which on coherent case is NOP...
** Here I agree the ordering from all CPUs within the cluster is guaranteed
as per as the descriptor memory view is concerned.
But what is produced by CPU is not visible to DMA yet. So completion
isn't guaranteed.
3. If DMA kicks the transfer assuming the producer(CPU) completion then
that doesn't work.


Regards,
Santosh
Russell King - ARM Linux April 24, 2014, 7:12 p.m. UTC | #33
On Thu, Apr 24, 2014 at 10:44:58AM -0400, Santosh Shilimkar wrote:
> DMA_TO_DEVICE: CPU->producer and DMA->consumer
> 1. CPU fills a descriptor/buffer in memory for DMA to pick it up.
> 2. Performs necessary dma_op() which on coherent case is NOP...
> ** Here I agree the ordering from all CPUs within the cluster is guaranteed
> as per as the descriptor memory view is concerned.
> But what is produced by CPU is not visible to DMA yet. So completion
> isn't guaranteed.
> 3. If DMA kicks the transfer assuming the producer(CPU) completion then
> that doesn't work.

Step 3 should be done via a writel(), which is a dsb() followed by an
outer_sync() followed by the actual write to the register.

The dsb and outer_sync are there to ensure that the previous writes to
things like DMA coherent memory are visible to the device before the
device sees the write to its register.

Moreover, if there are descriptors in DMA coherent memory, and there is
a bit in them which must be set to hand ownership over to the device
(eg, as in an ethernet driver) then _additionally_ the driver already
has to add an additional barrier between the remainder of the descriptor
update and handing the descriptor over, and that barrier should ensure
that *any* effects prior to the barrier are seen before the effects of
the accesses after the barrier.

That said, in __dma_page_cpu_to_dev() we do the L1 followed by the L2
cache.  The effects of cleaning out the L1 cache must be seen by the
L2 cache before the effects of cleaning the L2 cache.  So we _do_
have an ordering requirement there which is purely down to the
implementation, and not down to any other requirements.
Joel Fernandes May 2, 2014, 9:33 p.m. UTC | #34
Hey Will,

On Wed, Apr 23, 2014 at 12:17 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
>> On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote:
>> > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
>> > > writel() or an explcit barrier in the driver will do the job. I was
>> > > just thinking that we are trying to work around the short comings
>> > > of streaming API by adding barriers in the driver. For example
>> > > on a non-coherent system, i don't need that barrier because
>> > > dma_ops does take care of that.
>> >
>> > I wonder whether we can remove those barriers altogether then (from the DMA
>> > cache operations). For the coherent case, the driver must provide the
>> > barrier (probably via writel) so the non-coherent case shouldn't be any
>> > different.
>>
>> For the DMA_TO_DEVICE case the effect should be the same as wmb()
>> implies dsb (and outer_sync() for write). But the reason we have
>> barriers in the DMA ops is slightly different - the completion of the
>> cache maintenance operation rather than ordering with any previous
>> writes to the DMA buffer.
>>
>> In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt
>> for a finished DMA transfer and executes dma_unmap_single() prior to
>> accessing the page. However the CPU access after unmapping is done using
>> normal LDR/STR which do not imply any barrier. So we need to ensure the
>> completion of the cache invalidation in the dma operation.
>
> I don't think we necessarily need completion, we just need ordering. That
> is, the normal LDR/STR instructions must be observed after the cache
> maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb
> should be sufficient for that guarantee.

Just wondering if you were convinced from the ARM ARM that a dsb is
not required after cache maintenance for the DMA_FROM_DEVICE case?

thanks,
-Joel
Will Deacon May 6, 2014, 10:01 a.m. UTC | #35
On Fri, May 02, 2014 at 10:33:25PM +0100, Joel Fernandes wrote:
> Hey Will,

Hi Joel,

> On Wed, Apr 23, 2014 at 12:17 PM, Will Deacon <will.deacon@arm.com> wrote:
> > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote:
> >> On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote:
> >> > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote:
> >> > > writel() or an explcit barrier in the driver will do the job. I was
> >> > > just thinking that we are trying to work around the short comings
> >> > > of streaming API by adding barriers in the driver. For example
> >> > > on a non-coherent system, i don't need that barrier because
> >> > > dma_ops does take care of that.
> >> >
> >> > I wonder whether we can remove those barriers altogether then (from the DMA
> >> > cache operations). For the coherent case, the driver must provide the
> >> > barrier (probably via writel) so the non-coherent case shouldn't be any
> >> > different.
> >>
> >> For the DMA_TO_DEVICE case the effect should be the same as wmb()
> >> implies dsb (and outer_sync() for write). But the reason we have
> >> barriers in the DMA ops is slightly different - the completion of the
> >> cache maintenance operation rather than ordering with any previous
> >> writes to the DMA buffer.
> >>
> >> In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt
> >> for a finished DMA transfer and executes dma_unmap_single() prior to
> >> accessing the page. However the CPU access after unmapping is done using
> >> normal LDR/STR which do not imply any barrier. So we need to ensure the
> >> completion of the cache invalidation in the dma operation.
> >
> > I don't think we necessarily need completion, we just need ordering. That
> > is, the normal LDR/STR instructions must be observed after the cache
> > maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb
> > should be sufficient for that guarantee.
> 
> Just wondering if you were convinced from the ARM ARM that a dsb is
> not required after cache maintenance for the DMA_FROM_DEVICE case?

It's not quite as clear-cut as that. For AArch32, the cache-maintenance
operations (for inner-caches) will be ordered with respect to one another
without the need for additional barriers. Furthermore, ordering is also
guaranteed with respect to normal load/store instructions if the buffer is
mapped as normal-cacheable and accessed via the same VA with which the
maintenance was performed.

For the DMA_FROM_DEVICE case, this then starts to sound pretty good but
there are a couple of spanners thrown into the works (and these have been
discussed earlier in the thread):

  (1) An IMP DEF operation can be required to publish data from the device
      after a completion interrupt is received.

  (2) Outer cache maintenance will require a dsb before (to ensure
      completion of maintenance on the inner caches) and after (to ensure
      completion before accesses to the buffer).

(1) could be solved by either adding a new driver API function or by
piggy-backing on rmb(). (2) could be solved by adding extra barriers to our
outer_cache implementations, but that needs some careful thought to avoid
penalising performance unnecessarily.

Will
diff mbox

Patch

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 5260f43..7c9f55d 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -83,6 +83,8 @@  static dma_addr_t arm_coherent_dma_map_page(struct device *dev, struct page *pag
 	     unsigned long offset, size_t size, enum dma_data_direction dir,
 	     struct dma_attrs *attrs)
 {
+	/* Drain cpu write buffer to main memory */
+	wmb();
 	return pfn_to_dma(dev, page_to_pfn(page)) + offset;
 }
 
@@ -117,6 +119,13 @@  static void arm_dma_sync_single_for_cpu(struct device *dev,
 	__dma_page_dev_to_cpu(page, offset, size, dir);
 }
 
+static void arm_coherent_dma_sync_single_for_cpu(struct device *dev,
+		dma_addr_t handle, size_t size, enum dma_data_direction dir)
+{
+	/* Drain cpu write buffer to main memory */
+	wmb();
+}
+
 static void arm_dma_sync_single_for_device(struct device *dev,
 		dma_addr_t handle, size_t size, enum dma_data_direction dir)
 {
@@ -125,6 +134,33 @@  static void arm_dma_sync_single_for_device(struct device *dev,
 	__dma_page_cpu_to_dev(page, offset, size, dir);
 }
 
+/**
+ * arm_dma_sync_sg_for_cpu
+ * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
+ * @sg: list of buffers
+ * @nents: number of buffers to map (returned from dma_map_sg)
+ * @dir: DMA transfer direction (same as was passed to dma_map_sg)
+ */
+void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
+			int nents, enum dma_data_direction dir)
+{
+	struct dma_map_ops *ops = get_dma_ops(dev);
+	struct scatterlist *s;
+	int i;
+
+	for_each_sg(sg, s, nents, i)
+		ops->sync_single_for_cpu(dev, sg_dma_address(s), s->length,
+					 dir);
+}
+
+void arm_coherent_dma_sync_sg_for_cpu(struct device *dev,
+			struct scatterlist *sg,
+			int nents, enum dma_data_direction dir)
+{
+	/* Drain cpu write buffer to main memory */
+	wmb();
+}
+
 struct dma_map_ops arm_dma_ops = {
 	.alloc			= arm_dma_alloc,
 	.free			= arm_dma_free,
@@ -154,6 +190,8 @@  struct dma_map_ops arm_coherent_dma_ops = {
 	.get_sgtable		= arm_dma_get_sgtable,
 	.map_page		= arm_coherent_dma_map_page,
 	.map_sg			= arm_dma_map_sg,
+	.sync_single_for_cpu	= arm_coherent_dma_sync_single_for_cpu,
+	.sync_sg_for_cpu	= arm_coherent_dma_sync_sg_for_cpu,
 	.set_dma_mask		= arm_dma_set_mask,
 };
 EXPORT_SYMBOL(arm_coherent_dma_ops);
@@ -994,25 +1032,6 @@  void arm_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents,
 }
 
 /**
- * arm_dma_sync_sg_for_cpu
- * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
- * @sg: list of buffers
- * @nents: number of buffers to map (returned from dma_map_sg)
- * @dir: DMA transfer direction (same as was passed to dma_map_sg)
- */
-void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg,
-			int nents, enum dma_data_direction dir)
-{
-	struct dma_map_ops *ops = get_dma_ops(dev);
-	struct scatterlist *s;
-	int i;
-
-	for_each_sg(sg, s, nents, i)
-		ops->sync_single_for_cpu(dev, sg_dma_address(s), s->length,
-					 dir);
-}
-
-/**
  * arm_dma_sync_sg_for_device
  * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
  * @sg: list of buffers