Message ID | 1398103390-31968-1-git-send-email-santosh.shilimkar@ti.com |
---|---|
State | New |
Headers | show |
Hi Santosh, On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote: > ARM coherent CPU dma map APIS are assumed to be nops on cache coherent > machines. While this is true, one still needs to ensure that no > outstanding writes are pending in CPU write buffers. To take care > of that, we at least need a memory barrier to commit those changes > to main memory. > > Patch is trying to fix those cases. Without such a patch, you will > end up patching device drivers to avoid the synchronisation issues. Don't you only need these barriers if you're passing ownership of a CPU buffer to a device? In that case, I would expect a subsequent writel to tell the device about the new buffer, which includes the required __iowmb(). That's the reason for the relaxed accessors: to avoid this barrier when it's not needed. Perhaps you're using the relaxed accessors where you actually need the stronger ordering guarantees? Will
Hi Will, On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: > Hi Santosh, > > On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote: >> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent >> machines. While this is true, one still needs to ensure that no >> outstanding writes are pending in CPU write buffers. To take care >> of that, we at least need a memory barrier to commit those changes >> to main memory. >> >> Patch is trying to fix those cases. Without such a patch, you will >> end up patching device drivers to avoid the synchronisation issues. > > Don't you only need these barriers if you're passing ownership of a CPU > buffer to a device? In that case, I would expect a subsequent writel to tell > the device about the new buffer, which includes the required __iowmb(). > That's the reason for the relaxed accessors: to avoid this barrier when it's > not needed. Perhaps you're using the relaxed accessors where you actually > need the stronger ordering guarantees? > I kind of guessed some one will bring up above point. Infact this is how mostly people have been living with the issue on coherent machines. On Keystone too, we did explicit barriers in respective drivers. I have added these barriers only on CPU to device streaming APIs because on other direction, the memory is already upto date from CPU's perspective. But if you look at the actual problem, its really responsibility of DMA streaming APIs which we are trying to push on to drivers. A device driver should be independent of whether it is running on a coherent or a non-coherent CPU. Lets take a example.... MMC controller driver running on a non-coherent and coherent machine. Driver has below code sequence which is generic. 1. Prepare SG list 2. Perform CMO using DMA streaming API 3. Start DMA transfer... Step 3 expects that step 2 has done its job and buffer is completely in the main memory. And thats what also happens on non-coherent machines. Now, on coherent machines, as you mentioned, we are saying drivers should add a barrier because Step2 is just NOP which is not correct. The Step3 itself which is just suppose to start DMA doesn't need any barrier as such. This is the whole rationale behind the patch. Regards, Santosh
On Tuesday 22 April 2014, Santosh Shilimkar wrote: > On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: > > Don't you only need these barriers if you're passing ownership of a CPU > > buffer to a device? In that case, I would expect a subsequent writel to tell > > the device about the new buffer, which includes the required __iowmb(). > > That's the reason for the relaxed accessors: to avoid this barrier when it's > > not needed. Perhaps you're using the relaxed accessors where you actually > > need the stronger ordering guarantees? > > > I kind of guessed some one will bring up above point. Infact this is how > mostly people have been living with the issue on coherent machines. On > Keystone too, we did explicit barriers in respective drivers. You should not actually need explicit barriers in the drivers. As Will said, you already do a writel() operation, which is contains the implicit wmb. > I have added these barriers only on CPU to device streaming APIs because on > other direction, the memory is already upto date from CPU's perspective. > > But if you look at the actual problem, its really responsibility of > DMA streaming APIs which we are trying to push on to drivers. A device > driver should be independent of whether it is running on a coherent or > a non-coherent CPU. > > Lets take a example.... > MMC controller driver running on a non-coherent and coherent machine. > Driver has below code sequence which is generic. > 1. Prepare SG list > 2. Perform CMO using DMA streaming API > 3. Start DMA transfer... > > Step 3 expects that step 2 has done its job and buffer is > completely in the main memory. And thats what also happens > on non-coherent machines. > > Now, on coherent machines, as you mentioned, we are saying drivers > should add a barrier because Step2 is just NOP which is not correct. > The Step3 itself which is just suppose to start DMA doesn't need > any barrier as such. This is the whole rationale behind the patch. That's not what the API is. The entire reason for having both writel() and writel_relaxed() is that drivers rely on writel() to do the barrier. Doing another barrier in the DMA operations would add unnecessary overhead for every single driver. It's not the nicest API ever, but that's what it is and has been, mostly for compatibility with x86, where the 'mov' instruction performing the store to MMIO registers implies that all writes to DMA memory are visible to the device. Arnd
On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: > On Tuesday 22 April 2014, Santosh Shilimkar wrote: >> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: > >>> Don't you only need these barriers if you're passing ownership of a CPU >>> buffer to a device? In that case, I would expect a subsequent writel to tell >>> the device about the new buffer, which includes the required __iowmb(). >>> That's the reason for the relaxed accessors: to avoid this barrier when it's >>> not needed. Perhaps you're using the relaxed accessors where you actually >>> need the stronger ordering guarantees? >>> >> I kind of guessed some one will bring up above point. Infact this is how >> mostly people have been living with the issue on coherent machines. On >> Keystone too, we did explicit barriers in respective drivers. > > You should not actually need explicit barriers in the drivers. As Will > said, you already do a writel() operation, which is contains the > implicit wmb. > >> I have added these barriers only on CPU to device streaming APIs because on >> other direction, the memory is already upto date from CPU's perspective. >> >> But if you look at the actual problem, its really responsibility of >> DMA streaming APIs which we are trying to push on to drivers. A device >> driver should be independent of whether it is running on a coherent or >> a non-coherent CPU. >> >> Lets take a example.... >> MMC controller driver running on a non-coherent and coherent machine. >> Driver has below code sequence which is generic. >> 1. Prepare SG list >> 2. Perform CMO using DMA streaming API >> 3. Start DMA transfer... >> >> Step 3 expects that step 2 has done its job and buffer is >> completely in the main memory. And thats what also happens >> on non-coherent machines. >> >> Now, on coherent machines, as you mentioned, we are saying drivers >> should add a barrier because Step2 is just NOP which is not correct. >> The Step3 itself which is just suppose to start DMA doesn't need >> any barrier as such. This is the whole rationale behind the patch. > > That's not what the API is. The entire reason for having both writel() > and writel_relaxed() is that drivers rely on writel() to do the barrier. > Doing another barrier in the DMA operations would add unnecessary > overhead for every single driver. > > It's not the nicest API ever, but that's what it is and has been, mostly > for compatibility with x86, where the 'mov' instruction performing the > store to MMIO registers implies that all writes to DMA memory are > visible to the device. > This is not about writel() and writel_relaxed(). The driver don't need that barrier. For example if the actual start of the DMA happens bit later, that doesn't matter for the driver. DMA APIs already do barriers today for non-coherent case. We are not talking anything new here. Sorry but I don't see the connection here. Regards, Santosh
On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote: > On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: > > On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote: > >> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent > >> machines. While this is true, one still needs to ensure that no > >> outstanding writes are pending in CPU write buffers. To take care > >> of that, we at least need a memory barrier to commit those changes > >> to main memory. > >> > >> Patch is trying to fix those cases. Without such a patch, you will > >> end up patching device drivers to avoid the synchronisation issues. > > > > Don't you only need these barriers if you're passing ownership of a CPU > > buffer to a device? In that case, I would expect a subsequent writel to tell > > the device about the new buffer, which includes the required __iowmb(). > > That's the reason for the relaxed accessors: to avoid this barrier when it's > > not needed. Perhaps you're using the relaxed accessors where you actually > > need the stronger ordering guarantees? > > I kind of guessed some one will bring up above point. Infact this is how > mostly people have been living with the issue on coherent machines. On > Keystone too, we did explicit barriers in respective drivers. > > I have added these barriers only on CPU to device streaming APIs because on > other direction, the memory is already upto date from CPU's perspective. > > But if you look at the actual problem, its really responsibility of > DMA streaming APIs which we are trying to push on to drivers. A device > driver should be independent of whether it is running on a coherent or > a non-coherent CPU. > > Lets take a example.... > MMC controller driver running on a non-coherent and coherent machine. > Driver has below code sequence which is generic. > 1. Prepare SG list > 2. Perform CMO using DMA streaming API > 3. Start DMA transfer... The key here is how you start the DMA transfer. So far we assumed it's done via an I/O operation like writel() and it has the right barriers. If we have other ways for starting this (like writing the dma_addr in some other memory descriptor), should we use explicit memory barriers?
On Tuesday 22 April 2014 11:07 AM, Catalin Marinas wrote: > On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote: >> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: >>> On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote: >>>> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent >>>> machines. While this is true, one still needs to ensure that no >>>> outstanding writes are pending in CPU write buffers. To take care >>>> of that, we at least need a memory barrier to commit those changes >>>> to main memory. >>>> >>>> Patch is trying to fix those cases. Without such a patch, you will >>>> end up patching device drivers to avoid the synchronisation issues. >>> >>> Don't you only need these barriers if you're passing ownership of a CPU >>> buffer to a device? In that case, I would expect a subsequent writel to tell >>> the device about the new buffer, which includes the required __iowmb(). >>> That's the reason for the relaxed accessors: to avoid this barrier when it's >>> not needed. Perhaps you're using the relaxed accessors where you actually >>> need the stronger ordering guarantees? >> >> I kind of guessed some one will bring up above point. Infact this is how >> mostly people have been living with the issue on coherent machines. On >> Keystone too, we did explicit barriers in respective drivers. >> >> I have added these barriers only on CPU to device streaming APIs because on >> other direction, the memory is already upto date from CPU's perspective. >> >> But if you look at the actual problem, its really responsibility of >> DMA streaming APIs which we are trying to push on to drivers. A device >> driver should be independent of whether it is running on a coherent or >> a non-coherent CPU. >> >> Lets take a example.... >> MMC controller driver running on a non-coherent and coherent machine. >> Driver has below code sequence which is generic. >> 1. Prepare SG list >> 2. Perform CMO using DMA streaming API >> 3. Start DMA transfer... > > The key here is how you start the DMA transfer. So far we assumed it's > done via an I/O operation like writel() and it has the right barriers. > If we have other ways for starting this (like writing the dma_addr in > some other memory descriptor), should we use explicit memory barriers? > My point is why we want to rely on the DMA kick mechanism barrier for the operation which is suppose to be completed in step 2. Regards, Santosh
On Tue, Apr 22, 2014 at 04:18:08PM +0100, Santosh Shilimkar wrote: > On Tuesday 22 April 2014 11:07 AM, Catalin Marinas wrote: > > On Tue, Apr 22, 2014 at 02:49:06PM +0100, Santosh Shilimkar wrote: > >> On Tuesday 22 April 2014 06:28 AM, Will Deacon wrote: > >>> On Mon, Apr 21, 2014 at 07:03:10PM +0100, Santosh Shilimkar wrote: > >>>> ARM coherent CPU dma map APIS are assumed to be nops on cache coherent > >>>> machines. While this is true, one still needs to ensure that no > >>>> outstanding writes are pending in CPU write buffers. To take care > >>>> of that, we at least need a memory barrier to commit those changes > >>>> to main memory. > >>>> > >>>> Patch is trying to fix those cases. Without such a patch, you will > >>>> end up patching device drivers to avoid the synchronisation issues. > >>> > >>> Don't you only need these barriers if you're passing ownership of a CPU > >>> buffer to a device? In that case, I would expect a subsequent writel to tell > >>> the device about the new buffer, which includes the required __iowmb(). > >>> That's the reason for the relaxed accessors: to avoid this barrier when it's > >>> not needed. Perhaps you're using the relaxed accessors where you actually > >>> need the stronger ordering guarantees? > >> > >> I kind of guessed some one will bring up above point. Infact this is how > >> mostly people have been living with the issue on coherent machines. On > >> Keystone too, we did explicit barriers in respective drivers. > >> > >> I have added these barriers only on CPU to device streaming APIs because on > >> other direction, the memory is already upto date from CPU's perspective. > >> > >> But if you look at the actual problem, its really responsibility of > >> DMA streaming APIs which we are trying to push on to drivers. A device > >> driver should be independent of whether it is running on a coherent or > >> a non-coherent CPU. > >> > >> Lets take a example.... > >> MMC controller driver running on a non-coherent and coherent machine. > >> Driver has below code sequence which is generic. > >> 1. Prepare SG list > >> 2. Perform CMO using DMA streaming API > >> 3. Start DMA transfer... > > > > The key here is how you start the DMA transfer. So far we assumed it's > > done via an I/O operation like writel() and it has the right barriers. > > If we have other ways for starting this (like writing the dma_addr in > > some other memory descriptor), should we use explicit memory barriers? > > My point is why we want to rely on the DMA kick mechanism barrier > for the operation which is suppose to be completed in step 2. Because 2 only works if you use streaming DMA. Since we cover this with the DMA kick mechanism already for coherent DMA buffers, we just double the barriers for the streaming case.
On Tuesday 22 April 2014, Santosh Shilimkar wrote: > On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: > > It's not the nicest API ever, but that's what it is and has been, mostly > > for compatibility with x86, where the 'mov' instruction performing the > > store to MMIO registers implies that all writes to DMA memory are > > visible to the device. > > > This is not about writel() and writel_relaxed(). The driver don't > need that barrier. For example if the actual start of the DMA > happens bit later, that doesn't matter for the driver. > > DMA APIs already do barriers today for non-coherent case. We > are not talking anything new here. Sorry but I don't see the > connection here. I don't think they do, nor should they. Can you tell me where you see a barrier in dma_sync_single_for_cpu() or arm_dma_sync_single_for_device()? For all I can tell, they only deal with L1 and L2 cache maintainance in arm_dma_ops. Arnd
On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote: > On Tuesday 22 April 2014, Santosh Shilimkar wrote: >> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: >>> It's not the nicest API ever, but that's what it is and has been, mostly >>> for compatibility with x86, where the 'mov' instruction performing the >>> store to MMIO registers implies that all writes to DMA memory are >>> visible to the device. >>> >> This is not about writel() and writel_relaxed(). The driver don't >> need that barrier. For example if the actual start of the DMA >> happens bit later, that doesn't matter for the driver. >> >> DMA APIs already do barriers today for non-coherent case. We >> are not talking anything new here. Sorry but I don't see the >> connection here. > > I don't think they do, nor should they. Can you tell me where > you see a barrier in dma_sync_single_for_cpu() or > arm_dma_sync_single_for_device()? For all I can tell, they > only deal with L1 and L2 cache maintainance in arm_dma_ops. > The cache APIs used by dma_ops do have the necessary barriers at end of the of the cache operations. Thats what I meant. So for end user(Device driver), its transparent. Regards, Santosh
On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote: > On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote: > > On Tuesday 22 April 2014, Santosh Shilimkar wrote: > >> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: > >>> It's not the nicest API ever, but that's what it is and has been, mostly > >>> for compatibility with x86, where the 'mov' instruction performing the > >>> store to MMIO registers implies that all writes to DMA memory are > >>> visible to the device. > >>> > >> This is not about writel() and writel_relaxed(). The driver don't > >> need that barrier. For example if the actual start of the DMA > >> happens bit later, that doesn't matter for the driver. > >> > >> DMA APIs already do barriers today for non-coherent case. We > >> are not talking anything new here. Sorry but I don't see the > >> connection here. > > > > I don't think they do, nor should they. Can you tell me where > > you see a barrier in dma_sync_single_for_cpu() or > > arm_dma_sync_single_for_device()? For all I can tell, they > > only deal with L1 and L2 cache maintainance in arm_dma_ops. > > > The cache APIs used by dma_ops do have the necessary barriers > at end of the of the cache operations. Thats what I meant. So for > end user(Device driver), its transparent. Ok, I see it now for the noncoherent operations, and I see the same thing on PowerPC and MIPS, which also have both coherent and noncoherent versions of their dma_map_ops. However, I also see that neither of those does a wmb() for the coherent version. I don't see why ARM should be different from the others here, so if there is a reason to do a barrier there, we should change all architectures. I still don't see a reason why the barrier is needed though. Can you be more specific in your driver example why you think the barrier in the writel() is not sufficient? Arnd
On Tuesday 22 April 2014 04:23 PM, Arnd Bergmann wrote: > On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote: >> On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote: >>> On Tuesday 22 April 2014, Santosh Shilimkar wrote: >>>> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: >>>>> It's not the nicest API ever, but that's what it is and has been, mostly >>>>> for compatibility with x86, where the 'mov' instruction performing the >>>>> store to MMIO registers implies that all writes to DMA memory are >>>>> visible to the device. >>>>> >>>> This is not about writel() and writel_relaxed(). The driver don't >>>> need that barrier. For example if the actual start of the DMA >>>> happens bit later, that doesn't matter for the driver. >>>> >>>> DMA APIs already do barriers today for non-coherent case. We >>>> are not talking anything new here. Sorry but I don't see the >>>> connection here. >>> >>> I don't think they do, nor should they. Can you tell me where >>> you see a barrier in dma_sync_single_for_cpu() or >>> arm_dma_sync_single_for_device()? For all I can tell, they >>> only deal with L1 and L2 cache maintainance in arm_dma_ops. >>> >> The cache APIs used by dma_ops do have the necessary barriers >> at end of the of the cache operations. Thats what I meant. So for >> end user(Device driver), its transparent. > > Ok, I see it now for the noncoherent operations, and I see > the same thing on PowerPC and MIPS, which also have both coherent > and noncoherent versions of their dma_map_ops. > > However, I also see that neither of those does a wmb() for the > coherent version. I don't see why ARM should be different from > the others here, so if there is a reason to do a barrier there, > we should change all architectures. I still don't see a reason > why the barrier is needed though. > Thats fair. > Can you be more specific in your driver example why you think > the barrier in the writel() is not sufficient? > writel() or an explcit barrier in the driver will do the job. I was just thinking that we are trying to work around the short comings of streaming API by adding barriers in the driver. For example on a non-coherent system, i don't need that barrier because dma_ops does take care of that. Anyway, I think you and Catalin convinced me enough to handle the case at driver level so I step back on the patch. It make sense to keep ARM consistent with other architectures. Regards, Santosh
On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: > On Tuesday 22 April 2014 04:23 PM, Arnd Bergmann wrote: > > On Tuesday 22 April 2014 15:58:09 Santosh Shilimkar wrote: > >> On Tuesday 22 April 2014 03:53 PM, Arnd Bergmann wrote: > >>> On Tuesday 22 April 2014, Santosh Shilimkar wrote: > >>>> On Tuesday 22 April 2014 10:08 AM, Arnd Bergmann wrote: > >>>>> It's not the nicest API ever, but that's what it is and has been, mostly > >>>>> for compatibility with x86, where the 'mov' instruction performing the > >>>>> store to MMIO registers implies that all writes to DMA memory are > >>>>> visible to the device. > >>>>> > >>>> This is not about writel() and writel_relaxed(). The driver don't > >>>> need that barrier. For example if the actual start of the DMA > >>>> happens bit later, that doesn't matter for the driver. > >>>> > >>>> DMA APIs already do barriers today for non-coherent case. We > >>>> are not talking anything new here. Sorry but I don't see the > >>>> connection here. > >>> > >>> I don't think they do, nor should they. Can you tell me where > >>> you see a barrier in dma_sync_single_for_cpu() or > >>> arm_dma_sync_single_for_device()? For all I can tell, they > >>> only deal with L1 and L2 cache maintainance in arm_dma_ops. > >>> > >> The cache APIs used by dma_ops do have the necessary barriers > >> at end of the of the cache operations. Thats what I meant. So for > >> end user(Device driver), its transparent. > > > > Ok, I see it now for the noncoherent operations, and I see > > the same thing on PowerPC and MIPS, which also have both coherent > > and noncoherent versions of their dma_map_ops. > > > > However, I also see that neither of those does a wmb() for the > > coherent version. I don't see why ARM should be different from > > the others here, so if there is a reason to do a barrier there, > > we should change all architectures. I still don't see a reason > > why the barrier is needed though. > > > Thats fair. > > > Can you be more specific in your driver example why you think > > the barrier in the writel() is not sufficient? > > > writel() or an explcit barrier in the driver will do the job. I was > just thinking that we are trying to work around the short comings > of streaming API by adding barriers in the driver. For example > on a non-coherent system, i don't need that barrier because > dma_ops does take care of that. I wonder whether we can remove those barriers altogether then (from the DMA cache operations). For the coherent case, the driver must provide the barrier (probably via writel) so the non-coherent case shouldn't be any different. I need some more coffee and a serious look at the code, but we may be able to use dmb instructions to order the cache maintenance and avoid a final dsb for completion. Will
On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote: > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: > > writel() or an explcit barrier in the driver will do the job. I was > > just thinking that we are trying to work around the short comings > > of streaming API by adding barriers in the driver. For example > > on a non-coherent system, i don't need that barrier because > > dma_ops does take care of that. > > I wonder whether we can remove those barriers altogether then (from the DMA > cache operations). For the coherent case, the driver must provide the > barrier (probably via writel) so the non-coherent case shouldn't be any > different. For the DMA_TO_DEVICE case the effect should be the same as wmb() implies dsb (and outer_sync() for write). But the reason we have barriers in the DMA ops is slightly different - the completion of the cache maintenance operation rather than ordering with any previous writes to the DMA buffer. In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt for a finished DMA transfer and executes dma_unmap_single() prior to accessing the page. However the CPU access after unmapping is done using normal LDR/STR which do not imply any barrier. So we need to ensure the completion of the cache invalidation in the dma operation. In the I/O coherency case, I would say it is the responsibility of the device/hardware to ensure that the data is visible to all observers (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu code, I think it covers such scenario from-device or bidirectional scenarios. Maybe Santosh still has a point ;) but I don't know what the right barrier would be here. And I really *hate* per-SoC/snoop unit barriers (I still hope a dsb would do the trick on newer/ARMv8 systems). > I need some more coffee and a serious look at the code, but we may be able > to use dmb instructions to order the cache maintenance and avoid a final > dsb for completion. Is the dmb enough (assuming no outer cache)? We need to ensure the flushed cache lines reach the memory for device access.
On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote: > > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: > > > writel() or an explcit barrier in the driver will do the job. I was > > > just thinking that we are trying to work around the short comings > > > of streaming API by adding barriers in the driver. For example > > > on a non-coherent system, i don't need that barrier because > > > dma_ops does take care of that. > > > > I wonder whether we can remove those barriers altogether then (from the DMA > > cache operations). For the coherent case, the driver must provide the > > barrier (probably via writel) so the non-coherent case shouldn't be any > > different. > > For the DMA_TO_DEVICE case the effect should be the same as wmb() > implies dsb (and outer_sync() for write). But the reason we have > barriers in the DMA ops is slightly different - the completion of the > cache maintenance operation rather than ordering with any previous > writes to the DMA buffer. > > In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt > for a finished DMA transfer and executes dma_unmap_single() prior to > accessing the page. However the CPU access after unmapping is done using > normal LDR/STR which do not imply any barrier. So we need to ensure the > completion of the cache invalidation in the dma operation. I don't think we necessarily need completion, we just need ordering. That is, the normal LDR/STR instructions must be observed after the cache maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb should be sufficient for that guarantee. > In the I/O coherency case, I would say it is the responsibility of the > device/hardware to ensure that the data is visible to all observers > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > code, I think it covers such scenario from-device or bidirectional > scenarios. > > Maybe Santosh still has a point ;) but I don't know what the right > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > (I still hope a dsb would do the trick on newer/ARMv8 systems). If you have device interrupts which are asynchronous to memory coherency, then you're in a world of pain. I can't think of a generic (architected) solution to this problem, unfortunately -- it's going to be both device and interconnect specific. Adding dsbs doesn't necessarily help at all. > > I need some more coffee and a serious look at the code, but we may be able > > to use dmb instructions to order the cache maintenance and avoid a final > > dsb for completion. > > Is the dmb enough (assuming no outer cache)? We need to ensure the > flushed cache lines reach the memory for device access. I would keep the dsb in writel for that (the same argument as we had for the coherent case earlier in the thread). Only the DMA cache maintenance operations would be relaxed to dmb. Will
On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > In the I/O coherency case, I would say it is the responsibility of the > > device/hardware to ensure that the data is visible to all observers > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > > code, I think it covers such scenario from-device or bidirectional > > scenarios. > > > > Maybe Santosh still has a point ;) but I don't know what the right > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > > (I still hope a dsb would do the trick on newer/ARMv8 systems). > > If you have device interrupts which are asynchronous to memory coherency, > then you're in a world of pain. I can't think of a generic (architected) > solution to this problem, unfortunately -- it's going to be both device > and interconnect specific. Adding dsbs doesn't necessarily help at all. Think, network devices with NAPI handling. There, we explicitly turn off the device's interrupt, and switch to software polling for received packets. The memory for the packets has already been mapped, and we're unmapping the buffer, and then reading from it (to locate the ether type, and/or vlan headers) before passing it up the network stack. So in this case, we need to ensure that the cache operations are ordered before the subsequent loads read from the DMA'd data. It's purely an ordering thing, it's not a completion thing. However, what must not happen is that the unmap must not be re-ordered before reading the descriptor and deciding whether there's a packet present to be unmapped. That probabily imples that code _should_ be doing this: status = desc->status; if (!(status & CPU_OWNS_THIS_DESCRIPTOR)) no_packet; rmb(); addr = desc->buf; len = desc->length; dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE); ...receive skb...reading buffer... and there's a number of ethernet drivers which do exactly that. For example, drivers/net/ethernet/intel/e1000e/netdev.c, e1000_clean_rx_irq() and various other Intel networking drivers.
On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote: > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > > In the I/O coherency case, I would say it is the responsibility of the > > > device/hardware to ensure that the data is visible to all observers > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > > > code, I think it covers such scenario from-device or bidirectional > > > scenarios. > > > > > > Maybe Santosh still has a point but I don't know what the right > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > > > (I still hope a dsb would do the trick on newer/ARMv8 systems). > > > > If you have device interrupts which are asynchronous to memory coherency, > > then you're in a world of pain. I can't think of a generic (architected) > > solution to this problem, unfortunately -- it's going to be both device > > and interconnect specific. Adding dsbs doesn't necessarily help at all. > > Think, network devices with NAPI handling. There, we explicitly turn > off the device's interrupt, and switch to software polling for received > packets. > > The memory for the packets has already been mapped, and we're unmapping > the buffer, and then reading from it (to locate the ether type, and/or > vlan headers) before passing it up the network stack. > > So in this case, we need to ensure that the cache operations are ordered > before the subsequent loads read from the DMA'd data. It's purely an > ordering thing, it's not a completion thing. PCI guarantees this, but I have seen systems in the past (on PowerPC) that would violate them on the internal interconnect: You could sometimes see the completion DMA data in the descriptor ring before the actual user data is there. We only ever observed it in combination with an IOMMU, when the descriptor address had a valid IOTLB but the data address did not. I would hope that the ARM SMMU gets this right, but there are also a number of other IOMMU implementations. The x-gene SATA driver apparently suffers from a related problem, and they have to flush outstanding DMAs at the interconnect whenever they get a completion interrupt. Another problem is MSI processing. MSI was specifically invented to avoid having to check an MMIO register for a DMA completion that as a side-effect flushes pending DMAs from the same device. This breaks down if the MSI packet gets turned into a level interrupt before it reaches the CPU's coherency domain, which is likely the case on the dw-pcie controller that comes with its own MSI block. Arnd
On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote: > On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote: > > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > > > In the I/O coherency case, I would say it is the responsibility of the > > > > device/hardware to ensure that the data is visible to all observers > > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > > > > code, I think it covers such scenario from-device or bidirectional > > > > scenarios. > > > > > > > > Maybe Santosh still has a point but I don't know what the right > > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > > > > (I still hope a dsb would do the trick on newer/ARMv8 systems). > > > > > > If you have device interrupts which are asynchronous to memory coherency, > > > then you're in a world of pain. I can't think of a generic (architected) > > > solution to this problem, unfortunately -- it's going to be both device > > > and interconnect specific. Adding dsbs doesn't necessarily help at all. > > > > Think, network devices with NAPI handling. There, we explicitly turn > > off the device's interrupt, and switch to software polling for received > > packets. > > > > The memory for the packets has already been mapped, and we're unmapping > > the buffer, and then reading from it (to locate the ether type, and/or > > vlan headers) before passing it up the network stack. > > > > So in this case, we need to ensure that the cache operations are ordered > > before the subsequent loads read from the DMA'd data. It's purely an > > ordering thing, it's not a completion thing. > > PCI guarantees this, but I have seen systems in the past (on PowerPC) that > would violate them on the internal interconnect: You could sometimes see the > completion DMA data in the descriptor ring before the actual user data > is there. We only ever observed it in combination with an IOMMU, when the > descriptor address had a valid IOTLB but the data address did not. What is done on down-stream buses is of no concern to the behaviour of the CPU, which is what's being discussed here (in terms of barriers.) and the correct CPU ordering of various read/writes to memory and devices vs the streaming cache operations.
On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote: > PCI guarantees this, but I have seen systems in the past (on > PowerPC) that would violate them on the internal interconnect: You > could sometimes see the completion DMA data in the descriptor ring > before the actual user data is there. We only ever observed it in > combination with an IOMMU, when the descriptor address had a valid > IOTLB but the data address did not. Ordering in PCI-E gets a bit fuzzy when you talk about internal order within the host bridge, but AFAIK, re-ordering non-relaxed PCI-E writes is certainly a big no-no.. It breaks the entire producer/consumer driver model.. > Another problem is MSI processing. MSI was specifically invented to avoid > having to check an MMIO register for a DMA completion that as a side-effect > flushes pending DMAs from the same device. This breaks down if the MSI > packet gets turned into a level interrupt before it reaches the CPU's > coherency domain, which is likely the case on the dw-pcie controller that > comes with its own MSI block. I recently implemented a PCI-E to AXI bridge HW that does MSI in an AXI environment and it requires waiting for all AXI operations associated with prior PCI-E packets to complete and be acked back to the bridge before sending an MSI edge over to the GIC. Unlike PCI, AXI provides a write completion ack back to the initiator, which can only be sent by the completor once the transaction is visible to all other initiators. A bridge must similarly serialize other TLPs: eg a series of posted writes with the relaxed ordering bit set can be pipelined into AXI, but once a non-relaxed TLP is hit, the bridge must wait for all the prior writes to be ack'd before forwarding the non-relaxed one. Not doing this serialization would be the root cause of problem like you described above in PPC - where the IOMMU path takes longer than the non-IOMMU path so the non-relaxed completion write arrives too soon. IMHO, if someone builds a PCI-E bridge that doesn't do this, then it's MSI support is completely broken and should not be used. Delivering a MSI interrupt before data visibility completely violates requirements in PCI-E for transaction ordering. It is also important to note that even level interrupts require bridge serialization. When a non-relaxed read-response TLP is returned the bridge must wait for all AXI writes to be ack'd before forwarding the read response. Otherwise writes could be buffered within the interconnect and still not be visible to the CPU while the read response 'passes' them. Jason
On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote: > > > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: > > > > writel() or an explcit barrier in the driver will do the job. I was > > > > just thinking that we are trying to work around the short comings > > > > of streaming API by adding barriers in the driver. For example > > > > on a non-coherent system, i don't need that barrier because > > > > dma_ops does take care of that. > > > > > > I wonder whether we can remove those barriers altogether then (from the DMA > > > cache operations). For the coherent case, the driver must provide the > > > barrier (probably via writel) so the non-coherent case shouldn't be any > > > different. > > > > For the DMA_TO_DEVICE case the effect should be the same as wmb() > > implies dsb (and outer_sync() for write). But the reason we have > > barriers in the DMA ops is slightly different - the completion of the > > cache maintenance operation rather than ordering with any previous > > writes to the DMA buffer. > > > > In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt > > for a finished DMA transfer and executes dma_unmap_single() prior to > > accessing the page. However the CPU access after unmapping is done using > > normal LDR/STR which do not imply any barrier. So we need to ensure the > > completion of the cache invalidation in the dma operation. > > I don't think we necessarily need completion, we just need ordering. That > is, the normal LDR/STR instructions must be observed after the cache > maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb > should be sufficient for that guarantee. If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8) claims that these are ordered relative to any explicit load/stores to the same address. So in theory we don't even need a DMB for unmapping with DMA_FROM_DEVICE. But in practice, we may have the outer cache, hence a DSB is required before the outer_sync() (we could move it there though).
On Thu, Apr 24, 2014 at 10:09:27AM +0100, Catalin Marinas wrote: > If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8) > claims that these are ordered relative to any explicit load/stores to > the same address. So in theory we don't even need a DMB for unmapping > with DMA_FROM_DEVICE. But in practice, we may have the outer cache, > hence a DSB is required before the outer_sync() (we could move it there > though). The general usecase for outer_sync() is: dsb(); outer_sync(); Why would we want to change this to dsb(); dmb(); outer_sync(); (where the dmb is in outer_sync itself?) Seems more sensible for it to stay at the outer_sync() call site where it's needed.
On Wed, Apr 23, 2014 at 07:37:42PM +0100, Russell King - ARM Linux wrote: > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > > In the I/O coherency case, I would say it is the responsibility of the > > > device/hardware to ensure that the data is visible to all observers > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > > > code, I think it covers such scenario from-device or bidirectional > > > scenarios. > > > > > > Maybe Santosh still has a point ;) but I don't know what the right > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > > > (I still hope a dsb would do the trick on newer/ARMv8 systems). > > > > If you have device interrupts which are asynchronous to memory coherency, > > then you're in a world of pain. I can't think of a generic (architected) > > solution to this problem, unfortunately -- it's going to be both device > > and interconnect specific. Adding dsbs doesn't necessarily help at all. > > Think, network devices with NAPI handling. There, we explicitly turn > off the device's interrupt, and switch to software polling for received > packets. > > The memory for the packets has already been mapped, and we're unmapping > the buffer, and then reading from it (to locate the ether type, and/or > vlan headers) before passing it up the network stack. > > So in this case, we need to ensure that the cache operations are ordered > before the subsequent loads read from the DMA'd data. It's purely an > ordering thing, it's not a completion thing. Well, ordering of completed cache operations ;). > However, what must not happen is that the unmap must not be re-ordered > before reading the descriptor and deciding whether there's a packet > present to be unmapped. That probabily imples that code _should_ be > doing this: > > status = desc->status; > if (!(status & CPU_OWNS_THIS_DESCRIPTOR)) > no_packet; > > rmb(); > > addr = desc->buf; > len = desc->length; > > dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE); Indeed. > ...receive skb...reading buffer... The point Will and myself were trying to make is about ordering as observed by the CPU (rather than ordering of CPU actions). Independently of whether the DMA is coherent or not, the ordering between device write to the buffer and status update (in-memory descriptor or IRQ) *must* be ordered by the device and interconnects. The rmb() as per your example above only solves the relative CPU loads and cache maintenance but they don't have any effect on the transactions done by the device. The mvebu SoC has exactly this problem. mvebu_hwcc_sync_io_barrier() is something that should be handled by the hardware automatically, especially in a system which claims cache coherency.
On Thu, Apr 24, 2014 at 10:16:24AM +0100, Russell King - ARM Linux wrote: > On Thu, Apr 24, 2014 at 10:09:27AM +0100, Catalin Marinas wrote: > > If we only do D-cache maintenance by MVA, the ARM ARM (both v7 and v8) > > claims that these are ordered relative to any explicit load/stores to > > the same address. So in theory we don't even need a DMB for unmapping > > with DMA_FROM_DEVICE. But in practice, we may have the outer cache, > > hence a DSB is required before the outer_sync() (we could move it there > > though). > > The general usecase for outer_sync() is: dsb(); outer_sync(); Why would > we want to change this to dsb(); dmb(); outer_sync(); (where the dmb is > in outer_sync itself?) > > Seems more sensible for it to stay at the outer_sync() call site where > it's needed. You are right, it gets worse for the wmb() case if we change outer_sync(), I was thinking about cache maintenance. An optimisation would be for functions like v7_dma_inv_range() to no longer have the dsb but move it to the __dma_page_cpu_to_dev() before the outer_*_range() ops. If we assume that a streaming DMA is started by a writel() access which has a dsb already, in the absence of outer cache we wouldn't need any dsb at all, hence something like a conditional sync_for_outer() barrier (dsb if outer cache or no-op otherwise) in __dma_page_cpu_to_dev(). In the __dma_page_dev_to_cpu() we wouldn't need any dsb at all for cache maintenance since subsequent accesses to the same address are ordered by the hardware (and outer cache maintenance is done before the inner anyway). (that's from an ARMv7 perspective, we need to check ordering on earlier architectures)
On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote: > On Wed, Apr 23, 2014 at 08:58:05PM +0200, Arnd Bergmann wrote: > > On Wednesday 23 April 2014 19:37:42 Russell King - ARM Linux wrote: > > > On Wed, Apr 23, 2014 at 06:17:27PM +0100, Will Deacon wrote: > > > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > > > > > In the I/O coherency case, I would say it is the responsibility of the > > > > > device/hardware to ensure that the data is visible to all observers > > > > > (CPUs) prior to issuing a interrupt for DMA-ready. Looking at the mvebu > > > > > code, I think it covers such scenario from-device or bidirectional > > > > > scenarios. > > > > > > > > > > Maybe Santosh still has a point but I don't know what the right > > > > > barrier would be here. And I really *hate* per-SoC/snoop unit barriers > > > > > (I still hope a dsb would do the trick on newer/ARMv8 systems). > > > > > > > > If you have device interrupts which are asynchronous to memory coherency, > > > > then you're in a world of pain. I can't think of a generic (architected) > > > > solution to this problem, unfortunately -- it's going to be both device > > > > and interconnect specific. Adding dsbs doesn't necessarily help at all. > > > > > > Think, network devices with NAPI handling. There, we explicitly turn > > > off the device's interrupt, and switch to software polling for received > > > packets. > > > > > > The memory for the packets has already been mapped, and we're unmapping > > > the buffer, and then reading from it (to locate the ether type, and/or > > > vlan headers) before passing it up the network stack. > > > > > > So in this case, we need to ensure that the cache operations are ordered > > > before the subsequent loads read from the DMA'd data. It's purely an > > > ordering thing, it's not a completion thing. > > > > PCI guarantees this, but I have seen systems in the past (on PowerPC) that > > would violate them on the internal interconnect: You could sometimes see the > > completion DMA data in the descriptor ring before the actual user data > > is there. We only ever observed it in combination with an IOMMU, when the > > descriptor address had a valid IOTLB but the data address did not. > > What is done on down-stream buses is of no concern to the behaviour of > the CPU, which is what's being discussed here (in terms of barriers.) > and the correct CPU ordering of various read/writes to memory and > devices vs the streaming cache operations. It is still of concern because for cases like NAPI an rmb() on the CPU side is no longer enough. If you use a common network driver, written correctly with rmb(), but you have some weird interconnect which doesn't ensure ordering, you have to add interconnect specific barrier to the rmb() (or hack the dma ops like mvebu). I consider such hardware broken.
Hi Arnd, On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote: > Another problem is MSI processing. MSI was specifically invented to avoid > having to check an MMIO register for a DMA completion that as a side-effect > flushes pending DMAs from the same device. This breaks down if the MSI > packet gets turned into a level interrupt before it reaches the CPU's > coherency domain, which is likely the case on the dw-pcie controller that > comes with its own MSI block. I'm not sure there's anything special about MSI which helps with this problem. For GICv3, the MSI write will target the ITS (a slave device), whereas the data produced is assumedly targetting main memory. That still requires careful ordering by the producer, in the same way as if it was signalling a legacy interrupt. Will
On Thu, Apr 24, 2014 at 10:54:23AM +0100, Catalin Marinas wrote: > On Wed, Apr 23, 2014 at 07:37:42PM +0100, Russell King - ARM Linux wrote: > > However, what must not happen is that the unmap must not be re-ordered > > before reading the descriptor and deciding whether there's a packet > > present to be unmapped. That probabily imples that code _should_ be > > doing this: > > > > status = desc->status; > > if (!(status & CPU_OWNS_THIS_DESCRIPTOR)) > > no_packet; > > > > rmb(); > > > > addr = desc->buf; > > len = desc->length; > > > > dma_unmap_single(dev, addr, len, DMA_FROM_DEVICE); > > Indeed. > > > ...receive skb...reading buffer... > > The point Will and myself were trying to make is about ordering as > observed by the CPU (rather than ordering of CPU actions). Independently > of whether the DMA is coherent or not, the ordering between device write > to the buffer and status update (in-memory descriptor or IRQ) *must* be > ordered by the device and interconnects. The rmb() as per your example > above only solves the relative CPU loads and cache maintenance but they > don't have any effect on the transactions done by the device. If the hardware does not ensure proper ordering of the writes to the buffer vs writes to the descriptor indicating that the packet has been received, then the hardware is quite simply broken. > The mvebu SoC has exactly this problem. mvebu_hwcc_sync_io_barrier() is > something that should be handled by the hardware automatically, > especially in a system which claims cache coherency. So what if the descriptor also contains other data, such as a received timestamp, or separate error bits that need checking before DMA unmap? (There are implementations where this is true - for example the freescale FEC driver, but that currently omits the rmb()...) Stuffing this into the DMA API may not be the best place for it, but the alternative is to stuff it into drivers, and that's just not acceptable for any generic driver - plus we want to keep stuff as simple as possible for driver authors who probably don't want to understand any of these details.
On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote: > On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote: > > What is done on down-stream buses is of no concern to the behaviour of > > the CPU, which is what's being discussed here (in terms of barriers.) > > and the correct CPU ordering of various read/writes to memory and > > devices vs the streaming cache operations. > > It is still of concern because for cases like NAPI an rmb() on the CPU > side is no longer enough. If you use a common network driver, written > correctly with rmb(), but you have some weird interconnect which doesn't > ensure ordering, you have to add interconnect specific barrier to the > rmb() (or hack the dma ops like mvebu). I consider such hardware broken. Yes, the hardware /is/ broken, but if you want to get it working in a way that's acceptable in upstream kernels, adding that barrier to rmb() is probably the only acceptable solution - especially if you have other stuff going in between the rmb() and the DMA unmap.
On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote: > On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote: > > On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote: > > > What is done on down-stream buses is of no concern to the behaviour of > > > the CPU, which is what's being discussed here (in terms of barriers.) > > > and the correct CPU ordering of various read/writes to memory and > > > devices vs the streaming cache operations. > > > > It is still of concern because for cases like NAPI an rmb() on the CPU > > side is no longer enough. If you use a common network driver, written > > correctly with rmb(), but you have some weird interconnect which doesn't > > ensure ordering, you have to add interconnect specific barrier to the > > rmb() (or hack the dma ops like mvebu). I consider such hardware broken. > > Yes, the hardware /is/ broken, but if you want to get it working in a > way that's acceptable in upstream kernels, adding that barrier to rmb() > is probably the only acceptable solution - especially if you have other > stuff going in between the rmb() and the DMA unmap. The problem then is that the additional barrier may well require bus-specific knowledge and access to parts of bus driver code which we can't use inside the rmb() macro. To solve this properly, the bus topology topic once again rears its ugly head, as I think you'd need a callback from the device driver to the bus on which it resides in order to provide the appropriate barrier (which isn't something that can be done sanely for the platform_bus). Will
On Thursday 24 April 2014 11:58:46 Will Deacon wrote: > > On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote: > > Another problem is MSI processing. MSI was specifically invented to avoid > > having to check an MMIO register for a DMA completion that as a side-effect > > flushes pending DMAs from the same device. This breaks down if the MSI > > packet gets turned into a level interrupt before it reaches the CPU's > > coherency domain, which is likely the case on the dw-pcie controller that > > comes with its own MSI block. > > I'm not sure there's anything special about MSI which helps with this > problem. For GICv3, the MSI write will target the ITS (a slave device), > whereas the data produced is assumedly targetting main memory. That still > requires careful ordering by the producer, in the same way as if it was > signalling a legacy interrupt. With legacy interrupts a PCI bus master has no way to order the transactions: It initiates the DMA to memory, but does not wait for the DMA to complete. It raises the interrupt line, which causes the interrupt handler of the driver to start. The driver then reads a status register from the bus master device, and this read is ordered with respect to the DMA that may still be in progress at the time. Any PCI driver that works with legacy interrupts and DMA has to do this, and the PCI host controller has to ensure that these ordering semantics are maintained on the upstream buses. The difference with MSI is that the driver does not have to do an MMIO read transaction, and that the host controller has to ensure ordering between the (possibly weakly ordered) data DMA and the MSI transaction, rather than between the DMA and the MMIO read. These two are not the same, and it's totally possible for a broken implementation to get one of them right but not the other. Arnd
On Thu, Apr 24, 2014 at 01:12:16PM +0100, Arnd Bergmann wrote: > On Thursday 24 April 2014 11:58:46 Will Deacon wrote: > > On Wed, Apr 23, 2014 at 07:58:05PM +0100, Arnd Bergmann wrote: > > > Another problem is MSI processing. MSI was specifically invented to avoid > > > having to check an MMIO register for a DMA completion that as a side-effect > > > flushes pending DMAs from the same device. This breaks down if the MSI > > > packet gets turned into a level interrupt before it reaches the CPU's > > > coherency domain, which is likely the case on the dw-pcie controller that > > > comes with its own MSI block. > > > > I'm not sure there's anything special about MSI which helps with this > > problem. For GICv3, the MSI write will target the ITS (a slave device), > > whereas the data produced is assumedly targetting main memory. That still > > requires careful ordering by the producer, in the same way as if it was > > signalling a legacy interrupt. > > With legacy interrupts a PCI bus master has no way to order the transactions: > It initiates the DMA to memory, but does not wait for the DMA to complete. > It raises the interrupt line, which causes the interrupt handler of the > driver to start. The driver then reads a status register from the bus master > device, and this read is ordered with respect to the DMA that may still > be in progress at the time. Any PCI driver that works with legacy interrupts > and DMA has to do this, and the PCI host controller has to ensure that these > ordering semantics are maintained on the upstream buses. > > The difference with MSI is that the driver does not have to do an MMIO read > transaction, and that the host controller has to ensure ordering between > the (possibly weakly ordered) data DMA and the MSI transaction, rather than > between the DMA and the MMIO read. These two are not the same, and it's > totally possible for a broken implementation to get one of them right > but not the other. Ok, so the problem of enforcing ordering (from the CPU's perspective) moves from the endpoint to the host controller. That's certainly an improvement, but it doesn't seem unlikely for the host controller to screw that up :( Will
On Thursday 24 April 2014 07:21 AM, Will Deacon wrote: > On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote: >> On Thu, Apr 24, 2014 at 11:47:37AM +0100, Catalin Marinas wrote: >>> On Wed, Apr 23, 2014 at 08:04:48PM +0100, Russell King - ARM Linux wrote: >>>> What is done on down-stream buses is of no concern to the behaviour of >>>> the CPU, which is what's being discussed here (in terms of barriers.) >>>> and the correct CPU ordering of various read/writes to memory and >>>> devices vs the streaming cache operations. >>> >>> It is still of concern because for cases like NAPI an rmb() on the CPU >>> side is no longer enough. If you use a common network driver, written >>> correctly with rmb(), but you have some weird interconnect which doesn't >>> ensure ordering, you have to add interconnect specific barrier to the >>> rmb() (or hack the dma ops like mvebu). I consider such hardware broken. >> >> Yes, the hardware /is/ broken, but if you want to get it working in a >> way that's acceptable in upstream kernels, adding that barrier to rmb() >> is probably the only acceptable solution - especially if you have other >> stuff going in between the rmb() and the DMA unmap. > > The problem then is that the additional barrier may well require > bus-specific knowledge and access to parts of bus driver code which we can't > use inside the rmb() macro. To solve this properly, the bus topology topic > once again rears its ugly head, as I think you'd need a callback from the > device driver to the bus on which it resides in order to provide the > appropriate barrier (which isn't something that can be done sanely for > the platform_bus). > Not exactly against the bus notifier point but we can't afford to have such notifier calls in hot paths. Especially gigabit network drivers per packet processing paths where even 10 cycle cost makes huge impact on the throughput. Interconnect barriers are really needed for completion. I think CPUs within at least same clusters will be ordered with rmb(). But same is not true when you have multiple clusters and then further down coherent interconnect comes into picture where all other non-CPU coherent masters are participating. If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect most of the ARM coherent architectures are broken. If you take any typical SOC, ARM CPUs are bolted with other coherent masters at AXI boundary or may be with ACP interfaces. At this level rmb() isn't good enough and you at least need a dsb() for completion. So in my view unless and until you have features like DVM in hardware, dsb() is needed to guarantee even the ordering within CPUs sitting across clusters. Regards, Santosh
On Thu, Apr 24, 2014 at 02:38:28PM +0100, Santosh Shilimkar wrote: > On Thursday 24 April 2014 07:21 AM, Will Deacon wrote: > > On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote: > >> Yes, the hardware /is/ broken, but if you want to get it working in a > >> way that's acceptable in upstream kernels, adding that barrier to rmb() > >> is probably the only acceptable solution - especially if you have other > >> stuff going in between the rmb() and the DMA unmap. > > > > The problem then is that the additional barrier may well require > > bus-specific knowledge and access to parts of bus driver code which we can't > > use inside the rmb() macro. To solve this properly, the bus topology topic > > once again rears its ugly head, as I think you'd need a callback from the > > device driver to the bus on which it resides in order to provide the > > appropriate barrier (which isn't something that can be done sanely for > > the platform_bus). > > > Not exactly against the bus notifier point but we can't afford to have such > notifier calls in hot paths. Especially gigabit network drivers per packet > processing paths where even 10 cycle cost makes huge impact on the throughput. I don't think anybody is suggesting that you do this per-packet. This is a per-DMA-transfer barrier, which is required anyway. The details of the barrier are what varies, and are likely bus-specific. > Interconnect barriers are really needed for completion. I think CPUs within at > least same clusters will be ordered with rmb(). But same is not true when you > have multiple clusters and then further down coherent interconnect comes into > picture where all other non-CPU coherent masters are participating. You're making a lot of rash generalisations here. The architected barrier instructions as used by Linux will work perfectly well within the inner-shareable domain. That means you don't need to worry about multiple-clusters of CPUs. However, you can't read *anything* into how a barrier instruction executed on the CPU affects writes from another master; there is inherently a race there which must be dealt with by either the external master or some implementation-specific action by the CPU. This is the real problem. > If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect > most of the ARM coherent architectures are broken. If you take any typical SOC, > ARM CPUs are bolted with other coherent masters at AXI boundary or may be with > ACP interfaces. At this level rmb() isn't good enough and you at least > need a dsb() for completion. An rmb() expands to dsb, neither of which give you anything in this scenario as described by the architecture. > So in my view unless and until you have features like DVM in hardware, dsb() is > needed to guarantee even the ordering within CPUs sitting across clusters. Firstly, you can only have multiple clusters of CPUs running with a single Linux image if hardware coherency is supported between them. In this case, all the CPUs will live in the same inner-shareable domain and dmb ish is sufficient to enforce ordering between them. Secondly, a dsb executed by a CPU is irrelevant to ordering of accesses by an external peripheral, regardless of whether that peripheral is cache coherent. If you think about this as a producer/consumer problem, you need ordering at *both* ends to make any guarantees. Will
On Thursday 24 April 2014 10:09 AM, Will Deacon wrote: > On Thu, Apr 24, 2014 at 02:38:28PM +0100, Santosh Shilimkar wrote: >> On Thursday 24 April 2014 07:21 AM, Will Deacon wrote: >>> On Thu, Apr 24, 2014 at 12:15:47PM +0100, Russell King - ARM Linux wrote: >>>> Yes, the hardware /is/ broken, but if you want to get it working in a >>>> way that's acceptable in upstream kernels, adding that barrier to rmb() >>>> is probably the only acceptable solution - especially if you have other >>>> stuff going in between the rmb() and the DMA unmap. >>> >>> The problem then is that the additional barrier may well require >>> bus-specific knowledge and access to parts of bus driver code which we can't >>> use inside the rmb() macro. To solve this properly, the bus topology topic >>> once again rears its ugly head, as I think you'd need a callback from the >>> device driver to the bus on which it resides in order to provide the >>> appropriate barrier (which isn't something that can be done sanely for >>> the platform_bus). >>> >> Not exactly against the bus notifier point but we can't afford to have such >> notifier calls in hot paths. Especially gigabit network drivers per packet >> processing paths where even 10 cycle cost makes huge impact on the throughput. > > I don't think anybody is suggesting that you do this per-packet. This is a > per-DMA-transfer barrier, which is required anyway. The details of the > barrier are what varies, and are likely bus-specific. > fair enough >> Interconnect barriers are really needed for completion. I think CPUs within at >> least same clusters will be ordered with rmb(). But same is not true when you >> have multiple clusters and then further down coherent interconnect comes into >> picture where all other non-CPU coherent masters are participating. > > You're making a lot of rash generalisations here. The architected barrier > instructions as used by Linux will work perfectly well within the > inner-shareable domain. That means you don't need to worry about > multiple-clusters of CPUs. > > However, you can't read *anything* into how a barrier instruction executed > on the CPU affects writes from another master; there is inherently a race > there which must be dealt with by either the external master or some > implementation-specific action by the CPU. This is the real problem. > >> If rmb() has to reach all the way to coherent masters(non-CPU), then I suspect >> most of the ARM coherent architectures are broken. If you take any typical SOC, >> ARM CPUs are bolted with other coherent masters at AXI boundary or may be with >> ACP interfaces. At this level rmb() isn't good enough and you at least >> need a dsb() for completion. > > An rmb() expands to dsb, neither of which give you anything in this scenario > as described by the architecture. > My bad... I don't for what reason I though rmb() just expands to dmb(). Just ignore that point. >> So in my view unless and until you have features like DVM in hardware, dsb() is >> needed to guarantee even the ordering within CPUs sitting across clusters. > > Firstly, you can only have multiple clusters of CPUs running with a single > Linux image if hardware coherency is supported between them. In this case, > all the CPUs will live in the same inner-shareable domain and dmb ish is > sufficient to enforce ordering between them. > Thanks for expanding and correcting me. Inner sharable domain if implemented correctly should take care of it. > Secondly, a dsb executed by a CPU is irrelevant to ordering of accesses by > an external peripheral, regardless of whether that peripheral is cache > coherent. If you think about this as a producer/consumer problem, you need > ordering at *both* ends to make any guarantees. > Agreed. Isn't this an assumption we are doing on coherent DMA streaming case ? Ofcourse we are talking about io*mb() in that case which could be more than dsb() if needed but its actually a dsb() for A15 class of devices. May be I was not clear, but if we are saying that only ordering needs to be guaranteed and not completion then we have all the expected behaviour. And converting existing non-coherent dma_ops barriers to dmb() is right. My concern is completion is important for external master cases. So whats the expectation in those cases from producer and consumer DMA_FROM_DEVICE case ... DMA-> producer, CPU->consumer 1. DMA updates the main memory with correct descriptors and buffers. 2. CPU perfroms the dma_ops() to take over the buffer ownership. In coherent DMA case this is NOP. ** At this point of time DMA has guaranteed the ordering as well completion. 3. CPU operates on the buffer/descriptor which is correct. DMA_TO_DEVICE: CPU->producer and DMA->consumer 1. CPU fills a descriptor/buffer in memory for DMA to pick it up. 2. Performs necessary dma_op() which on coherent case is NOP... ** Here I agree the ordering from all CPUs within the cluster is guaranteed as per as the descriptor memory view is concerned. But what is produced by CPU is not visible to DMA yet. So completion isn't guaranteed. 3. If DMA kicks the transfer assuming the producer(CPU) completion then that doesn't work. Regards, Santosh
On Thu, Apr 24, 2014 at 10:44:58AM -0400, Santosh Shilimkar wrote: > DMA_TO_DEVICE: CPU->producer and DMA->consumer > 1. CPU fills a descriptor/buffer in memory for DMA to pick it up. > 2. Performs necessary dma_op() which on coherent case is NOP... > ** Here I agree the ordering from all CPUs within the cluster is guaranteed > as per as the descriptor memory view is concerned. > But what is produced by CPU is not visible to DMA yet. So completion > isn't guaranteed. > 3. If DMA kicks the transfer assuming the producer(CPU) completion then > that doesn't work. Step 3 should be done via a writel(), which is a dsb() followed by an outer_sync() followed by the actual write to the register. The dsb and outer_sync are there to ensure that the previous writes to things like DMA coherent memory are visible to the device before the device sees the write to its register. Moreover, if there are descriptors in DMA coherent memory, and there is a bit in them which must be set to hand ownership over to the device (eg, as in an ethernet driver) then _additionally_ the driver already has to add an additional barrier between the remainder of the descriptor update and handing the descriptor over, and that barrier should ensure that *any* effects prior to the barrier are seen before the effects of the accesses after the barrier. That said, in __dma_page_cpu_to_dev() we do the L1 followed by the L2 cache. The effects of cleaning out the L1 cache must be seen by the L2 cache before the effects of cleaning the L2 cache. So we _do_ have an ordering requirement there which is purely down to the implementation, and not down to any other requirements.
Hey Will, On Wed, Apr 23, 2014 at 12:17 PM, Will Deacon <will.deacon@arm.com> wrote: > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: >> On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote: >> > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: >> > > writel() or an explcit barrier in the driver will do the job. I was >> > > just thinking that we are trying to work around the short comings >> > > of streaming API by adding barriers in the driver. For example >> > > on a non-coherent system, i don't need that barrier because >> > > dma_ops does take care of that. >> > >> > I wonder whether we can remove those barriers altogether then (from the DMA >> > cache operations). For the coherent case, the driver must provide the >> > barrier (probably via writel) so the non-coherent case shouldn't be any >> > different. >> >> For the DMA_TO_DEVICE case the effect should be the same as wmb() >> implies dsb (and outer_sync() for write). But the reason we have >> barriers in the DMA ops is slightly different - the completion of the >> cache maintenance operation rather than ordering with any previous >> writes to the DMA buffer. >> >> In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt >> for a finished DMA transfer and executes dma_unmap_single() prior to >> accessing the page. However the CPU access after unmapping is done using >> normal LDR/STR which do not imply any barrier. So we need to ensure the >> completion of the cache invalidation in the dma operation. > > I don't think we necessarily need completion, we just need ordering. That > is, the normal LDR/STR instructions must be observed after the cache > maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb > should be sufficient for that guarantee. Just wondering if you were convinced from the ARM ARM that a dsb is not required after cache maintenance for the DMA_FROM_DEVICE case? thanks, -Joel
On Fri, May 02, 2014 at 10:33:25PM +0100, Joel Fernandes wrote: > Hey Will, Hi Joel, > On Wed, Apr 23, 2014 at 12:17 PM, Will Deacon <will.deacon@arm.com> wrote: > > On Wed, Apr 23, 2014 at 05:02:16PM +0100, Catalin Marinas wrote: > >> On Wed, Apr 23, 2014 at 10:02:51AM +0100, Will Deacon wrote: > >> > On Tue, Apr 22, 2014 at 09:30:27PM +0100, Santosh Shilimkar wrote: > >> > > writel() or an explcit barrier in the driver will do the job. I was > >> > > just thinking that we are trying to work around the short comings > >> > > of streaming API by adding barriers in the driver. For example > >> > > on a non-coherent system, i don't need that barrier because > >> > > dma_ops does take care of that. > >> > > >> > I wonder whether we can remove those barriers altogether then (from the DMA > >> > cache operations). For the coherent case, the driver must provide the > >> > barrier (probably via writel) so the non-coherent case shouldn't be any > >> > different. > >> > >> For the DMA_TO_DEVICE case the effect should be the same as wmb() > >> implies dsb (and outer_sync() for write). But the reason we have > >> barriers in the DMA ops is slightly different - the completion of the > >> cache maintenance operation rather than ordering with any previous > >> writes to the DMA buffer. > >> > >> In the DMA_FROM_DEVICE scenario for example, the CPU gets an interrupt > >> for a finished DMA transfer and executes dma_unmap_single() prior to > >> accessing the page. However the CPU access after unmapping is done using > >> normal LDR/STR which do not imply any barrier. So we need to ensure the > >> completion of the cache invalidation in the dma operation. > > > > I don't think we necessarily need completion, we just need ordering. That > > is, the normal LDR/STR instructions must be observed after the cache > > maintenance. I'll have to revisit the ARM ARM to be sure of this, but a dmb > > should be sufficient for that guarantee. > > Just wondering if you were convinced from the ARM ARM that a dsb is > not required after cache maintenance for the DMA_FROM_DEVICE case? It's not quite as clear-cut as that. For AArch32, the cache-maintenance operations (for inner-caches) will be ordered with respect to one another without the need for additional barriers. Furthermore, ordering is also guaranteed with respect to normal load/store instructions if the buffer is mapped as normal-cacheable and accessed via the same VA with which the maintenance was performed. For the DMA_FROM_DEVICE case, this then starts to sound pretty good but there are a couple of spanners thrown into the works (and these have been discussed earlier in the thread): (1) An IMP DEF operation can be required to publish data from the device after a completion interrupt is received. (2) Outer cache maintenance will require a dsb before (to ensure completion of maintenance on the inner caches) and after (to ensure completion before accesses to the buffer). (1) could be solved by either adding a new driver API function or by piggy-backing on rmb(). (2) could be solved by adding extra barriers to our outer_cache implementations, but that needs some careful thought to avoid penalising performance unnecessarily. Will
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 5260f43..7c9f55d 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -83,6 +83,8 @@ static dma_addr_t arm_coherent_dma_map_page(struct device *dev, struct page *pag unsigned long offset, size_t size, enum dma_data_direction dir, struct dma_attrs *attrs) { + /* Drain cpu write buffer to main memory */ + wmb(); return pfn_to_dma(dev, page_to_pfn(page)) + offset; } @@ -117,6 +119,13 @@ static void arm_dma_sync_single_for_cpu(struct device *dev, __dma_page_dev_to_cpu(page, offset, size, dir); } +static void arm_coherent_dma_sync_single_for_cpu(struct device *dev, + dma_addr_t handle, size_t size, enum dma_data_direction dir) +{ + /* Drain cpu write buffer to main memory */ + wmb(); +} + static void arm_dma_sync_single_for_device(struct device *dev, dma_addr_t handle, size_t size, enum dma_data_direction dir) { @@ -125,6 +134,33 @@ static void arm_dma_sync_single_for_device(struct device *dev, __dma_page_cpu_to_dev(page, offset, size, dir); } +/** + * arm_dma_sync_sg_for_cpu + * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices + * @sg: list of buffers + * @nents: number of buffers to map (returned from dma_map_sg) + * @dir: DMA transfer direction (same as was passed to dma_map_sg) + */ +void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + struct dma_map_ops *ops = get_dma_ops(dev); + struct scatterlist *s; + int i; + + for_each_sg(sg, s, nents, i) + ops->sync_single_for_cpu(dev, sg_dma_address(s), s->length, + dir); +} + +void arm_coherent_dma_sync_sg_for_cpu(struct device *dev, + struct scatterlist *sg, + int nents, enum dma_data_direction dir) +{ + /* Drain cpu write buffer to main memory */ + wmb(); +} + struct dma_map_ops arm_dma_ops = { .alloc = arm_dma_alloc, .free = arm_dma_free, @@ -154,6 +190,8 @@ struct dma_map_ops arm_coherent_dma_ops = { .get_sgtable = arm_dma_get_sgtable, .map_page = arm_coherent_dma_map_page, .map_sg = arm_dma_map_sg, + .sync_single_for_cpu = arm_coherent_dma_sync_single_for_cpu, + .sync_sg_for_cpu = arm_coherent_dma_sync_sg_for_cpu, .set_dma_mask = arm_dma_set_mask, }; EXPORT_SYMBOL(arm_coherent_dma_ops); @@ -994,25 +1032,6 @@ void arm_dma_unmap_sg(struct device *dev, struct scatterlist *sg, int nents, } /** - * arm_dma_sync_sg_for_cpu - * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices - * @sg: list of buffers - * @nents: number of buffers to map (returned from dma_map_sg) - * @dir: DMA transfer direction (same as was passed to dma_map_sg) - */ -void arm_dma_sync_sg_for_cpu(struct device *dev, struct scatterlist *sg, - int nents, enum dma_data_direction dir) -{ - struct dma_map_ops *ops = get_dma_ops(dev); - struct scatterlist *s; - int i; - - for_each_sg(sg, s, nents, i) - ops->sync_single_for_cpu(dev, sg_dma_address(s), s->length, - dir); -} - -/** * arm_dma_sync_sg_for_device * @dev: valid struct device pointer, or NULL for ISA and EISA-like devices * @sg: list of buffers
ARM coherent CPU dma map APIS are assumed to be nops on cache coherent machines. While this is true, one still needs to ensure that no outstanding writes are pending in CPU write buffers. To take care of that, we at least need a memory barrier to commit those changes to main memory. Patch is trying to fix those cases. Without such a patch, you will end up patching device drivers to avoid the synchronisation issues. Cc: Russell King <linux@arm.linux.org.uk> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> --- arch/arm/mm/dma-mapping.c | 57 ++++++++++++++++++++++++++++++--------------- 1 file changed, 38 insertions(+), 19 deletions(-)