diff mbox

[6/7] k3dma: Fix occasional DMA ERR issue by using proper dma api

Message ID 1469073189-9167-7-git-send-email-john.stultz@linaro.org
State New
Headers show

Commit Message

John Stultz July 21, 2016, 3:53 a.m. UTC
After lots of debugging on an occasional DMA ERR issue, I realized
that the desc structures which we point the dma hardware are being
allocated out of regular memory. This means when we fill the desc
structures, that data doesn't always get flushed out to memory by
the time we start the dma transfer, resulting in the dma engine getting
some null values, resulting in a DMA ERR on the first irq.

Thus, this patch adopts mechanism similar to the zx296702_dma of
allocating the desc structures from a dma pool, so the memory caching
rules are properly set to avoid this issue.

Cc: Zhangfei Gao <zhangfei.gao@linaro.org>
Cc: Jingoo Han <jg1.han@samsung.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Maxime Ripard <maxime.ripard@free-electrons.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Andy Green <andy@warmcat.com>
Signed-off-by: John Stutlz <john.stultz@linaro.org>

---
 drivers/dma/k3dma.c | 58 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 46 insertions(+), 12 deletions(-)

-- 
1.9.1

Comments

Zhangfei Gao July 21, 2016, 4:26 a.m. UTC | #1
On 07/21/2016 11:53 AM, John Stultz wrote:
> After lots of debugging on an occasional DMA ERR issue, I realized

> that the desc structures which we point the dma hardware are being

> allocated out of regular memory. This means when we fill the desc

> structures, that data doesn't always get flushed out to memory by

> the time we start the dma transfer, resulting in the dma engine getting

> some null values, resulting in a DMA ERR on the first irq.


How about using wmb() flush before start dma to sync desc?

I remember I used dma_pool first, then do some optimization referring 
Russell's driver.

Thanks
John Stultz July 21, 2016, 5:22 a.m. UTC | #2
On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> wrote:
>

>

> On 07/21/2016 11:53 AM, John Stultz wrote:

>>

>> After lots of debugging on an occasional DMA ERR issue, I realized

>> that the desc structures which we point the dma hardware are being

>> allocated out of regular memory. This means when we fill the desc

>> structures, that data doesn't always get flushed out to memory by

>> the time we start the dma transfer, resulting in the dma engine getting

>> some null values, resulting in a DMA ERR on the first irq.

>

>

> How about using wmb() flush before start dma to sync desc?


So I'm not going to pretend to be an expert here, but my understanding
is that wmb() syncrhonizes cpu write ordering operations across cpus,
so the cpus see all the changes before the wmb() before they see any
changes after.  But I'm not sure what effect wmb() has across cpu
cache to device ordering.   I don't think it works as a cache flush to
memory.

Andy's patch introducing the cyclic support actually had a wmb() in it
that I removed as I couldn't understand clearly why it was there (and
there wasn't a comment explaining, as required by checkpatch :).   But
even with that wmb(), the DMA ERR was still seen.

Only with these two new changes have I gotten to the point where I
can't seem to trigger the DMA error.

thanks
-john
Andy Green July 21, 2016, 6:27 a.m. UTC | #3
On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote:
>On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org>

>wrote:

>>

>>

>> On 07/21/2016 11:53 AM, John Stultz wrote:

>>>

>>> After lots of debugging on an occasional DMA ERR issue, I realized

>>> that the desc structures which we point the dma hardware are being

>>> allocated out of regular memory. This means when we fill the desc

>>> structures, that data doesn't always get flushed out to memory by

>>> the time we start the dma transfer, resulting in the dma engine

>getting

>>> some null values, resulting in a DMA ERR on the first irq.

>>

>>

>> How about using wmb() flush before start dma to sync desc?

>

>So I'm not going to pretend to be an expert here, but my understanding

>is that wmb() syncrhonizes cpu write ordering operations across cpus,


IIUI what the memory barrier does is tell the *compiler* to actually do any writes that the code asked for, but which otherwise might actually be deferred past that point.  The compiler doesn't know that buffer area has other hardware snooping it, so by default it feels it can play tricks with what seems to it like just generally deferring spilling registers to memory.  wmb makes sure the compiler's pending writes actually happen right there.  (writel() etc definitions have one built-in, so they always do what you asked when you asked).

That can be of interest when syncing with other cores but also to other IPs that intend to use the written-to area immediately, which is what's happening here.  Without the barrier the write may not be issued anywhere, even to local cpu cache, until after the hw is asked to go read the buffer, corrupting what the hw sees.

>so the cpus see all the changes before the wmb() before they see any

>changes after.  But I'm not sure what effect wmb() has across cpu

>cache to device ordering.   I don't think it works as a cache flush to

>memory.

>

>Andy's patch introducing the cyclic support actually had a wmb() in it

>that I removed as I couldn't understand clearly why it was there (and

>there wasn't a comment explaining, as required by checkpatch :).   But

>even with that wmb(), the DMA ERR was still seen.


The rule about the comment is there partially because there's a general tendancy for throwing voodoo smbs on broken things in case it helps.  But writing out memory descriptors that other hw is going to read is a legit use for it I think.  If the compiler you use wasn't deferring the write, you won't notice any difference removing it, but that doesn't mean it shouldn't be there.

>Only with these two new changes have I gotten to the point where I

>can't seem to trigger the DMA error.


Sounds good...

-Andy

>thanks

>-john
Mark Brown July 21, 2016, 10:40 a.m. UTC | #4
On Thu, Jul 21, 2016 at 02:27:02PM +0800, Andy Green wrote:
> On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote:

> >On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org>


Please fix your mail client to word wrap within paragraphs at something
substantially less than 80 columns.  Doing this makes your messages much
easier to read and reply to.

> >> How about using wmb() flush before start dma to sync desc?


> >So I'm not going to pretend to be an expert here, but my understanding

> >is that wmb() syncrhonizes cpu write ordering operations across cpus,


> IIUI what the memory barrier does is tell the *compiler* to actually

> do any writes that the code asked for, but which otherwise might

> actually be deferred past that point.  The compiler doesn't know that

> buffer area has other hardware snooping it, so by default it feels it

> can play tricks with what seems to it like just generally deferring

> spilling registers to memory.  wmb makes sure the compiler's pending

> writes actually happen right there.  (writel() etc definitions have

> one built-in, so they always do what you asked when you asked).


You might be interested in Mark Rutland's talk from ELC (Stale data, or
how we (mis-)manage modern caches):

    http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf
    https://www.youtube.com/watch?v=F0SlIMHRnLk
Andy Green July 21, 2016, 1:12 p.m. UTC | #5
On Thu, 2016-07-21 at 11:40 +0100, Mark Brown wrote:
> On Thu, Jul 21, 2016 at 02:27:02PM +0800, Andy Green wrote:

> > 

> > On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@lin

> > aro.org> wrote:

> > > 

> > > On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.or

> > > g>

> 

> Please fix your mail client to word wrap within paragraphs at

> something

> substantially less than 80 columns.  Doing this makes your messages

> much

> easier to read and reply to.

> 

> > 

> > > 

> > > > 

> > > > How about using wmb() flush before start dma to sync desc?

> 

> > 

> > > 

> > > So I'm not going to pretend to be an expert here, but my

> > > understanding

> > > is that wmb() syncrhonizes cpu write ordering operations across

> > > cpus,

> 

> > 

> > IIUI what the memory barrier does is tell the *compiler* to

> > actually

> > do any writes that the code asked for, but which otherwise might

> > actually be deferred past that point.  The compiler doesn't know

> > that

> > buffer area has other hardware snooping it, so by default it feels

> > it

> > can play tricks with what seems to it like just generally deferring

> > spilling registers to memory.  wmb makes sure the compiler's

> > pending

> > writes actually happen right there.  (writel() etc definitions have

> > one built-in, so they always do what you asked when you asked).

> 

> You might be interested in Mark Rutland's talk from ELC (Stale data,

> or

> how we (mis-)manage modern caches):

> 

>     http://events.linuxfoundation.org/sites/events/files/slides/slide

> s_17.pdf


Thanks for the pointer.

That is a somewhat different animal to wmb though: wmb is about
managing when the initiator of the write actualizes it, prior to that
the write does not instantenously exist anywhere and so is not subject
to these coherency nightmares [1].

Also the presentation is from the hw POV only, at the Linux driver
level most of the considerations can be made moot if you just use the
generic Linux DMA or related apis, which thanks to the work of the Arm
Linux gods already knows how to deal with / wrap the issues, plus or
minus.  So it's not as dire as it sounds.

-Andy

[1] The details of some of those nightmares are unfortunately very
familiar to me, since I spent many months where Linux was being blamed
for Mali breakage via CCI on a platform that ultimately had its
problems resolved by tweaks in its secure world...

>     https://www.youtube.com/watch?v=F0SlIMHRnLk
Zhangfei Gao July 21, 2016, 4:08 p.m. UTC | #6
On 07/21/2016 01:22 PM, John Stultz wrote:
> On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> wrote:

>>

>>

>> On 07/21/2016 11:53 AM, John Stultz wrote:

>>>

>>> After lots of debugging on an occasional DMA ERR issue, I realized

>>> that the desc structures which we point the dma hardware are being

>>> allocated out of regular memory. This means when we fill the desc

>>> structures, that data doesn't always get flushed out to memory by

>>> the time we start the dma transfer, resulting in the dma engine getting

>>> some null values, resulting in a DMA ERR on the first irq.

>>

>>

>> How about using wmb() flush before start dma to sync desc?

>

> So I'm not going to pretend to be an expert here, but my understanding

> is that wmb() syncrhonizes cpu write ordering operations across cpus,

> so the cpus see all the changes before the wmb() before they see any

> changes after.  But I'm not sure what effect wmb() has across cpu

> cache to device ordering.   I don't think it works as a cache flush to

> memory.

>

> Andy's patch introducing the cyclic support actually had a wmb() in it

> that I removed as I couldn't understand clearly why it was there (and

> there wasn't a comment explaining, as required by checkpatch :).   But

> even with that wmb(), the DMA ERR was still seen.

>

> Only with these two new changes have I gotten to the point where I

> can't seem to trigger the DMA error.

>


Yes, you are right.
Have double checked, we have to use non-cached memory here as dma 
descriptor, instead of cached memory from kzalloc.

And barrier (wmb or writel) is used to ensure descriptor are written 
before start dma.
Though we start dma much later in issue_pending -> tasklet, so the 
chance is low.

Thanks
John Stultz July 21, 2016, 4:18 p.m. UTC | #7
On Wed, Jul 20, 2016 at 11:27 PM, Andy Green <andy@warmcat.com> wrote:
> On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote:

>>On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org>

>>wrote:

>>>

>>>

>>> On 07/21/2016 11:53 AM, John Stultz wrote:

>>>>

>>>> After lots of debugging on an occasional DMA ERR issue, I realized

>>>> that the desc structures which we point the dma hardware are being

>>>> allocated out of regular memory. This means when we fill the desc

>>>> structures, that data doesn't always get flushed out to memory by

>>>> the time we start the dma transfer, resulting in the dma engine

>>getting

>>>> some null values, resulting in a DMA ERR on the first irq.

>>>

>>>

>>> How about using wmb() flush before start dma to sync desc?

>>

>>So I'm not going to pretend to be an expert here, but my understanding

>>is that wmb() syncrhonizes cpu write ordering operations across cpus,

>

> IIUI what the memory barrier does is tell the *compiler* to actually do any writes that the code asked for, but which otherwise might actually be deferred past that point.  The compiler doesn't know that buffer area has other hardware snooping it, so by default it feels it can play tricks with what seems to it like just generally deferring spilling registers to memory.  wmb makes sure the compiler's pending writes actually happen right there.  (writel() etc definitions have one built-in, so they always do what you asked when you asked).

>

> That can be of interest when syncing with other cores but also to other IPs that intend to use the written-to area immediately, which is what's happening here.  Without the barrier the write may not be issued anywhere, even to local cpu cache, until after the hw is asked to go read the buffer, corrupting what the hw sees.

>

>>so the cpus see all the changes before the wmb() before they see any

>>changes after.  But I'm not sure what effect wmb() has across cpu

>>cache to device ordering.   I don't think it works as a cache flush to

>>memory.

>>

>>Andy's patch introducing the cyclic support actually had a wmb() in it

>>that I removed as I couldn't understand clearly why it was there (and

>>there wasn't a comment explaining, as required by checkpatch :).   But

>>even with that wmb(), the DMA ERR was still seen.

>

> The rule about the comment is there partially because there's a general tendancy for throwing voodoo smbs on broken things in case it helps.  But writing out memory descriptors that other hw is going to read is a legit use for it I think.  If the compiler you use wasn't deferring the write, you won't notice any difference removing it, but that doesn't mean it shouldn't be there.

>


Though from your comments above, wouldn't using writel() instead of
writel_relaxed(), followed by a wmb() be sufficient?

thanks
-john
Andy Green July 21, 2016, 7:57 p.m. UTC | #8
On July 22, 2016 12:18:48 AM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote:
>On Wed, Jul 20, 2016 at 11:27 PM, Andy Green <andy@warmcat.com> wrote:

>> On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz

><john.stultz@linaro.org> wrote:

>>>On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org>

>>>wrote:

>>>>

>>>>

>>>> On 07/21/2016 11:53 AM, John Stultz wrote:

>>>>>

>>>>> After lots of debugging on an occasional DMA ERR issue, I realized

>>>>> that the desc structures which we point the dma hardware are being

>>>>> allocated out of regular memory. This means when we fill the desc

>>>>> structures, that data doesn't always get flushed out to memory by

>>>>> the time we start the dma transfer, resulting in the dma engine

>>>getting

>>>>> some null values, resulting in a DMA ERR on the first irq.

>>>>

>>>>

>>>> How about using wmb() flush before start dma to sync desc?

>>>

>>>So I'm not going to pretend to be an expert here, but my

>understanding

>>>is that wmb() syncrhonizes cpu write ordering operations across cpus,

>>

>> IIUI what the memory barrier does is tell the *compiler* to actually

>do any writes that the code asked for, but which otherwise might

>actually be deferred past that point.  The compiler doesn't know that

>buffer area has other hardware snooping it, so by default it feels it

>can play tricks with what seems to it like just generally deferring

>spilling registers to memory.  wmb makes sure the compiler's pending

>writes actually happen right there.  (writel() etc definitions have one

>built-in, so they always do what you asked when you asked).

>>

>> That can be of interest when syncing with other cores but also to

>other IPs that intend to use the written-to area immediately, which is

>what's happening here.  Without the barrier the write may not be issued

>anywhere, even to local cpu cache, until after the hw is asked to go

>read the buffer, corrupting what the hw sees.

>>

>>>so the cpus see all the changes before the wmb() before they see any

>>>changes after.  But I'm not sure what effect wmb() has across cpu

>>>cache to device ordering.   I don't think it works as a cache flush

>to

>>>memory.

>>>

>>>Andy's patch introducing the cyclic support actually had a wmb() in

>it

>>>that I removed as I couldn't understand clearly why it was there (and

>>>there wasn't a comment explaining, as required by checkpatch :).  

>But

>>>even with that wmb(), the DMA ERR was still seen.

>>

>> The rule about the comment is there partially because there's a

>general tendancy for throwing voodoo smbs on broken things in case it

>helps.  But writing out memory descriptors that other hw is going to

>read is a legit use for it I think.  If the compiler you use wasn't

>deferring the write, you won't notice any difference removing it, but

>that doesn't mean it shouldn't be there.

>>

>

>Though from your comments above, wouldn't using writel() instead of

>writel_relaxed(), followed by a wmb() be sufficient?


Yes, since on Arm writel() sticks a wmb (ultimately a dsb instruction + outer_sync()) after every write it does. 

-Andy

>thanks

>-john
diff mbox

Patch

diff --git a/drivers/dma/k3dma.c b/drivers/dma/k3dma.c
index 950ed36..259bd3b 100644
--- a/drivers/dma/k3dma.c
+++ b/drivers/dma/k3dma.c
@@ -8,6 +8,8 @@ 
  */
 #include <linux/sched.h>
 #include <linux/device.h>
+#include <linux/dma-mapping.h>
+#include <linux/dmapool.h>
 #include <linux/dmaengine.h>
 #include <linux/init.h>
 #include <linux/interrupt.h>
@@ -26,6 +28,7 @@ 
 #define DRIVER_NAME		"k3-dma"
 #define DMA_MAX_SIZE		0x1ffc
 #define DMA_CYCLIC_MAX_PERIOD	0x1000
+#define LLI_BLOCK_SIZE		(4 * PAGE_SIZE)
 
 #define INT_STAT		0x00
 #define INT_TC1			0x04
@@ -74,7 +77,7 @@  struct k3_dma_desc_sw {
 	dma_addr_t		desc_hw_lli;
 	size_t			desc_num;
 	size_t			size;
-	struct k3_desc_hw	desc_hw[0];
+	struct k3_desc_hw	*desc_hw;
 };
 
 struct k3_dma_phy;
@@ -107,6 +110,7 @@  struct k3_dma_dev {
 	struct k3_dma_phy	*phy;
 	struct k3_dma_chan	*chans;
 	struct clk		*clk;
+	struct dma_pool		*pool;
 	u32			dma_channels;
 	u32			dma_requests;
 };
@@ -434,6 +438,35 @@  static void k3_dma_fill_desc(struct k3_dma_desc_sw *ds, dma_addr_t dst,
 	ds->desc_hw[num].config = ccfg;
 }
 
+static struct k3_dma_desc_sw *k3_dma_alloc_desc_resource(int num,
+							struct dma_chan *chan)
+{
+	struct k3_dma_chan *c = to_k3_chan(chan);
+	struct k3_dma_desc_sw *ds;
+	struct k3_dma_dev *d = to_k3_dma(chan->device);
+	int lli_limit = LLI_BLOCK_SIZE / sizeof(struct k3_desc_hw);
+
+	if (num > lli_limit) {
+		dev_dbg(chan->device->dev, "vch %p: sg num %d exceed max %d\n",
+			&c->vc, num, lli_limit);
+		return NULL;
+	}
+
+	ds = kzalloc(sizeof(*ds), GFP_ATOMIC);
+	if (!ds)
+		return NULL;
+
+	ds->desc_hw = dma_pool_alloc(d->pool, GFP_NOWAIT, &ds->desc_hw_lli);
+	if (!ds->desc_hw) {
+		dev_dbg(chan->device->dev, "vch %p: dma alloc fail\n", &c->vc);
+		kfree(ds);
+		return NULL;
+	}
+	memset(ds->desc_hw, 0, sizeof(struct k3_desc_hw) * num);
+	ds->desc_num = num;
+	return ds;
+}
+
 static struct dma_async_tx_descriptor *k3_dma_prep_memcpy(
 	struct dma_chan *chan,	dma_addr_t dst, dma_addr_t src,
 	size_t len, unsigned long flags)
@@ -447,15 +480,14 @@  static struct dma_async_tx_descriptor *k3_dma_prep_memcpy(
 		return NULL;
 
 	num = DIV_ROUND_UP(len, DMA_MAX_SIZE);
-	ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC);
+
+	ds = k3_dma_alloc_desc_resource(num, chan);
 	if (!ds) {
 		dev_dbg(chan->device->dev, "vchan %p: kzalloc fail\n", &c->vc);
 		return NULL;
 	}
 	c->cyclic = 0;
-	ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]);
 	ds->size = len;
-	ds->desc_num = num;
 	num = 0;
 
 	if (!c->ccfg) {
@@ -506,12 +538,9 @@  static struct dma_async_tx_descriptor *k3_dma_prep_slave_sg(
 			num += DIV_ROUND_UP(avail, DMA_MAX_SIZE) - 1;
 	}
 
-	ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC);
+	ds = k3_dma_alloc_desc_resource(num, chan);
 	if (!ds)
 		return NULL;
-
-	ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]);
-	ds->desc_num = num;
 	num = 0;
 
 	for_each_sg(sgl, sg, sglen, i) {
@@ -564,13 +593,10 @@  k3_dma_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t buf_addr,
 	if (avail > modulo)
 		num += DIV_ROUND_UP(avail, modulo) - 1;
 
-	ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC);
+	ds = k3_dma_alloc_desc_resource(num, chan);
 	if (!ds)
 		return NULL;
 
-	ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]);
-	ds->desc_num = num;
-
 	c->cyclic = 1;
 	addr = buf_addr;
 	avail = buf_len;
@@ -664,7 +690,9 @@  static void k3_dma_free_desc(struct virt_dma_desc *vd)
 {
 	struct k3_dma_desc_sw *ds =
 		container_of(vd, struct k3_dma_desc_sw, vd);
+	struct k3_dma_dev *d = to_k3_dma(vd->tx.chan->device);
 
+	dma_pool_free(d->pool, ds->desc_hw, ds->desc_hw_lli);
 	kfree(ds);
 }
 
@@ -810,6 +838,12 @@  static int k3_dma_probe(struct platform_device *op)
 	if (ret)
 		return ret;
 
+	/* A DMA memory pool for LLIs, align on 32-byte boundary */
+	d->pool = dmam_pool_create(DRIVER_NAME, &op->dev,
+					LLI_BLOCK_SIZE, 32, 0);
+	if (!d->pool)
+		return -ENOMEM;
+
 	/* init phy channel */
 	d->phy = devm_kzalloc(&op->dev,
 		d->dma_channels * sizeof(struct k3_dma_phy), GFP_KERNEL);