[Xen-devel] mpt3sas bug with Debian jessie kernel only under Xen - "swiotlb buffer is full"

Message ID 20161206024350.GV1804@bitfolk.com
State New
Headers show

Commit Message

Andy Smith Dec. 6, 2016, 2:43 a.m.
Hi Andrew,

On Sun, Dec 04, 2016 at 03:59:20PM +0000, Andrew Cooper wrote:
> On 04/12/16 08:32, Andy Smith wrote:

> > Under the Debian jessie amd64 kernel (linux-image-3.16.0-4-amd64

> > 3.16.36-1+deb8u2) running under Xen, I cannot put the system's

> > storage under heavy load without receiving a bunch of "swiotlb

> > buffer is full" kernel error messages and severely degraded

> > performance. Sometimes the system panics and reboots itself.


[…]

> Can you try these two patches from the XenServer Patch queue?

> https://github.com/xenserver/linux-3.x.pg/blob/master/master/series#L613-L614


Looking good.

Using those patches I'm ~20 minutes into this now:

Every 2.0s: cat /proc/mdstat                                 Tue Dec  6 02:16:40 2016

Personalities : [raid1] [raid10]
md5 : active raid10 sdb[0] sda[1]
      1875243008 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      [==>..................]  check = 11.5% (217058176/1875243008) finish=133.9min speed=206252K/sec
      bitmap: 0/14 pages [0KB], 65536KB chunk

md4 : active raid10 sdc[0] sdd[1]
      3906886656 blocks super 1.2 512K chunks 2 far-copies [2/2] [UU]
      [>....................]  check =  2.6% (102650880/3906886656) finish=674.4min speed=94007K/sec
      bitmap: 0/30 pages [0KB], 65536KB chunk

…where previously it would have given kernel errors within 5
seconds, so I think that fixes it. I will have to perform some more
strenuous testing.

Those two patches did not apply cleanly to source of
linux-image-3.16.0-4-amd64 3.16.36-1+deb8u2. The last bit of each
patch was rejected, so I removed them and put them into a separate
patch file (0003-fixup.patch attached).

I have not done this process in a long time so just for the
archives, my process was as per:

    https://kernel-handbook.alioth.debian.org/ch-common-tasks.html#s-common-official

# mkdir -p /data/debian
# chown andy: /data/debian
# apt-get install build-essential fakeroot
# apt-get build-dep linux
$ cd /data/debian
$ apt-get source linux
$ wget https://raw.githubusercontent.com/xenserver/linux-3.x.pg/master/master/0001-dma-add-dma_get_required_mask_from_max_pfn.patch
$ wget https://raw.githubusercontent.com/xenserver/linux-3.x.pg/master/master/0002-x86-xen-correct-dma_get_required_mask-for-Xen-PV-gue.patch
$ # remove last parts of each patch file, create 0003-fixup.patch that performs equivalent changes
$ cd linux-3.16.36
$ # applying these patches is going to change symbols so changing the abiname
$ # is necessary.
$ # See https://kernel-handbook.alioth.debian.org/ch-versions.html#s-abi-name
$ sed -i -e 's/^abiname: 4/abiname: 4bf/' debian/config/defines
$ fakeroot debian/rules debian/control-real
$ bash debian/bin/test-patches -f amd64 ../0001-dma-add-dma_get_required_mask_from_max_pfn.patch ../0002-x86-xen-correct-dma_get_required_mask-for-Xen-PV-gue.patch ../0003-fixup.patch
# dpkg -i ../linux-headers-3.16.0-4bf-amd64_3.16.36-1+deb8u2a~test_amd64.deb ../linux-image-3.16.0-4bf-amd64_3.16.36-1+deb8u2a~test_amd64.deb

boot into new kernel under Xen

$ uname -a
Linux elephant 3.16.0-4bf-amd64 #1 SMP Debian 3.16.36-1+deb8u2a~test (2016-12-05) x86_64 GNU/Linux

I think my next steps should be:

1. Do some more strenuous testing

2. Report bug against source package "linux" in Debian jessie with
   pointer to those two patches.

3. Check if those fixes are already applied in Debian backports
   and/or Debian testing linux package.

> > Dec  4 07:06:00 elephant kernel: [22019.373653] mpt3sas 0000:01:00.0: swiotlb buffer is full (sz: 57344 bytes)

> > Dec  4 07:06:00 elephant kernel: [22019.374707] mpt3sas 0000:01:00.0: swiotlb buffer is full

> > Dec  4 07:06:00 elephant kernel: [22019.375754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010

> > Dec  4 07:06:00 elephant kernel: [22019.376430] IP: [<ffffffffa004e779>] _base_build_sg_scmd_ieee+0x1f9/0x2d0 [mpt3sas]

> > Dec  4 07:06:00 elephant kernel: [22019.377122] PGD 0

> 

> This alone is a clear error handling bug in the mpt3sas driver.  It

> hasn't checked the DMA mapping call for a successful mapping before

> following the NULL pointer it got given back.  It is collateral damage

> from the swiotlb buffer being full, but a bug none the less.


Does that require reporting as an upstream linux bug in mpt3sas
then?

Thanks for your help.

Cheers,
Andy

Patch hide | download patch | download mbox

diff -u a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
--- a/drivers/xen/swiotlb-xen.c	2016-06-15 20:29:36.000000000 +0000
+++ b/drivers/xen/swiotlb-xen.c	2016-12-05 07:05:13.009992832 +0000
@@ -673,6 +673,13 @@ 
 }
 EXPORT_SYMBOL_GPL(xen_swiotlb_dma_supported);
 
+u64
+xen_swiotlb_get_required_mask(struct device *dev)
+{
+	return DMA_BIT_MASK(64);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_get_required_mask);
+
 int
 xen_swiotlb_set_dma_mask(struct device *dev, u64 dma_mask)
 {
diff -u a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
--- a/include/linux/dma-mapping.h	2016-06-15 20:29:36.000000000 +0000
+++ b/include/linux/dma-mapping.h	2016-12-05 07:03:13.992601404 +0000
@@ -127,6 +127,7 @@ 
 	return dma_set_mask_and_coherent(dev, mask);
 }
 
+extern u64 dma_get_required_mask_from_max_pfn(struct device *dev);
 extern u64 dma_get_required_mask(struct device *dev);
 
 #ifndef set_arch_dma_coherent_ops
diff -u a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
--- a/include/xen/swiotlb-xen.h	2016-06-15 20:29:36.000000000 +0000
+++ b/include/xen/swiotlb-xen.h	2016-12-05 07:06:01.084938801 +0000
@@ -56,6 +56,10 @@ 
 extern int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask);
 
+extern u64
+xen_swiotlb_get_required_mask(struct device *dev);
+
+
 extern int
 xen_swiotlb_set_dma_mask(struct device *dev, u64 dma_mask);
 #endif /* __LINUX_SWIOTLB_XEN_H */