Message ID | 1400143605.1006.1.camel@kazak.uk.xensource.com |
---|---|
State | New |
Headers | show |
On 15.05.2014 10:46, Ian Campbell wrote: > On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: >> On 13/05/14 19:21, Stefan Bader wrote: >>> We had reports about this message being seen on EC2 for a while but finally a >>> reporter did notice some details about the guests and was able to provide a >>> simple way to reproduce[1]. >>> >>> For my local experiments I use a Xen-4.2.2 based host (though I would say the >>> host versions are not important). The host has one NIC which is used as the >>> outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use >>> that bridge. I set the mtu to 9001 (which was seen on affected instance types) >>> and also inside the guests. As described in the report one guests runs >>> redis-server and the other nodejs through two scripts (for me I had to do the >>> two sub.js calls in separate shells). After a bit the error messages appear on >>> the guest running the redis-server. >>> >>> I added some debug printk's to show a bit more detail about the skb and got the >>> following (<length>@<offset (after masking off complete pages)>): >>> >>> [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots >>> [ 698.108134] header 1490@238 -> 1 slots >>> [ 698.108139] frag #0 1614@2164 -> + 1 pages >>> [ 698.108143] frag #1 3038@1296 -> + 2 pages >>> [ 698.108147] frag #2 6076@1852 -> + 2 pages >>> [ 698.108151] frag #3 6076@292 -> + 2 pages >>> [ 698.108156] frag #4 6076@2828 -> + 3 pages >>> [ 698.108160] frag #5 3038@1268 -> + 2 pages >>> [ 698.108164] frag #6 2272@1824 -> + 1 pages >>> [ 698.108168] frag #7 3804@0 -> + 1 pages >>> [ 698.108172] frag #8 6076@264 -> + 2 pages >>> [ 698.108177] frag #9 3946@2800 -> + 2 pages >>> [ 698.108180] frags adding 18 slots >>> >>> Since I am not deeply familiar with the networking code, I wonder about two things: >>> - is there something that should limit the skb data length from all frags >>> to stay below the 64K which the definition of MAX_SKB_FRAGS hints? >> I think netfront should be able to handle 64K packets at most. > > Ah, maybe this relates to this fix from Wei? That indeed should (imo) limit the overall size. And if that would happen still, we would see the different message. The problem I see is not the overall size but the layout of the frags. Since multiple start at an offset high enough, the count of slots goes too high. Now I have to read that code in more detail, but there also has been some changes which introduce a make frags function. I wonder whether maybe there is already some kind of shifting (or copying) going on and whether its just the check that needs to improve. But right now that is speculation. For the fun I actually found an even simpler way to trigger it and (need to confirm it yet without mucking around with mtu before). It looked like mtu actually did not make a difference. One only needed the redis server on one guest and run redis-benchmark from the other with (I think -d 1000, or whatever it is that sets the datasize). Then on the range tests this happens quite often. -Stefan > > commit 9ecd1a75d977e2e8c48139c7d3efed183f898d94 > Author: Wei Liu <wei.liu2@citrix.com> > Date: Mon Apr 22 02:20:41 2013 +0000 > > xen-netfront: reduce gso_max_size to account for max TCP header > > The maximum packet including header that can be handled by netfront / netback > wire format is 65535. Reduce gso_max_size accordingly. > > Drop skb and print warning when skb->len > 65535. This can 1) save the effort > to send malformed packet to netback, 2) help spotting misconfiguration of > netfront in the future. > > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > Acked-by: Ian Campbell <ian.campbell@citrix.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 1bb2e20..1db10141 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -36,7 +36,7 @@ > #include <linux/skbuff.h> > #include <linux/ethtool.h> > #include <linux/if_ether.h> > -#include <linux/tcp.h> > +#include <net/tcp.h> > #include <linux/udp.h> > #include <linux/moduleparam.h> > #include <linux/mm.h> > @@ -547,6 +547,16 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > unsigned int len = skb_headlen(skb); > unsigned long flags; > > + /* If skb->len is too big for wire format, drop skb and alert > + * user about misconfiguration. > + */ > + if (unlikely(skb->len > XEN_NETIF_MAX_TX_SIZE)) { > + net_alert_ratelimited( > + "xennet: skb->len = %u, too big for wire format\n", > + skb->len); > + goto drop; > + } > + > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > xennet_count_skb_frag_slots(skb); > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > @@ -1058,7 +1068,8 @@ err: > > static int xennet_change_mtu(struct net_device *dev, int mtu) > { > - int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN; > + int max = xennet_can_sg(dev) ? > + XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER : ETH_DATA_LEN; > > if (mtu > max) > return -EINVAL; > @@ -1362,6 +1373,8 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev) > SET_ETHTOOL_OPS(netdev, &xennet_ethtool_ops); > SET_NETDEV_DEV(netdev, &dev->dev); > > + netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER); > + > np->netdev = netdev; > > netif_carrier_off(netdev); > diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h > index 9dfc120..58fadca 100644 > --- a/include/xen/interface/io/netif.h > +++ b/include/xen/interface/io/netif.h > @@ -47,6 +47,7 @@ > #define _XEN_NETTXF_extra_info (3) > #define XEN_NETTXF_extra_info (1U<<_XEN_NETTXF_extra_info) > > +#define XEN_NETIF_MAX_TX_SIZE 0xFFFF > struct xen_netif_tx_request { > grant_ref_t gref; /* Reference to buffer page */ > uint16_t offset; /* Offset within buffer page */ > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
Thursday, May 15, 2014, 10:58:38 AM, you wrote: > On 15.05.2014 10:46, Ian Campbell wrote: >> On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: >>> On 13/05/14 19:21, Stefan Bader wrote: >>>> We had reports about this message being seen on EC2 for a while but finally a >>>> reporter did notice some details about the guests and was able to provide a >>>> simple way to reproduce[1]. >>>> >>>> For my local experiments I use a Xen-4.2.2 based host (though I would say the >>>> host versions are not important). The host has one NIC which is used as the >>>> outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use >>>> that bridge. I set the mtu to 9001 (which was seen on affected instance types) >>>> and also inside the guests. As described in the report one guests runs >>>> redis-server and the other nodejs through two scripts (for me I had to do the >>>> two sub.js calls in separate shells). After a bit the error messages appear on >>>> the guest running the redis-server. >>>> >>>> I added some debug printk's to show a bit more detail about the skb and got the >>>> following (<length>@<offset (after masking off complete pages)>): >>>> >>>> [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots >>>> [ 698.108134] header 1490@238 -> 1 slots >>>> [ 698.108139] frag #0 1614@2164 -> + 1 pages >>>> [ 698.108143] frag #1 3038@1296 -> + 2 pages >>>> [ 698.108147] frag #2 6076@1852 -> + 2 pages >>>> [ 698.108151] frag #3 6076@292 -> + 2 pages >>>> [ 698.108156] frag #4 6076@2828 -> + 3 pages >>>> [ 698.108160] frag #5 3038@1268 -> + 2 pages >>>> [ 698.108164] frag #6 2272@1824 -> + 1 pages >>>> [ 698.108168] frag #7 3804@0 -> + 1 pages >>>> [ 698.108172] frag #8 6076@264 -> + 2 pages >>>> [ 698.108177] frag #9 3946@2800 -> + 2 pages >>>> [ 698.108180] frags adding 18 slots >>>> >>>> Since I am not deeply familiar with the networking code, I wonder about two things: >>>> - is there something that should limit the skb data length from all frags >>>> to stay below the 64K which the definition of MAX_SKB_FRAGS hints? >>> I think netfront should be able to handle 64K packets at most. >> >> Ah, maybe this relates to this fix from Wei? > That indeed should (imo) limit the overall size. And if that would happen still, > we would see the different message. The problem I see is not the overall size > but the layout of the frags. Since multiple start at an offset high enough, the > count of slots goes too high. > Now I have to read that code in more detail, but there also has been some > changes which introduce a make frags function. I wonder whether maybe there is > already some kind of shifting (or copying) going on and whether its just the > check that needs to improve. But right now that is speculation. Netback seems to be able to cope with it, but properly calculating and checking the needed slots proved difficult (see discussion with Paul Durrant for 3.14). From my tests there it seemed that max slots was capped to MAX_SKB_FRAGS. I assume that could be explained by packets having a max size, so if there are frags with offsets that need more pages per frags, that is counterbalanced .. since the total packet size is already limited .. so it would have to have less frags. That's at least what the netback code is using now as a calculation method for the required slots, pessimistic with a cap on MAX_SKB_FRAGS. But i'm not 100% sure the assumption is valid (by backing it theoretically). > For the fun I actually found an even simpler way to trigger it and (need to > confirm it yet without mucking around with mtu before). It looked like mtu > actually did not make a difference. One only needed the redis server on one > guest and run redis-benchmark from the other with (I think -d 1000, or whatever > it is that sets the datasize). Then on the range tests this happens quite often. > -Stefan >> >> commit 9ecd1a75d977e2e8c48139c7d3efed183f898d94 >> Author: Wei Liu <wei.liu2@citrix.com> >> Date: Mon Apr 22 02:20:41 2013 +0000 >> >> xen-netfront: reduce gso_max_size to account for max TCP header >> >> The maximum packet including header that can be handled by netfront / netback >> wire format is 65535. Reduce gso_max_size accordingly. >> >> Drop skb and print warning when skb->len > 65535. This can 1) save the effort >> to send malformed packet to netback, 2) help spotting misconfiguration of >> netfront in the future. >> >> Signed-off-by: Wei Liu <wei.liu2@citrix.com> >> Acked-by: Ian Campbell <ian.campbell@citrix.com> >> Signed-off-by: David S. Miller <davem@davemloft.net> >> >> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c >> index 1bb2e20..1db10141 100644 >> --- a/drivers/net/xen-netfront.c >> +++ b/drivers/net/xen-netfront.c >> @@ -36,7 +36,7 @@ >> #include <linux/skbuff.h> >> #include <linux/ethtool.h> >> #include <linux/if_ether.h> >> -#include <linux/tcp.h> >> +#include <net/tcp.h> >> #include <linux/udp.h> >> #include <linux/moduleparam.h> >> #include <linux/mm.h> >> @@ -547,6 +547,16 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) >> unsigned int len = skb_headlen(skb); >> unsigned long flags; >> >> + /* If skb->len is too big for wire format, drop skb and alert >> + * user about misconfiguration. >> + */ >> + if (unlikely(skb->len > XEN_NETIF_MAX_TX_SIZE)) { >> + net_alert_ratelimited( >> + "xennet: skb->len = %u, too big for wire format\n", >> + skb->len); >> + goto drop; >> + } >> + >> slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + >> xennet_count_skb_frag_slots(skb); >> if (unlikely(slots > MAX_SKB_FRAGS + 1)) { >> @@ -1058,7 +1068,8 @@ err: >> >> static int xennet_change_mtu(struct net_device *dev, int mtu) >> { >> - int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN; >> + int max = xennet_can_sg(dev) ? >> + XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER : ETH_DATA_LEN; >> >> if (mtu > max) >> return -EINVAL; >> @@ -1362,6 +1373,8 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev) >> SET_ETHTOOL_OPS(netdev, &xennet_ethtool_ops); >> SET_NETDEV_DEV(netdev, &dev->dev); >> >> + netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER); >> + >> np->netdev = netdev; >> >> netif_carrier_off(netdev); >> diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h >> index 9dfc120..58fadca 100644 >> --- a/include/xen/interface/io/netif.h >> +++ b/include/xen/interface/io/netif.h >> @@ -47,6 +47,7 @@ >> #define _XEN_NETTXF_extra_info (3) >> #define XEN_NETTXF_extra_info (1U<<_XEN_NETTXF_extra_info) >> >> +#define XEN_NETIF_MAX_TX_SIZE 0xFFFF >> struct xen_netif_tx_request { >> grant_ref_t gref; /* Reference to buffer page */ >> uint16_t offset; /* Offset within buffer page */ >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, May 15, 2014 at 09:46:45AM +0100, Ian Campbell wrote: > On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: > > On 13/05/14 19:21, Stefan Bader wrote: > > > We had reports about this message being seen on EC2 for a while but finally a > > > reporter did notice some details about the guests and was able to provide a > > > simple way to reproduce[1]. > > > > > > For my local experiments I use a Xen-4.2.2 based host (though I would say the > > > host versions are not important). The host has one NIC which is used as the > > > outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use > > > that bridge. I set the mtu to 9001 (which was seen on affected instance types) > > > and also inside the guests. As described in the report one guests runs > > > redis-server and the other nodejs through two scripts (for me I had to do the > > > two sub.js calls in separate shells). After a bit the error messages appear on > > > the guest running the redis-server. > > > > > > I added some debug printk's to show a bit more detail about the skb and got the > > > following (<length>@<offset (after masking off complete pages)>): > > > > > > [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots > > > [ 698.108134] header 1490@238 -> 1 slots > > > [ 698.108139] frag #0 1614@2164 -> + 1 pages > > > [ 698.108143] frag #1 3038@1296 -> + 2 pages > > > [ 698.108147] frag #2 6076@1852 -> + 2 pages > > > [ 698.108151] frag #3 6076@292 -> + 2 pages > > > [ 698.108156] frag #4 6076@2828 -> + 3 pages > > > [ 698.108160] frag #5 3038@1268 -> + 2 pages > > > [ 698.108164] frag #6 2272@1824 -> + 1 pages > > > [ 698.108168] frag #7 3804@0 -> + 1 pages > > > [ 698.108172] frag #8 6076@264 -> + 2 pages > > > [ 698.108177] frag #9 3946@2800 -> + 2 pages > > > [ 698.108180] frags adding 18 slots > > > > > > Since I am not deeply familiar with the networking code, I wonder about two things: > > > - is there something that should limit the skb data length from all frags > > > to stay below the 64K which the definition of MAX_SKB_FRAGS hints? > > I think netfront should be able to handle 64K packets at most. > > Ah, maybe this relates to this fix from Wei? > Yes, below patch limits SKB size to 64KB. However the problem here is not SKB exceeding 64KB. The said SKB is acutally 43KB in size. The problem is that guest kernel is using compound page so a frag which can be fit into one 4K page spans two 4K pages. The fix seems to be coalescing SKB in frontend, but it will degrade performance. Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Wei Liu > On Thu, May 15, 2014 at 09:46:45AM +0100, Ian Campbell wrote: > > On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: > > > On 13/05/14 19:21, Stefan Bader wrote: > > > > We had reports about this message being seen on EC2 for a while but finally a > > > > reporter did notice some details about the guests and was able to provide a > > > > simple way to reproduce[1]. > > > > > > > > For my local experiments I use a Xen-4.2.2 based host (though I would say the > > > > host versions are not important). The host has one NIC which is used as the > > > > outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use > > > > that bridge. I set the mtu to 9001 (which was seen on affected instance types) > > > > and also inside the guests. As described in the report one guests runs > > > > redis-server and the other nodejs through two scripts (for me I had to do the > > > > two sub.js calls in separate shells). After a bit the error messages appear on > > > > the guest running the redis-server. > > > > > > > > I added some debug printk's to show a bit more detail about the skb and got the > > > > following (<length>@<offset (after masking off complete pages)>): > > > > > > > > [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots > > > > [ 698.108134] header 1490@238 -> 1 slots > > > > [ 698.108139] frag #0 1614@2164 -> + 1 pages > > > > [ 698.108143] frag #1 3038@1296 -> + 2 pages > > > > [ 698.108147] frag #2 6076@1852 -> + 2 pages > > > > [ 698.108151] frag #3 6076@292 -> + 2 pages > > > > [ 698.108156] frag #4 6076@2828 -> + 3 pages > > > > [ 698.108160] frag #5 3038@1268 -> + 2 pages > > > > [ 698.108164] frag #6 2272@1824 -> + 1 pages > > > > [ 698.108168] frag #7 3804@0 -> + 1 pages > > > > [ 698.108172] frag #8 6076@264 -> + 2 pages > > > > [ 698.108177] frag #9 3946@2800 -> + 2 pages > > > > [ 698.108180] frags adding 18 slots > > > > > > > > Since I am not deeply familiar with the networking code, I wonder about two things: > > > > - is there something that should limit the skb data length from all frags > > > > to stay below the 64K which the definition of MAX_SKB_FRAGS hints? > > > I think netfront should be able to handle 64K packets at most. > > > > Ah, maybe this relates to this fix from Wei? > > > > Yes, below patch limits SKB size to 64KB. However the problem here is > not SKB exceeding 64KB. The said SKB is acutally 43KB in size. The > problem is that guest kernel is using compound page so a frag which can > be fit into one 4K page spans two 4K pages. The fix seems to be > coalescing SKB in frontend, but it will degrade performance. Presumably the thing to do is copy the frag list into a locally allocated list of 4k fragments. Unless you can support having some fragments mapped and some local there probably isn't much point in trying to maintain the fragment boundaries. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2014-05-15 at 12:04 +0100, Wei Liu wrote: > On Thu, May 15, 2014 at 09:46:45AM +0100, Ian Campbell wrote: > > On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: > > > On 13/05/14 19:21, Stefan Bader wrote: > > > > We had reports about this message being seen on EC2 for a while but finally a > > > > reporter did notice some details about the guests and was able to provide a > > > > simple way to reproduce[1]. > > > > > > > > For my local experiments I use a Xen-4.2.2 based host (though I would say the > > > > host versions are not important). The host has one NIC which is used as the > > > > outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use > > > > that bridge. I set the mtu to 9001 (which was seen on affected instance types) > > > > and also inside the guests. As described in the report one guests runs > > > > redis-server and the other nodejs through two scripts (for me I had to do the > > > > two sub.js calls in separate shells). After a bit the error messages appear on > > > > the guest running the redis-server. > > > > > > > > I added some debug printk's to show a bit more detail about the skb and got the > > > > following (<length>@<offset (after masking off complete pages)>): > > > > > > > > [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots > > > > [ 698.108134] header 1490@238 -> 1 slots > > > > [ 698.108139] frag #0 1614@2164 -> + 1 pages > > > > [ 698.108143] frag #1 3038@1296 -> + 2 pages > > > > [ 698.108147] frag #2 6076@1852 -> + 2 pages > > > > [ 698.108151] frag #3 6076@292 -> + 2 pages > > > > [ 698.108156] frag #4 6076@2828 -> + 3 pages > > > > [ 698.108160] frag #5 3038@1268 -> + 2 pages > > > > [ 698.108164] frag #6 2272@1824 -> + 1 pages > > > > [ 698.108168] frag #7 3804@0 -> + 1 pages > > > > [ 698.108172] frag #8 6076@264 -> + 2 pages > > > > [ 698.108177] frag #9 3946@2800 -> + 2 pages > > > > [ 698.108180] frags adding 18 slots > > > > > > > > Since I am not deeply familiar with the networking code, I wonder about two things: > > > > - is there something that should limit the skb data length from all frags > > > > to stay below the 64K which the definition of MAX_SKB_FRAGS hints? > > > I think netfront should be able to handle 64K packets at most. > > > > Ah, maybe this relates to this fix from Wei? > > > > Yes, below patch limits SKB size to 64KB. However the problem here is > not SKB exceeding 64KB. The said SKB is acutally 43KB in size. The > problem is that guest kernel is using compound page so a frag which can > be fit into one 4K page spans two 4K pages. The fix seems to be > coalescing SKB in frontend, but it will degrade performance. So long as it only happens when this scenario occurs a performance degradation would seem preferable to dropping the skb altogether. Ian. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 15.05.2014 13:04, Wei Liu wrote: > On Thu, May 15, 2014 at 09:46:45AM +0100, Ian Campbell wrote: >> On Wed, 2014-05-14 at 20:49 +0100, Zoltan Kiss wrote: >>> On 13/05/14 19:21, Stefan Bader wrote: >>>> We had reports about this message being seen on EC2 for a while but finally a >>>> reporter did notice some details about the guests and was able to provide a >>>> simple way to reproduce[1]. >>>> >>>> For my local experiments I use a Xen-4.2.2 based host (though I would say the >>>> host versions are not important). The host has one NIC which is used as the >>>> outgoing port of a Linux based (not openvswitch) bridge. And the PV guests use >>>> that bridge. I set the mtu to 9001 (which was seen on affected instance types) >>>> and also inside the guests. As described in the report one guests runs >>>> redis-server and the other nodejs through two scripts (for me I had to do the >>>> two sub.js calls in separate shells). After a bit the error messages appear on >>>> the guest running the redis-server. >>>> >>>> I added some debug printk's to show a bit more detail about the skb and got the >>>> following (<length>@<offset (after masking off complete pages)>): >>>> >>>> [ 698.108119] xen_netfront: xennet: skb rides the rocket: 19 slots >>>> [ 698.108134] header 1490@238 -> 1 slots >>>> [ 698.108139] frag #0 1614@2164 -> + 1 pages >>>> [ 698.108143] frag #1 3038@1296 -> + 2 pages >>>> [ 698.108147] frag #2 6076@1852 -> + 2 pages >>>> [ 698.108151] frag #3 6076@292 -> + 2 pages >>>> [ 698.108156] frag #4 6076@2828 -> + 3 pages >>>> [ 698.108160] frag #5 3038@1268 -> + 2 pages >>>> [ 698.108164] frag #6 2272@1824 -> + 1 pages >>>> [ 698.108168] frag #7 3804@0 -> + 1 pages >>>> [ 698.108172] frag #8 6076@264 -> + 2 pages >>>> [ 698.108177] frag #9 3946@2800 -> + 2 pages >>>> [ 698.108180] frags adding 18 slots >>>> >>>> Since I am not deeply familiar with the networking code, I wonder about two things: >>>> - is there something that should limit the skb data length from all frags >>>> to stay below the 64K which the definition of MAX_SKB_FRAGS hints? >>> I think netfront should be able to handle 64K packets at most. >> >> Ah, maybe this relates to this fix from Wei? >> > > Yes, below patch limits SKB size to 64KB. However the problem here is > not SKB exceeding 64KB. The said SKB is acutally 43KB in size. The > problem is that guest kernel is using compound page so a frag which can > be fit into one 4K page spans two 4K pages. The fix seems to be > coalescing SKB in frontend, but it will degrade performance. > > Wei. > Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at least now with compound pages) cannot be used in any way to derive the number of 4k slots a transfer will require. Zoltan already commented on worst cases. Not sure it would get as bad as that or "just" 16*4k frags all in the middle of compound pages. That would then end in around 33 or 34 slots, depending on the header. Zoltan wrote: > I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, > which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of them have a 4k > page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. > That's 51 individual pages. I cannot claim to really know what to expect worst case. Somewhat I was thinking of a worst case of (16+1)*2, which would be inconvenient enough. So without knowing exactly how to do it, but as Ian said it sounds best to come up with some sort of exception coalescing in cases the slot count goes over 18 and we know the data size is below 64K. -Stefan
On Fri, May 16, 2014 at 10:48:42AM +0100, Wei Liu wrote: > On Thu, May 15, 2014 at 02:14:00PM +0200, Stefan Bader wrote: > [...] > > > Wei. > > > > > Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at > > least now with compound pages) cannot be used in any way to derive the number of > > 4k slots a transfer will require. > > > > Zoltan already commented on worst cases. Not sure it would get as bad as that or > > "just" 16*4k frags all in the middle of compound pages. That would then end in > > around 33 or 34 slots, depending on the header. > > > > Zoltan wrote: > > > I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, > > > which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of > > them have a 4k > > > page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. > > > That's 51 individual pages. > > > > I cannot claim to really know what to expect worst case. Somewhat I was thinking > > of a > > worst case of (16+1)*2, which would be inconvenient enough. > > > > So without knowing exactly how to do it, but as Ian said it sounds best to come > > up with some sort of exception coalescing in cases the slot count goes over 18 > > and we know the data size is below 64K. > > > > I took a stab at it this morning and came up with this patch. Ran > redis-benchmark, it seemed to fix that for me -- only saw one "failed to > linearize skb" during > > redis-benchmark -h XXX -d 1000 -t lrange > > And before this change, a lot of "rides rocket" were triggered. > > Thought? > And don't take this as a formal patch because I haven't tested it thoroughly. It's just an idea to coalesce SKBs. We can certainly code up a loop to do so by ourself. Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Wei Liu > On Thu, May 15, 2014 at 02:14:00PM +0200, Stefan Bader wrote: > [...] > > > Wei. > > > > > Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at > > least now with compound pages) cannot be used in any way to derive the number of > > 4k slots a transfer will require. > > > > Zoltan already commented on worst cases. Not sure it would get as bad as that or > > "just" 16*4k frags all in the middle of compound pages. That would then end in > > around 33 or 34 slots, depending on the header. > > > > Zoltan wrote: > > > I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, > > > which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of > > them have a 4k > > > page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. > > > That's 51 individual pages. > > > > I cannot claim to really know what to expect worst case. Somewhat I was thinking > > of a worst case of (16+1)*2, which would be inconvenient enough. > > > > So without knowing exactly how to do it, but as Ian said it sounds best to come > > up with some sort of exception coalescing in cases the slot count goes over 18 > > and we know the data size is below 64K. > > > > I took a stab at it this morning and came up with this patch. Ran > redis-benchmark, it seemed to fix that for me -- only saw one "failed to > linearize skb" during > > redis-benchmark -h XXX -d 1000 -t lrange > > And before this change, a lot of "rides rocket" were triggered. > > Thought? > > ---8<--- > From 743495a2b2d338fc6cfe9bfd4b6e840392b87f4a Mon Sep 17 00:00:00 2001 > From: Wei Liu <wei.liu2@citrix.com> > Date: Fri, 16 May 2014 10:39:01 +0100 > Subject: [PATCH] xen-netfront: linearize SKB if it occupies too many slots > > Some workload, such as Redis can generate SKBs which make use of compound > pages. Netfront doesn't quite like that because it doesn't want to send > exessive slots to the backend as backend might deem it malicious. On the > flip side these packets are actually legit, the size check at the > beginning of xennet_start_xmit ensures that packet size is below 64K. > > So we linearize SKB if it occupies too many slots. If the linearization > fails then the SKB is dropped. > > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > --- > drivers/net/xen-netfront.c | 18 +++++++++++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 895355d..0361fc5 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -573,9 +573,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > xennet_count_skb_frag_slots(skb); > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > - net_alert_ratelimited( > - "xennet: skb rides the rocket: %d slots\n", slots); > - goto drop; > + if (skb_linearize(skb)) { You don't need to actually linearize the skb here. One with multiple fragments is fine. I'm not sure there is a standard function to 'copy and refragment' the skb data though. > + net_alert_ratelimited( > + "xennet: failed to linearize skb, skb dropped\n"); > + goto drop; > + } > + data = skb->data; > + offset = offset_in_page(data); > + len = skb_headlen(skb); > + slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > + xennet_count_skb_frag_slots(skb); IIRC If you have called skb_linearize then there shouldn't be any fragments. > + if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > + net_alert_ratelimited( > + "xennet: still too many slots after linerization: %d", slots); > + goto drop; > + } > } > > spin_lock_irqsave(&np->tx_lock, flags); > -- > 1.7.10.4 David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16.05.2014 11:48, Wei Liu wrote: > On Thu, May 15, 2014 at 02:14:00PM +0200, Stefan Bader wrote: > [...] >>> Wei. >>> >> Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at >> least now with compound pages) cannot be used in any way to derive the number of >> 4k slots a transfer will require. >> >> Zoltan already commented on worst cases. Not sure it would get as bad as that or >> "just" 16*4k frags all in the middle of compound pages. That would then end in >> around 33 or 34 slots, depending on the header. >> >> Zoltan wrote: >>> I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, >>> which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of >> them have a 4k >>> page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. >>> That's 51 individual pages. >> >> I cannot claim to really know what to expect worst case. Somewhat I was thinking >> of a >> worst case of (16+1)*2, which would be inconvenient enough. >> >> So without knowing exactly how to do it, but as Ian said it sounds best to come >> up with some sort of exception coalescing in cases the slot count goes over 18 >> and we know the data size is below 64K. >> > > I took a stab at it this morning and came up with this patch. Ran > redis-benchmark, it seemed to fix that for me -- only saw one "failed to > linearize skb" during > > redis-benchmark -h XXX -d 1000 -t lrange > > And before this change, a lot of "rides rocket" were triggered. > > Thought? It appears at least to me as something that nicely makes use of existing code. I was wondering about what could or could not be used. Trying to get ones head around the whole thing is kind of a lot to look at. The change at least looks straight forward enough. -Stefan > > ---8<--- > From 743495a2b2d338fc6cfe9bfd4b6e840392b87f4a Mon Sep 17 00:00:00 2001 > From: Wei Liu <wei.liu2@citrix.com> > Date: Fri, 16 May 2014 10:39:01 +0100 > Subject: [PATCH] xen-netfront: linearize SKB if it occupies too many slots > > Some workload, such as Redis can generate SKBs which make use of compound > pages. Netfront doesn't quite like that because it doesn't want to send > exessive slots to the backend as backend might deem it malicious. On the > flip side these packets are actually legit, the size check at the > beginning of xennet_start_xmit ensures that packet size is below 64K. > > So we linearize SKB if it occupies too many slots. If the linearization > fails then the SKB is dropped. > > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > --- > drivers/net/xen-netfront.c | 18 +++++++++++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 895355d..0361fc5 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -573,9 +573,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > xennet_count_skb_frag_slots(skb); > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > - net_alert_ratelimited( > - "xennet: skb rides the rocket: %d slots\n", slots); > - goto drop; > + if (skb_linearize(skb)) { > + net_alert_ratelimited( > + "xennet: failed to linearize skb, skb dropped\n"); > + goto drop; > + } > + data = skb->data; > + offset = offset_in_page(data); > + len = skb_headlen(skb); > + slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > + xennet_count_skb_frag_slots(skb); > + if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > + net_alert_ratelimited( > + "xennet: still too many slots after linerization: %d", slots); > + goto drop; > + } > } > > spin_lock_irqsave(&np->tx_lock, flags); >
On 16.05.2014 12:09, Stefan Bader wrote: > On 16.05.2014 11:48, Wei Liu wrote: >> On Thu, May 15, 2014 at 02:14:00PM +0200, Stefan Bader wrote: >> [...] >>>> Wei. >>>> >>> Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at >>> least now with compound pages) cannot be used in any way to derive the number of >>> 4k slots a transfer will require. >>> >>> Zoltan already commented on worst cases. Not sure it would get as bad as that or >>> "just" 16*4k frags all in the middle of compound pages. That would then end in >>> around 33 or 34 slots, depending on the header. >>> >>> Zoltan wrote: >>>> I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, >>>> which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of >>> them have a 4k >>>> page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. >>>> That's 51 individual pages. >>> >>> I cannot claim to really know what to expect worst case. Somewhat I was thinking >>> of a >>> worst case of (16+1)*2, which would be inconvenient enough. >>> >>> So without knowing exactly how to do it, but as Ian said it sounds best to come >>> up with some sort of exception coalescing in cases the slot count goes over 18 >>> and we know the data size is below 64K. >>> >> >> I took a stab at it this morning and came up with this patch. Ran >> redis-benchmark, it seemed to fix that for me -- only saw one "failed to >> linearize skb" during >> >> redis-benchmark -h XXX -d 1000 -t lrange >> >> And before this change, a lot of "rides rocket" were triggered. >> >> Thought? > > It appears at least to me as something that nicely makes use of existing code. I > was wondering about what could or could not be used. Trying to get ones head > around the whole thing is kind of a lot to look at. > > The change at least looks straight forward enough. The only woe for me is that I am looking puzzled at the implementation of skb_linearize(). Somehow the data_len element decides whether a skb can be linearized and basically how much it tries to pull from the tail. It probably makes sense ... just not to me with not deep experience here. -Stefan
On Fri, May 16, 2014 at 10:05:46AM +0000, David Laight wrote: [...] > > ---8<--- > > From 743495a2b2d338fc6cfe9bfd4b6e840392b87f4a Mon Sep 17 00:00:00 2001 > > From: Wei Liu <wei.liu2@citrix.com> > > Date: Fri, 16 May 2014 10:39:01 +0100 > > Subject: [PATCH] xen-netfront: linearize SKB if it occupies too many slots > > > > Some workload, such as Redis can generate SKBs which make use of compound > > pages. Netfront doesn't quite like that because it doesn't want to send > > exessive slots to the backend as backend might deem it malicious. On the > > flip side these packets are actually legit, the size check at the > > beginning of xennet_start_xmit ensures that packet size is below 64K. > > > > So we linearize SKB if it occupies too many slots. If the linearization > > fails then the SKB is dropped. > > > > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > > --- > > drivers/net/xen-netfront.c | 18 +++++++++++++++--- > > 1 file changed, 15 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > > index 895355d..0361fc5 100644 > > --- a/drivers/net/xen-netfront.c > > +++ b/drivers/net/xen-netfront.c > > @@ -573,9 +573,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > > xennet_count_skb_frag_slots(skb); > > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > > - net_alert_ratelimited( > > - "xennet: skb rides the rocket: %d slots\n", slots); > > - goto drop; > > + if (skb_linearize(skb)) { > > You don't need to actually linearize the skb here. > One with multiple fragments is fine. Yes I know; but my idea is to get a SKB that doesn't use up too many slots -- a linearized SKB will not use too many slots. > I'm not sure there is a standard function to 'copy and refragment' > the skb data though. > That can only help if we can control how the fragment page is allocated. > > + net_alert_ratelimited( > > + "xennet: failed to linearize skb, skb dropped\n"); > > + goto drop; > > + } > > + data = skb->data; > > + offset = offset_in_page(data); > > + len = skb_headlen(skb); > > + slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > > + xennet_count_skb_frag_slots(skb); > > IIRC If you have called skb_linearize then there shouldn't be any fragments. > Plain copy-n-paste from code above. I will pay more attention if it's a formal patch. ;-) Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 16, 2014 at 12:17:48PM +0200, Stefan Bader wrote: > On 16.05.2014 12:09, Stefan Bader wrote: > > On 16.05.2014 11:48, Wei Liu wrote: > >> On Thu, May 15, 2014 at 02:14:00PM +0200, Stefan Bader wrote: > >> [...] > >>>> Wei. > >>>> > >>> Reading more of the code I would agree. The definition of MAX_SKB_FRAGS (at > >>> least now with compound pages) cannot be used in any way to derive the number of > >>> 4k slots a transfer will require. > >>> > >>> Zoltan already commented on worst cases. Not sure it would get as bad as that or > >>> "just" 16*4k frags all in the middle of compound pages. That would then end in > >>> around 33 or 34 slots, depending on the header. > >>> > >>> Zoltan wrote: > >>>> I think the worst case scenario is when every frag and the linear buffer contains 2 bytes, > >>>> which are overlapping a page boundary (that's (17+1)*2=36 so far), plus 15 of > >>> them have a 4k > >>>> page in the middle of them, so, a 1+4096+1 byte buffer can span over 3 page. > >>>> That's 51 individual pages. > >>> > >>> I cannot claim to really know what to expect worst case. Somewhat I was thinking > >>> of a > >>> worst case of (16+1)*2, which would be inconvenient enough. > >>> > >>> So without knowing exactly how to do it, but as Ian said it sounds best to come > >>> up with some sort of exception coalescing in cases the slot count goes over 18 > >>> and we know the data size is below 64K. > >>> > >> > >> I took a stab at it this morning and came up with this patch. Ran > >> redis-benchmark, it seemed to fix that for me -- only saw one "failed to > >> linearize skb" during > >> > >> redis-benchmark -h XXX -d 1000 -t lrange > >> > >> And before this change, a lot of "rides rocket" were triggered. > >> > >> Thought? > > > > It appears at least to me as something that nicely makes use of existing code. I > > was wondering about what could or could not be used. Trying to get ones head > > around the whole thing is kind of a lot to look at. > > > > The change at least looks straight forward enough. > > The only woe for me is that I am looking puzzled at the implementation of > skb_linearize(). Somehow the data_len element decides whether a skb can be > linearized and basically how much it tries to pull from the tail. It probably > makes sense ... just not to me with not deep experience here. > Data_len is the size of paged data (that's the size of data in frags). If it pulls everything from frags to linear area then SKB is linearized. Wei. > -Stefan > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 1bb2e20..1db10141 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -36,7 +36,7 @@ #include <linux/skbuff.h> #include <linux/ethtool.h> #include <linux/if_ether.h> -#include <linux/tcp.h> +#include <net/tcp.h> #include <linux/udp.h> #include <linux/moduleparam.h> #include <linux/mm.h> @@ -547,6 +547,16 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) unsigned int len = skb_headlen(skb); unsigned long flags; + /* If skb->len is too big for wire format, drop skb and alert + * user about misconfiguration. + */ + if (unlikely(skb->len > XEN_NETIF_MAX_TX_SIZE)) { + net_alert_ratelimited( + "xennet: skb->len = %u, too big for wire format\n", + skb->len); + goto drop; + } + slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + xennet_count_skb_frag_slots(skb); if (unlikely(slots > MAX_SKB_FRAGS + 1)) { @@ -1058,7 +1068,8 @@ err: static int xennet_change_mtu(struct net_device *dev, int mtu) { - int max = xennet_can_sg(dev) ? 65535 - ETH_HLEN : ETH_DATA_LEN; + int max = xennet_can_sg(dev) ? + XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER : ETH_DATA_LEN; if (mtu > max) return -EINVAL; @@ -1362,6 +1373,8 @@ static struct net_device *xennet_create_dev(struct xenbus_device *dev) SET_ETHTOOL_OPS(netdev, &xennet_ethtool_ops); SET_NETDEV_DEV(netdev, &dev->dev); + netif_set_gso_max_size(netdev, XEN_NETIF_MAX_TX_SIZE - MAX_TCP_HEADER); + np->netdev = netdev; netif_carrier_off(netdev); diff --git a/include/xen/interface/io/netif.h b/include/xen/interface/io/netif.h index 9dfc120..58fadca 100644 --- a/include/xen/interface/io/netif.h +++ b/include/xen/interface/io/netif.h @@ -47,6 +47,7 @@ #define _XEN_NETTXF_extra_info (3) #define XEN_NETTXF_extra_info (1U<<_XEN_NETTXF_extra_info) +#define XEN_NETIF_MAX_TX_SIZE 0xFFFF struct xen_netif_tx_request { grant_ref_t gref; /* Reference to buffer page */ uint16_t offset; /* Offset within buffer page */