[Xen-devel,BUG] : [PATCH added] Incorrect queue counters in netback interface

Message ID 585D338C.5020905@greenhost.nl
State New
Headers show

Commit Message

Mart van Santen Dec. 23, 2016, 2:24 p.m.
Hello Xen Devel,

We encountered an issue with one of our instances pushing a lot of
(network) data, resulting in incorrect bandwidth reporting. We do poll
interface traffic every 5 minutes for every interface of our instances
(from Dom0) and store those values for reporting/graphing. At some point
we discovered reports of machines pushing a lot of data, were incorrect
and were going down in chucks of about 4 GB, while you expect this is an
always increasing number (or wraps around a 32 or 64 bit counter)

After futher analysing the problem and the source code we discovered
that the counters of the queues in the code are handled as 32-bit
integers, while the total bytes tx/rx for the while interface are
counted as 64-bit longs.


Attached a patch to solve the problem.


Regards,


Mart van Santen





Below test results:

Test existed of 2 DomU's on the same Dom0. Stats were gatherer from
Dom0. The 2 DomU's were pushing data to each other using a network
performance tool (iperf)


Non-patched kernel results:


node309 statistics # pwd
/sys/devices/vif-2-0/net/vif2.0/statistics
node309 statistics #
node309 statistics # while [ 1 == 1 ]; do cat rx_bytes; sleep 1; done
873025622
1235062978
1572261463
2146257487
2700756931
3270058797
3843728761
113171325
665259289
1190130825
1555559865
1950490941
2499614765
3062263901
3640042389
4202300253
477729635
1046926733
^C


----

Patched kernel:

node310 statistics # pwd
/sys/devices/vif-2-0/net/vif2.0/statistics
node310 statistics # while [ 1 == 1 ]; do cat rx_bytes; sleep 1; done
38069388494
38632298582
39187057066
39749628321
40316842297
40875579821
41440446165
42005641261
42539933757
42991421265
43543245209
44104655317
44666040545
45218321305
45780709593
46346749753
46897399881
47393645401
47687655485
48069251037
48463066305
48866467737
49269999633
49695523708
50119597344
50541042376
50985332831
51446717847
51856736763
52234568347
52684797743
53250974835
53828014155
54406444595
54972895475
55548898027
55894007361
56408304870
56865463506
57435079678
57998185298
58577528786
59149894510
59715804286
60292539214
60861187894
61431075562
^C
(etc)

(increasing numbers, as expected)



-- 
Mart van Santen
Greenhost
E: mart@greenhost.nl
T: +31 20 4890444
W: https://greenhost.nl

A PGP signature can be attached to this e-mail,
you need PGP software to verify it. 
My public key is available in keyserver(s)
see: http://tinyurl.com/openpgp-manual

PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5

Comments

Juergen Gross Dec. 23, 2016, 3:38 p.m. | #1
On 23/12/16 15:24, Mart van Santen wrote:
> Hello Xen Devel,
> 
> We encountered an issue with one of our instances pushing a lot of
> (network) data, resulting in incorrect bandwidth reporting. We do poll
> interface traffic every 5 minutes for every interface of our instances
> (from Dom0) and store those values for reporting/graphing. At some point
> we discovered reports of machines pushing a lot of data, were incorrect
> and were going down in chucks of about 4 GB, while you expect this is an
> always increasing number (or wraps around a 32 or 64 bit counter)
> 
> After futher analysing the problem and the source code we discovered
> that the counters of the queues in the code are handled as 32-bit
> integers, while the total bytes tx/rx for the while interface are
> counted as 64-bit longs.
> 
> 
> Attached a patch to solve the problem.

...

> --- a/drivers/net/xen-netback/common.h	2016-12-22 15:41:07.785535748 +0000
> +++ b/drivers/net/xen-netback/common.h	2016-12-23 13:08:18.123080064 +0000
> @@ -119,10 +119,10 @@
>  	 * A subset of struct net_device_stats that contains only the
>  	 * fields that are updated in netback.c for each queue.
>  	 */
> -	unsigned int rx_bytes;
> -	unsigned int rx_packets;
> -	unsigned int tx_bytes;
> -	unsigned int tx_packets;
> +	unsigned long rx_bytes;
> +	unsigned long rx_packets;
> +	unsigned long tx_bytes;
> +	unsigned long tx_packets;

So on 32 bit kernels you'll still encounter the same problem.

Please use the type "u64" as (nearly) everywhere else.

You might want to read Documentation/SubmittingPatches in the Linux
kernel source tree.


Juergen

Patch

--- a/drivers/net/xen-netback/common.h	2016-12-22 15:41:07.785535748 +0000
+++ b/drivers/net/xen-netback/common.h	2016-12-23 13:08:18.123080064 +0000
@@ -119,10 +119,10 @@ 
 	 * A subset of struct net_device_stats that contains only the
 	 * fields that are updated in netback.c for each queue.
 	 */
-	unsigned int rx_bytes;
-	unsigned int rx_packets;
-	unsigned int tx_bytes;
-	unsigned int tx_packets;
+	unsigned long rx_bytes;
+	unsigned long rx_packets;
+	unsigned long tx_bytes;
+	unsigned long tx_packets;
 
 	/* Additional stats used by xenvif */
 	unsigned long rx_gso_checksum_fixup;