diff mbox series

libceph: give priority to sockets of MDS clients

Message ID 20221025142731.22636-1-make.dirty.code@gmail.com
State New
Headers show
Series libceph: give priority to sockets of MDS clients | expand

Commit Message

Minjong Kim Oct. 25, 2022, 2:27 p.m. UTC
MDS requests are buffered or dropped when the client's network is saturated.
To alleviate this by giving priority to sockets in MDS.

Signed-off-by: Minjong Kim <make.dirty.code@gmail.com>
---
Hello. I am very new to kernel. I would appreciate it if you could
understand my clumsy process. I'm not sure if I should post this as a
general question, as a patch, or if I should write a comment like this
here, but I'll write a few words.

I've found that at the point where the client network saturates, requests
from MDSs drop significantly. To solve this, I added code to the kernel's
code to tag MDS sockets with IP_TOS.

However, there are some problems caused by my inadequacies.

First, is it okay to use higher-level functions like ip_setsockopt? This
function works fine, but I haven't seen any other kernel code use it. Do I
have to change the code like skb->priority manually? I'm mainly working on
high-level code, so I'm careful about whether I can access these attributes
directly.

Second, IP_TOS seems to be a deprecated option. It seems to be managed
through diffserv these days (though it is compatible with IP_TOS), but I
couldn't find a function to tag dscp directly. In this case, using a
function like ip_setsockopt(..IP_TOS) seems to be a problem, but I couldn't
solve it in my own way.

Third, the benchmarks I conducted seem to have many variables depending on
various computing environments. I think I've done it several times as best
I can, but this may be variable due to my local environment.

Finally, this doesn't seem to be a perfect way to solve the problem. It
seems that MDS packets are still buffered when burst. Also, it seems that
many distributions these days use fq_codel by default, which doesn't
support diffserv. But tagging IP_TOS doesn't seem to get any worse. (since
the filesystem's workload is very small). The next version of fq_codel,
cake, supports it, so there is a possibility that it will be improved.

Thanks for reading this long post. Apart from the shortcomings in my code,
please forgive me for the shortcomings in the kernel contributing process.

 net/ceph/messenger_v1.c | 14 ++++++++++++++
 net/ceph/messenger_v2.c | 13 +++++++++++++
 2 files changed, 27 insertions(+)
diff mbox series

Patch

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index 3ddbde87e4d6..bab6ec4af82c 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -6,6 +6,7 @@ 
 #include <linux/net.h>
 #include <linux/socket.h>
 #include <net/sock.h>
+#include <net/ip.h>
 
 #include <linux/ceph/ceph_features.h>
 #include <linux/ceph/decode.h>
@@ -1423,6 +1424,19 @@  int ceph_con_v1_try_write(struct ceph_connection *con)
 			con->error_msg = "connect error";
 			goto out;
 		}
+
+		if (con->peer_name.type == CEPH_ENTITY_TYPE_MDS) {
+			__u8 tos_mds = 0xb0; // mark as AF32
+
+			ret = ip_setsockopt(con->sock->sk, SOL_IP, IP_TOS,
+			                    KERNEL_SOCKPTR(&tos_mds), 1);
+
+			if (ret) {
+				pr_err("ip_setsockopt failed: %d\n", ret);
+				con->error_msg = "connect error";
+				return ret;
+			}
+		}
 	}
 
 more:
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index cc8ff81a50b7..d87430f333c9 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -3180,6 +3180,19 @@  int ceph_con_v2_try_write(struct ceph_connection *con)
 			con->error_msg = "connect error";
 			return ret;
 		}
+
+		if (con->peer_name.type == CEPH_ENTITY_TYPE_MDS) {
+			__u8 tos_mds = 0xb0; // mark as AF32
+
+			ret = ip_setsockopt(con->sock->sk, SOL_IP, IP_TOS,
+			                    KERNEL_SOCKPTR(&tos_mds), 1);
+
+			if (ret) {
+				pr_err("ip_setsockopt failed: %d\n", ret);
+				con->error_msg = "connect error";
+				return ret;
+			}
+		}
 	}
 
 	if (!iov_iter_count(&con->v2.out_iter)) {