diff mbox series

[v4] locking/rwbase: Mitigate indefinite writer starvation

Message ID 20230120140847.4pjqf3oinemokcyp@techsingularity.net
State New
Headers show
Series [v4] locking/rwbase: Mitigate indefinite writer starvation | expand

Commit Message

Mel Gorman Jan. 20, 2023, 2:08 p.m. UTC
rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

        The implementation is writer unfair, as it is not feasible to do
        priority inheritance on multiple readers, but experience has shown
        that real-time workloads are not the typical workloads which are
        sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Eventually the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A. The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked. DL /
RT tasks can continue to take the lock for read as long as readers exist
indefinitely. Other readers can acquire the read lock unless a writer
has been blocked for a minimum of 4ms. This is sufficient to allow the
dio_truncate test case to complete within the 30 minutes timeout.

[bigeasy@linutronix.de: Fix overflow, close race against reader, match rwsem
			timeouts, better rt_task handling, simplification]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/rwbase_rt.h  |  3 +++
 kernel/locking/rwbase_rt.c | 41 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 41 insertions(+), 3 deletions(-)

Comments

Mel Gorman Jan. 27, 2023, 11 a.m. UTC | #1
On Fri, Jan 20, 2023 at 02:08:47PM +0000, Mel Gorman wrote:
> rw_semaphore and rwlock are explicitly unfair to writers in the presense
> of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
> 
>         The implementation is writer unfair, as it is not feasible to do
>         priority inheritance on multiple readers, but experience has shown
>         that real-time workloads are not the typical workloads which are
>         sensitive to writer starvation.
> 
> While atypical, it's also trivial to block writers with PREEMPT_RT
> indefinitely without ever making forward progress. Since LTP-20220121,
> the dio_truncate test case went from having 1 reader to having 16 readers
> and the number of readers is sufficient to prevent the down_write ever
> succeeding while readers exist. Eventually the test is killed after 30
> minutes as a failure.
> 
> dio_truncate is not a realtime application but indefinite writer starvation
> is undesirable. The test case has one writer appending and truncating files
> A and B while multiple readers read file A. The readers and writer are
> contending for one file's inode lock which never succeeds as the readers
> keep reading until the writer is done which never happens.
> 
> This patch records a timestamp when the first writer is blocked. DL /
> RT tasks can continue to take the lock for read as long as readers exist
> indefinitely. Other readers can acquire the read lock unless a writer
> has been blocked for a minimum of 4ms. This is sufficient to allow the
> dio_truncate test case to complete within the 30 minutes timeout.
> 
> [bigeasy@linutronix.de: Fix overflow, close race against reader, match rwsem
> 			timeouts, better rt_task handling, simplification]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Yay/nay?
Thomas Gleixner Feb. 6, 2023, 2:30 p.m. UTC | #2
Mel!

On Fri, Jan 20 2023 at 14:08, Mel Gorman wrote:
> dio_truncate is not a realtime application but indefinite writer starvation
> is undesirable. The test case has one writer appending and truncating files
> A and B while multiple readers read file A. The readers and writer are
> contending for one file's inode lock which never succeeds as the readers
> keep reading until the writer is done which never happens.
>
> This patch records a timestamp when the first writer is blocked. DL /

git grep 'This patch' Documentation/process/

> RT tasks can continue to take the lock for read as long as readers exist
> indefinitely. Other readers can acquire the read lock unless a writer
> has been blocked for a minimum of 4ms. This is sufficient to allow the
> dio_truncate test case to complete within the 30 minutes timeout.

I'm not opposed to this, but what's the actual reason for this pulled
out of thin air timeout?

What's the downside of actually forcing !RT readers into the slowpath
once there is a writer waiting?

Thanks,

        tglx
Mel Gorman Feb. 8, 2023, 8:19 p.m. UTC | #3
On Mon, Feb 06, 2023 at 03:30:35PM +0100, Thomas Gleixner wrote:
> Mel!

Hi :)

I'm not really online for the next several weeks so further responses may
take ages. It's co-incidence that I'm online at the moment for an unrelated
matter and glancing through mail.

> 
> On Fri, Jan 20 2023 at 14:08, Mel Gorman wrote:
> > dio_truncate is not a realtime application but indefinite writer starvation
> > is undesirable. The test case has one writer appending and truncating files
> > A and B while multiple readers read file A. The readers and writer are
> > contending for one file's inode lock which never succeeds as the readers
> > keep reading until the writer is done which never happens.
> >
> > This patch records a timestamp when the first writer is blocked. DL /
> 
> git grep 'This patch' Documentation/process/
> 

I'm aware of the rule but tend to forget at times as enforcement varies
between subsystems. First sentence of the paragraph becomes;

Record a timestamp when the first writer is blocked and force all new
readers into the slow path upon expiration.

> > RT tasks can continue to take the lock for read as long as readers exist
> > indefinitely. Other readers can acquire the read lock unless a writer
> > has been blocked for a minimum of 4ms. This is sufficient to allow the
> > dio_truncate test case to complete within the 30 minutes timeout.
> 
> I'm not opposed to this, but what's the actual reason for this pulled
> out of thin air timeout?
> 

No good reason, a value had to be picked. It happens to match the rwsem
cutoff for optimistic spinning. That at least is some threshold for "a
lock failed to be acquired within a reasonable time period". It's also
arbitrary that it happened to be a value that allowed the dio_truncate
LTP test to complete in a reasonable time.

> What's the downside of actually forcing !RT readers into the slowpath
> once there is a writer waiting?
> 

I actually don't know for sure because it's application dependant but at
minimum, I believe it would be a deviation from how generic rwsems behave
where a writer optimistically spins for the same duration before forcing
the handoff. Whether that matters or not depends on the application,
the ratio between readers/writers and the number of concurrent readers.
Sebastian Andrzej Siewior Feb. 15, 2023, 4:02 p.m. UTC | #4
On 2023-02-06 15:30:35 [+0100], Thomas Gleixner wrote:
> What's the downside of actually forcing !RT readers into the slowpath
> once there is a writer waiting?

We always said that there are no RT users of rwsem. Therefore it
shouldn't matter because we still assume that nothing depends on this.
After all we had one a one reader implementation of rwsem and this is
the first report (to my knowledge) of a fallout since it was changed to
multi-reader.

That said let me update Mel's patch and resend it without this bit.

> Thanks,
> 
>         tglx

Sebastian
diff mbox series

Patch

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..b969b1d9bb85 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@ 
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_timeout;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_timeout = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@  struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_timeout = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..24fafe16e008 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -39,7 +39,10 @@ 
  * major surgery for a very dubious value.
  *
  * The risk of writer starvation is there, but the pathological use cases
- * which trigger it are not necessarily the typical RT workloads.
+ * which trigger it are not necessarily the typical RT workloads. SCHED_OTHER
+ * reader acquisitions will be forced into the slow path if a writer is
+ * blocked for more than RWBASE_RT_WAIT_TIMEOUT jiffies. New DL / RT readers
+ * can still starve a writer indefinitely.
  *
  * Fast-path orderings:
  * The lock/unlock of readers can run in fast paths: lock and unlock are only
@@ -65,6 +68,27 @@  static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/*
+ * Allow reader bias for SCHED_OTHER tasks with a pending writer for a
+ * minimum of 4ms or 1 tick. This matches RWSEM_WAIT_TIMEOUT for the
+ * generic RWSEM implementation.
+ */
+#define RWBASE_RT_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
+
+static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
+{
+	/*
+	 * Allow reader bias if no writer is blocked or for DL / RT tasks.
+	 * Such tasks should be designed to avoid heavy writer contention
+	 * or indefinite starvation.
+	 */
+	if (!rwb->waiter_timeout || rt_task(current))
+		return true;
+
+	/* Allow reader bias unless a writer timeout has expired. */
+	return time_before(jiffies, rwb->waiter_timeout);
+}
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -74,9 +98,11 @@  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	raw_spin_lock_irq(&rtm->wait_lock);
 	/*
 	 * Allow readers, as long as the writer has not completely
-	 * acquired the semaphore for write.
+	 * acquired the semaphore for write and reader bias is still
+	 * allowed.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    rwbase_allow_reader_bias(rwb)) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -255,6 +281,7 @@  static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 	for (;;) {
 		/* Optimized out for rwlocks */
 		if (rwbase_signal_pending_state(state, current)) {
+			rwb->waiter_timeout = 0;
 			rwbase_restore_current_state();
 			__rwbase_write_unlock(rwb, 0, flags);
 			trace_contention_end(rwb, -EINTR);
@@ -264,12 +291,20 @@  static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/*
+		 * Record timeout when reader bias is ignored. Ensure timeout
+		 * is at least 1 in case of overflow.
+		 */
+		rwb->waiter_timeout = (jiffies + RWBASE_RT_WAIT_TIMEOUT) | 1;
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_timeout = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);