diff mbox series

[RFC] locking/rwbase: Prevent indefinite writer starvation

Message ID 20230106142743.30759-1-mgorman@techsingularity.net
State New
Headers show
Series [RFC] locking/rwbase: Prevent indefinite writer starvation | expand

Commit Message

Mel Gorman Jan. 6, 2023, 2:27 p.m. UTC
rw_semaphore and rwlock are explicitly unfair to writers in the presense
of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;

	The implementation is writer unfair, as it is not feasible to do
	priority inheritance on multiple readers, but experience has shown
	that real-time workloads are not the typical workloads which are
	sensitive to writer starvation.

While atypical, it's also trivial to block writers with PREEMPT_RT
indefinitely without ever making forward progress. Since LTP-20220121,
the dio_truncate test case went from having 1 reader to having 16 readers
and the number of readers is sufficient to prevent the down_write ever
succeeding while readers exist. Ultimately the test is killed after 30
minutes as a failure.

dio_truncate is not a realtime application but indefinite writer starvation
is undesirable. The test case has one writer appending and truncating files
A and B while multiple readers read file A.  The readers and writer are
contending for one file's inode lock which never succeeds as the readers
keep reading until the writer is done which never happens.

This patch records a timestamp when the first writer is blocked. Reader
bias is allowed until the first writer has been blocked for a minimum of
4ms and a maximum of (4ms + 1 jiffie). The cutoff time is arbitrary on
the assumption that a hard realtime application missing a 4ms deadline
would not need PRREMPT_RT. It's expected that hard realtime applications
avoid such heavy reader/writer contention by design. On a test machine,
the test completed in 92 seconds.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/rwbase_rt.h  |  3 +++
 kernel/locking/rwbase_rt.c | 12 +++++++++++-
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Jan. 9, 2023, 3:23 p.m. UTC | #1
On Fri, Jan 06, 2023 at 02:27:43PM +0000, Mel Gorman wrote:
> rw_semaphore and rwlock are explicitly unfair to writers in the presense
> of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
> 
> 	The implementation is writer unfair, as it is not feasible to do
> 	priority inheritance on multiple readers, but experience has shown
> 	that real-time workloads are not the typical workloads which are
> 	sensitive to writer starvation.
> 
> While atypical, it's also trivial to block writers with PREEMPT_RT
> indefinitely without ever making forward progress. Since LTP-20220121,
> the dio_truncate test case went from having 1 reader to having 16 readers
> and the number of readers is sufficient to prevent the down_write ever
> succeeding while readers exist. Ultimately the test is killed after 30
> minutes as a failure.
> 
> dio_truncate is not a realtime application but indefinite writer starvation
> is undesirable. The test case has one writer appending and truncating files
> A and B while multiple readers read file A.  The readers and writer are
> contending for one file's inode lock which never succeeds as the readers
> keep reading until the writer is done which never happens.
> 
> This patch records a timestamp when the first writer is blocked. Reader
> bias is allowed until the first writer has been blocked for a minimum of
> 4ms and a maximum of (4ms + 1 jiffie). The cutoff time is arbitrary on
> the assumption that a hard realtime application missing a 4ms deadline
> would not need PRREMPT_RT. It's expected that hard realtime applications
> avoid such heavy reader/writer contention by design. On a test machine,
> the test completed in 92 seconds.

>  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  				      unsigned int state)
>  {
> @@ -76,7 +79,8 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>  	 * Allow readers, as long as the writer has not completely
>  	 * acquired the semaphore for write.
>  	 */
> -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> +	    jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD) {
>  		atomic_inc(&rwb->readers);
>  		raw_spin_unlock_irq(&rtm->wait_lock);
>  		return 0;

Blergh.

So a number of comments:

 - this deserves a giant comment, not only an obscure extra condition.

 - this would be better if it were limited to only have effect
   when there are no RT/DL tasks involved.

This made me re-read the phase-fair rwlock paper and again note that RW
semaphore (eg blocking) variant was delayed to future work and AFAICT
this future hasn't happened yet :/

AFAICT it would still require boosting the readers (something tglx still
has nightmares of) and limiting reader concurrency, another thing that
hurts.
Davidlohr Bueso Jan. 9, 2023, 4:31 p.m. UTC | #2
On Mon, 09 Jan 2023, Peter Zijlstra wrote:

>On Fri, Jan 06, 2023 at 02:27:43PM +0000, Mel Gorman wrote:
>> rw_semaphore and rwlock are explicitly unfair to writers in the presense
>> of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
>> ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
>>
>>	The implementation is writer unfair, as it is not feasible to do
>>	priority inheritance on multiple readers, but experience has shown
>>	that real-time workloads are not the typical workloads which are
>>	sensitive to writer starvation.
>>
>> While atypical, it's also trivial to block writers with PREEMPT_RT
>> indefinitely without ever making forward progress. Since LTP-20220121,
>> the dio_truncate test case went from having 1 reader to having 16 readers
>> and the number of readers is sufficient to prevent the down_write ever
>> succeeding while readers exist. Ultimately the test is killed after 30
>> minutes as a failure.
>>
>> dio_truncate is not a realtime application but indefinite writer starvation
>> is undesirable. The test case has one writer appending and truncating files
>> A and B while multiple readers read file A.  The readers and writer are
>> contending for one file's inode lock which never succeeds as the readers
>> keep reading until the writer is done which never happens.
>>
>> This patch records a timestamp when the first writer is blocked. Reader
>> bias is allowed until the first writer has been blocked for a minimum of
>> 4ms and a maximum of (4ms + 1 jiffie). The cutoff time is arbitrary on
>> the assumption that a hard realtime application missing a 4ms deadline
>> would not need PRREMPT_RT. It's expected that hard realtime applications
>> avoid such heavy reader/writer contention by design. On a test machine,
>> the test completed in 92 seconds.
>
>>  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>>				      unsigned int state)
>>  {
>> @@ -76,7 +79,8 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
>>	 * Allow readers, as long as the writer has not completely
>>	 * acquired the semaphore for write.
>>	 */
>> -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
>> +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
>> +	    jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD) {
>>		atomic_inc(&rwb->readers);
>>		raw_spin_unlock_irq(&rtm->wait_lock);
>>		return 0;
>
>Blergh.
>
>So a number of comments:
>
> - this deserves a giant comment, not only an obscure extra condition.
>
> - this would be better if it were limited to only have effect
>   when there are no RT/DL tasks involved.

Agreed.

(Sorry for hijacking this thread, also more Cc)

Hmm this reminds me of the epoll rwlock situation[1, 2] which does the lockless
ready event list updates from irq callback context and hits the writer unfair
scenario, which was designed really for tasklist_lock. Converting the read_lock
to RCU looks like a no-go because this is not a read-mostly pattern, far from
it actually. And in fact the read path is not at all a read path (ie: simply
traversing the list(s)). We also probably hit this unfair is good for throughput
condition mentioned by Linus as these are spinning locks and thus a short critical
region to really benefit from actual concurrent readers.

So while the numbers in a218cc491420 (epoll: use rwlock in order to reduce ep_poll
callback() contention) are very nice, based on the above and the fact that per
the changelog it does misasume the fairness I would vote for removing the lockless
stuff and return to simply using a spinlock (epoll is wacky enough already).
It is ultimately less burden on the kernel, and I suspect that people who really
care about epoll performance will mostly be looking at io_uring.

Thanks,
Davidlohr

[1] https://lore.kernel.org/all/20210825132754.GA895675@lothringen/
[2] https://lore.kernel.org/all/20220617091039.2257083-1-eric.dumazet@gmail.com/

>
>This made me re-read the phase-fair rwlock paper and again note that RW
>semaphore (eg blocking) variant was delayed to future work and AFAICT
>this future hasn't happened yet :/
>
>AFAICT it would still require boosting the readers (something tglx still
>has nightmares of) and limiting reader concurrency, another thing that
>hurts.
>
>
Mel Gorman Jan. 10, 2023, 12:04 p.m. UTC | #3
On Mon, Jan 09, 2023 at 04:23:56PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 06, 2023 at 02:27:43PM +0000, Mel Gorman wrote:
> > rw_semaphore and rwlock are explicitly unfair to writers in the presense
> > of readers by design with a PREEMPT_RT configuration. Commit 943f0edb754f
> > ("locking/rt: Add base code for RT rw_semaphore and rwlock") notes;
> > 
> > 	The implementation is writer unfair, as it is not feasible to do
> > 	priority inheritance on multiple readers, but experience has shown
> > 	that real-time workloads are not the typical workloads which are
> > 	sensitive to writer starvation.
> > 
> > While atypical, it's also trivial to block writers with PREEMPT_RT
> > indefinitely without ever making forward progress. Since LTP-20220121,
> > the dio_truncate test case went from having 1 reader to having 16 readers
> > and the number of readers is sufficient to prevent the down_write ever
> > succeeding while readers exist. Ultimately the test is killed after 30
> > minutes as a failure.
> > 
> > dio_truncate is not a realtime application but indefinite writer starvation
> > is undesirable. The test case has one writer appending and truncating files
> > A and B while multiple readers read file A.  The readers and writer are
> > contending for one file's inode lock which never succeeds as the readers
> > keep reading until the writer is done which never happens.
> > 
> > This patch records a timestamp when the first writer is blocked. Reader
> > bias is allowed until the first writer has been blocked for a minimum of
> > 4ms and a maximum of (4ms + 1 jiffie). The cutoff time is arbitrary on
> > the assumption that a hard realtime application missing a 4ms deadline
> > would not need PRREMPT_RT. It's expected that hard realtime applications
> > avoid such heavy reader/writer contention by design. On a test machine,
> > the test completed in 92 seconds.
> 
> >  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> >  				      unsigned int state)
> >  {
> > @@ -76,7 +79,8 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
> >  	 * Allow readers, as long as the writer has not completely
> >  	 * acquired the semaphore for write.
> >  	 */
> > -	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
> > +	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
> > +	    jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD) {
> >  		atomic_inc(&rwb->readers);
> >  		raw_spin_unlock_irq(&rtm->wait_lock);
> >  		return 0;
> 
> Blergh.
> 

I'm not surprised at the reaction.

> So a number of comments:
> 
>  - this deserves a giant comment, not only an obscure extra condition.
> 

Easy enough.

>  - this would be better if it were limited to only have effect
>    when there are no RT/DL tasks involved.
> 

That's a little more involved but as a bonus, it gets lots of comments
at the cost of a loss of waiter_blocked granularity.

> This made me re-read the phase-fair rwlock paper and again note that RW
> semaphore (eg blocking) variant was delayed to future work and AFAICT
> this future hasn't happened yet :/
> 

Story of our lives. Is https://www.cs.unc.edu/~anderson/papers/ecrts09b.pdf
the right paper to start with? (just a quick search, haven't actually
read it)

> AFAICT it would still require boosting the readers (something tglx still
> has nightmares of) and limiting reader concurrency, another thing that
> hurts.
> 

I guess proxy execution would be part of that puzzle but even though I'm not
overly familiar with the problem, the more I think about it, the less fun
it gets. Even if readers were boosted, all of the readers have to complete
before the writer can go ahead so proxy execution would only help if the
writer had higher priority. I'm no expert but it occurs to me that blocking
readers for writes or proxy execution would have to assume that writers are
inherently high priority than readers regardless of their actual priority.
I guess that's mostly true but maybe there is a counter-example.

Taking mmap_sem as an example. A design stupid enough to mix critical threads
with low-priority threads that require malloc() within the same address
space would have a problem assuming the writer is higher priority. Then
again, the critical phase faulting memory and needing mmap_sem for read is
also broken behaviour. Same with inode semaphores, critical tasks doing
IO at all or IO to the same file is madness. I cannot think of a case
offhand where the writer is inherently less important than the reader *and*
contenting on the semaphore is reasonable behaviour for a RT task.

This is not tested or even booted but it tracks dl/rt task taking a lock for
read since the last write unlock. The major downside is additional overhead
in the read fast path the first time a rt/dl task acquires the lock for read.

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..05c4dc74b8bd 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_blocked;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_blocked = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@ struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_blocked = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..a732cbdd5a62 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -39,7 +39,11 @@
  * major surgery for a very dubious value.
  *
  * The risk of writer starvation is there, but the pathological use cases
- * which trigger it are not necessarily the typical RT workloads.
+ * which trigger it are not necessarily the typical RT workloads. The worst
+ * case of indefinite starvation of a writer will force readers into the
+ * slow path if a writer is blocked for more than RW_CONTENTION_THRESHOLD
+ * jiffies unless dl/rt tasks have taken a read lock since the last write
+ * unlock.
  *
  * Fast-path orderings:
  * The lock/unlock of readers can run in fast paths: lock and unlock are only
@@ -65,6 +69,67 @@ static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/*
+ * Allow reader bias with a pending writer for a minimum of 4ms or 1 tick.
+ * The granularity is not exact as the lowest bit in rwbase_rt->waiter_blocked
+ * is used to detect recent rt/dl tasks taking a read lock.
+ */
+#define RW_CONTENTION_THRESHOLD (HZ/250+1)
+
+static void __sched update_dlrt_reader(struct rwbase_rt *rwb)
+{
+
+	/* No update required if dl/rt tasks already identified. */
+	if (rwb->waiter_blocked & 1)
+		return;
+
+	/*
+	 * Record a dl/rt task acquiring the lock for read. This may result
+	 * in indefinite writer starvation but dl/rt tasks should avoid such
+	 * behaviour.
+	 */
+	if (dl_task(current) || rt_task(current)) {
+		struct rt_mutex_base *rtm;
+		unsigned long flags;
+
+		rtm = &rwb->rtmutex;
+		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
+
+		/* Guarantees at least 1 tick of a delay. */
+		rwb->waiter_blocked = (rwb->waiter_blocked + 1) | 1;
+
+		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
+	}
+}
+
+/* rtmutex->wait_lock must be held. */
+static void __sched set_writer_blocked(struct rwbase_rt *rwb)
+{
+	/*
+	 * Lowest bit preserved to identify recent rt/dl tasks acquiring
+	 * the lock for read.
+	 */
+	rwb->waiter_blocked |= jiffies & ~1UL;
+}
+
+static bool __sched rwbase_allow_reader_bias(struct rwbase_rt *rwb)
+{
+	/*
+	 * Allow reader bias if a dl or rt task took the lock for read
+	 * since the last write unlock. Such tasks should be designed
+	 * to avoid heavy writer contention or indefinite starvation.
+	 */
+	if (rwb->waiter_blocked & 1)
+		return true;
+
+	/*
+	 * Allow reader bias unless a writer has been blocked for more
+	 * than RW_CONTENTION_THRESHOLD jiffies. This may break priority
+	 * inheritance for readers but avoid indefinite writer starvation.
+	 */
+	return jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD;
+}
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -74,9 +139,11 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	raw_spin_lock_irq(&rtm->wait_lock);
 	/*
 	 * Allow readers, as long as the writer has not completely
-	 * acquired the semaphore for write.
+	 * acquired the semaphore for write and reader bias is still
+	 * allowed.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    rwbase_allow_reader_bias(rwb)) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -140,10 +207,18 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 static __always_inline int rwbase_read_lock(struct rwbase_rt *rwb,
 					    unsigned int state)
 {
+	int ret;
+
 	if (rwbase_read_trylock(rwb))
-		return 0;
+		ret = 0;
+	else
+		ret = __rwbase_read_lock(rwb, state);
 
-	return __rwbase_read_lock(rwb, state);
+	/* Record if the current task acquiring the lock is a dl/rt task. */
+	if (!ret)
+		update_dlrt_reader(rwb);
+
+	return ret;
 }
 
 static void __sched __rwbase_read_unlock(struct rwbase_rt *rwb,
@@ -264,12 +339,17 @@ static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/* Record first new read/write contention. */
+		set_writer_blocked(rwb);
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_blocked = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);
Peter Zijlstra Jan. 10, 2023, 12:16 p.m. UTC | #4
On Tue, Jan 10, 2023 at 12:04:47PM +0000, Mel Gorman wrote:

> Story of our lives. Is https://www.cs.unc.edu/~anderson/papers/ecrts09b.pdf
> the right paper to start with? (just a quick search, haven't actually
> read it)

The paper I (re)read was: https://cs.unc.edu/~anderson/papers/rtsj10-for-web.pdf

Same people though, so might be closely related.
diff mbox series

Patch

diff --git a/include/linux/rwbase_rt.h b/include/linux/rwbase_rt.h
index 1d264dd08625..05c4dc74b8bd 100644
--- a/include/linux/rwbase_rt.h
+++ b/include/linux/rwbase_rt.h
@@ -10,12 +10,14 @@ 
 
 struct rwbase_rt {
 	atomic_t		readers;
+	unsigned long		waiter_blocked;
 	struct rt_mutex_base	rtmutex;
 };
 
 #define __RWBASE_INITIALIZER(name)				\
 {								\
 	.readers = ATOMIC_INIT(READER_BIAS),			\
+	.waiter_blocked = 0,					\
 	.rtmutex = __RT_MUTEX_BASE_INITIALIZER(name.rtmutex),	\
 }
 
@@ -23,6 +25,7 @@  struct rwbase_rt {
 	do {							\
 		rt_mutex_base_init(&(rwbase)->rtmutex);		\
 		atomic_set(&(rwbase)->readers, READER_BIAS);	\
+		(rwbase)->waiter_blocked = 0;			\
 	} while (0)
 
 
diff --git a/kernel/locking/rwbase_rt.c b/kernel/locking/rwbase_rt.c
index c201aadb9301..492bcfa7572c 100644
--- a/kernel/locking/rwbase_rt.c
+++ b/kernel/locking/rwbase_rt.c
@@ -65,6 +65,9 @@  static __always_inline int rwbase_read_trylock(struct rwbase_rt *rwb)
 	return 0;
 }
 
+/* Allow reader bias with a pending writer for a minimum of 4ms or 1 tick. */
+#define RW_CONTENTION_THRESHOLD (HZ/250+1)
+
 static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 				      unsigned int state)
 {
@@ -76,7 +79,8 @@  static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
 	 * Allow readers, as long as the writer has not completely
 	 * acquired the semaphore for write.
 	 */
-	if (atomic_read(&rwb->readers) != WRITER_BIAS) {
+	if (atomic_read(&rwb->readers) != WRITER_BIAS &&
+	    jiffies - rwb->waiter_blocked < RW_CONTENTION_THRESHOLD) {
 		atomic_inc(&rwb->readers);
 		raw_spin_unlock_irq(&rtm->wait_lock);
 		return 0;
@@ -264,12 +268,18 @@  static int __sched rwbase_write_lock(struct rwbase_rt *rwb,
 		if (__rwbase_write_trylock(rwb))
 			break;
 
+		/* Record first new read/write contention. */
+		if (!rwb->waiter_blocked)
+			rwb->waiter_blocked = jiffies;
+
 		raw_spin_unlock_irqrestore(&rtm->wait_lock, flags);
 		rwbase_schedule();
 		raw_spin_lock_irqsave(&rtm->wait_lock, flags);
 
 		set_current_state(state);
 	}
+
+	rwb->waiter_blocked = 0;
 	rwbase_restore_current_state();
 	trace_contention_end(rwb, 0);