diff mbox series

[2/2] fbdev: Don't sort deferred-I/O pages by default

Message ID 20220210141111.5231-3-tzimmermann@suse.de
State New
Headers show
Series fbdev: Significantly improve performance of fbdefio mmap | expand

Commit Message

Thomas Zimmermann Feb. 10, 2022, 2:11 p.m. UTC
Fbdev's deferred I/O sorts all dirty pages by default, which incurs a
significant overhead. Make the sorting step optional and update the few
drivers that require it. Use a FIFO list by default.

Sorting pages by memory offset for deferred I/O performs an implicit
bubble-sort step on the list of dirty pages. The algorithm goes through
the list of dirty pages and inserts each new page according to its
index field. Even worse, list traversal always starts at the first
entry. As video memory is most likely updated scanline by scanline, the
algorithm traverses through the complete list for each updated page.

For example, with 1024x768x32bpp a page covers exactly one scanline.
Writing a single screen update from top to bottom requires updating
768 pages. With an average list length of 384 entries, a screen update
creates (768 * 384 =) 294912 compare operation.

Fix this by making the sorting step opt-in and update the few drivers
that require it. All other drivers work with unsorted page lists. Pages
are appended to the list. Therefore, in the common case of writing the
framebuffer top to bottom, pages are still sorted by offset, which may
have a positive effect on performance.

Playing a video [1] in mplayer's benchmark mode shows the difference
(i7-4790, FullHD, simpledrm, kernel with debugging).

  mplayer -benchmark -nosound -vo fbdev ./big_buck_bunny_720p_stereo.ogg

With sorted page lists:

  BENCHMARKs: VC:  32.960s VO:  73.068s A:   0.000s Sys:   2.413s =  108.441s
  BENCHMARK%: VC: 30.3947% VO: 67.3802% A:  0.0000% Sys:  2.2251% = 100.0000%

With unsorted page lists:

  BENCHMARKs: VC:  31.005s VO:  42.889s A:   0.000s Sys:   2.256s =   76.150s
  BENCHMARK%: VC: 40.7156% VO: 56.3219% A:  0.0000% Sys:  2.9625% = 100.0000%

VC shows the overhead of video decoding, VO shows the overhead of the
video output. Using unsorted page lists reduces the benchmark's run time
by ~32s/~25%.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://download.blender.org/peach/bigbuckbunny_movies/big_buck_bunny_720p_stereo.ogg # [1]
---
 drivers/staging/fbtft/fbtft-core.c  |  1 +
 drivers/video/fbdev/broadsheetfb.c  |  1 +
 drivers/video/fbdev/core/fb_defio.c | 19 ++++++++++++-------
 drivers/video/fbdev/metronomefb.c   |  1 +
 drivers/video/fbdev/udlfb.c         |  1 +
 include/linux/fb.h                  |  1 +
 6 files changed, 17 insertions(+), 7 deletions(-)

Comments

Geert Uytterhoeven Feb. 14, 2022, 8:05 a.m. UTC | #1
Hi Thomas,

On Thu, Feb 10, 2022 at 4:24 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> Fbdev's deferred I/O sorts all dirty pages by default, which incurs a
> significant overhead. Make the sorting step optional and update the few
> drivers that require it. Use a FIFO list by default.
>
> Sorting pages by memory offset for deferred I/O performs an implicit
> bubble-sort step on the list of dirty pages. The algorithm goes through
> the list of dirty pages and inserts each new page according to its
> index field. Even worse, list traversal always starts at the first
> entry. As video memory is most likely updated scanline by scanline, the
> algorithm traverses through the complete list for each updated page.
>
> For example, with 1024x768x32bpp a page covers exactly one scanline.
> Writing a single screen update from top to bottom requires updating
> 768 pages. With an average list length of 384 entries, a screen update
> creates (768 * 384 =) 294912 compare operation.

What about using folios?
If consecutive pages are merged into a single entry, there's much less
(or nothing in the example above) to sort.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Geert Uytterhoeven Feb. 14, 2022, 9:05 a.m. UTC | #2
Hi Thomas,

On Mon, Feb 14, 2022 at 9:28 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> Am 14.02.22 um 09:05 schrieb Geert Uytterhoeven:
> > On Thu, Feb 10, 2022 at 4:24 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >> Fbdev's deferred I/O sorts all dirty pages by default, which incurs a
> >> significant overhead. Make the sorting step optional and update the few
> >> drivers that require it. Use a FIFO list by default.
> >>
> >> Sorting pages by memory offset for deferred I/O performs an implicit
> >> bubble-sort step on the list of dirty pages. The algorithm goes through
> >> the list of dirty pages and inserts each new page according to its
> >> index field. Even worse, list traversal always starts at the first
> >> entry. As video memory is most likely updated scanline by scanline, the
> >> algorithm traverses through the complete list for each updated page.
> >>
> >> For example, with 1024x768x32bpp a page covers exactly one scanline.
> >> Writing a single screen update from top to bottom requires updating
> >> 768 pages. With an average list length of 384 entries, a screen update
> >> creates (768 * 384 =) 294912 compare operation.
> >
> > What about using folios?
> > If consecutive pages are merged into a single entry, there's much less
> > (or nothing in the example above) to sort.
>
> How would the code know that? Calls to page_mkwrite happen
> pagefault-by-pagefault in any order AFAICT.

fb_deferred_io_mkwrite() would still be called for a page, but an
adjacent page can be merged with an existing entry while adding it
to the list.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Thomas Zimmermann Feb. 14, 2022, 1:29 p.m. UTC | #3
Hi

Am 14.02.22 um 10:05 schrieb Geert Uytterhoeven:
> Hi Thomas,
> 
> On Mon, Feb 14, 2022 at 9:28 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Am 14.02.22 um 09:05 schrieb Geert Uytterhoeven:
>>> On Thu, Feb 10, 2022 at 4:24 PM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> Fbdev's deferred I/O sorts all dirty pages by default, which incurs a
>>>> significant overhead. Make the sorting step optional and update the few
>>>> drivers that require it. Use a FIFO list by default.
>>>>
>>>> Sorting pages by memory offset for deferred I/O performs an implicit
>>>> bubble-sort step on the list of dirty pages. The algorithm goes through
>>>> the list of dirty pages and inserts each new page according to its
>>>> index field. Even worse, list traversal always starts at the first
>>>> entry. As video memory is most likely updated scanline by scanline, the
>>>> algorithm traverses through the complete list for each updated page.
>>>>
>>>> For example, with 1024x768x32bpp a page covers exactly one scanline.
>>>> Writing a single screen update from top to bottom requires updating
>>>> 768 pages. With an average list length of 384 entries, a screen update
>>>> creates (768 * 384 =) 294912 compare operation.
>>>
>>> What about using folios?
>>> If consecutive pages are merged into a single entry, there's much less
>>> (or nothing in the example above) to sort.
>>
>> How would the code know that? Calls to page_mkwrite happen
>> pagefault-by-pagefault in any order AFAICT.
> 
> fb_deferred_io_mkwrite() would still be called for a page, but an
> adjacent page can be merged with an existing entry while adding it
> to the list.

I still don't understand how we'd use it to our advantage. Most drivers 
don't need sorted pages at all. A folio has strong alignment 
requirements for size and offset AFAICT. We might end up flushing way 
too much of the display memory.

Best regards
Thomas

> 
> Gr{oetje,eeting}s,
> 
>                          Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                  -- Linus Torvalds
diff mbox series

Patch

diff --git a/drivers/staging/fbtft/fbtft-core.c b/drivers/staging/fbtft/fbtft-core.c
index f2684d2d6851..4a35347b3020 100644
--- a/drivers/staging/fbtft/fbtft-core.c
+++ b/drivers/staging/fbtft/fbtft-core.c
@@ -654,6 +654,7 @@  struct fb_info *fbtft_framebuffer_alloc(struct fbtft_display *display,
 	fbops->fb_blank     =      fbtft_fb_blank;
 
 	fbdefio->delay =           HZ / fps;
+	fbdefio->sort_pagelist =   true;
 	fbdefio->deferred_io =     fbtft_deferred_io;
 	fb_deferred_io_init(info);
 
diff --git a/drivers/video/fbdev/broadsheetfb.c b/drivers/video/fbdev/broadsheetfb.c
index fd66f4d4a621..b9054f658838 100644
--- a/drivers/video/fbdev/broadsheetfb.c
+++ b/drivers/video/fbdev/broadsheetfb.c
@@ -1059,6 +1059,7 @@  static const struct fb_ops broadsheetfb_ops = {
 
 static struct fb_deferred_io broadsheetfb_defio = {
 	.delay		= HZ/4,
+	.sort_pagelist	= true,
 	.deferred_io	= broadsheetfb_dpy_deferred_io,
 };
 
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c
index 3727b1ca87b1..1f672cf253b2 100644
--- a/drivers/video/fbdev/core/fb_defio.c
+++ b/drivers/video/fbdev/core/fb_defio.c
@@ -132,15 +132,20 @@  static vm_fault_t fb_deferred_io_mkwrite(struct vm_fault *vmf)
 	if (!list_empty(&page->lru))
 		goto page_already_added;
 
-	/* we loop through the pagelist before adding in order
-	to keep the pagelist sorted */
-	list_for_each_entry(cur, &fbdefio->pagelist, lru) {
-		if (cur->index > page->index)
-			break;
+	if (fbdefio->sort_pagelist) {
+		/*
+		 * We loop through the pagelist before adding in order
+		 * to keep the pagelist sorted.
+		 */
+		list_for_each_entry(cur, &fbdefio->pagelist, lru) {
+			if (cur->index > page->index)
+				break;
+		}
+		list_add_tail(&page->lru, &cur->lru);
+	} else {
+		list_add_tail(&page->lru, &fbdefio->pagelist);
 	}
 
-	list_add_tail(&page->lru, &cur->lru);
-
 page_already_added:
 	mutex_unlock(&fbdefio->lock);
 
diff --git a/drivers/video/fbdev/metronomefb.c b/drivers/video/fbdev/metronomefb.c
index 952826557a0c..af858dd23ea6 100644
--- a/drivers/video/fbdev/metronomefb.c
+++ b/drivers/video/fbdev/metronomefb.c
@@ -568,6 +568,7 @@  static const struct fb_ops metronomefb_ops = {
 
 static struct fb_deferred_io metronomefb_defio = {
 	.delay		= HZ,
+	.sort_pagelist	= true,
 	.deferred_io	= metronomefb_dpy_deferred_io,
 };
 
diff --git a/drivers/video/fbdev/udlfb.c b/drivers/video/fbdev/udlfb.c
index b9cdd02c1000..184bb8433b78 100644
--- a/drivers/video/fbdev/udlfb.c
+++ b/drivers/video/fbdev/udlfb.c
@@ -980,6 +980,7 @@  static int dlfb_ops_open(struct fb_info *info, int user)
 
 		if (fbdefio) {
 			fbdefio->delay = DL_DEFIO_WRITE_DELAY;
+			fbdefio->sort_pagelist = true;
 			fbdefio->deferred_io = dlfb_dpy_deferred_io;
 		}
 
diff --git a/include/linux/fb.h b/include/linux/fb.h
index 3d7306c9a706..9a77ab615c36 100644
--- a/include/linux/fb.h
+++ b/include/linux/fb.h
@@ -204,6 +204,7 @@  struct fb_pixmap {
 struct fb_deferred_io {
 	/* delay between mkwrite and deferred handler */
 	unsigned long delay;
+	bool sort_pagelist; /* sort pagelist by offset */
 	struct mutex lock; /* mutex that protects the page list */
 	struct list_head pagelist; /* list of touched pages */
 	/* callback */