diff mbox series

media: rkisp1: allow non-coherent video capture buffers

Message ID 20250102-b4-rkisp-noncoherent-v1-1-bba164f7132c@gmail.com
State Superseded
Headers show
Series media: rkisp1: allow non-coherent video capture buffers | expand

Commit Message

Mikhail Rudenko Jan. 2, 2025, 3:35 p.m. UTC
Currently, the rkisp1 driver always uses coherent DMA allocations for
video capture buffers. However, on some platforms, using non-coherent
buffers can improve performance, especially when CPU processing of
MMAP'ed video buffers is required.

For example, on the Rockchip RK3399 running at maximum CPU frequency,
the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
non-coherent DMA allocation. CPU usage also decreases accordingly.

This change allows userspace to request the allocation of non-coherent
buffers. Note that the behavior for existing users will remain unchanged
unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
allocating buffers.

Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
---
 drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
 1 file changed, 1 insertion(+)


---
base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba

Best regards,

Comments

Laurent Pinchart Jan. 3, 2025, 3:23 p.m. UTC | #1
On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> Currently, the rkisp1 driver always uses coherent DMA allocations for
> video capture buffers. However, on some platforms, using non-coherent
> buffers can improve performance, especially when CPU processing of
> MMAP'ed video buffers is required.
> 
> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> non-coherent DMA allocation. CPU usage also decreases accordingly.

What's the time taken by the cache management operations ?

> This change allows userspace to request the allocation of non-coherent
> buffers. Note that the behavior for existing users will remain unchanged
> unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
> allocating buffers.
> 
> Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
> ---
>  drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
> --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap)
>  	q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
>  	q->lock = &node->vlock;
>  	q->dev = cap->rkisp1->dev;
> +	q->allow_cache_hints = 1;
>  	ret = vb2_queue_init(q);
>  	if (ret) {
>  		dev_err(cap->rkisp1->dev,
> 
> ---
> base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
> change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Mikhail Rudenko Jan. 14, 2025, 4 p.m. UTC | #2
Hi Laurent,

On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:

> On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> video capture buffers. However, on some platforms, using non-coherent
>> buffers can improve performance, especially when CPU processing of
>> MMAP'ed video buffers is required.
>>
>> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> non-coherent DMA allocation. CPU usage also decreases accordingly.
>
> What's the time taken by the cache management operations ?

Sorry for the late reply, your question turned out a little more
interesting than I expected initially. :)

When capturing using Yavta with MMAP buffers under the conditions mentioned
in the commit message, ftrace gives 437.6 +- 1.1 us for
dma_sync_sgtable_for_cpu and 409 +- 14 us for
dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
buffers in this case is more CPU-efficient even when considering cache
management overhead.

When trying to do the same measurements with libcamera, I failed. In a
typical libcamera use case when MMAP buffers are allocated from a
device, exported as dmabufs and then used for capture on the same device
with DMABUF memory type, cache management in kernel is skipped [1]
[2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
DMA_BUF_IOCTL_SYNC from userspace does not work either.

So it looks like to make this change really useful, the above issue of
cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
solved. I'm not an expert in this area, so any advice is kindly welcome. :)

[1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
[2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
[3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426

--
Best regards,
Mikhail Rudenko
Tomasz Figa Jan. 15, 2025, 8:31 a.m. UTC | #3
Hi Mikhail and Laurent,

On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>
>
> Hi Laurent,
>
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> video capture buffers. However, on some platforms, using non-coherent
> >> buffers can improve performance, especially when CPU processing of
> >> MMAP'ed video buffers is required.
> >>
> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >
> > What's the time taken by the cache management operations ?
>
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
>
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
>
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.

Oops, so I believe this is a bug. When an MMAP buffer is allocated in
the non-coherent mode, those ops should perform proper cache
maintenance.

Let me send a patch to fix this in a couple of days unless someone
does it earlier.

Best regards,
Tomasz

>
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>
> --
> Best regards,
> Mikhail Rudenko
>
Mikhail Rudenko Jan. 15, 2025, 5:29 p.m. UTC | #4
On 2025-01-15 at 23:46 +09, Tomasz Figa <tfiga@chromium.org> wrote:

> On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>>
>> Hi Tomasz,
>>
>> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
>>
>> > Hi Mikhail and Laurent,
>> >
>> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>> >>
>> >>
>> >> Hi Laurent,
>> >>
>> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>> >>
>> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> >> video capture buffers. However, on some platforms, using non-coherent
>> >> >> buffers can improve performance, especially when CPU processing of
>> >> >> MMAP'ed video buffers is required.
>> >> >>
>> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >> >
>> >> > What's the time taken by the cache management operations ?
>> >>
>> >> Sorry for the late reply, your question turned out a little more
>> >> interesting than I expected initially. :)
>> >>
>> >> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> >> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> >> buffers in this case is more CPU-efficient even when considering cache
>> >> management overhead.
>> >>
>> >> When trying to do the same measurements with libcamera, I failed. In a
>> >> typical libcamera use case when MMAP buffers are allocated from a
>> >> device, exported as dmabufs and then used for capture on the same device
>> >> with DMABUF memory type, cache management in kernel is skipped [1]
>> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> >> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>> >
>> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in
>> > the non-coherent mode, those ops should perform proper cache
>> > maintenance.
>>
>> Thanks for pointing this out!
>>
>> > Let me send a patch to fix this in a couple of days unless someone
>> > does it earlier.
>>
>> Now that we know that this is a bug, not an API misuse from my side, I
>> can fix this myself and send a v2. Would this be okay for you?
>
> I'd be more than happy :)

Done, see [1]. A review would be appreciated. :)

[1] https://lore.kernel.org/all/20250115-b4-rkisp-noncoherent-v2-0-0853e1a24012@gmail.com/


--
Best regards,
Mikhail Rudenko
diff mbox series

Patch

diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
--- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
+++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
@@ -1563,6 +1563,7 @@  static int rkisp1_register_capture(struct rkisp1_capture *cap)
 	q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
 	q->lock = &node->vlock;
 	q->dev = cap->rkisp1->dev;
+	q->allow_cache_hints = 1;
 	ret = vb2_queue_init(q);
 	if (ret) {
 		dev_err(cap->rkisp1->dev,