Message ID | 20250102-b4-rkisp-noncoherent-v1-1-bba164f7132c@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | media: rkisp1: allow non-coherent video capture buffers | expand |
On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > Currently, the rkisp1 driver always uses coherent DMA allocations for > video capture buffers. However, on some platforms, using non-coherent > buffers can improve performance, especially when CPU processing of > MMAP'ed video buffers is required. > > For example, on the Rockchip RK3399 running at maximum CPU frequency, > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > non-coherent DMA allocation. CPU usage also decreases accordingly. What's the time taken by the cache management operations ? > This change allows userspace to request the allocation of non-coherent > buffers. Note that the behavior for existing users will remain unchanged > unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when > allocating buffers. > > Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com> > --- > drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644 > --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap) > q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC; > q->lock = &node->vlock; > q->dev = cap->rkisp1->dev; > + q->allow_cache_hints = 1; > ret = vb2_queue_init(q); > if (ret) { > dev_err(cap->rkisp1->dev, > > --- > base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c > change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Hi Laurent, On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: >> Currently, the rkisp1 driver always uses coherent DMA allocations for >> video capture buffers. However, on some platforms, using non-coherent >> buffers can improve performance, especially when CPU processing of >> MMAP'ed video buffers is required. >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using >> non-coherent DMA allocation. CPU usage also decreases accordingly. > > What's the time taken by the cache management operations ? Sorry for the late reply, your question turned out a little more interesting than I expected initially. :) When capturing using Yavta with MMAP buffers under the conditions mentioned in the commit message, ftrace gives 437.6 +- 1.1 us for dma_sync_sgtable_for_cpu and 409 +- 14 us for dma_sync_sgtable_for_device. Thus, it looks like using non-coherent buffers in this case is more CPU-efficient even when considering cache management overhead. When trying to do the same measurements with libcamera, I failed. In a typical libcamera use case when MMAP buffers are allocated from a device, exported as dmabufs and then used for capture on the same device with DMABUF memory type, cache management in kernel is skipped [1] [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so DMA_BUF_IOCTL_SYNC from userspace does not work either. So it looks like to make this change really useful, the above issue of cache management for libcamera/DMABUF/videobuf2-dma-contig has to be solved. I'm not an expert in this area, so any advice is kindly welcome. :) [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 -- Best regards, Mikhail Rudenko
Hi Mikhail and Laurent, On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: > > > Hi Laurent, > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > >> Currently, the rkisp1 driver always uses coherent DMA allocations for > >> video capture buffers. However, on some platforms, using non-coherent > >> buffers can improve performance, especially when CPU processing of > >> MMAP'ed video buffers is required. > >> > >> For example, on the Rockchip RK3399 running at maximum CPU frequency, > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > >> non-coherent DMA allocation. CPU usage also decreases accordingly. > > > > What's the time taken by the cache management operations ? > > Sorry for the late reply, your question turned out a little more > interesting than I expected initially. :) > > When capturing using Yavta with MMAP buffers under the conditions mentioned > in the commit message, ftrace gives 437.6 +- 1.1 us for > dma_sync_sgtable_for_cpu and 409 +- 14 us for > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent > buffers in this case is more CPU-efficient even when considering cache > management overhead. > > When trying to do the same measurements with libcamera, I failed. In a > typical libcamera use case when MMAP buffers are allocated from a > device, exported as dmabufs and then used for capture on the same device > with DMABUF memory type, cache management in kernel is skipped [1] > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so > DMA_BUF_IOCTL_SYNC from userspace does not work either. Oops, so I believe this is a bug. When an MMAP buffer is allocated in the non-coherent mode, those ops should perform proper cache maintenance. Let me send a patch to fix this in a couple of days unless someone does it earlier. Best regards, Tomasz > > So it looks like to make this change really useful, the above issue of > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be > solved. I'm not an expert in this area, so any advice is kindly welcome. :) > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 > > -- > Best regards, > Mikhail Rudenko >
On 2025-01-15 at 23:46 +09, Tomasz Figa <tfiga@chromium.org> wrote: > On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: >> >> Hi Tomasz, >> >> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote: >> >> > Hi Mikhail and Laurent, >> > >> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: >> >> >> >> >> >> Hi Laurent, >> >> >> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: >> >> >> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: >> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for >> >> >> video capture buffers. However, on some platforms, using non-coherent >> >> >> buffers can improve performance, especially when CPU processing of >> >> >> MMAP'ed video buffers is required. >> >> >> >> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, >> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a >> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using >> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly. >> >> > >> >> > What's the time taken by the cache management operations ? >> >> >> >> Sorry for the late reply, your question turned out a little more >> >> interesting than I expected initially. :) >> >> >> >> When capturing using Yavta with MMAP buffers under the conditions mentioned >> >> in the commit message, ftrace gives 437.6 +- 1.1 us for >> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for >> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent >> >> buffers in this case is more CPU-efficient even when considering cache >> >> management overhead. >> >> >> >> When trying to do the same measurements with libcamera, I failed. In a >> >> typical libcamera use case when MMAP buffers are allocated from a >> >> device, exported as dmabufs and then used for capture on the same device >> >> with DMABUF memory type, cache management in kernel is skipped [1] >> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so >> >> DMA_BUF_IOCTL_SYNC from userspace does not work either. >> > >> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in >> > the non-coherent mode, those ops should perform proper cache >> > maintenance. >> >> Thanks for pointing this out! >> >> > Let me send a patch to fix this in a couple of days unless someone >> > does it earlier. >> >> Now that we know that this is a bug, not an API misuse from my side, I >> can fix this myself and send a v2. Would this be okay for you? > > I'd be more than happy :) Done, see [1]. A review would be appreciated. :) [1] https://lore.kernel.org/all/20250115-b4-rkisp-noncoherent-v2-0-0853e1a24012@gmail.com/ -- Best regards, Mikhail Rudenko
diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644 --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap) q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC; q->lock = &node->vlock; q->dev = cap->rkisp1->dev; + q->allow_cache_hints = 1; ret = vb2_queue_init(q); if (ret) { dev_err(cap->rkisp1->dev,
Currently, the rkisp1 driver always uses coherent DMA allocations for video capture buffers. However, on some platforms, using non-coherent buffers can improve performance, especially when CPU processing of MMAP'ed video buffers is required. For example, on the Rockchip RK3399 running at maximum CPU frequency, the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using non-coherent DMA allocation. CPU usage also decreases accordingly. This change allows userspace to request the allocation of non-coherent buffers. Note that the behavior for existing users will remain unchanged unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when allocating buffers. Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com> --- drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 + 1 file changed, 1 insertion(+) --- base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba Best regards,