mbox series

[00/30,for,5.4] Backport unencrypted non-blocking DMA allocations

Message ID 20200925161916.204667-1-pgonda@google.com
Headers show
Series Backport unencrypted non-blocking DMA allocations | expand

Message

Peter Gonda Sept. 25, 2020, 4:18 p.m. UTC
Currently SEV enabled guests hit might_sleep() warnings when a driver (nvme in
this case) allocates through the DMA API in a non-blockable context. Making
these unencrypted non-blocking DMA allocations come from the coherent pools
prevents this BUG.

BUG: sleeping function called from invalid context at mm/vmalloc.c:1710
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3383, name: fio
2 locks held by fio/3383:
 #0: ffff93b6a8568348 (&sb->s_type->i_mutex_key#16){+.+.}, at: ext4_file_write_iter+0xa2/0x5d0
 #1: ffffffffa52a61a0 (rcu_read_lock){....}, at: hctx_lock+0x1a/0xe0
CPU: 0 PID: 3383 Comm: fio Tainted: G        W         5.5.10 #14
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 dump_stack+0x98/0xd5
 ___might_sleep+0x175/0x260
 __might_sleep+0x4a/0x80
 _vm_unmap_aliases+0x45/0x250
 vm_unmap_aliases+0x19/0x20
 __set_memory_enc_dec+0xa4/0x130
 set_memory_decrypted+0x10/0x20
 dma_direct_alloc_pages+0x148/0x150
 dma_direct_alloc+0xe/0x10
 dma_alloc_attrs+0x86/0xc0
 dma_pool_alloc+0x16f/0x2b0
 nvme_queue_rq+0x878/0xc30 [nvme]
 __blk_mq_try_issue_directly+0x135/0x200
 blk_mq_request_issue_directly+0x4f/0x80
 blk_mq_try_issue_list_directly+0x46/0xb0
 blk_mq_sched_insert_requests+0x19b/0x2b0
 blk_mq_flush_plug_list+0x22f/0x3b0
 blk_flush_plug_list+0xd1/0x100
 blk_finish_plug+0x2c/0x40
 iomap_dio_rw+0x427/0x490
 ext4_file_write_iter+0x181/0x5d0
 aio_write+0x109/0x1b0
 io_submit_one+0x7d0/0xfa0
 __x64_sys_io_submit+0xa2/0x280
 do_syscall_64+0x5f/0x250
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Christoph Hellwig (9):
  dma-direct: remove __dma_direct_free_pages
  dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
  dma-direct: provide mmap and get_sgtable method overrides
  dma-mapping: merge the generic remapping helpers into dma-direct
  dma-direct: consolidate the error handling in dma_direct_alloc_pages
  xtensa: use the generic uncached segment support
  dma-direct: make uncached_kernel_address more general
  dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
  dma-pool: fix coherent pool allocations for IOMMU mappings

David Rientjes (13):
  dma-remap: separate DMA atomic pools from direct remap code
  dma-pool: add additional coherent pools to map to gfp mask
  dma-pool: dynamically expanding atomic pools
  dma-direct: atomic allocations must come from atomic coherent pools
  dma-pool: add pool sizes to debugfs
  x86/mm: unencrypted non-blocking DMA allocations use coherent pools
  dma-pool: scale the default DMA coherent pool size with memory
    capacity
  dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
  dma-direct: always align allocation size in dma_direct_alloc_pages()
  dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
  dma-direct: check return value when encrypting or decrypting memory
  dma-direct: add missing set_memory_decrypted() for coherent mapping
  dma-mapping: warn when coherent pool is depleted

Geert Uytterhoeven (1):
  dma-pool: fix too large DMA pools on medium memory size systems

Huang Shijie (1):
  lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr

Nicolas Saenz Julienne (6):
  dma-direct: provide function to check physical memory area validity
  dma-pool: get rid of dma_in_atomic_pool()
  dma-pool: introduce dma_guess_pool()
  dma-pool: make sure atomic pool suits device
  dma-pool: do not allocate pool memory from CMA
  dma/direct: turn ARCH_ZONE_DMA_BITS into a variable

 Documentation/core-api/genalloc.rst    |   2 +-
 arch/Kconfig                           |   8 +-
 arch/arc/Kconfig                       |   1 -
 arch/arm/Kconfig                       |   1 -
 arch/arm/mm/dma-mapping.c              |   8 +-
 arch/arm64/Kconfig                     |   1 -
 arch/arm64/mm/init.c                   |   9 +-
 arch/ia64/Kconfig                      |   2 +-
 arch/ia64/kernel/dma-mapping.c         |   6 -
 arch/microblaze/Kconfig                |   3 +-
 arch/microblaze/mm/consistent.c        |   2 +-
 arch/mips/Kconfig                      |   7 +-
 arch/mips/mm/dma-noncoherent.c         |   8 +-
 arch/nios2/Kconfig                     |   3 +-
 arch/nios2/mm/dma-mapping.c            |   2 +-
 arch/powerpc/include/asm/page.h        |   9 -
 arch/powerpc/mm/mem.c                  |  20 +-
 arch/powerpc/platforms/Kconfig.cputype |   1 -
 arch/s390/include/asm/page.h           |   2 -
 arch/s390/mm/init.c                    |   1 +
 arch/x86/Kconfig                       |   1 +
 arch/xtensa/Kconfig                    |   6 +-
 arch/xtensa/include/asm/platform.h     |  27 ---
 arch/xtensa/kernel/Makefile            |   3 +-
 arch/xtensa/kernel/pci-dma.c           | 121 ++---------
 drivers/iommu/dma-iommu.c              |   5 +-
 drivers/misc/sram-exec.c               |   2 +-
 include/linux/dma-direct.h             |  12 +-
 include/linux/dma-mapping.h            |   7 +-
 include/linux/dma-noncoherent.h        |   4 +-
 include/linux/genalloc.h               |   2 +-
 kernel/dma/Kconfig                     |  20 +-
 kernel/dma/Makefile                    |   1 +
 kernel/dma/direct.c                    | 224 ++++++++++++++++----
 kernel/dma/mapping.c                   |  45 +----
 kernel/dma/pool.c                      | 270 +++++++++++++++++++++++++
 kernel/dma/remap.c                     | 176 +---------------
 lib/genalloc.c                         |   5 +-
 38 files changed, 564 insertions(+), 463 deletions(-)
 create mode 100644 kernel/dma/pool.c

Comments

Greg KH Oct. 5, 2020, 1:07 p.m. UTC | #1
On Fri, Sep 25, 2020 at 09:18:46AM -0700, Peter Gonda wrote:
> Currently SEV enabled guests hit might_sleep() warnings when a driver (nvme in
> this case) allocates through the DMA API in a non-blockable context. Making
> these unencrypted non-blocking DMA allocations come from the coherent pools
> prevents this BUG.
> 
> BUG: sleeping function called from invalid context at mm/vmalloc.c:1710
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3383, name: fio
> 2 locks held by fio/3383:
>  #0: ffff93b6a8568348 (&sb->s_type->i_mutex_key#16){+.+.}, at: ext4_file_write_iter+0xa2/0x5d0
>  #1: ffffffffa52a61a0 (rcu_read_lock){....}, at: hctx_lock+0x1a/0xe0
> CPU: 0 PID: 3383 Comm: fio Tainted: G        W         5.5.10 #14
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> Call Trace:
>  dump_stack+0x98/0xd5
>  ___might_sleep+0x175/0x260
>  __might_sleep+0x4a/0x80
>  _vm_unmap_aliases+0x45/0x250
>  vm_unmap_aliases+0x19/0x20
>  __set_memory_enc_dec+0xa4/0x130
>  set_memory_decrypted+0x10/0x20
>  dma_direct_alloc_pages+0x148/0x150
>  dma_direct_alloc+0xe/0x10
>  dma_alloc_attrs+0x86/0xc0
>  dma_pool_alloc+0x16f/0x2b0
>  nvme_queue_rq+0x878/0xc30 [nvme]
>  __blk_mq_try_issue_directly+0x135/0x200
>  blk_mq_request_issue_directly+0x4f/0x80
>  blk_mq_try_issue_list_directly+0x46/0xb0
>  blk_mq_sched_insert_requests+0x19b/0x2b0
>  blk_mq_flush_plug_list+0x22f/0x3b0
>  blk_flush_plug_list+0xd1/0x100
>  blk_finish_plug+0x2c/0x40
>  iomap_dio_rw+0x427/0x490
>  ext4_file_write_iter+0x181/0x5d0
>  aio_write+0x109/0x1b0
>  io_submit_one+0x7d0/0xfa0
>  __x64_sys_io_submit+0xa2/0x280
>  do_syscall_64+0x5f/0x250
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Christoph Hellwig (9):
>   dma-direct: remove __dma_direct_free_pages
>   dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
>   dma-direct: provide mmap and get_sgtable method overrides
>   dma-mapping: merge the generic remapping helpers into dma-direct
>   dma-direct: consolidate the error handling in dma_direct_alloc_pages
>   xtensa: use the generic uncached segment support
>   dma-direct: make uncached_kernel_address more general
>   dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
>   dma-pool: fix coherent pool allocations for IOMMU mappings
> 
> David Rientjes (13):
>   dma-remap: separate DMA atomic pools from direct remap code
>   dma-pool: add additional coherent pools to map to gfp mask
>   dma-pool: dynamically expanding atomic pools
>   dma-direct: atomic allocations must come from atomic coherent pools
>   dma-pool: add pool sizes to debugfs
>   x86/mm: unencrypted non-blocking DMA allocations use coherent pools
>   dma-pool: scale the default DMA coherent pool size with memory
>     capacity
>   dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
>   dma-direct: always align allocation size in dma_direct_alloc_pages()
>   dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
>   dma-direct: check return value when encrypting or decrypting memory
>   dma-direct: add missing set_memory_decrypted() for coherent mapping
>   dma-mapping: warn when coherent pool is depleted
> 
> Geert Uytterhoeven (1):
>   dma-pool: fix too large DMA pools on medium memory size systems
> 
> Huang Shijie (1):
>   lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr
> 
> Nicolas Saenz Julienne (6):
>   dma-direct: provide function to check physical memory area validity
>   dma-pool: get rid of dma_in_atomic_pool()
>   dma-pool: introduce dma_guess_pool()
>   dma-pool: make sure atomic pool suits device
>   dma-pool: do not allocate pool memory from CMA
>   dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
> 
>  Documentation/core-api/genalloc.rst    |   2 +-
>  arch/Kconfig                           |   8 +-
>  arch/arc/Kconfig                       |   1 -
>  arch/arm/Kconfig                       |   1 -
>  arch/arm/mm/dma-mapping.c              |   8 +-
>  arch/arm64/Kconfig                     |   1 -
>  arch/arm64/mm/init.c                   |   9 +-
>  arch/ia64/Kconfig                      |   2 +-
>  arch/ia64/kernel/dma-mapping.c         |   6 -
>  arch/microblaze/Kconfig                |   3 +-
>  arch/microblaze/mm/consistent.c        |   2 +-
>  arch/mips/Kconfig                      |   7 +-
>  arch/mips/mm/dma-noncoherent.c         |   8 +-
>  arch/nios2/Kconfig                     |   3 +-
>  arch/nios2/mm/dma-mapping.c            |   2 +-
>  arch/powerpc/include/asm/page.h        |   9 -
>  arch/powerpc/mm/mem.c                  |  20 +-
>  arch/powerpc/platforms/Kconfig.cputype |   1 -
>  arch/s390/include/asm/page.h           |   2 -
>  arch/s390/mm/init.c                    |   1 +
>  arch/x86/Kconfig                       |   1 +
>  arch/xtensa/Kconfig                    |   6 +-
>  arch/xtensa/include/asm/platform.h     |  27 ---
>  arch/xtensa/kernel/Makefile            |   3 +-
>  arch/xtensa/kernel/pci-dma.c           | 121 ++---------
>  drivers/iommu/dma-iommu.c              |   5 +-
>  drivers/misc/sram-exec.c               |   2 +-
>  include/linux/dma-direct.h             |  12 +-
>  include/linux/dma-mapping.h            |   7 +-
>  include/linux/dma-noncoherent.h        |   4 +-
>  include/linux/genalloc.h               |   2 +-
>  kernel/dma/Kconfig                     |  20 +-
>  kernel/dma/Makefile                    |   1 +
>  kernel/dma/direct.c                    | 224 ++++++++++++++++----
>  kernel/dma/mapping.c                   |  45 +----
>  kernel/dma/pool.c                      | 270 +++++++++++++++++++++++++
>  kernel/dma/remap.c                     | 176 +---------------
>  lib/genalloc.c                         |   5 +-
>  38 files changed, 564 insertions(+), 463 deletions(-)
>  create mode 100644 kernel/dma/pool.c

This is a pretty big set of changes for a stable tree.  Can I get some
acks/reviews/whatever by the developers involved here that this really
is a good idea to take into the 5.4.y tree?

thanks,

greg k-h
Peter Gonda Oct. 6, 2020, 3:26 p.m. UTC | #2
We realize this is a large set of changes but it's the only way for us
to remove that customer facing BUG for SEV guests. When David asked
back in May a full series was preferred, is there another way forward?
I've CCd the authors of the changes.

On Mon, Oct 5, 2020 at 7:06 AM Greg KH <greg@kroah.com> wrote:
>
> On Fri, Sep 25, 2020 at 09:18:46AM -0700, Peter Gonda wrote:
> > Currently SEV enabled guests hit might_sleep() warnings when a driver (nvme in
> > this case) allocates through the DMA API in a non-blockable context. Making
> > these unencrypted non-blocking DMA allocations come from the coherent pools
> > prevents this BUG.
> >
> > BUG: sleeping function called from invalid context at mm/vmalloc.c:1710
> > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3383, name: fio
> > 2 locks held by fio/3383:
> >  #0: ffff93b6a8568348 (&sb->s_type->i_mutex_key#16){+.+.}, at: ext4_file_write_iter+0xa2/0x5d0
> >  #1: ffffffffa52a61a0 (rcu_read_lock){....}, at: hctx_lock+0x1a/0xe0
> > CPU: 0 PID: 3383 Comm: fio Tainted: G        W         5.5.10 #14
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > Call Trace:
> >  dump_stack+0x98/0xd5
> >  ___might_sleep+0x175/0x260
> >  __might_sleep+0x4a/0x80
> >  _vm_unmap_aliases+0x45/0x250
> >  vm_unmap_aliases+0x19/0x20
> >  __set_memory_enc_dec+0xa4/0x130
> >  set_memory_decrypted+0x10/0x20
> >  dma_direct_alloc_pages+0x148/0x150
> >  dma_direct_alloc+0xe/0x10
> >  dma_alloc_attrs+0x86/0xc0
> >  dma_pool_alloc+0x16f/0x2b0
> >  nvme_queue_rq+0x878/0xc30 [nvme]
> >  __blk_mq_try_issue_directly+0x135/0x200
> >  blk_mq_request_issue_directly+0x4f/0x80
> >  blk_mq_try_issue_list_directly+0x46/0xb0
> >  blk_mq_sched_insert_requests+0x19b/0x2b0
> >  blk_mq_flush_plug_list+0x22f/0x3b0
> >  blk_flush_plug_list+0xd1/0x100
> >  blk_finish_plug+0x2c/0x40
> >  iomap_dio_rw+0x427/0x490
> >  ext4_file_write_iter+0x181/0x5d0
> >  aio_write+0x109/0x1b0
> >  io_submit_one+0x7d0/0xfa0
> >  __x64_sys_io_submit+0xa2/0x280
> >  do_syscall_64+0x5f/0x250
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > Christoph Hellwig (9):
> >   dma-direct: remove __dma_direct_free_pages
> >   dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
> >   dma-direct: provide mmap and get_sgtable method overrides
> >   dma-mapping: merge the generic remapping helpers into dma-direct
> >   dma-direct: consolidate the error handling in dma_direct_alloc_pages
> >   xtensa: use the generic uncached segment support
> >   dma-direct: make uncached_kernel_address more general
> >   dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
> >   dma-pool: fix coherent pool allocations for IOMMU mappings
> >
> > David Rientjes (13):
> >   dma-remap: separate DMA atomic pools from direct remap code
> >   dma-pool: add additional coherent pools to map to gfp mask
> >   dma-pool: dynamically expanding atomic pools
> >   dma-direct: atomic allocations must come from atomic coherent pools
> >   dma-pool: add pool sizes to debugfs
> >   x86/mm: unencrypted non-blocking DMA allocations use coherent pools
> >   dma-pool: scale the default DMA coherent pool size with memory
> >     capacity
> >   dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
> >   dma-direct: always align allocation size in dma_direct_alloc_pages()
> >   dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
> >   dma-direct: check return value when encrypting or decrypting memory
> >   dma-direct: add missing set_memory_decrypted() for coherent mapping
> >   dma-mapping: warn when coherent pool is depleted
> >
> > Geert Uytterhoeven (1):
> >   dma-pool: fix too large DMA pools on medium memory size systems
> >
> > Huang Shijie (1):
> >   lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr
> >
> > Nicolas Saenz Julienne (6):
> >   dma-direct: provide function to check physical memory area validity
> >   dma-pool: get rid of dma_in_atomic_pool()
> >   dma-pool: introduce dma_guess_pool()
> >   dma-pool: make sure atomic pool suits device
> >   dma-pool: do not allocate pool memory from CMA
> >   dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
> >
> >  Documentation/core-api/genalloc.rst    |   2 +-
> >  arch/Kconfig                           |   8 +-
> >  arch/arc/Kconfig                       |   1 -
> >  arch/arm/Kconfig                       |   1 -
> >  arch/arm/mm/dma-mapping.c              |   8 +-
> >  arch/arm64/Kconfig                     |   1 -
> >  arch/arm64/mm/init.c                   |   9 +-
> >  arch/ia64/Kconfig                      |   2 +-
> >  arch/ia64/kernel/dma-mapping.c         |   6 -
> >  arch/microblaze/Kconfig                |   3 +-
> >  arch/microblaze/mm/consistent.c        |   2 +-
> >  arch/mips/Kconfig                      |   7 +-
> >  arch/mips/mm/dma-noncoherent.c         |   8 +-
> >  arch/nios2/Kconfig                     |   3 +-
> >  arch/nios2/mm/dma-mapping.c            |   2 +-
> >  arch/powerpc/include/asm/page.h        |   9 -
> >  arch/powerpc/mm/mem.c                  |  20 +-
> >  arch/powerpc/platforms/Kconfig.cputype |   1 -
> >  arch/s390/include/asm/page.h           |   2 -
> >  arch/s390/mm/init.c                    |   1 +
> >  arch/x86/Kconfig                       |   1 +
> >  arch/xtensa/Kconfig                    |   6 +-
> >  arch/xtensa/include/asm/platform.h     |  27 ---
> >  arch/xtensa/kernel/Makefile            |   3 +-
> >  arch/xtensa/kernel/pci-dma.c           | 121 ++---------
> >  drivers/iommu/dma-iommu.c              |   5 +-
> >  drivers/misc/sram-exec.c               |   2 +-
> >  include/linux/dma-direct.h             |  12 +-
> >  include/linux/dma-mapping.h            |   7 +-
> >  include/linux/dma-noncoherent.h        |   4 +-
> >  include/linux/genalloc.h               |   2 +-
> >  kernel/dma/Kconfig                     |  20 +-
> >  kernel/dma/Makefile                    |   1 +
> >  kernel/dma/direct.c                    | 224 ++++++++++++++++----
> >  kernel/dma/mapping.c                   |  45 +----
> >  kernel/dma/pool.c                      | 270 +++++++++++++++++++++++++
> >  kernel/dma/remap.c                     | 176 +---------------
> >  lib/genalloc.c                         |   5 +-
> >  38 files changed, 564 insertions(+), 463 deletions(-)
> >  create mode 100644 kernel/dma/pool.c
>
> This is a pretty big set of changes for a stable tree.  Can I get some
> acks/reviews/whatever by the developers involved here that this really
> is a good idea to take into the 5.4.y tree?
>
> thanks,
>
> greg k-h
David Rientjes Oct. 6, 2020, 6:10 p.m. UTC | #3
Thanks Peter.

The series of commits certainly expanded from my initial set that I asked 
about in a thread with the subject "DMA API stable backports for AMD SEV" 
on May 19.  Turns out that switching how DMA memory is allocated based on 
various characteristics of the allocation and device is trickier than 
originally thought :)  There were a number of fixes that were needed for 
subtleties and cornercases that folks ran into, but were addressed and 
have been merged by Linus.  I believe it's stable in upstream and that 
we've been thorough in compiling a full set of changes that are required 
for 5.4.

Note that without this series, all SEV-enabled guests will run into the 
"sleeping function called from invalid context" issue in the vmalloc layer 
that Peter cites when using certain drivers.  For such configurations, 
there is no way to avoid the "BUG" messages in the guest kernel when using 
AMD SEV unless this series is merged into an LTS kernel that the distros 
will then pick up.

For my 13 patches in the 30 patch series, I fully stand by Peter's 
backports and rationale for merge into 5.4 LTS.


On Tue, 6 Oct 2020, Peter Gonda wrote:

> We realize this is a large set of changes but it's the only way for us
> to remove that customer facing BUG for SEV guests. When David asked
> back in May a full series was preferred, is there another way forward?
> I've CCd the authors of the changes.
> 
> On Mon, Oct 5, 2020 at 7:06 AM Greg KH <greg@kroah.com> wrote:
> >
> > On Fri, Sep 25, 2020 at 09:18:46AM -0700, Peter Gonda wrote:
> > > Currently SEV enabled guests hit might_sleep() warnings when a driver (nvme in
> > > this case) allocates through the DMA API in a non-blockable context. Making
> > > these unencrypted non-blocking DMA allocations come from the coherent pools
> > > prevents this BUG.
> > >
> > > BUG: sleeping function called from invalid context at mm/vmalloc.c:1710
> > > in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 3383, name: fio
> > > 2 locks held by fio/3383:
> > >  #0: ffff93b6a8568348 (&sb->s_type->i_mutex_key#16){+.+.}, at: ext4_file_write_iter+0xa2/0x5d0
> > >  #1: ffffffffa52a61a0 (rcu_read_lock){....}, at: hctx_lock+0x1a/0xe0
> > > CPU: 0 PID: 3383 Comm: fio Tainted: G        W         5.5.10 #14
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > Call Trace:
> > >  dump_stack+0x98/0xd5
> > >  ___might_sleep+0x175/0x260
> > >  __might_sleep+0x4a/0x80
> > >  _vm_unmap_aliases+0x45/0x250
> > >  vm_unmap_aliases+0x19/0x20
> > >  __set_memory_enc_dec+0xa4/0x130
> > >  set_memory_decrypted+0x10/0x20
> > >  dma_direct_alloc_pages+0x148/0x150
> > >  dma_direct_alloc+0xe/0x10
> > >  dma_alloc_attrs+0x86/0xc0
> > >  dma_pool_alloc+0x16f/0x2b0
> > >  nvme_queue_rq+0x878/0xc30 [nvme]
> > >  __blk_mq_try_issue_directly+0x135/0x200
> > >  blk_mq_request_issue_directly+0x4f/0x80
> > >  blk_mq_try_issue_list_directly+0x46/0xb0
> > >  blk_mq_sched_insert_requests+0x19b/0x2b0
> > >  blk_mq_flush_plug_list+0x22f/0x3b0
> > >  blk_flush_plug_list+0xd1/0x100
> > >  blk_finish_plug+0x2c/0x40
> > >  iomap_dio_rw+0x427/0x490
> > >  ext4_file_write_iter+0x181/0x5d0
> > >  aio_write+0x109/0x1b0
> > >  io_submit_one+0x7d0/0xfa0
> > >  __x64_sys_io_submit+0xa2/0x280
> > >  do_syscall_64+0x5f/0x250
> > >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > >
> > > Christoph Hellwig (9):
> > >   dma-direct: remove __dma_direct_free_pages
> > >   dma-direct: remove the dma_handle argument to __dma_direct_alloc_pages
> > >   dma-direct: provide mmap and get_sgtable method overrides
> > >   dma-mapping: merge the generic remapping helpers into dma-direct
> > >   dma-direct: consolidate the error handling in dma_direct_alloc_pages
> > >   xtensa: use the generic uncached segment support
> > >   dma-direct: make uncached_kernel_address more general
> > >   dma-mapping: DMA_COHERENT_POOL should select GENERIC_ALLOCATOR
> > >   dma-pool: fix coherent pool allocations for IOMMU mappings
> > >
> > > David Rientjes (13):
> > >   dma-remap: separate DMA atomic pools from direct remap code
> > >   dma-pool: add additional coherent pools to map to gfp mask
> > >   dma-pool: dynamically expanding atomic pools
> > >   dma-direct: atomic allocations must come from atomic coherent pools
> > >   dma-pool: add pool sizes to debugfs
> > >   x86/mm: unencrypted non-blocking DMA allocations use coherent pools
> > >   dma-pool: scale the default DMA coherent pool size with memory
> > >     capacity
> > >   dma-pool: decouple DMA_REMAP from DMA_COHERENT_POOL
> > >   dma-direct: always align allocation size in dma_direct_alloc_pages()
> > >   dma-direct: re-encrypt memory if dma_direct_alloc_pages() fails
> > >   dma-direct: check return value when encrypting or decrypting memory
> > >   dma-direct: add missing set_memory_decrypted() for coherent mapping
> > >   dma-mapping: warn when coherent pool is depleted
> > >
> > > Geert Uytterhoeven (1):
> > >   dma-pool: fix too large DMA pools on medium memory size systems
> > >
> > > Huang Shijie (1):
> > >   lib/genalloc.c: rename addr_in_gen_pool to gen_pool_has_addr
> > >
> > > Nicolas Saenz Julienne (6):
> > >   dma-direct: provide function to check physical memory area validity
> > >   dma-pool: get rid of dma_in_atomic_pool()
> > >   dma-pool: introduce dma_guess_pool()
> > >   dma-pool: make sure atomic pool suits device
> > >   dma-pool: do not allocate pool memory from CMA
> > >   dma/direct: turn ARCH_ZONE_DMA_BITS into a variable
> > >
> > >  Documentation/core-api/genalloc.rst    |   2 +-
> > >  arch/Kconfig                           |   8 +-
> > >  arch/arc/Kconfig                       |   1 -
> > >  arch/arm/Kconfig                       |   1 -
> > >  arch/arm/mm/dma-mapping.c              |   8 +-
> > >  arch/arm64/Kconfig                     |   1 -
> > >  arch/arm64/mm/init.c                   |   9 +-
> > >  arch/ia64/Kconfig                      |   2 +-
> > >  arch/ia64/kernel/dma-mapping.c         |   6 -
> > >  arch/microblaze/Kconfig                |   3 +-
> > >  arch/microblaze/mm/consistent.c        |   2 +-
> > >  arch/mips/Kconfig                      |   7 +-
> > >  arch/mips/mm/dma-noncoherent.c         |   8 +-
> > >  arch/nios2/Kconfig                     |   3 +-
> > >  arch/nios2/mm/dma-mapping.c            |   2 +-
> > >  arch/powerpc/include/asm/page.h        |   9 -
> > >  arch/powerpc/mm/mem.c                  |  20 +-
> > >  arch/powerpc/platforms/Kconfig.cputype |   1 -
> > >  arch/s390/include/asm/page.h           |   2 -
> > >  arch/s390/mm/init.c                    |   1 +
> > >  arch/x86/Kconfig                       |   1 +
> > >  arch/xtensa/Kconfig                    |   6 +-
> > >  arch/xtensa/include/asm/platform.h     |  27 ---
> > >  arch/xtensa/kernel/Makefile            |   3 +-
> > >  arch/xtensa/kernel/pci-dma.c           | 121 ++---------
> > >  drivers/iommu/dma-iommu.c              |   5 +-
> > >  drivers/misc/sram-exec.c               |   2 +-
> > >  include/linux/dma-direct.h             |  12 +-
> > >  include/linux/dma-mapping.h            |   7 +-
> > >  include/linux/dma-noncoherent.h        |   4 +-
> > >  include/linux/genalloc.h               |   2 +-
> > >  kernel/dma/Kconfig                     |  20 +-
> > >  kernel/dma/Makefile                    |   1 +
> > >  kernel/dma/direct.c                    | 224 ++++++++++++++++----
> > >  kernel/dma/mapping.c                   |  45 +----
> > >  kernel/dma/pool.c                      | 270 +++++++++++++++++++++++++
> > >  kernel/dma/remap.c                     | 176 +---------------
> > >  lib/genalloc.c                         |   5 +-
> > >  38 files changed, 564 insertions(+), 463 deletions(-)
> > >  create mode 100644 kernel/dma/pool.c
> >
> > This is a pretty big set of changes for a stable tree.  Can I get some
> > acks/reviews/whatever by the developers involved here that this really
> > is a good idea to take into the 5.4.y tree?
> >
> > thanks,
> >
> > greg k-h
>
Greg KH Jan. 4, 2021, 1 p.m. UTC | #4
On Tue, Oct 06, 2020 at 11:10:41AM -0700, David Rientjes wrote:
> Thanks Peter.

> 

> The series of commits certainly expanded from my initial set that I asked 

> about in a thread with the subject "DMA API stable backports for AMD SEV" 

> on May 19.  Turns out that switching how DMA memory is allocated based on 

> various characteristics of the allocation and device is trickier than 

> originally thought :)  There were a number of fixes that were needed for 

> subtleties and cornercases that folks ran into, but were addressed and 

> have been merged by Linus.  I believe it's stable in upstream and that 

> we've been thorough in compiling a full set of changes that are required 

> for 5.4.

> 

> Note that without this series, all SEV-enabled guests will run into the 

> "sleeping function called from invalid context" issue in the vmalloc layer 

> that Peter cites when using certain drivers.  For such configurations, 

> there is no way to avoid the "BUG" messages in the guest kernel when using 

> AMD SEV unless this series is merged into an LTS kernel that the distros 

> will then pick up.

> 

> For my 13 patches in the 30 patch series, I fully stand by Peter's 

> backports and rationale for merge into 5.4 LTS.


Given that this "feature" has never worked in the 5.4 or older kernels,
why should this be backported there?  This isn't a bugfix from what I
can tell, is it?  And if so, what kernel version did work properly?

And if someone really wants this new feature, why can't they just use a
newer kernel release?

thanks,

greg k-h
David Rientjes Jan. 4, 2021, 10:37 p.m. UTC | #5
On Mon, 4 Jan 2021, Greg KH wrote:

> > The series of commits certainly expanded from my initial set that I asked 

> > about in a thread with the subject "DMA API stable backports for AMD SEV" 

> > on May 19.  Turns out that switching how DMA memory is allocated based on 

> > various characteristics of the allocation and device is trickier than 

> > originally thought :)  There were a number of fixes that were needed for 

> > subtleties and cornercases that folks ran into, but were addressed and 

> > have been merged by Linus.  I believe it's stable in upstream and that 

> > we've been thorough in compiling a full set of changes that are required 

> > for 5.4.

> > 

> > Note that without this series, all SEV-enabled guests will run into the 

> > "sleeping function called from invalid context" issue in the vmalloc layer 

> > that Peter cites when using certain drivers.  For such configurations, 

> > there is no way to avoid the "BUG" messages in the guest kernel when using 

> > AMD SEV unless this series is merged into an LTS kernel that the distros 

> > will then pick up.

> > 

> > For my 13 patches in the 30 patch series, I fully stand by Peter's 

> > backports and rationale for merge into 5.4 LTS.

> 

> Given that this "feature" has never worked in the 5.4 or older kernels,

> why should this be backported there?  This isn't a bugfix from what I

> can tell, is it?  And if so, what kernel version did work properly?

> 


I think it can be considered a bug fix.

Today, if you boot an SEV encrypted guest running 5.4 and it requires 
atomic DMA allocations, you'll get the "sleeping function called from 
invalid context" bugs.  We see this in our Cloud because there is a 
reliance on atomic allocations through the DMA API by the NVMe driver.  
Likely nobody else has triggered this because they don't have such driver 
dependencies.

No previous kernel version worked properly since SEV guest support was 
introduced in 4.14.

> And if someone really wants this new feature, why can't they just use a

> newer kernel release?

> 


This is more of a product question that I'll defer to Peter and he can 
loop the necessary people in if required.

Since the SEV feature provides confidentiality for guest managed memory, 
running an unmodified guest vs a guest modified to avoid these bugs by the 
cloud provider is a very different experience from the perspective of the 
customer trying to protect their data.

These customers are running standard distros that may be slow to upgrade 
to new kernels released over the past few months.  We could certainly work 
with the distros to backport this support directly to them on a 
case-by-case basis, but the thought was to first attempt to fix this in 
5.4 stable for everybody and allow them to receive the fixes necessary for 
running a non-buggy SEV encrypted guest that way vs multiple distros doing 
the backport so they can run with SEV.
Greg KH Jan. 5, 2021, 5:58 a.m. UTC | #6
On Mon, Jan 04, 2021 at 02:37:00PM -0800, David Rientjes wrote:
> On Mon, 4 Jan 2021, Greg KH wrote:

> 

> > > The series of commits certainly expanded from my initial set that I asked 

> > > about in a thread with the subject "DMA API stable backports for AMD SEV" 

> > > on May 19.  Turns out that switching how DMA memory is allocated based on 

> > > various characteristics of the allocation and device is trickier than 

> > > originally thought :)  There were a number of fixes that were needed for 

> > > subtleties and cornercases that folks ran into, but were addressed and 

> > > have been merged by Linus.  I believe it's stable in upstream and that 

> > > we've been thorough in compiling a full set of changes that are required 

> > > for 5.4.

> > > 

> > > Note that without this series, all SEV-enabled guests will run into the 

> > > "sleeping function called from invalid context" issue in the vmalloc layer 

> > > that Peter cites when using certain drivers.  For such configurations, 

> > > there is no way to avoid the "BUG" messages in the guest kernel when using 

> > > AMD SEV unless this series is merged into an LTS kernel that the distros 

> > > will then pick up.

> > > 

> > > For my 13 patches in the 30 patch series, I fully stand by Peter's 

> > > backports and rationale for merge into 5.4 LTS.

> > 

> > Given that this "feature" has never worked in the 5.4 or older kernels,

> > why should this be backported there?  This isn't a bugfix from what I

> > can tell, is it?  And if so, what kernel version did work properly?

> > 

> 

> I think it can be considered a bug fix.

> 

> Today, if you boot an SEV encrypted guest running 5.4 and it requires 

> atomic DMA allocations, you'll get the "sleeping function called from 

> invalid context" bugs.  We see this in our Cloud because there is a 

> reliance on atomic allocations through the DMA API by the NVMe driver.  

> Likely nobody else has triggered this because they don't have such driver 

> dependencies.

> 

> No previous kernel version worked properly since SEV guest support was 

> introduced in 4.14.


So since this has never worked, it is not a regression that is being
fixed, but rather, a "new feature".  And because of that, if you want it
to work properly, please use a new kernel that has all of these major
changes in it.

> > And if someone really wants this new feature, why can't they just use a

> > newer kernel release?

> > 

> 

> This is more of a product question that I'll defer to Peter and he can 

> loop the necessary people in if required.


If you want to make a "product" of a new feature, using an old kernel
base, then yes, you have to backport this and you are on your own here.
That's just totally normal for all "products" that do not want to use
the latest kernel release.

> Since the SEV feature provides confidentiality for guest managed memory, 

> running an unmodified guest vs a guest modified to avoid these bugs by the 

> cloud provider is a very different experience from the perspective of the 

> customer trying to protect their data.

> 

> These customers are running standard distros that may be slow to upgrade 

> to new kernels released over the past few months.  We could certainly work 

> with the distros to backport this support directly to them on a 

> case-by-case basis, but the thought was to first attempt to fix this in 

> 5.4 stable for everybody and allow them to receive the fixes necessary for 

> running a non-buggy SEV encrypted guest that way vs multiple distros doing 

> the backport so they can run with SEV.


What distro that is based on 5.4 that follows the upstream stable trees
have not already included these patches in their releases?  And what
prevents them from using a newer kernel release entirely for this new
feature their customers are requesting?

thanks,

greg k-h
David Rientjes Jan. 5, 2021, 6:38 a.m. UTC | #7
On Tue, 5 Jan 2021, Greg KH wrote:

> > I think it can be considered a bug fix.

> > 

> > Today, if you boot an SEV encrypted guest running 5.4 and it requires 

> > atomic DMA allocations, you'll get the "sleeping function called from 

> > invalid context" bugs.  We see this in our Cloud because there is a 

> > reliance on atomic allocations through the DMA API by the NVMe driver.  

> > Likely nobody else has triggered this because they don't have such driver 

> > dependencies.

> > 

> > No previous kernel version worked properly since SEV guest support was 

> > introduced in 4.14.

> 

> So since this has never worked, it is not a regression that is being

> fixed, but rather, a "new feature".  And because of that, if you want it

> to work properly, please use a new kernel that has all of these major

> changes in it.

> 


Hmm, maybe :)  AMD shipped guest support in 4.14 and host support in 4.16 
for the SEV feature.  In turns out that a subset of drivers (for Google, 
NVMe) would run into scheduling while atomic bugs because they do 
GFP_ATOMIC allocations through the DMA API and that uses 
set_memory_decrypted() for SEV which can block.  I'd argue that's a bug in 
the SEV feature for a subset of configs.

So this never worked correctly for a subset of drivers until I added 
atomic DMA pools in 5.7, which was the preferred way of fixing it.  But 
SEV as a feature works for everybody not using this subset of drivers.  I 
wouldn't say that the fix is a "new feature" because it's the means by 
which we provide unencrypted DMA memory for atomic allocators that can't 
make the transition from encrypted to unecrypted during allocation because 
of their context; it specifically addresses the bug.

> What distro that is based on 5.4 that follows the upstream stable trees

> have not already included these patches in their releases?  And what

> prevents them from using a newer kernel release entirely for this new

> feature their customers are requesting?

> 


I'll defer this to Peter who would have a far better understanding of the 
base kernel versions that our customers use with SEV.

Thanks
Peter Gonda May 12, 2021, 7:06 p.m. UTC | #8
Reviving this thread.

This is repro-able on 5.4 LTS guests as long as
CONFIG_DEBUG_ATOMIC_SLEEP is set. I am not sure if this can be
classified as a regression or not since I think this issue would have
shown up on all SEV guests using NVMe boot disks since the start of
SEV. So it seems like a regression to me.

Currently I know of Ubuntu20.04.01 being based on 5.4. I have also
sent this backport to 4.19 which is close to the 4.18 which RHEL8 is
based on. I do not know what prevents the distros from using a more
recent kernel but that seems like their choice. I was just hoping to
get these backport submitted into LTS to show the issue and help any
distro wanting to backport these changes to prevent it. Greg is there
more information I can help provide as justification?

On Mon, Jan 4, 2021 at 11:38 PM David Rientjes <rientjes@google.com> wrote:
>

> On Tue, 5 Jan 2021, Greg KH wrote:

>

> > > I think it can be considered a bug fix.

> > >

> > > Today, if you boot an SEV encrypted guest running 5.4 and it requires

> > > atomic DMA allocations, you'll get the "sleeping function called from

> > > invalid context" bugs.  We see this in our Cloud because there is a

> > > reliance on atomic allocations through the DMA API by the NVMe driver.

> > > Likely nobody else has triggered this because they don't have such driver

> > > dependencies.

> > >

> > > No previous kernel version worked properly since SEV guest support was

> > > introduced in 4.14.

> >

> > So since this has never worked, it is not a regression that is being

> > fixed, but rather, a "new feature".  And because of that, if you want it

> > to work properly, please use a new kernel that has all of these major

> > changes in it.

> >

>

> Hmm, maybe :)  AMD shipped guest support in 4.14 and host support in 4.16

> for the SEV feature.  In turns out that a subset of drivers (for Google,

> NVMe) would run into scheduling while atomic bugs because they do

> GFP_ATOMIC allocations through the DMA API and that uses

> set_memory_decrypted() for SEV which can block.  I'd argue that's a bug in

> the SEV feature for a subset of configs.

>

> So this never worked correctly for a subset of drivers until I added

> atomic DMA pools in 5.7, which was the preferred way of fixing it.  But

> SEV as a feature works for everybody not using this subset of drivers.  I

> wouldn't say that the fix is a "new feature" because it's the means by

> which we provide unencrypted DMA memory for atomic allocators that can't

> make the transition from encrypted to unecrypted during allocation because

> of their context; it specifically addresses the bug.

>

> > What distro that is based on 5.4 that follows the upstream stable trees

> > have not already included these patches in their releases?  And what

> > prevents them from using a newer kernel release entirely for this new

> > feature their customers are requesting?

> >

>

> I'll defer this to Peter who would have a far better understanding of the

> base kernel versions that our customers use with SEV.

>

> Thanks