[v2,1/1] memory.h: Improve IOMMU related documentation

Message ID 20180501101907.22638-1-peter.maydell@linaro.org
State Superseded
Headers show
Series
  • [v2,1/1] memory.h: Improve IOMMU related documentation
Related show

Commit Message

Peter Maydell May 1, 2018, 10:19 a.m.
Add more detail to the documentation for memory_region_init_iommu()
and other IOMMU-related functions and data structures.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

---
v1 -> v2 changes:
 * documented replay method
 * added note about wanting RCU or big qemu lock while calling
   translate

 include/exec/memory.h | 96 +++++++++++++++++++++++++++++++++++++++----
 1 file changed, 87 insertions(+), 9 deletions(-)

-- 
2.17.0

Comments

Auger Eric May 1, 2018, 2:32 p.m. | #1
Hi Peter,

On 05/01/2018 12:19 PM, Peter Maydell wrote:
> Add more detail to the documentation for memory_region_init_iommu()

> and other IOMMU-related functions and data structures.

> 

> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

> ---

> v1 -> v2 changes:

>  * documented replay method

>  * added note about wanting RCU or big qemu lock while calling

>    translate

> 

>  include/exec/memory.h | 96 +++++++++++++++++++++++++++++++++++++++----

>  1 file changed, 87 insertions(+), 9 deletions(-)

> 

> diff --git a/include/exec/memory.h b/include/exec/memory.h

> index 4402ba6c0d..997598c664 100644

> --- a/include/exec/memory.h

> +++ b/include/exec/memory.h

> @@ -194,29 +194,95 @@ enum IOMMUMemoryRegionAttr {

>      IOMMU_ATTR_SPAPR_TCE_FD

>  };

>  

> +/**

> + * IOMMUMemoryRegionClass:

> + *

> + * All IOMMU implementations need to subclass TYPE_IOMMU_MEMORY_REGION

> + * and provide implementations of at least some of the methods here

> + * to handle requests to the memory region.

of at least the @translate method. Others are optional?
 The minimum requirement
> + * is a @translate method.

> + */

>  typedef struct IOMMUMemoryRegionClass {

>      /* private */

>      struct DeviceClass parent_class;

>  

>      /*

>       * Return a TLB entry that contains a given address. Flag should

> -     * be the access permission of this translation operation. We can

> -     * set flag to IOMMU_NONE to mean that we don't need any

> -     * read/write permission checks, like, when for region replay.

> +     * be the access permission of this translation operation. The

> +     * flag may be set to IOMMU_NONE to mean that we don't need any

> +     * read/write permission checks; this is used by the default

> +     * implementation of memory_region_iommu_replay(). (See the documentation

> +     * of the replay method below for more details.)

> +     *

> +     * Once the IOMMU has returned a TLB entry, it must notify

> +     * the IOMMU's users if that TLB entry changes, using

> +     * memory_region_notify_iommu() (or, if necessary, by calling

> +     * memory_region_notify_one() for each registered notifier).

if caching mode is set (VFIO integration), On the first guest map
(invalidation on map), the IOMMU must replay the mapping (notify VFIO)
to get the 1st stage of the physical IOMMU programmed. At that moment
the IOMMU may not have returned any TLB entry.

Maybe it is worth to define the term "IOMMU user", ie. someone who
registered an IOMMU notifier? In case the IOMMU is used without
vhost/VFIO, there is no registered notifier, however the IOMMU is "used".
> +     *

> +     * The returned information remains valid while the caller is

> +     * holding the big QEMU lock or is inside an RCU critical section;

> +     * if the caller wishes to cache the mapping beyond that it must

> +     * register an IOMMU notifier so it can invalidate its cached

> +     * information when the IOMMU mapping changes..

> +     *

> +     * @iommu: the IOMMUMemoryRegion

> +     * @hwaddr: address to be translated within the memory region

> +     * @flag: requested access permissions

>       */

>      IOMMUTLBEntry (*translate)(IOMMUMemoryRegion *iommu, hwaddr addr,

>                                 IOMMUAccessFlags flag);

> -    /* Returns minimum supported page size */

> +    /* Returns minimum supported page size in bytes.

> +     * If this method is not provided then the minimum is assumed to

> +     * be TARGET_PAGE_SIZE.

> +     *

> +     * @iommu: the IOMMUMemoryRegion

> +     */

>      uint64_t (*get_min_page_size)(IOMMUMemoryRegion *iommu);

> -    /* Called when IOMMU Notifier flag changed */

> +    /* Called when IOMMU Notifier flag changes (ie when the set of

> +     * events which IOMMU users are requesting notification for changes).

> +     * Optional method -- need not be provided if the IOMMU does not

> +     * need to know exactly which events must be notified.

> +     *

> +     * @iommu: the IOMMUMemoryRegion

> +     * @old_flags: events which previously needed to be notified

> +     * @new_flags: events which now need to be notified

> +     */

>      void (*notify_flag_changed)(IOMMUMemoryRegion *iommu,

>                                  IOMMUNotifierFlag old_flags,

>                                  IOMMUNotifierFlag new_flags);

> -    /* Set this up to provide customized IOMMU replay function */

> +    /* Called to handle memory_region_iommu_replay().

> +     *

> +     * The default implementation of memory_region_iommu_replay() is to

> +     * call the IOMMU translate method for every page in the address space

> +     * with flag == IOMMU_NONE

generally meaning invalidation?
 and then call the notifier if translate
I think possibly we could have several notifiers?
> +     * returns valid mapping

a valid mapping?
. If this method is implemented then it
> +     * overrides the default behaviour, and must provide the full semantics

> +     * of memory_region_iommu_replay(), by calling @notifier for every

> +     * translation present in the IOMMU.

by first invalidating the whole range and then calling @notifier for
every valid mapping?
> +     *

> +     * Optional method -- an IOMMU only needs to provide this method

> +     * if the default is inefficient or produces undesirable side effects.

> +     *

> +     * Note: this is not related to record-and-replay functionality.

The record-and-replay functionality is not obvious to me.
> +     */

>      void (*replay)(IOMMUMemoryRegion *iommu, IOMMUNotifier *notifier);

>  

> -    /* Get IOMMU misc attributes */

> -    int (*get_attr)(IOMMUMemoryRegion *iommu, enum IOMMUMemoryRegionAttr,

> +    /* Get IOMMU misc attributes. This is an optional method that

> +     * can be used to allow users of the IOMMU to get implementation-specific

> +     * information. The IOMMU implements this method to handle calls

> +     * by IOMMU users to memory_region_iommu_get_attr() by filling in

> +     * the arbitrary data pointer for any IOMMUMemoryRegionAttr values that

> +     * the IOMMU supports. If the method is unimplemented then

> +     * memory_region_iommu_get_attr() will always return -EINVAL.

> +     *

> +     * @iommu: the IOMMUMemoryRegion

> +     * @attr: attribute being queried

> +     * @data: memory to fill in with the attribute data

> +     *

> +     * Returns 0 on success, or a negative errno; in particular

> +     * returns -EINVAL for unrecognized or unimplemented attribute types.

> +     */

> +    int (*get_attr)(IOMMUMemoryRegion *iommu, enum IOMMUMemoryRegionAttr attr,

>                      void *data);

>  } IOMMUMemoryRegionClass;

>  

> @@ -705,6 +771,14 @@ static inline void memory_region_init_reservation(MemoryRegion *mr,

>   * An IOMMU region translates addresses and forwards accesses to a target

>   * memory region.

>   *

> + * The IOMMU implementation must define a subclass of TYPE_IOMMU_MEMORY_REGION.

> + * @_iommu_mr should be a pointer to enough memory for an instance of

> + * that subclass, @instance_size is the size of that subclass, and

> + * @mrtypename is its name. This function will initialize @_iommu_mr as an

> + * instance of the subclass, and its methods will then be called to handle

> + * accesses to the memory region. See the documentation of

> + * #IOMMUMemoryRegionClass for further details.

> + *

>   * @_iommu_mr: the #IOMMUMemoryRegion to be initialized

>   * @instance_size: the IOMMUMemoryRegion subclass instance size

>   * @mrtypename: the type name of the #IOMMUMemoryRegion

> @@ -953,6 +1027,8 @@ void memory_region_register_iommu_notifier(MemoryRegion *mr,

>   * a notifier with the minimum page granularity returned by

>   * mr->iommu_ops->get_page_size().

>   *

> + * Note: this is not related to record-and-replay functionality.

> + *

>   * @iommu_mr: the memory region to observe

>   * @n: the notifier to which to replay iommu mappings

>   */

> @@ -962,6 +1038,8 @@ void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n);

>   * memory_region_iommu_replay_all: replay existing IOMMU translations

>   * to all the notifiers registered.

>   *

> + * Note: this is not related to record-and-replay functionality.

> + *

>   * @iommu_mr: the memory region to observe

>   */

>  void memory_region_iommu_replay_all(IOMMUMemoryRegion *iommu_mr);

> @@ -981,7 +1059,7 @@ void memory_region_unregister_iommu_notifier(MemoryRegion *mr,

>   * memory_region_iommu_get_attr: return an IOMMU attr if get_attr() is

>   * defined on the IOMMU.

>   *

> - * Returns 0 if succeded, error code otherwise.

> + * Returns 0 on success, or a negative errno otherwise.

>   *

>   * @iommu_mr: the memory region

>   * @attr: the requested attribute

> 


Thanks

Eric
Peter Maydell May 1, 2018, 3 p.m. | #2
On 1 May 2018 at 15:32, Auger Eric <eric.auger@redhat.com> wrote:
> Hi Peter,

>

> On 05/01/2018 12:19 PM, Peter Maydell wrote:

>> Add more detail to the documentation for memory_region_init_iommu()

>> and other IOMMU-related functions and data structures.

>>

>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

>> ---

>> v1 -> v2 changes:

>>  * documented replay method

>>  * added note about wanting RCU or big qemu lock while calling

>>    translate


Hi; thanks for the review. I have some followup questions where
my understanding of the IOMMU is weak...

>>  include/exec/memory.h | 96 +++++++++++++++++++++++++++++++++++++++----

>>  1 file changed, 87 insertions(+), 9 deletions(-)

>>

>> diff --git a/include/exec/memory.h b/include/exec/memory.h

>> index 4402ba6c0d..997598c664 100644

>> --- a/include/exec/memory.h

>> +++ b/include/exec/memory.h

>> @@ -194,29 +194,95 @@ enum IOMMUMemoryRegionAttr {

>>      IOMMU_ATTR_SPAPR_TCE_FD

>>  };

>>

>> +/**

>> + * IOMMUMemoryRegionClass:

>> + *

>> + * All IOMMU implementations need to subclass TYPE_IOMMU_MEMORY_REGION

>> + * and provide implementations of at least some of the methods here

>> + * to handle requests to the memory region.

> of at least the @translate method. Others are optional?


Yes, that's probably a bit smoother wording.

>  The minimum requirement

>> + * is a @translate method.


>> + */

>>  typedef struct IOMMUMemoryRegionClass {

>>      /* private */

>>      struct DeviceClass parent_class;

>>

>>      /*

>>       * Return a TLB entry that contains a given address. Flag should

>> -     * be the access permission of this translation operation. We can

>> -     * set flag to IOMMU_NONE to mean that we don't need any

>> -     * read/write permission checks, like, when for region replay.

>> +     * be the access permission of this translation operation. The

>> +     * flag may be set to IOMMU_NONE to mean that we don't need any

>> +     * read/write permission checks; this is used by the default

>> +     * implementation of memory_region_iommu_replay(). (See the documentation

>> +     * of the replay method below for more details.)

>> +     *

>> +     * Once the IOMMU has returned a TLB entry, it must notify

>> +     * the IOMMU's users if that TLB entry changes, using

>> +     * memory_region_notify_iommu() (or, if necessary, by calling

>> +     * memory_region_notify_one() for each registered notifier).

> if caching mode is set (VFIO integration), On the first guest map

> (invalidation on map), the IOMMU must replay the mapping (notify VFIO)

> to get the 1st stage of the physical IOMMU programmed. At that moment

> the IOMMU may not have returned any TLB entry.


I'm afraid you've lost me. What is caching mode, how do you set it,
how does the IOMMU replay the mapping? More concretely, what change
needs to be made to this text?

> Maybe it is worth to define the term "IOMMU user", ie. someone who

> registered an IOMMU notifier? In case the IOMMU is used without

> vhost/VFIO, there is no registered notifier, however the IOMMU is "used".


I was using "IOMMU user" here in a fairly loose sense to mean anybody
using these interfaces (ie the code "outside" of the IOMMU implementation).

Maybe we should fix this by removing this paragraph and putting something
in the introductory "All IOMMU implementations need ..." instead as a
general description of the IOMMU's responsibilities regarding notifiers.

>> +     *

>> +     * The returned information remains valid while the caller is

>> +     * holding the big QEMU lock or is inside an RCU critical section;

>> +     * if the caller wishes to cache the mapping beyond that it must

>> +     * register an IOMMU notifier so it can invalidate its cached

>> +     * information when the IOMMU mapping changes..

>> +     *

>> +     * @iommu: the IOMMUMemoryRegion

>> +     * @hwaddr: address to be translated within the memory region

>> +     * @flag: requested access permissions

>>       */

>>      IOMMUTLBEntry (*translate)(IOMMUMemoryRegion *iommu, hwaddr addr,

>>                                 IOMMUAccessFlags flag);

>> -    /* Returns minimum supported page size */

>> +    /* Returns minimum supported page size in bytes.

>> +     * If this method is not provided then the minimum is assumed to

>> +     * be TARGET_PAGE_SIZE.

>> +     *

>> +     * @iommu: the IOMMUMemoryRegion

>> +     */

>>      uint64_t (*get_min_page_size)(IOMMUMemoryRegion *iommu);

>> -    /* Called when IOMMU Notifier flag changed */

>> +    /* Called when IOMMU Notifier flag changes (ie when the set of

>> +     * events which IOMMU users are requesting notification for changes).

>> +     * Optional method -- need not be provided if the IOMMU does not

>> +     * need to know exactly which events must be notified.

>> +     *

>> +     * @iommu: the IOMMUMemoryRegion

>> +     * @old_flags: events which previously needed to be notified

>> +     * @new_flags: events which now need to be notified

>> +     */

>>      void (*notify_flag_changed)(IOMMUMemoryRegion *iommu,

>>                                  IOMMUNotifierFlag old_flags,

>>                                  IOMMUNotifierFlag new_flags);

>> -    /* Set this up to provide customized IOMMU replay function */

>> +    /* Called to handle memory_region_iommu_replay().

>> +     *

>> +     * The default implementation of memory_region_iommu_replay() is to

>> +     * call the IOMMU translate method for every page in the address space

>> +     * with flag == IOMMU_NONE

> generally meaning invalidation?


The flags argument is something I find a bit confusing overall --
what does it mean to put a transaction through an IOMMU that
isn't either a read transaction or a write transaction? Do we
use IOMMU_NONE for anything except this special purpose "I am
the default replay implementation and just want to cause notifiers
to be called" ?

>  and then call the notifier if translate

> I think possibly we could have several notifiers?


This method takes one notifier pointer and is responsible
for notifying it and only it, I think.

>> +     * returns valid mapping

> a valid mapping?


Yes.

> . If this method is implemented then it

>> +     * overrides the default behaviour, and must provide the full semantics

>> +     * of memory_region_iommu_replay(), by calling @notifier for every

>> +     * translation present in the IOMMU.

> by first invalidating the whole range and then calling @notifier for

> every valid mapping?


Wait, why do we have to invalidate anything? I thought replay was
purely a "tell the notifier about the current state of the IOMMU"
function?

>> +     *

>> +     * Optional method -- an IOMMU only needs to provide this method

>> +     * if the default is inefficient or produces undesirable side effects.

>> +     *

>> +     * Note: this is not related to record-and-replay functionality.

> The record-and-replay functionality is not obvious to me.


This is just a note to say "this is not anything to do with what the
rest of the codebase calls 'replay', it's just an in-retrospect
poorly chosen method name, so don't confuse the two".

>> +     */

>>      void (*replay)(IOMMUMemoryRegion *iommu, IOMMUNotifier *notifier);


thanks
-- PMM
Auger Eric May 1, 2018, 3:53 p.m. | #3
Hi Peter,

On 05/01/2018 05:00 PM, Peter Maydell wrote:
> On 1 May 2018 at 15:32, Auger Eric <eric.auger@redhat.com> wrote:

>> Hi Peter,

>>

>> On 05/01/2018 12:19 PM, Peter Maydell wrote:

>>> Add more detail to the documentation for memory_region_init_iommu()

>>> and other IOMMU-related functions and data structures.

>>>

>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

>>> ---

>>> v1 -> v2 changes:

>>>  * documented replay method

>>>  * added note about wanting RCU or big qemu lock while calling

>>>    translate

> 

> Hi; thanks for the review. I have some followup questions where

> my understanding of the IOMMU is weak...

> 

>>>  include/exec/memory.h | 96 +++++++++++++++++++++++++++++++++++++++----

>>>  1 file changed, 87 insertions(+), 9 deletions(-)

>>>

>>> diff --git a/include/exec/memory.h b/include/exec/memory.h

>>> index 4402ba6c0d..997598c664 100644

>>> --- a/include/exec/memory.h

>>> +++ b/include/exec/memory.h

>>> @@ -194,29 +194,95 @@ enum IOMMUMemoryRegionAttr {

>>>      IOMMU_ATTR_SPAPR_TCE_FD

>>>  };

>>>

>>> +/**

>>> + * IOMMUMemoryRegionClass:

>>> + *

>>> + * All IOMMU implementations need to subclass TYPE_IOMMU_MEMORY_REGION

>>> + * and provide implementations of at least some of the methods here

>>> + * to handle requests to the memory region.

>> of at least the @translate method. Others are optional?

> 

> Yes, that's probably a bit smoother wording.

> 

>>  The minimum requirement

>>> + * is a @translate method.

> 

>>> + */

>>>  typedef struct IOMMUMemoryRegionClass {

>>>      /* private */

>>>      struct DeviceClass parent_class;

>>>

>>>      /*

>>>       * Return a TLB entry that contains a given address. Flag should

>>> -     * be the access permission of this translation operation. We can

>>> -     * set flag to IOMMU_NONE to mean that we don't need any

>>> -     * read/write permission checks, like, when for region replay.

>>> +     * be the access permission of this translation operation. The

>>> +     * flag may be set to IOMMU_NONE to mean that we don't need any

>>> +     * read/write permission checks; this is used by the default

>>> +     * implementation of memory_region_iommu_replay(). (See the documentation

>>> +     * of the replay method below for more details.)

>>> +     *

>>> +     * Once the IOMMU has returned a TLB entry, it must notify

>>> +     * the IOMMU's users if that TLB entry changes, using

>>> +     * memory_region_notify_iommu() (or, if necessary, by calling

>>> +     * memory_region_notify_one() for each registered notifier).

>> if caching mode is set (VFIO integration), On the first guest map

>> (invalidation on map), the IOMMU must replay the mapping (notify VFIO)

>> to get the 1st stage of the physical IOMMU programmed. At that moment

>> the IOMMU may not have returned any TLB entry.

> 

> I'm afraid you've lost me. What is caching mode, how do you set it,

> how does the IOMMU replay the mapping? More concretely, what change

> needs to be made to this text?


np. The caching mode relates to Intel IOMMU. When this bit is advertised
in the Intel IOMMU, it forces the IOMMU driver to invalidate TLB entries
also on map and not only on unmap. So when any iova is mapped, an
invalidation command is sent and this allows to trap when entries are
created. Then the IOMMU device can notify "users" that a mapping
happened and this mapping can be forwarded to the physical IOMMU.
Caching mode is set when a VFIO device is behind the IOMMU and both
stage 1 and stage 2 are combined in the physical IOMMU.

what about: the IOMMU is responsible to call the registered user
notifiers when the associated event occurs on an IOTLB entry (map/unmap).
> 

>> Maybe it is worth to define the term "IOMMU user", ie. someone who

>> registered an IOMMU notifier? In case the IOMMU is used without

>> vhost/VFIO, there is no registered notifier, however the IOMMU is "used".

> 

> I was using "IOMMU user" here in a fairly loose sense to mean anybody

> using these interfaces (ie the code "outside" of the IOMMU implementation).

> 

> Maybe we should fix this by removing this paragraph and putting something

> in the introductory "All IOMMU implementations need ..." instead as a

> general description of the IOMMU's responsibilities regarding notifiers.

> 

>>> +     *

>>> +     * The returned information remains valid while the caller is

>>> +     * holding the big QEMU lock or is inside an RCU critical section;

>>> +     * if the caller wishes to cache the mapping beyond that it must

>>> +     * register an IOMMU notifier so it can invalidate its cached

>>> +     * information when the IOMMU mapping changes..

>>> +     *

>>> +     * @iommu: the IOMMUMemoryRegion

>>> +     * @hwaddr: address to be translated within the memory region

>>> +     * @flag: requested access permissions

>>>       */

>>>      IOMMUTLBEntry (*translate)(IOMMUMemoryRegion *iommu, hwaddr addr,

>>>                                 IOMMUAccessFlags flag);

>>> -    /* Returns minimum supported page size */

>>> +    /* Returns minimum supported page size in bytes.

>>> +     * If this method is not provided then the minimum is assumed to

>>> +     * be TARGET_PAGE_SIZE.

>>> +     *

>>> +     * @iommu: the IOMMUMemoryRegion

>>> +     */

>>>      uint64_t (*get_min_page_size)(IOMMUMemoryRegion *iommu);

>>> -    /* Called when IOMMU Notifier flag changed */

>>> +    /* Called when IOMMU Notifier flag changes (ie when the set of

>>> +     * events which IOMMU users are requesting notification for changes).

>>> +     * Optional method -- need not be provided if the IOMMU does not

>>> +     * need to know exactly which events must be notified.

>>> +     *

>>> +     * @iommu: the IOMMUMemoryRegion

>>> +     * @old_flags: events which previously needed to be notified

>>> +     * @new_flags: events which now need to be notified

>>> +     */

>>>      void (*notify_flag_changed)(IOMMUMemoryRegion *iommu,

>>>                                  IOMMUNotifierFlag old_flags,

>>>                                  IOMMUNotifierFlag new_flags);

>>> -    /* Set this up to provide customized IOMMU replay function */

>>> +    /* Called to handle memory_region_iommu_replay().

>>> +     *

>>> +     * The default implementation of memory_region_iommu_replay() is to

>>> +     * call the IOMMU translate method for every page in the address space

>>> +     * with flag == IOMMU_NONE

>> generally meaning invalidation?

> 

> The flags argument is something I find a bit confusing overall --

> what does it mean to put a transaction through an IOMMU that

> isn't either a read transaction or a write transaction? Do we

> use IOMMU_NONE for anything except this special purpose "I am

> the default replay implementation and just want to cause notifiers

> to be called" ?


No I don't think so. To me the purpose of the change was for replay
(before we only had is_write and it was not obvious what it meant for a
replay action, hence the change I think).
bf55b7a  memory: tune last param of iommu_ops.translate()
> 

>>  and then call the notifier if translate

>> I think possibly we could have several notifiers?

> 

> This method takes one notifier pointer and is responsible

> for notifying it and only it, I think.

oh yes forget it.
> 

>>> +     * returns valid mapping

>> a valid mapping?

> 

> Yes.

> 

>> . If this method is implemented then it

>>> +     * overrides the default behaviour, and must provide the full semantics

>>> +     * of memory_region_iommu_replay(), by calling @notifier for every

>>> +     * translation present in the IOMMU.

>> by first invalidating the whole range and then calling @notifier for

>> every valid mapping?

> 

> Wait, why do we have to invalidate anything? I thought replay was

> purely a "tell the notifier about the current state of the IOMMU"

> function?



Hum effectively the generic implementation of memory_region_iommu_replay
only notifies when a mapping exists.

Intel's custom method also removes stale entries and make the cache
consistent with page tables for the mr's span. This is also what I did
when I implemented the tlbi on map trick in smmuv3.

see hw/i386/intel_iommu.c, vtd_iommu_replay
    /*
     * The replay can be triggered by either a invalidation or a newly
     * created entry. No matter what, we release existing mappings
     * (it means flushing caches for UNMAP-only registers).
     */

However most probably that's due to the fact the function is also used
internally by the intel iommu on invalidations (context).

So you are right invalidation is not mandated.

> 

>>> +     *

>>> +     * Optional method -- an IOMMU only needs to provide this method

>>> +     * if the default is inefficient or produces undesirable side effects.

>>> +     *

>>> +     * Note: this is not related to record-and-replay functionality.

>> The record-and-replay functionality is not obvious to me.

> 

> This is just a note to say "this is not anything to do with what the

> rest of the codebase calls 'replay', it's just an in-retrospect

> poorly chosen method name, so don't confuse the two".

ok

Thanks

Eric
> 

>>> +     */

>>>      void (*replay)(IOMMUMemoryRegion *iommu, IOMMUNotifier *notifier);

> 

> thanks

> -- PMM

>
Peter Maydell May 8, 2018, 4:25 p.m. | #4
On 1 May 2018 at 16:53, Auger Eric <eric.auger@redhat.com> wrote:
> Hi Peter,

>

> On 05/01/2018 05:00 PM, Peter Maydell wrote:

>> On 1 May 2018 at 15:32, Auger Eric <eric.auger@redhat.com> wrote:

>>> Hi Peter,

>>>

>>> On 05/01/2018 12:19 PM, Peter Maydell wrote:

>>>> +     * Once the IOMMU has returned a TLB entry, it must notify

>>>> +     * the IOMMU's users if that TLB entry changes, using

>>>> +     * memory_region_notify_iommu() (or, if necessary, by calling

>>>> +     * memory_region_notify_one() for each registered notifier).

>>> if caching mode is set (VFIO integration), On the first guest map

>>> (invalidation on map), the IOMMU must replay the mapping (notify VFIO)

>>> to get the 1st stage of the physical IOMMU programmed. At that moment

>>> the IOMMU may not have returned any TLB entry.

>>

>> I'm afraid you've lost me. What is caching mode, how do you set it,

>> how does the IOMMU replay the mapping? More concretely, what change

>> needs to be made to this text?

>

> np. The caching mode relates to Intel IOMMU. When this bit is advertised

> in the Intel IOMMU, it forces the IOMMU driver to invalidate TLB entries

> also on map and not only on unmap. So when any iova is mapped, an

> invalidation command is sent and this allows to trap when entries are

> created. Then the IOMMU device can notify "users" that a mapping

> happened and this mapping can be forwarded to the physical IOMMU.

> Caching mode is set when a VFIO device is behind the IOMMU and both

> stage 1 and stage 2 are combined in the physical IOMMU.

>

> what about: the IOMMU is responsible to call the registered user

> notifiers when the associated event occurs on an IOTLB entry (map/unmap).


This runs into something I found when I was implementing the Arm
Memory Protection Controller -- at the point when the guest changes
the config, it doesn't have enough information to be able to call a
"map" notifier, because the mapping depends on the memory transaction
attributes, it's not fixed and dependent only on the address. We can
say "unmap" to indicate that the old information is not usable any
more, but that's all. (Similarly if the mapping depends on the
read/write permission then the IOMMU isn't going to be able to
provide a new mapping to the notifier, because it doesn't have the
information to do that.)

My use-case for notifiers (getting the TCG code to throw away its
TLB entries for the now-stale information) doesn't require anything
more from the notifier, but it's not clear to me what the exact
notifier-calling requirements are.

If you can't actually use the memory transaction attributes info
to inform your translation, then there's not much point plumbing
them into the IOMMU.

thanks
-- PMM
Auger Eric May 16, 2018, 3:33 p.m. | #5
Hi Peter,

On 05/08/2018 06:25 PM, Peter Maydell wrote:
> On 1 May 2018 at 16:53, Auger Eric <eric.auger@redhat.com> wrote:

>> Hi Peter,

>>

>> On 05/01/2018 05:00 PM, Peter Maydell wrote:

>>> On 1 May 2018 at 15:32, Auger Eric <eric.auger@redhat.com> wrote:

>>>> Hi Peter,

>>>>

>>>> On 05/01/2018 12:19 PM, Peter Maydell wrote:

>>>>> +     * Once the IOMMU has returned a TLB entry, it must notify

>>>>> +     * the IOMMU's users if that TLB entry changes, using

>>>>> +     * memory_region_notify_iommu() (or, if necessary, by calling

>>>>> +     * memory_region_notify_one() for each registered notifier).

>>>> if caching mode is set (VFIO integration), On the first guest map

>>>> (invalidation on map), the IOMMU must replay the mapping (notify VFIO)

>>>> to get the 1st stage of the physical IOMMU programmed. At that moment

>>>> the IOMMU may not have returned any TLB entry.

>>>

>>> I'm afraid you've lost me. What is caching mode, how do you set it,

>>> how does the IOMMU replay the mapping? More concretely, what change

>>> needs to be made to this text?

>>

>> np. The caching mode relates to Intel IOMMU. When this bit is advertised

>> in the Intel IOMMU, it forces the IOMMU driver to invalidate TLB entries

>> also on map and not only on unmap. So when any iova is mapped, an

>> invalidation command is sent and this allows to trap when entries are

>> created. Then the IOMMU device can notify "users" that a mapping

>> happened and this mapping can be forwarded to the physical IOMMU.

>> Caching mode is set when a VFIO device is behind the IOMMU and both

>> stage 1 and stage 2 are combined in the physical IOMMU.

>>

>> what about: the IOMMU is responsible to call the registered user

>> notifiers when the associated event occurs on an IOTLB entry (map/unmap).

> 

> This runs into something I found when I was implementing the Arm

> Memory Protection Controller -- at the point when the guest changes

> the config,

do you mean the config structures (STE, CD) or the page table update?

 it doesn't have enough information to be able to call a
> "map" notifier, because the mapping depends on the memory transaction

> attributes, it's not fixed and dependent only on the address.

I am not sure I understand what you mean here. When the notifier get's
called, aren't we supposed to look for the info in the actual page table
(where memory access control attributes and others can be found at that
time, ie. ARM AP[] for instance) and send those through the notifier (as
stored in the IOTLBEnry)?

For instance in the intel iommu code, the whole table is parsed and the
replay hook is called for all valid entries.

vtd_iommu_replay/ptw/vtd_replay_hook/memory_region_notify_one

Thanks

Eric
 We can
> say "unmap" to indicate that the old information is not usable any

> more, but that's all. (Similarly if the mapping depends on the

> read/write permission then the IOMMU isn't going to be able to

> provide a new mapping to the notifier, because it doesn't have the

> information to do that.)


> 

> My use-case for notifiers (getting the TCG code to throw away its

> TLB entries for the now-stale information) doesn't require anything

> more from the notifier, but it's not clear to me what the exact

> notifier-calling requirements are.

> 

> If you can't actually use the memory transaction attributes info

> to inform your translation, then there's not much point plumbing

> them into the IOMMU.

> 

> thanks

> -- PMM

>
Peter Maydell May 16, 2018, 4:18 p.m. | #6
On 16 May 2018 at 15:33, Auger Eric <eric.auger@redhat.com> wrote:
> On 05/08/2018 06:25 PM, Peter Maydell wrote:

>> This runs into something I found when I was implementing the Arm

>> Memory Protection Controller -- at the point when the guest changes

>> the config,

> do you mean the config structures (STE, CD) or the page table update?


The Memory Protection Controller is not the SMMUv3. The MPC
config is set using some registers to write to the MPC's LUT
(lookup table).

>  it doesn't have enough information to be able to call a

>> "map" notifier, because the mapping depends on the memory transaction

>> attributes, it's not fixed and dependent only on the address.

>

> I am not sure I understand what you mean here. When the notifier get's

> called, aren't we supposed to look for the info in the actual page table

> (where memory access control attributes and others can be found at that

> time, ie. ARM AP[] for instance) and send those through the notifier (as

> stored in the IOTLBEnry)?


The problem is that if your translations depend on the tx attributes,
ie "secure access to address X should translate to Y, but nonsecure
access to address X should translate to Z", then the notifier
API doesn't let you report that, because all it knows about is
unmap events which are "address X unmapped" and map events which
are "address X translates to Y".

Paolo has suggested some API changes in another thread (roughly,
having the notifier say which tx attributes matter for the translation,
and send multiple map events with appropriate information).

> For instance in the intel iommu code, the whole table is parsed and the

> replay hook is called for all valid entries.


This works because the intel iommu does not care about memory
transaction attributes: it has one translation for the input
address, regardless.

thanks
-- PMM
Auger Eric May 17, 2018, 7:36 a.m. | #7
Hi Peter,
On 05/16/2018 06:18 PM, Peter Maydell wrote:
> On 16 May 2018 at 15:33, Auger Eric <eric.auger@redhat.com> wrote:

>> On 05/08/2018 06:25 PM, Peter Maydell wrote:

>>> This runs into something I found when I was implementing the Arm

>>> Memory Protection Controller -- at the point when the guest changes

>>> the config,

>> do you mean the config structures (STE, CD) or the page table update?

> 

> The Memory Protection Controller is not the SMMUv3. The MPC

> config is set using some registers to write to the MPC's LUT

> (lookup table).

> 

>>  it doesn't have enough information to be able to call a

>>> "map" notifier, because the mapping depends on the memory transaction

>>> attributes, it's not fixed and dependent only on the address.

>>

>> I am not sure I understand what you mean here. When the notifier get's

>> called, aren't we supposed to look for the info in the actual page table

>> (where memory access control attributes and others can be found at that

>> time, ie. ARM AP[] for instance) and send those through the notifier (as

>> stored in the IOTLBEnry)?

> 

> The problem is that if your translations depend on the tx attributes,

> ie "secure access to address X should translate to Y, but nonsecure

> access to address X should translate to Z", then the notifier

> API doesn't let you report that, because all it knows about is

> unmap events which are "address X unmapped" and map events which

> are "address X translates to Y".

> 

> Paolo has suggested some API changes in another thread (roughly,

> having the notifier say which tx attributes matter for the translation,

> and send multiple map events with appropriate information).

> 

>> For instance in the intel iommu code, the whole table is parsed and the

>> replay hook is called for all valid entries.

> 

> This works because the intel iommu does not care about memory

> transaction attributes: it has one translation for the input

> address, regardless.


OK I get it now. Thank you for the explanations.

Eric
> 

> thanks

> -- PMM

>

Patch

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 4402ba6c0d..997598c664 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -194,29 +194,95 @@  enum IOMMUMemoryRegionAttr {
     IOMMU_ATTR_SPAPR_TCE_FD
 };
 
+/**
+ * IOMMUMemoryRegionClass:
+ *
+ * All IOMMU implementations need to subclass TYPE_IOMMU_MEMORY_REGION
+ * and provide implementations of at least some of the methods here
+ * to handle requests to the memory region. The minimum requirement
+ * is a @translate method.
+ */
 typedef struct IOMMUMemoryRegionClass {
     /* private */
     struct DeviceClass parent_class;
 
     /*
      * Return a TLB entry that contains a given address. Flag should
-     * be the access permission of this translation operation. We can
-     * set flag to IOMMU_NONE to mean that we don't need any
-     * read/write permission checks, like, when for region replay.
+     * be the access permission of this translation operation. The
+     * flag may be set to IOMMU_NONE to mean that we don't need any
+     * read/write permission checks; this is used by the default
+     * implementation of memory_region_iommu_replay(). (See the documentation
+     * of the replay method below for more details.)
+     *
+     * Once the IOMMU has returned a TLB entry, it must notify
+     * the IOMMU's users if that TLB entry changes, using
+     * memory_region_notify_iommu() (or, if necessary, by calling
+     * memory_region_notify_one() for each registered notifier).
+     *
+     * The returned information remains valid while the caller is
+     * holding the big QEMU lock or is inside an RCU critical section;
+     * if the caller wishes to cache the mapping beyond that it must
+     * register an IOMMU notifier so it can invalidate its cached
+     * information when the IOMMU mapping changes..
+     *
+     * @iommu: the IOMMUMemoryRegion
+     * @hwaddr: address to be translated within the memory region
+     * @flag: requested access permissions
      */
     IOMMUTLBEntry (*translate)(IOMMUMemoryRegion *iommu, hwaddr addr,
                                IOMMUAccessFlags flag);
-    /* Returns minimum supported page size */
+    /* Returns minimum supported page size in bytes.
+     * If this method is not provided then the minimum is assumed to
+     * be TARGET_PAGE_SIZE.
+     *
+     * @iommu: the IOMMUMemoryRegion
+     */
     uint64_t (*get_min_page_size)(IOMMUMemoryRegion *iommu);
-    /* Called when IOMMU Notifier flag changed */
+    /* Called when IOMMU Notifier flag changes (ie when the set of
+     * events which IOMMU users are requesting notification for changes).
+     * Optional method -- need not be provided if the IOMMU does not
+     * need to know exactly which events must be notified.
+     *
+     * @iommu: the IOMMUMemoryRegion
+     * @old_flags: events which previously needed to be notified
+     * @new_flags: events which now need to be notified
+     */
     void (*notify_flag_changed)(IOMMUMemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
-    /* Set this up to provide customized IOMMU replay function */
+    /* Called to handle memory_region_iommu_replay().
+     *
+     * The default implementation of memory_region_iommu_replay() is to
+     * call the IOMMU translate method for every page in the address space
+     * with flag == IOMMU_NONE and then call the notifier if translate
+     * returns valid mapping. If this method is implemented then it
+     * overrides the default behaviour, and must provide the full semantics
+     * of memory_region_iommu_replay(), by calling @notifier for every
+     * translation present in the IOMMU.
+     *
+     * Optional method -- an IOMMU only needs to provide this method
+     * if the default is inefficient or produces undesirable side effects.
+     *
+     * Note: this is not related to record-and-replay functionality.
+     */
     void (*replay)(IOMMUMemoryRegion *iommu, IOMMUNotifier *notifier);
 
-    /* Get IOMMU misc attributes */
-    int (*get_attr)(IOMMUMemoryRegion *iommu, enum IOMMUMemoryRegionAttr,
+    /* Get IOMMU misc attributes. This is an optional method that
+     * can be used to allow users of the IOMMU to get implementation-specific
+     * information. The IOMMU implements this method to handle calls
+     * by IOMMU users to memory_region_iommu_get_attr() by filling in
+     * the arbitrary data pointer for any IOMMUMemoryRegionAttr values that
+     * the IOMMU supports. If the method is unimplemented then
+     * memory_region_iommu_get_attr() will always return -EINVAL.
+     *
+     * @iommu: the IOMMUMemoryRegion
+     * @attr: attribute being queried
+     * @data: memory to fill in with the attribute data
+     *
+     * Returns 0 on success, or a negative errno; in particular
+     * returns -EINVAL for unrecognized or unimplemented attribute types.
+     */
+    int (*get_attr)(IOMMUMemoryRegion *iommu, enum IOMMUMemoryRegionAttr attr,
                     void *data);
 } IOMMUMemoryRegionClass;
 
@@ -705,6 +771,14 @@  static inline void memory_region_init_reservation(MemoryRegion *mr,
  * An IOMMU region translates addresses and forwards accesses to a target
  * memory region.
  *
+ * The IOMMU implementation must define a subclass of TYPE_IOMMU_MEMORY_REGION.
+ * @_iommu_mr should be a pointer to enough memory for an instance of
+ * that subclass, @instance_size is the size of that subclass, and
+ * @mrtypename is its name. This function will initialize @_iommu_mr as an
+ * instance of the subclass, and its methods will then be called to handle
+ * accesses to the memory region. See the documentation of
+ * #IOMMUMemoryRegionClass for further details.
+ *
  * @_iommu_mr: the #IOMMUMemoryRegion to be initialized
  * @instance_size: the IOMMUMemoryRegion subclass instance size
  * @mrtypename: the type name of the #IOMMUMemoryRegion
@@ -953,6 +1027,8 @@  void memory_region_register_iommu_notifier(MemoryRegion *mr,
  * a notifier with the minimum page granularity returned by
  * mr->iommu_ops->get_page_size().
  *
+ * Note: this is not related to record-and-replay functionality.
+ *
  * @iommu_mr: the memory region to observe
  * @n: the notifier to which to replay iommu mappings
  */
@@ -962,6 +1038,8 @@  void memory_region_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n);
  * memory_region_iommu_replay_all: replay existing IOMMU translations
  * to all the notifiers registered.
  *
+ * Note: this is not related to record-and-replay functionality.
+ *
  * @iommu_mr: the memory region to observe
  */
 void memory_region_iommu_replay_all(IOMMUMemoryRegion *iommu_mr);
@@ -981,7 +1059,7 @@  void memory_region_unregister_iommu_notifier(MemoryRegion *mr,
  * memory_region_iommu_get_attr: return an IOMMU attr if get_attr() is
  * defined on the IOMMU.
  *
- * Returns 0 if succeded, error code otherwise.
+ * Returns 0 on success, or a negative errno otherwise.
  *
  * @iommu_mr: the memory region
  * @attr: the requested attribute