diff mbox series

[v3,19/50] tcg: let plugins instrument memory accesses

Message ID 20190614171200.21078-20-alex.bennee@linaro.org
State New
Headers show
Series tcg plugin support | expand

Commit Message

Alex Bennée June 14, 2019, 5:11 p.m. UTC
From: "Emilio G. Cota" <cota@braap.org>


XXX: store hostaddr from non-i386 TCG backends (do it in a helper?)
XXX: what hostaddr to return for I/O accesses?
XXX: what hostaddr to return for cross-page accesses?

Here the trickiest feature is passing the host address to
memory callbacks that request it. Perhaps it would be more
appropriate to pass a "physical" address to plugins, but since
in QEMU host addr ~= guest physical, I'm going with that for
simplicity.

To keep the implementation simple we piggy-back on the TLB fast path,
and thus can only provide the host address _after_ memory accesses
have occurred. For the slow path, it's a bit tedious because there
are many places to update, but it's fairly simple.

However, note that cross-page accesses are tricky, since the
access might be to non-contiguous host addresses. So I'm punting
on that and just passing NULL.

Signed-off-by: Emilio G. Cota <cota@braap.org>

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>


---
v3
  - fixes for cpu_neg()
---
 accel/tcg/atomic_template.h               |  5 +++
 accel/tcg/cpu-exec.c                      |  3 ++
 accel/tcg/cputlb.c                        | 37 +++++++++++++++++----
 accel/tcg/plugin-gen.c                    | 17 +++++-----
 include/exec/cpu-defs.h                   |  9 +++++
 include/exec/cpu_ldst.h                   |  9 +++++
 include/exec/cpu_ldst_template.h          | 40 ++++++++++++++---------
 include/exec/cpu_ldst_useronly_template.h | 34 ++++++++++++-------
 tcg/i386/tcg-target.inc.c                 |  8 +++++
 tcg/tcg-op.c                              | 40 ++++++++++++++++++-----
 tcg/tcg.h                                 |  1 +
 11 files changed, 153 insertions(+), 50 deletions(-)

-- 
2.20.1

Comments

Richard Henderson June 17, 2019, 8:51 p.m. UTC | #1
On 6/14/19 10:11 AM, Alex Bennée wrote:
> +static inline void set_hostaddr(CPUArchState *env, TCGMemOp mo, void *haddr)

> +{

> +#ifdef CONFIG_PLUGIN

> +    if (mo & MO_HADDR) {

> +        env_tlb(env)->c.hostaddr = haddr;

> +    }

> +#endif

> +}

> +


Even if we weren't talking about recomputing this in the helper, would an
unconditional store be cheaper?


r~
Gonglei (Arei)" via June 28, 2019, 3:30 p.m. UTC | #2
On Jun 14 18:11, Alex Bennée wrote:
> From: "Emilio G. Cota" <cota@braap.org>

> 

> Here the trickiest feature is passing the host address to

> memory callbacks that request it. Perhaps it would be more

> appropriate to pass a "physical" address to plugins, but since

> in QEMU host addr ~= guest physical, I'm going with that for

> simplicity.


How much more difficult would it be to get the true physical address (on
the guest)?

This is important enough to me that I would be willing to help if
pointed in the right direction.

-Aaron
Alex Bennée June 28, 2019, 5:11 p.m. UTC | #3
Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> On Jun 14 18:11, Alex Bennée wrote:

>> From: "Emilio G. Cota" <cota@braap.org>

>>

>> Here the trickiest feature is passing the host address to

>> memory callbacks that request it. Perhaps it would be more

>> appropriate to pass a "physical" address to plugins, but since

>> in QEMU host addr ~= guest physical, I'm going with that for

>> simplicity.

>

> How much more difficult would it be to get the true physical address (on

> the guest)?


Previously there was a helper that converted host address (i.e. where
QEMU actually stores that value) back to the physical address (ram
offset + ram base). However the code for calculating all of this is
pretty invasive and requires tweaks to all the softmmu TCG backends as
well as hooks into a slew of memory functions.

I'm re-working this now so we just have the one memory callback and we
provide a helper function that can provide an opaque hwaddr struct which
can then be queried. The catch is you can only call this helper during a
memory callback. I'm not sure if having this restriction violates our
aim of not leaking implementation details to the plugin but it makes the
code simpler.

Internally what the helper does is simply re-query the SoftMMU TLB. As
the TLBs are per-CPU nothing else can have touched the TLB and the cache
should be hot so the cost of lookup should be minor. We could also
potentially expand the helpers so if you are interested in only IO
accesses we can do the full resolution and figure out what device we
just accessed.

> This is important enough to me that I would be willing to help if

> pointed in the right direction.


Well I'll certainly CC on the next series (hopefully posted Monday,
softfreeze starts Tuesday). I'll welcome any testing and review. Also if
you can tell us more about your use case that will help.

>

> -Aaron



--
Alex Bennée
Gonglei (Arei)" via June 28, 2019, 5:58 p.m. UTC | #4
On Jun 28 18:11, Alex Bennée wrote:
> Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> > On Jun 14 18:11, Alex Bennée wrote:

> >> From: "Emilio G. Cota" <cota@braap.org>

> >>

> >> Here the trickiest feature is passing the host address to

> >> memory callbacks that request it. Perhaps it would be more

> >> appropriate to pass a "physical" address to plugins, but since

> >> in QEMU host addr ~= guest physical, I'm going with that for

> >> simplicity.

> >

> > How much more difficult would it be to get the true physical address (on

> > the guest)?

> 

> Previously there was a helper that converted host address (i.e. where

> QEMU actually stores that value) back to the physical address (ram

> offset + ram base). However the code for calculating all of this is

> pretty invasive and requires tweaks to all the softmmu TCG backends as

> well as hooks into a slew of memory functions.

> 

> I'm re-working this now so we just have the one memory callback and we

> provide a helper function that can provide an opaque hwaddr struct which

> can then be queried.


To make sure I understand - you're implying that one such query will
return the PA from the guest's perspective, right?

> The catch is you can only call this helper during a

> memory callback.


Does this mean it will be difficult to get the physical address for the
bytes containing the instruction encoding itself?

> I'm not sure if having this restriction violates our

> aim of not leaking implementation details to the plugin but it makes the

> code simpler.


Assuming that the purpose of "not leaking implementation details" is to
allow the same plugin interface to work with other backend
implementations in the future, isn't this probably fine? It may add an
unnecessary limitation for another backend driving the same plugin
interface, but I don't think it likely changes the structure of the
interface itself. And that seems like the sort of restriction that could
easily be dropped in the future while remaining backwards-compatible.

> Internally what the helper does is simply re-query the SoftMMU TLB. As

> the TLBs are per-CPU nothing else can have touched the TLB and the cache

> should be hot so the cost of lookup should be minor. We could also

> potentially expand the helpers so if you are interested in only IO

> accesses we can do the full resolution and figure out what device we

> just accessed.


Oh, so you're already working on doing just what I asked about?

> > This is important enough to me that I would be willing to help if

> > pointed in the right direction.

> 

> Well I'll certainly CC on the next series (hopefully posted Monday,

> softfreeze starts Tuesday). I'll welcome any testing and review. Also if

> you can tell us more about your use case that will help.


Awesome, thanks!

In terms of our use case - we use QEMU to drive studies to help us
design the next generation of processors. As you can imagine, having the
right physical addresses is important for some aspects of that. We're
currently using a version of Pavel Dovgalyuk's earlier plugin patchset
with some of our own patches/fixes on top, but it would obviously make
our lives easier to work together to get this sort of infrastructure
upstream!

-Aaron
Alex Bennée June 28, 2019, 8:52 p.m. UTC | #5
Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> On Jun 28 18:11, Alex Bennée wrote:

>> Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

>> > On Jun 14 18:11, Alex Bennée wrote:

>> >> From: "Emilio G. Cota" <cota@braap.org>

>> >>

>> >> Here the trickiest feature is passing the host address to

>> >> memory callbacks that request it. Perhaps it would be more

>> >> appropriate to pass a "physical" address to plugins, but since

>> >> in QEMU host addr ~= guest physical, I'm going with that for

>> >> simplicity.

>> >

>> > How much more difficult would it be to get the true physical address (on

>> > the guest)?

>>

>> Previously there was a helper that converted host address (i.e. where

>> QEMU actually stores that value) back to the physical address (ram

>> offset + ram base). However the code for calculating all of this is

>> pretty invasive and requires tweaks to all the softmmu TCG backends as

>> well as hooks into a slew of memory functions.

>>

>> I'm re-working this now so we just have the one memory callback and we

>> provide a helper function that can provide an opaque hwaddr struct which

>> can then be queried.

>

> To make sure I understand - you're implying that one such query will

> return the PA from the guest's perspective, right?


Yes - although it will be two queries:

  struct qemu_plugin_hwaddr *hw = qemu_plugin_get_hwaddr(info, vaddr);

This does the actual lookup and stores enough information for the
further queries.

  uint64_t pa = qemu_plugin_hwaddr_to_raddr(hw);

will return the physical address (assuming it's a RAM reference and not
some IO location).

>

>> The catch is you can only call this helper during a

>> memory callback.

>

> Does this mean it will be difficult to get the physical address for the

> bytes containing the instruction encoding itself?


Hmm good question. We track the hostaddr of the instructions as we load
them so we should be able to track that back to the guest physical
address. There isn't a helper for doing that yet though.

>

>> I'm not sure if having this restriction violates our

>> aim of not leaking implementation details to the plugin but it makes the

>> code simpler.

>

> Assuming that the purpose of "not leaking implementation details" is to

> allow the same plugin interface to work with other backend

> implementations in the future, isn't this probably fine?


Quite. We don't want plugin authors to make any assumptions about the
internals of the TCG. It's not totally opaque because there are
translation time events where we offer the plugin a chance to instrument
individual instructions (or even a "block") which obviously exposes
there is a JIT of some sort.

> It may add an

> unnecessary limitation for another backend driving the same plugin

> interface, but I don't think it likely changes the structure of the

> interface itself. And that seems like the sort of restriction that could

> easily be dropped in the future while remaining backwards-compatible.

>

>> Internally what the helper does is simply re-query the SoftMMU TLB. As

>> the TLBs are per-CPU nothing else can have touched the TLB and the cache

>> should be hot so the cost of lookup should be minor. We could also

>> potentially expand the helpers so if you are interested in only IO

>> accesses we can do the full resolution and figure out what device we

>> just accessed.

>

> Oh, so you're already working on doing just what I asked about?


Yes.

>

>> > This is important enough to me that I would be willing to help if

>> > pointed in the right direction.

>>

>> Well I'll certainly CC on the next series (hopefully posted Monday,

>> softfreeze starts Tuesday). I'll welcome any testing and review. Also if

>> you can tell us more about your use case that will help.

>

> Awesome, thanks!

>

> In terms of our use case - we use QEMU to drive studies to help us

> design the next generation of processors. As you can imagine, having the

> right physical addresses is important for some aspects of that. We're

> currently using a version of Pavel Dovgalyuk's earlier plugin patchset

> with some of our own patches/fixes on top, but it would obviously make

> our lives easier to work together to get this sort of infrastructure

> upstream!


Was this:

 Date: Tue, 05 Jun 2018 13:39:15 +0300
 Message-ID: <152819515565.30857.16834004920507717324.stgit@pasha-ThinkPad-T60>
 Subject: [Qemu-devel] [RFC PATCH v2 0/7] QEMU binary instrumentation prototype

There have certainly been a lot of attempts to getting some sort of
plugin functionality into QEMU. I make no promises this one will be the
one but we shall see!

What patches did you add on top?

>

> -Aaron



--
Alex Bennée
Gonglei (Arei)" via July 1, 2019, 2:40 p.m. UTC | #6
On Jun 28 21:52, Alex Bennée wrote:
> Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> > To make sure I understand - you're implying that one such query will

> > return the PA from the guest's perspective, right?

> 

> Yes - although it will be two queries:

> 

>   struct qemu_plugin_hwaddr *hw = qemu_plugin_get_hwaddr(info, vaddr);

> 

> This does the actual lookup and stores enough information for the

> further queries.

> 

>   uint64_t pa = qemu_plugin_hwaddr_to_raddr(hw);

> 

> will return the physical address (assuming it's a RAM reference and not

> some IO location).


Sounds good, as long as we have a good way to either prevent or cleanly
detect the failure mode for the IO accesses.

> > In terms of our use case - we use QEMU to drive studies to help us

> > design the next generation of processors. As you can imagine, having the

> > right physical addresses is important for some aspects of that. We're

> > currently using a version of Pavel Dovgalyuk's earlier plugin patchset

> > with some of our own patches/fixes on top, but it would obviously make

> > our lives easier to work together to get this sort of infrastructure

> > upstream!

> 

> Was this:

> 

>  Date: Tue, 05 Jun 2018 13:39:15 +0300

>  Message-ID: <152819515565.30857.16834004920507717324.stgit@pasha-ThinkPad-T60>

>  Subject: [Qemu-devel] [RFC PATCH v2 0/7] QEMU binary instrumentation prototype


Yes, that looks like the one.

> What patches did you add on top?


We added:
- plugin support for linux-user mode (I sent that one upstream, I think)
- memory tracing support and a VA->PA conversion helper
- a way for a plugin to request getting a callback just before QEMU
  exits to clean up any internal state
- a way for a plugin to reset any instrumentation decisions made in the
  past (essentially calls `tb_flush(cpu);` under the covers). We found
  this critical for plugins which undergo state changes during the
  course of their execution (i.e. watch for event X, then go into a more
  detailed profiling mode until you see event Y)
- instrumentation at the TB granularity (in addition to the existing
  instruction-level support)
- the ability for a plugin to trigger a checkpoint to be taken

-Aaron
Alex Bennée July 1, 2019, 3 p.m. UTC | #7
Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> On Jun 28 21:52, Alex Bennée wrote:

>> Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

>> > To make sure I understand - you're implying that one such query will

>> > return the PA from the guest's perspective, right?

>>

>> Yes - although it will be two queries:

>>

>>   struct qemu_plugin_hwaddr *hw = qemu_plugin_get_hwaddr(info, vaddr);

>>

>> This does the actual lookup and stores enough information for the

>> further queries.

>>

>>   uint64_t pa = qemu_plugin_hwaddr_to_raddr(hw);

>>

>> will return the physical address (assuming it's a RAM reference and not

>> some IO location).

>

> Sounds good, as long as we have a good way to either prevent or cleanly

> detect the failure mode for the IO accesses.

>

>> > In terms of our use case - we use QEMU to drive studies to help us

>> > design the next generation of processors. As you can imagine, having the

>> > right physical addresses is important for some aspects of that. We're

>> > currently using a version of Pavel Dovgalyuk's earlier plugin patchset

>> > with some of our own patches/fixes on top, but it would obviously make

>> > our lives easier to work together to get this sort of infrastructure

>> > upstream!

>>

>> Was this:

>>

>>  Date: Tue, 05 Jun 2018 13:39:15 +0300

>>  Message-ID: <152819515565.30857.16834004920507717324.stgit@pasha-ThinkPad-T60>

>>  Subject: [Qemu-devel] [RFC PATCH v2 0/7] QEMU binary instrumentation prototype

>

> Yes, that looks like the one.

>

>> What patches did you add on top?

>

> We added:

> - plugin support for linux-user mode (I sent that one upstream, I think)

> - memory tracing support and a VA->PA conversion helper


check

> - a way for a plugin to request getting a callback just before QEMU

>   exits to clean up any internal state


check - qemu_plugin_register_atexit_cb

> - a way for a plugin to reset any instrumentation decisions made in the

>   past (essentially calls `tb_flush(cpu);` under the covers). We found

>   this critical for plugins which undergo state changes during the

>   course of their execution (i.e. watch for event X, then go into a more

>   detailed profiling mode until you see event Y)


check:

/**
 * qemu_plugin_reset() - Reset a plugin
 * @id: this plugin's opaque ID
 * @cb: callback to be called once the plugin has been reset
 *
 * Unregisters all callbacks for the plugin given by @id.
 *
 * Do NOT assume that the plugin has been reset once this function returns.
 * Plugins are reset asynchronously, and therefore the given plugin receives
 * callbacks until @cb is called.
 */
void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb);


> - instrumentation at the TB granularity (in addition to the existing

>   instruction-level support)


check

/**
 * qemu_plugin_register_vcpu_tb_trans_cb() - register a translate cb
 * @id: plugin ID
 * @cb: callback function
 *
 * The @cb function is called every time a translation occurs. The @cb
 * function is passed an opaque qemu_plugin_type which it can query
 * for additional information including the list of translated
 * instructions. At this point the plugin can register further
 * callbacks to be triggered when the block or individual instruction
 * executes.
 */

and then you can have instruction or TB level callbacks:

/**
 * qemu_plugin_register_vcpu_tb_trans_exec_cb() - register execution callback
 * @tb: the opaque qemu_plugin_tb handle for the translation
 * @cb: callback function
 * @flags: does the plugin read or write the CPU's registers?
 * @userdata: any plugin data to pass to the @cb?
 *
 * The @cb function is called every time a translated unit executes.
 */
void qemu_plugin_register_vcpu_tb_exec_cb(struct qemu_plugin_tb *tb,
                                          qemu_plugin_vcpu_udata_cb_t cb,
                                          enum qemu_plugin_cb_flags flags,
                                          void *userdata);

Or the inline equivalent.


> - the ability for a plugin to trigger a checkpoint to be taken


We don't have this at the moment. Pranith also mentioned it in his
review comments. I can see its use but I suspect it won't make the
initial implementation given the broader requirements of QEMU to do
checkpointing and how to cleanly expose that to plugins.

>

> -Aaron



--
Alex Bennée
Gonglei (Arei)" via July 2, 2019, 2:07 p.m. UTC | #8
On Jul 01 16:00, Alex Bennée wrote:
> Aaron Lindsay OS <aaron@os.amperecomputing.com> writes:

> > - a way for a plugin to reset any instrumentation decisions made in the

> >   past (essentially calls `tb_flush(cpu);` under the covers). We found

> >   this critical for plugins which undergo state changes during the

> >   course of their execution (i.e. watch for event X, then go into a more

> >   detailed profiling mode until you see event Y)

> 

> check:

> 

> /**

>  * qemu_plugin_reset() - Reset a plugin

>  * @id: this plugin's opaque ID

>  * @cb: callback to be called once the plugin has been reset

>  *

>  * Unregisters all callbacks for the plugin given by @id.

>  *

>  * Do NOT assume that the plugin has been reset once this function returns.

>  * Plugins are reset asynchronously, and therefore the given plugin receives

>  * callbacks until @cb is called.

>  */

> void qemu_plugin_reset(qemu_plugin_id_t id, qemu_plugin_simple_cb_t cb);


Is this essentially synchronous for the current cpu, and only
asynchronous for any other running cpus that didn't trigger the callback
from which the call to qemu_plugin_reset() is being made? If not, could
the state resetting be made synchronous for the current cpu (even if the
callback doesn't happen until the others are complete)? This isn't
absolutely critical, but it is often nice to begin capturing precisely
when you mean to.

> > - the ability for a plugin to trigger a checkpoint to be taken

> 

> We don't have this at the moment. Pranith also mentioned it in his

> review comments. I can see its use but I suspect it won't make the

> initial implementation given the broader requirements of QEMU to do

> checkpointing and how to cleanly expose that to plugins.


Sure. Our patch works for us, but I know we're ignoring a few things
that we can externally ensure won't happen while we're attempting a
checkpoint (i.e. migration) that may have to be considered for something
upstream.

-Aaron
diff mbox series

Patch

diff --git a/accel/tcg/atomic_template.h b/accel/tcg/atomic_template.h
index 04c4c7b0d2..33ddfd498c 100644
--- a/accel/tcg/atomic_template.h
+++ b/accel/tcg/atomic_template.h
@@ -18,6 +18,7 @@ 
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
 
+#include "qemu/plugin.h"
 #include "trace/mem.h"
 
 #if DATA_SIZE == 16
@@ -73,6 +74,8 @@  void atomic_trace_rmw_pre(CPUArchState *env, target_ulong addr, uint8_t info)
 static inline void atomic_trace_rmw_post(CPUArchState *env, target_ulong addr,
                                          void *haddr, uint8_t info)
 {
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, haddr, info);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, haddr, info | TRACE_MEM_ST);
 }
 
 static inline
@@ -84,6 +87,7 @@  void atomic_trace_ld_pre(CPUArchState *env, target_ulong addr, uint8_t info)
 static inline void atomic_trace_ld_post(CPUArchState *env, target_ulong addr,
                                         void *haddr, uint8_t info)
 {
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, haddr, info);
 }
 
 static inline
@@ -95,6 +99,7 @@  void atomic_trace_st_pre(CPUArchState *env, target_ulong addr, uint8_t info)
 static inline void atomic_trace_st_post(CPUArchState *env, target_ulong addr,
                                         void *haddr, uint8_t info)
 {
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, haddr, info);
 }
 #endif /* ATOMIC_TEMPLATE_COMMON */
 
diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c
index 6c85c3ee1e..c21353e54f 100644
--- a/accel/tcg/cpu-exec.c
+++ b/accel/tcg/cpu-exec.c
@@ -272,6 +272,7 @@  void cpu_exec_step_atomic(CPUState *cpu)
             qemu_mutex_unlock_iothread();
         }
         assert_no_pages_locked();
+        qemu_plugin_disable_mem_helpers(cpu);
     }
 
     if (in_exclusive_region) {
@@ -705,6 +706,8 @@  int cpu_exec(CPUState *cpu)
         if (qemu_mutex_iothread_locked()) {
             qemu_mutex_unlock_iothread();
         }
+        qemu_plugin_disable_mem_helpers(cpu);
+
         assert_no_pages_locked();
     }
 
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 6a0dc438ff..b39c1f06f7 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -879,9 +879,18 @@  static void tlb_fill(CPUState *cpu, target_ulong addr, int size,
     assert(ok);
 }
 
+static inline void set_hostaddr(CPUArchState *env, TCGMemOp mo, void *haddr)
+{
+#ifdef CONFIG_PLUGIN
+    if (mo & MO_HADDR) {
+        env_tlb(env)->c.hostaddr = haddr;
+    }
+#endif
+}
+
 static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
                          int mmu_idx, target_ulong addr, uintptr_t retaddr,
-                         MMUAccessType access_type, int size)
+                         TCGMemOp mo, MMUAccessType access_type, int size)
 {
     CPUState *cpu = env_cpu(env);
     hwaddr mr_offset;
@@ -891,6 +900,9 @@  static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
     bool locked = false;
     MemTxResult r;
 
+    /* XXX Any sensible choice other than NULL? */
+    set_hostaddr(env, mo, NULL);
+
     section = iotlb_to_section(cpu, iotlbentry->addr, iotlbentry->attrs);
     mr = section->mr;
     mr_offset = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
@@ -925,7 +937,7 @@  static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
 
 static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
                       int mmu_idx, uint64_t val, target_ulong addr,
-                      uintptr_t retaddr, int size)
+                      uintptr_t retaddr, TCGMemOp mo, int size)
 {
     CPUState *cpu = env_cpu(env);
     hwaddr mr_offset;
@@ -934,6 +946,8 @@  static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry,
     bool locked = false;
     MemTxResult r;
 
+    set_hostaddr(env, mo, NULL);
+
     section = iotlb_to_section(cpu, iotlbentry->addr, iotlbentry->attrs);
     mr = section->mr;
     mr_offset = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
@@ -1264,7 +1278,8 @@  load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
         offsetof(CPUTLBEntry, addr_code) : offsetof(CPUTLBEntry, addr_read);
     const MMUAccessType access_type =
         code_read ? MMU_INST_FETCH : MMU_DATA_LOAD;
-    unsigned a_bits = get_alignment_bits(get_memop(oi));
+    TCGMemOp mo = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(mo);
     void *haddr;
     uint64_t res;
 
@@ -1313,7 +1328,7 @@  load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
         }
 
         res = io_readx(env, &env_tlb(env)->d[mmu_idx].iotlb[index],
-                       mmu_idx, addr, retaddr, access_type, size);
+                       mmu_idx, addr, retaddr, mo, access_type, size);
         return handle_bswap(res, size, big_endian);
     }
 
@@ -1331,6 +1346,12 @@  load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
         r2 = full_load(env, addr2, oi, retaddr);
         shift = (addr & (size - 1)) * 8;
 
+        /*
+         * XXX cross-page accesses would have to be split into separate accesses
+         * for the host address to make sense. For now, just return NULL.
+         */
+        set_hostaddr(env, mo, NULL);
+
         if (big_endian) {
             /* Big-endian combine.  */
             res = (r1 << shift) | (r2 >> ((size * 8) - shift));
@@ -1343,6 +1364,7 @@  load_helper(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi,
 
  do_aligned_access:
     haddr = (void *)((uintptr_t)addr + entry->addend);
+    set_hostaddr(env, mo, (void *)haddr);
     switch (size) {
     case 1:
         res = ldub_p(haddr);
@@ -1513,7 +1535,8 @@  store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
     target_ulong tlb_addr = tlb_addr_write(entry);
     const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
-    unsigned a_bits = get_alignment_bits(get_memop(oi));
+    TCGMemOp mo = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(mo);
     void *haddr;
 
     /* Handle CPU specific unaligned behaviour */
@@ -1562,7 +1585,7 @@  store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
 
         io_writex(env, &env_tlb(env)->d[mmu_idx].iotlb[index], mmu_idx,
                   handle_bswap(val, size, big_endian),
-                  addr, retaddr, size);
+                  addr, retaddr, mo, size);
         return;
     }
 
@@ -1607,11 +1630,13 @@  store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
             }
             helper_ret_stb_mmu(env, addr + i, val8, oi, retaddr);
         }
+        set_hostaddr(env, mo, NULL);
         return;
     }
 
  do_aligned_access:
     haddr = (void *)((uintptr_t)addr + entry->addend);
+    set_hostaddr(env, mo, (void *)haddr);
     switch (size) {
     case 1:
         stb_p(haddr, val);
diff --git a/accel/tcg/plugin-gen.c b/accel/tcg/plugin-gen.c
index 7994819fe6..9d9ec29765 100644
--- a/accel/tcg/plugin-gen.c
+++ b/accel/tcg/plugin-gen.c
@@ -95,8 +95,7 @@  static void do_gen_mem_cb(TCGv vaddr, uint8_t info, bool is_haddr)
     TCGv_ptr udata = tcg_const_ptr(NULL);
     TCGv_ptr haddr;
 
-    tcg_gen_ld_i32(cpu_index, cpu_env,
-                   -ENV_OFFSET + offsetof(CPUState, cpu_index));
+    tcg_gen_ld_i32(cpu_index, cpu_env, -offsetof(ArchCPU, env) + offsetof(CPUState, cpu_index));
     tcg_gen_extu_tl_i64(vaddr64, vaddr);
 
     if (is_haddr) {
@@ -106,7 +105,9 @@  static void do_gen_mem_cb(TCGv vaddr, uint8_t info, bool is_haddr)
          */
 #ifdef CONFIG_SOFTMMU
         haddr = tcg_temp_new_ptr();
-        tcg_gen_ld_ptr(haddr, cpu_env, offsetof(CPUArchState, hostaddr));
+        tcg_gen_ld_ptr(haddr, cpu_env,
+                       offsetof(ArchCPU, neg.tlb.c.hostaddr) -
+                       offsetof(ArchCPU, env));
 #else
         haddr = tcg_const_ptr(NULL);
 #endif
@@ -128,8 +129,8 @@  static void gen_empty_udata_cb(void)
     TCGv_i32 cpu_index = tcg_temp_new_i32();
     TCGv_ptr udata = tcg_const_ptr(NULL); /* will be overwritten later */
 
-    tcg_gen_ld_i32(cpu_index, cpu_env,
-                   -ENV_OFFSET + offsetof(CPUState, cpu_index));
+    tcg_gen_ld_i32(cpu_index, cpu_env, -offsetof(ArchCPU, env) +
+                   offsetof(CPUState, cpu_index));
     gen_helper_plugin_vcpu_udata_cb(cpu_index, udata);
 
     tcg_temp_free_ptr(udata);
@@ -172,8 +173,7 @@  static void gen_empty_mem_helper(void)
     TCGv_ptr ptr;
 
     ptr = tcg_const_ptr(NULL);
-    tcg_gen_st_ptr(ptr, cpu_env, -ENV_OFFSET + offsetof(CPUState,
-                                                        plugin_mem_cbs));
+    tcg_gen_st_ptr(ptr, cpu_env, offsetof(CPUState, plugin_mem_cbs));
     tcg_temp_free_ptr(ptr);
 }
 
@@ -784,8 +784,7 @@  void plugin_gen_disable_mem_helpers(void)
         return;
     }
     ptr = tcg_const_ptr(NULL);
-    tcg_gen_st_ptr(ptr, cpu_env, -ENV_OFFSET + offsetof(CPUState,
-                                                        plugin_mem_cbs));
+    tcg_gen_st_ptr(ptr, cpu_env, offsetof(CPUState, plugin_mem_cbs));
     tcg_temp_free_ptr(ptr);
     tcg_ctx->plugin_insn->mem_helper = false;
 }
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 9bc713a70b..354788385b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -207,6 +207,14 @@  typedef struct CPUTLBCommon {
     size_t full_flush_count;
     size_t part_flush_count;
     size_t elide_flush_count;
+#ifdef CONFIG_PLUGIN
+    /*
+     * TODO: remove and calculate on the fly
+     *
+     * Stores the host address of a guest access
+     */
+    void *hostaddr;
+#endif
 } CPUTLBCommon;
 
 /*
@@ -215,6 +223,7 @@  typedef struct CPUTLBCommon {
  * Since this is placed within CPUNegativeOffsetState, the smallest
  * negative offsets are at the end of the struct.
  */
+
 typedef struct CPUTLB {
     CPUTLBCommon c;
     CPUTLBDesc d[NB_MMU_MODES];
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index a08b11bd2c..ac07556d25 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -85,6 +85,15 @@  typedef target_ulong abi_ptr;
 #define TARGET_ABI_FMT_ptr TARGET_ABI_FMT_lx
 #endif
 
+static inline void *read_hostaddr(CPUArchState *env)
+{
+#if defined(CONFIG_SOFTMMU) && defined(CONFIG_PLUGIN)
+    return env_tlb(env)->c.hostaddr;
+#else
+    return NULL;
+#endif
+}
+
 #if defined(CONFIG_USER_ONLY)
 
 extern __thread uintptr_t helper_retaddr;
diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h
index af7e0b49f2..38df113676 100644
--- a/include/exec/cpu_ldst_template.h
+++ b/include/exec/cpu_ldst_template.h
@@ -28,6 +28,7 @@ 
 #include "trace-root.h"
 #endif
 
+#include "qemu/plugin.h"
 #include "trace/mem.h"
 
 #if DATA_SIZE == 8
@@ -86,11 +87,10 @@  glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
     target_ulong addr;
     int mmu_idx;
     TCGMemOpIdx oi;
-
+    uintptr_t hostaddr;
 #if !defined(SOFTMMU_CODE_ACCESS)
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, false, MO_TE, false));
+    uint8_t meminfo = trace_mem_build_info(SHIFT, false, MO_TE, false);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
 #endif
 
     addr = ptr;
@@ -101,10 +101,14 @@  glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
         oi = make_memop_idx(SHIFT, mmu_idx);
         res = glue(glue(helper_ret_ld, URETSUFFIX), MMUSUFFIX)(env, addr,
                                                             oi, retaddr);
+        hostaddr = (uintptr_t)read_hostaddr(env);
     } else {
-        uintptr_t hostaddr = addr + entry->addend;
+        hostaddr = addr + entry->addend;
         res = glue(glue(ld, USUFFIX), _p)((uint8_t *)hostaddr);
     }
+#ifndef SOFTMMU_CODE_ACCESS
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, (void *)hostaddr, meminfo);
+#endif
     return res;
 }
 
@@ -125,11 +129,10 @@  glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
     target_ulong addr;
     int mmu_idx;
     TCGMemOpIdx oi;
-
+    uintptr_t hostaddr;
 #if !defined(SOFTMMU_CODE_ACCESS)
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, true, MO_TE, false));
+    uint8_t meminfo = trace_mem_build_info(SHIFT, true, MO_TE, false);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
 #endif
 
     addr = ptr;
@@ -140,10 +143,14 @@  glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
         oi = make_memop_idx(SHIFT, mmu_idx);
         res = (DATA_STYPE)glue(glue(helper_ret_ld, SRETSUFFIX),
                                MMUSUFFIX)(env, addr, oi, retaddr);
+        hostaddr = (uintptr_t)read_hostaddr(env);
     } else {
-        uintptr_t hostaddr = addr + entry->addend;
+        hostaddr = addr + entry->addend;
         res = glue(glue(lds, SUFFIX), _p)((uint8_t *)hostaddr);
     }
+#ifndef SOFTMMU_CODE_ACCESS
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, (void *)hostaddr, meminfo);
+#endif
     return res;
 }
 
@@ -167,11 +174,10 @@  glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
     target_ulong addr;
     int mmu_idx;
     TCGMemOpIdx oi;
-
+    uintptr_t hostaddr;
 #if !defined(SOFTMMU_CODE_ACCESS)
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, false, MO_TE, true));
+    uint8_t meminfo = trace_mem_build_info(SHIFT, false, MO_TE, true);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
 #endif
 
     addr = ptr;
@@ -182,10 +188,14 @@  glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
         oi = make_memop_idx(SHIFT, mmu_idx);
         glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, v, oi,
                                                      retaddr);
+        hostaddr = (uintptr_t)read_hostaddr(env);
     } else {
-        uintptr_t hostaddr = addr + entry->addend;
+        hostaddr = addr + entry->addend;
         glue(glue(st, SUFFIX), _p)((uint8_t *)hostaddr, v);
     }
+#ifndef SOFTMMU_CODE_ACCESS
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, (void *)hostaddr, meminfo);
+#endif
 }
 
 static inline void
diff --git a/include/exec/cpu_ldst_useronly_template.h b/include/exec/cpu_ldst_useronly_template.h
index 42a95237f1..cc625a3da8 100644
--- a/include/exec/cpu_ldst_useronly_template.h
+++ b/include/exec/cpu_ldst_useronly_template.h
@@ -64,12 +64,18 @@ 
 static inline RES_TYPE
 glue(glue(cpu_ld, USUFFIX), MEMSUFFIX)(CPUArchState *env, abi_ptr ptr)
 {
+    RES_TYPE ret;
+#if !defined(CODE_ACCESS)
+    uint8_t meminfo = trace_mem_build_info(SHIFT, false, MO_TE, false);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
+#endif
+
+    ret = glue(glue(ld, USUFFIX), _p)(g2h(ptr));
+
 #if !defined(CODE_ACCESS)
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, false, MO_TE, false));
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, NULL, meminfo);
 #endif
-    return glue(glue(ld, USUFFIX), _p)(g2h(ptr));
+    return ret;
 }
 
 static inline RES_TYPE
@@ -88,12 +94,18 @@  glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env,
 static inline int
 glue(glue(cpu_lds, SUFFIX), MEMSUFFIX)(CPUArchState *env, abi_ptr ptr)
 {
+    int ret;
+#if !defined(CODE_ACCESS)
+    uint8_t meminfo = trace_mem_build_info(SHIFT, true, MO_TE, false);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
+#endif
+
+    ret = glue(glue(lds, SUFFIX), _p)(g2h(ptr));
+
 #if !defined(CODE_ACCESS)
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, true, MO_TE, false));
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, NULL, meminfo);
 #endif
-    return glue(glue(lds, SUFFIX), _p)(g2h(ptr));
+    return ret;
 }
 
 static inline int
@@ -114,10 +126,10 @@  static inline void
 glue(glue(cpu_st, SUFFIX), MEMSUFFIX)(CPUArchState *env, abi_ptr ptr,
                                       RES_TYPE v)
 {
-    trace_guest_mem_before_exec(
-        env_cpu(env), ptr,
-        trace_mem_build_info(SHIFT, false, MO_TE, true));
+    uint8_t meminfo = trace_mem_build_info(SHIFT, false, MO_TE, true);
+    trace_guest_mem_before_exec(env_cpu(env), ptr, meminfo);
     glue(glue(st, SUFFIX), _p)(g2h(ptr), v);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), ptr, NULL, meminfo);
 }
 
 static inline void
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 6ddeebf4bc..8519fd0eb0 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1775,6 +1775,14 @@  static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
     /* add addend(r0), r1 */
     tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
                          offsetof(CPUTLBEntry, addend));
+
+#ifdef CONFIG_PLUGIN
+    if (opc & MO_HADDR) {
+        tcg_out_st(s, TCG_TYPE_PTR, r1, TCG_AREG0,
+                   offsetof(ArchCPU, neg.tlb.c.hostaddr) -
+                   offsetof(ArchCPU, env));
+    }
+#endif
 }
 
 /*
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 587d092238..e8094e27d0 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -30,6 +30,7 @@ 
 #include "tcg-mo.h"
 #include "trace-tcg.h"
 #include "trace/mem.h"
+#include "exec/plugin-gen.h"
 
 /* Reduce the number of ifdefs below.  This assumes that all uses of
    TCGV_HIGH and TCGV_LOW are properly protected by a conditional that
@@ -2684,6 +2685,7 @@  void tcg_gen_exit_tb(TranslationBlock *tb, unsigned idx)
         tcg_debug_assert(idx == TB_EXIT_REQUESTED);
     }
 
+    plugin_gen_disable_mem_helpers();
     tcg_gen_op1i(INDEX_op_exit_tb, val);
 }
 
@@ -2696,6 +2698,7 @@  void tcg_gen_goto_tb(unsigned idx)
     tcg_debug_assert((tcg_ctx->goto_tb_issue_mask & (1 << idx)) == 0);
     tcg_ctx->goto_tb_issue_mask |= 1 << idx;
 #endif
+    plugin_gen_disable_mem_helpers();
     /* When not chaining, we simply fall through to the "fallback" exit.  */
     if (!qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
         tcg_gen_op1i(INDEX_op_goto_tb, idx);
@@ -2705,7 +2708,10 @@  void tcg_gen_goto_tb(unsigned idx)
 void tcg_gen_lookup_and_goto_ptr(void)
 {
     if (TCG_TARGET_HAS_goto_ptr && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) {
-        TCGv_ptr ptr = tcg_temp_new_ptr();
+        TCGv_ptr ptr;
+
+        plugin_gen_disable_mem_helpers();
+        ptr = tcg_temp_new_ptr();
         gen_helper_lookup_tb_ptr(ptr, cpu_env);
         tcg_gen_op1i(INDEX_op_goto_ptr, tcgv_ptr_arg(ptr));
         tcg_temp_free_ptr(ptr);
@@ -2788,14 +2794,24 @@  static void tcg_gen_req_mo(TCGBar type)
     }
 }
 
+static inline void plugin_gen_mem_callbacks(TCGv vaddr, uint8_t info)
+{
+#ifdef CONFIG_PLUGIN
+    if (tcg_ctx->plugin_insn == NULL) {
+        return;
+    }
+    plugin_gen_empty_mem_callback(vaddr, info);
+#endif
+}
+
 void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     TCGMemOp orig_memop;
+    uint8_t info = trace_mem_get_info(memop, 0);
 
     tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
     memop = tcg_canonicalize_memop(memop, 0, 0);
-    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env,
-                               addr, trace_mem_get_info(memop, 0));
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env, addr, info);
 
     orig_memop = memop;
     if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
@@ -2807,6 +2823,7 @@  void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     gen_ldst_i32(INDEX_op_qemu_ld_i32, val, addr, memop, idx);
+    plugin_gen_mem_callbacks(addr, info);
 
     if ((orig_memop ^ memop) & MO_BSWAP) {
         switch (orig_memop & MO_SIZE) {
@@ -2828,11 +2845,11 @@  void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     TCGv_i32 swap = NULL;
+    uint8_t info = trace_mem_get_info(memop, 1);
 
     tcg_gen_req_mo(TCG_MO_LD_ST | TCG_MO_ST_ST);
     memop = tcg_canonicalize_memop(memop, 0, 1);
-    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env,
-                               addr, trace_mem_get_info(memop, 1));
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env, addr, info);
 
     if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
         swap = tcg_temp_new_i32();
@@ -2852,6 +2869,7 @@  void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
+    plugin_gen_mem_callbacks(addr, info);
 
     if (swap) {
         tcg_temp_free_i32(swap);
@@ -2861,6 +2879,7 @@  void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     TCGMemOp orig_memop;
+    uint8_t info;
 
     if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
         tcg_gen_qemu_ld_i32(TCGV_LOW(val), addr, idx, memop);
@@ -2874,8 +2893,8 @@  void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 
     tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
     memop = tcg_canonicalize_memop(memop, 1, 0);
-    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env,
-                               addr, trace_mem_get_info(memop, 0));
+    info = trace_mem_get_info(memop, 0);
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env, addr, info);
 
     orig_memop = memop;
     if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
@@ -2887,6 +2906,7 @@  void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     gen_ldst_i64(INDEX_op_qemu_ld_i64, val, addr, memop, idx);
+    plugin_gen_mem_callbacks(addr, info);
 
     if ((orig_memop ^ memop) & MO_BSWAP) {
         switch (orig_memop & MO_SIZE) {
@@ -2914,6 +2934,7 @@  void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     TCGv_i64 swap = NULL;
+    uint8_t info;
 
     if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
         tcg_gen_qemu_st_i32(TCGV_LOW(val), addr, idx, memop);
@@ -2922,8 +2943,8 @@  void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 
     tcg_gen_req_mo(TCG_MO_LD_ST | TCG_MO_ST_ST);
     memop = tcg_canonicalize_memop(memop, 1, 1);
-    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env,
-                               addr, trace_mem_get_info(memop, 1));
+    info = trace_mem_get_info(memop, 1);
+    trace_guest_mem_before_tcg(tcg_ctx->cpu, cpu_env, addr, info);
 
     if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
         swap = tcg_temp_new_i64();
@@ -2947,6 +2968,7 @@  void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     }
 
     gen_ldst_i64(INDEX_op_qemu_st_i64, val, addr, memop, idx);
+    plugin_gen_mem_callbacks(addr, info);
 
     if (swap) {
         tcg_temp_free_i64(swap);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 966e89104d..0e86e18ccb 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -28,6 +28,7 @@ 
 #include "cpu.h"
 #include "exec/tb-context.h"
 #include "qemu/bitops.h"
+#include "qemu/plugin.h"
 #include "qemu/queue.h"
 #include "tcg-mo.h"
 #include "tcg-target.h"