[RFC,v3,04/19] docs: new design document multi-thread-tcg.txt (DRAFTING)

Message ID	1464986428-6739-5-git-send-email-alex.bennee@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; From: =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org> To: mttcg@listserver.greensocs.com, qemu-devel@nongnu.org, fred.konrad@greensocs.com, a.rigo@virtualopensystems.com, serge.fdrv@gmail.com, cota@braap.org, bobby.prani@gmail.com Date: Fri, 3 Jun 2016 21:40:13 +0100 Message-Id: <1464986428-6739-5-git-send-email-alex.bennee@linaro.org> In-Reply-To: <1464986428-6739-1-git-send-email-alex.bennee@linaro.org> References: <1464986428-6739-1-git-send-email-alex.bennee@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Qemu-devel] [RFC v3 04/19] docs: new design document multi-thread-tcg.txt (DRAFTING) Precedence: list Cc: peter.maydell@linaro.org, claudio.fontana@huawei.com, mark.burton@greensocs.com, jan.kiszka@siemens.com, pbonzini@redhat.com, =?UTF-8?q?Alex=20Benn=C3=A9e?= <alex.bennee@linaro.org>, rth@twiddle.net Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>

diff --git a/docs/multi-thread-tcg.txt b/docs/multi-thread-tcg.txt new file mode 100644 index 0000000..5c88c99 --- /dev/null +++ b/docs/multi-thread-tcg.txt @@ -0,0 +1,225 @@ +Copyright (c) 2015 Linaro Ltd. + +This work is licensed under the terms of the GNU GPL, version 2 or later. See +the COPYING file in the top-level directory. + +STATUS: DRAFTING + +Introduction +============ + +This document outlines the design for multi-threaded TCG system-mode +emulation. The current user-mode emulation mirrors the thread +structure of the translated executable. + +The original system-mode TCG implementation was single threaded and +dealt with multiple CPUs by with simple round-robin scheduling. This +simplified a lot of things but became increasingly limited as systems +being emulated gained additional cores and per-core performance gains +for host systems started to level off. + +vCPU Scheduling +=============== + +We introduce a new running mode where each vCPU will run on its own +user-space thread. This will be enabled by default for all +FE/BE combinations that have had the required work done to support +this safely. + +In the general case of running translated code there should be no +inter-vCPU dependencies and all vCPUs should be able to run at full +speed. Synchronisation will only be required while accessing internal +shared data structures or when the emulated architecture requires a +coherent representation of the emulated machine state. + +Shared Data Structures +====================== + +Main Run Loop +------------- + +Even when there is no code being generated there are a number of +structures associated with the hot-path through the main run-loop. +These are associated with looking up the next translation block to +execute. These include: + + tb_jmp_cache (per-vCPU, cache of recent jumps) + tb_phys_hash (global, phys address->tb lookup) + +As TB linking only occurs when blocks are in the same page this code +is critical to performance as looking up the next TB to execute is the +most common reason to exit the generated code. + +DESIGN REQUIREMENT: Make access to lookup structures safe with +multiple reader/writer threads. Minimise any lock contention to do it. + +Global TCG State +---------------- + +We need to protect the entire code generation cycle including any post +generation patching of the translated code. This also implies a shared +translation buffer which contains code running on all cores. Any +execution path that comes to the main run loop will need to hold a +mutex for code generation. This also includes times when we need flush +code or entries from any shared lookups/caches. Structures held on a +per-vCPU basis won't need locking unless other vCPUs will need to +modify them. + +DESIGN REQUIREMENT: Add locking around all code generation and TB +patching. If possible make shared lookup/caches able to handle multiple +readers without locks otherwise protect them with locks as well. + +Translation Blocks +------------------ + +Currently the whole system shares a single code generation buffer +which when full will force a flush of all translations and start from +scratch again. + +Once a basic block has been translated it will continue to be used +until it is invalidated. These invalidation events are typically due +a change to the state of a physical page: + - code modification (self modify code, patching code) + - page changes (new mapping to physical page) + - debugging operations (breakpoint insertion/removal) + +There exist several places reference to TBs exist which need to be +cleared in a safe way. + +The main reference is a global page table (l1_map) which provides a 2 +level look-up for PageDesc structures which contain pointers to the +start of a linked list of all Translation Blocks in that page (see +page_next). + +When a block is invalidated any blocks which directly jump to it need +to have those jumps removed. This requires navigating the tb_jump_list +linked list as well as patching the jump code in a safe way. + +Finally there are a number of look-up mechanisms for accelerating +lookup of the next TB. These cache and hashed tables need to have +references removed in a safe way. + +DESIGN REQUIREMENT: Safely handle invalidation of TBs + - safely patch direct jumps + - remove central PageDesc lookup entries + - ensure lookup caches/hashes are safely updated + +Memory maps and TLBs +-------------------- + +The memory handling code is fairly critical to the speed of memory +access in the emulated system. The SoftMMU code is designed so the +hot-path can be handled entirely within translated code. This is +handled with a per-vCPU TLB structure which once populated will allow +a series of accesses to the page to occur without exiting the +translated code. It is possible to set flags in the TLB address which +will ensure the slow-path is taken for each access. This can be done +to support: + + - Memory regions (dividing up access to PIO, MMIO and RAM) + - Dirty page tracking (for code gen, migration and display) + - Virtual TLB (for translating guest address->real address) + +When the TLB tables are updated we need to ensure they are done in a +safe way by bringing all executing threads to a halt before making the +modifications. + +DESIGN REQUIREMENTS: + + - TLB Flush All/Page + - can be across-CPUs + - will need all other CPUs brought to a halt + - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) + - This is a per-CPU table - by definition can't race + - updated by its own thread when the slow-path is forced + +Emulated hardware state +----------------------- + +Currently thanks to KVM work any access to IO memory is automatically +protected by the global iothread mutex. Any IO region that doesn't use +global mutex is expected to do its own locking. + +Memory Consistency +================== + +Between emulated guests and host systems there are a range of memory +consistency models. Even emulating weakly ordered systems on strongly +ordered hosts needs to ensure things like store-after-load re-ordering +can be prevented when the guest wants to. + +Memory Barriers +--------------- + +Barriers (sometimes known as fences) provide a mechanism for software +to enforce a particular ordering of memory operations from the point +of view of external observers (e.g. another processor core). They can +apply to any memory operations as well as just loads or stores. + +The Linux kernel has an excellent write-up on the various forms of +memory barrier and the guarantees they can provide [1]. + +Barriers are often wrapped around synchronisation primitives to +provide explicit memory ordering semantics. However they can be used +by themselves to provide safe lockless access by ensuring for example +a signal flag will always be set after a payload. + +DESIGN REQUIREMENT: Add a new tcg_memory_barrier op + +This would enforce a strong load/store ordering so all loads/stores +complete at the memory barrier. On single-core non-SMP strongly +ordered backends this could become a NOP. + +There may be a case for further refinement if this causes performance +bottlenecks. + +Memory Control and Maintenance +------------------------------ + +This includes a class of instructions for controlling system cache +behaviour. While QEMU doesn't model cache behaviour these instructions +are often seen when code modification has taken place to ensure the +changes take effect. + +Synchronisation Primitives +-------------------------- + +There are two broad types of synchronisation primitives found in +modern ISAs: atomic instructions and exclusive regions. + +The first type offer a simple atomic instruction which will guarantee +some sort of test and conditional store will be truly atomic w.r.t. +other cores sharing access to the memory. The classic example is the +x86 cmpxchg instruction. + +The second type offer a pair of load/store instructions which offer a +guarantee that an region of memory has not been touched between the +load and store instructions. An example of this is ARM's ldrex/strex +pair where the strex instruction will return a flag indicating a +successful store only if no other CPU has accessed the memory region +since the ldrex. + +Traditionally TCG has generated a series of operations that work +because they are within the context of a single translation block so +will have completed before another CPU is scheduled. However with +the ability to have multiple threads running to emulate multiple CPUs +we will need to explicitly expose these semantics. + +DESIGN REQUIREMENTS: + - atomics + - Introduce some atomic TCG ops for the common semantics + - The default fallback helper function will use qemu_atomics + - Each backend can then add a more efficient implementation + - load/store exclusive + [AJB: + There are currently a number proposals of interest: + - Greensocs tweaks to ldst ex (using locks) + - Slow-path for atomic instruction translation [2] + - Helper-based Atomic Instruction Emulation (AIE) [3] + ] + +========== + +[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt +[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 +[3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297

[RFC,v3,04/19] docs: new design document multi-thread-tcg.txt (DRAFTING)

Commit Message

Patch