diff mbox

[v4] kprobes: arm: enable OPTPROBES for ARM 32

Message ID 1407819388-52145-1-git-send-email-wangnan0@huawei.com
State New
Headers show

Commit Message

Wang Nan Aug. 12, 2014, 4:56 a.m. UTC
This patch introduce kprobeopt for ARM 32.

Limitations:
 - Currently only kernel compiled with ARM ISA is supported.

 - Offset between probe point and optinsn slot must not larger than
   32MiB. Masami Hiramatsu suggests replacing 2 words, it will make
   things complex. Futher patch can make such optimization.

Kprobe opt on ARM is relatively simpler than kprobe opt on x86 because
ARM instruction is always 4 bytes aligned and 4 bytes long. This patch
replace probed instruction by a 'b', branch to trampoline code and then
calls optimized_callback(). optimized_callback() calls opt_pre_handler()
to execute kprobe handler. It also emulate/simulate replaced instruction.

When unregistering kprobe, the deferred manner of unoptimizer may leave
branch instruction before optimizer is called. Different from x86_64,
which only copy the probed insn after optprobe_template_end and
reexecute them, this patch call singlestep to emulate/simulate the insn
directly. Futher patch can optimize this behavior.

v1 -> v2:

 - Improvement: if replaced instruction is conditional, generate a
   conditional branch instruction for it;

 - Introduces RELATIVEJUMP_OPCODES due to ARM kprobe_opcode_t is 4
   bytes;

 - Removes size field in struct arch_optimized_insn;

 - Use arm_gen_branch() to generate branch instruction;

 - Remove all recover logic: ARM doesn't use tail buffer, no need
   recover replaced instructions like x86;

 - Remove incorrect CONFIG_THUMB checking;

 - can_optimize() always returns true if address is well aligned;

 - Improve optimized_callback: using opt_pre_handler();

 - Bugfix: correct range checking code and improve comments;

 - Fix commit message.

v2 -> v3:

 - Rename RELATIVEJUMP_OPCODES to MAX_COPIED_INSNS;

 - Remove unneeded checking:
      arch_check_optimized_kprobe(), can_optimize();

 - Add missing flush_icache_range() in arch_prepare_optimized_kprobe();

 - Remove unneeded 'return;'.

v3 -> v4:

 - Use __mem_to_opcode_arm() to translate copied_insn to ensure it
   works in big endian kernel;

 - Replace 'nop' placeholder in trampoline code template with
   '.long 0' to avoid confusion: reader may regard 'nop' as an
   instruction, but it is value in fact.

Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jon Medhurst (Tixy) <tixy@linaro.org>
Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
---
 arch/arm/Kconfig               |   1 +
 arch/arm/include/asm/kprobes.h |  26 +++++
 arch/arm/kernel/Makefile       |   3 +-
 arch/arm/kernel/kprobes-opt.c  | 255 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 284 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/kernel/kprobes-opt.c

Comments

Wang Nan Aug. 12, 2014, 1:03 p.m. UTC | #1
Hi Masami and everyone,

When checking my code I found a problem: if we replace a stack operatinon instruction,
it is possible that the emulate execution of such instruction destroy the stack used
by kprobeopt:

> +
> +asm (
> +			".global optprobe_template_entry\n"
> +			"optprobe_template_entry:\n"
> +			"	sub	sp, sp, #80\n"
> +			"	stmia	sp, {r0 - r14} \n"

Here, trampoline code sub sp with 80 (0x50, I choose this number without much thinking), and then
use stmia to push r0 - r14 (registers except pc) onto the stack. Assume the original sp is
0xd0000050, the stack becomes:

0xd0000000: r0
0xd0000004: r1
0xd0000008: r2
...
0xd0000038: r14
0xd000003c: r15 (place holder)
0xd0000040: cpsr (place holder)
0xd0000044: ?
0xd0000048: ?
0xd000004c: ?
0xd0000050: original stack

If the replaced code operates stack, for example, push {r0 - r10}, it will overwrite our register.
For that reason, sub sp, #80 is not enough, we need at least 64 bytes stack space, so the first instruction
here should be sub sp, #128.

However, it increase stack requirement. Moreover, although rare, there may be sp relative addressing,
such as: str r1, [sp, #-132].

To make every situations safe, do you think we need to alloc a pre-cpu optprobe private stack?

For example:

str sp, [pc, #??]     (store original sp first)
ldr sp, [pc, #??]     (load pre-cpu stack)
sub sp, #68
stmia	sp, {r0 - r12}
...                   (fix sp and pc in stack)
ldmia   sp, {r0 - r15}
optprobe_template_sp:
1: .long 0          (placeholder for saved sp)
optprobe_template_private_stack:
2: .long 0          (placeholder for per-cpu private stack)
optprobe_template_pc:
3: .long 0          (placeholder for pc)

> +			"	add	r3, sp, #80\n"
> +			"	str	r3, [sp, #52]\n"
> +			"	mrs	r4, cpsr\n"
> +			"	str	r4, [sp, #64]\n"
> +			"	mov	r1, sp\n"
> +			"	ldr	r0, 1f\n"
> +			"	ldr	r2, 2f\n"
> +			"	blx	r2\n"
> +			"	ldr	r1, [sp, #64]\n"
> +			"	msr	cpsr_fs, r1\n"
> +			"	ldmia	sp, {r0 - r15}\n"
> +			".global optprobe_template_val\n"
> +			"optprobe_template_val:\n"
> +			"1:	.long 0\n"
> +			".global optprobe_template_call\n"
> +			"optprobe_template_call:\n"
> +			"2:	.long 0\n"
> +			".global optprobe_template_end\n"
> +			"optprobe_template_end:\n");
> +


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Masami Hiramatsu Aug. 12, 2014, 3:12 p.m. UTC | #2
(2014/08/12 22:03), Wang Nan wrote:
> Hi Masami and everyone,
> 
> When checking my code I found a problem: if we replace a stack operatinon instruction,
> it is possible that the emulate execution of such instruction destroy the stack used
> by kprobeopt:
> 
>> +
>> +asm (
>> +			".global optprobe_template_entry\n"
>> +			"optprobe_template_entry:\n"
>> +			"	sub	sp, sp, #80\n"
>> +			"	stmia	sp, {r0 - r14} \n"
> 
> Here, trampoline code sub sp with 80 (0x50, I choose this number without much thinking), and then
> use stmia to push r0 - r14 (registers except pc) onto the stack. Assume the original sp is
> 0xd0000050, the stack becomes:
> 
> 0xd0000000: r0
> 0xd0000004: r1
> 0xd0000008: r2
> ...
> 0xd0000038: r14
> 0xd000003c: r15 (place holder)
> 0xd0000040: cpsr (place holder)
> 0xd0000044: ?
> 0xd0000048: ?
> 0xd000004c: ?
> 0xd0000050: original stack
> 
> If the replaced code operates stack, for example, push {r0 - r10}, it will overwrite our register.
> For that reason, sub sp, #80 is not enough, we need at least 64 bytes stack space, so the first instruction
> here should be sub sp, #128.
> 
> However, it increase stack requirement. Moreover, although rare, there may be sp relative addressing,
> such as: str r1, [sp, #-132].

Hmm, I see the increasing stack is clearly hard to emulate, but
why is it hard to emulate sp relative instruction? It should
access the memory under the stack pointer.

> To make every situations safe, do you think we need to alloc a pre-cpu optprobe private stack?

Of course, that is one possible idea, but the simplest way is just not
optimizing such instructions. Why not can_optimize() check that? ;)

Thank you,
Masami Hiramatsu Aug. 15, 2014, 3:23 p.m. UTC | #3
(2014/08/12 13:56), Wang Nan wrote:
> +/* Caller must ensure addr & 3 == 0 */
> +static int can_optimize(unsigned long paddr)
> +{
> +	return 1;
> +}

As we have talked on another thread, we'd better filter-out all stack-pushing
instructions here, since (as you said) that will corrupt pt_regs on the stack.

Thank you,
Wang Nan Aug. 16, 2014, 1:38 a.m. UTC | #4
On 2014/8/15 23:23, Masami Hiramatsu wrote:
> (2014/08/12 13:56), Wang Nan wrote:
>> +/* Caller must ensure addr & 3 == 0 */
>> +static int can_optimize(unsigned long paddr)
>> +{
>> +	return 1;
>> +}
> 
> As we have talked on another thread, we'd better filter-out all stack-pushing
> instructions here, since (as you said) that will corrupt pt_regs on the stack.
> 
> Thank you,
> 

So we need to identify the replaced instruction. I think some improvement on
arm instruction decoder is required, else we have to implement another (although simpler)
decoder for memory accessing instructions.

In this situation we are talking about, we need the decoder identify the addressing
information for str/stm instroction. However, decoder can bring up more information such as
instruction type, source/destnation registers, memory access pattern ...
With such information, we can further optimize our trampoline code.
For example: doesn't protect destnation registers, and for some (most of, I think) instruction,
we can direct execute them like x86_64.

What do you think?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Masami Hiramatsu Aug. 16, 2014, 2:44 a.m. UTC | #5
(2014/08/16 10:38), Wang Nan wrote:
> On 2014/8/15 23:23, Masami Hiramatsu wrote:
>> (2014/08/12 13:56), Wang Nan wrote:
>>> +/* Caller must ensure addr & 3 == 0 */
>>> +static int can_optimize(unsigned long paddr)
>>> +{
>>> +	return 1;
>>> +}
>>
>> As we have talked on another thread, we'd better filter-out all stack-pushing
>> instructions here, since (as you said) that will corrupt pt_regs on the stack.
>>
>> Thank you,
>>
> 
> So we need to identify the replaced instruction. I think some improvement on
> arm instruction decoder is required, else we have to implement another (although simpler)
> decoder for memory accessing instructions.

Since arm32 already have instruction emulator, I guess it's not so hard, we can
start with using emulator code to find which one will change sp.

> In this situation we are talking about, we need the decoder identify the addressing
> information for str/stm instroction.

No, sp register must be always the top of stack, or the code is just broken (breaks
stack frame). So we need to identify the stm/str instructions which destination is
sp register.

> However, decoder can bring up more information such as
> instruction type, source/destnation registers, memory access pattern ...
> With such information, we can further optimize our trampoline code.
> For example: doesn't protect destnation registers, and for some (most of, I think) instruction,
> we can direct execute them like x86_64.

Yeah, direct execution may reduce the overhead much :). But anyway, since we need pt_regs,
we have to store all registers same as pt_regs.

Thank you,
diff mbox

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 290f02ee..2106918 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -57,6 +57,7 @@  config ARM
 	select HAVE_MEMBLOCK
 	select HAVE_MOD_ARCH_SPECIFIC if ARM_UNWIND
 	select HAVE_OPROFILE if (HAVE_PERF_EVENTS)
+	select HAVE_OPTPROBES if (!THUMB2_KERNEL)
 	select HAVE_PERF_EVENTS
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
diff --git a/arch/arm/include/asm/kprobes.h b/arch/arm/include/asm/kprobes.h
index 49fa0df..a05297f 100644
--- a/arch/arm/include/asm/kprobes.h
+++ b/arch/arm/include/asm/kprobes.h
@@ -51,5 +51,31 @@  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
 int kprobe_exceptions_notify(struct notifier_block *self,
 			     unsigned long val, void *data);
 
+/* optinsn template addresses */
+extern __visible kprobe_opcode_t optprobe_template_entry;
+extern __visible kprobe_opcode_t optprobe_template_val;
+extern __visible kprobe_opcode_t optprobe_template_call;
+extern __visible kprobe_opcode_t optprobe_template_end;
+
+#define MAX_OPTIMIZED_LENGTH	(4)
+#define MAX_OPTINSN_SIZE				\
+	(((unsigned long)&optprobe_template_end -	\
+	  (unsigned long)&optprobe_template_entry))
+#define RELATIVEJUMP_SIZE	(4)
+
+struct arch_optimized_insn {
+	/*
+	 * copy of the original instructions.
+	 * Different from x86, ARM kprobe_opcode_t is u32.
+	 */
+#define MAX_COPIED_INSN	((RELATIVEJUMP_SIZE) / sizeof(kprobe_opcode_t))
+	kprobe_opcode_t copied_insn[MAX_COPIED_INSN];
+	/* detour code buffer */
+	kprobe_opcode_t *insn;
+	/*
+	 *  we always copies one instruction on arm32,
+	 *  size always be 4, so no size field.
+	 */
+};
 
 #endif /* _ARM_KPROBES_H */
diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile
index 38ddd9f..6a38ec1 100644
--- a/arch/arm/kernel/Makefile
+++ b/arch/arm/kernel/Makefile
@@ -52,11 +52,12 @@  obj-$(CONFIG_FUNCTION_GRAPH_TRACER)	+= ftrace.o insn.o
 obj-$(CONFIG_JUMP_LABEL)	+= jump_label.o insn.o patch.o
 obj-$(CONFIG_KEXEC)		+= machine_kexec.o relocate_kernel.o
 obj-$(CONFIG_UPROBES)		+= probes.o probes-arm.o uprobes.o uprobes-arm.o
-obj-$(CONFIG_KPROBES)		+= probes.o kprobes.o kprobes-common.o patch.o
+obj-$(CONFIG_KPROBES)		+= probes.o kprobes.o kprobes-common.o patch.o insn.o
 ifdef CONFIG_THUMB2_KERNEL
 obj-$(CONFIG_KPROBES)		+= kprobes-thumb.o probes-thumb.o
 else
 obj-$(CONFIG_KPROBES)		+= kprobes-arm.o probes-arm.o
+obj-$(CONFIG_OPTPROBES)		+= kprobes-opt.o
 endif
 obj-$(CONFIG_ARM_KPROBES_TEST)	+= test-kprobes.o
 test-kprobes-objs		:= kprobes-test.o
diff --git a/arch/arm/kernel/kprobes-opt.c b/arch/arm/kernel/kprobes-opt.c
new file mode 100644
index 0000000..52330fb
--- /dev/null
+++ b/arch/arm/kernel/kprobes-opt.c
@@ -0,0 +1,255 @@ 
+/*
+ *  Kernel Probes Jump Optimization (Optprobes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2002, 2004
+ * Copyright (C) Hitachi Ltd., 2012
+ * Copyright (C) Huawei Inc., 2014
+ */
+
+#include <linux/kprobes.h>
+#include <linux/jump_label.h>
+#include <asm/kprobes.h>
+#include <asm/cacheflush.h>
+/* for arm_gen_branch */
+#include "insn.h"
+/* for patch_text */
+#include "patch.h"
+
+asm (
+			".global optprobe_template_entry\n"
+			"optprobe_template_entry:\n"
+			"	sub	sp, sp, #80\n"
+			"	stmia	sp, {r0 - r14} \n"
+			"	add	r3, sp, #80\n"
+			"	str	r3, [sp, #52]\n"
+			"	mrs	r4, cpsr\n"
+			"	str	r4, [sp, #64]\n"
+			"	mov	r1, sp\n"
+			"	ldr	r0, 1f\n"
+			"	ldr	r2, 2f\n"
+			"	blx	r2\n"
+			"	ldr	r1, [sp, #64]\n"
+			"	msr	cpsr_fs, r1\n"
+			"	ldmia	sp, {r0 - r15}\n"
+			".global optprobe_template_val\n"
+			"optprobe_template_val:\n"
+			"1:	.long 0\n"
+			".global optprobe_template_call\n"
+			"optprobe_template_call:\n"
+			"2:	.long 0\n"
+			".global optprobe_template_end\n"
+			"optprobe_template_end:\n");
+
+#define TMPL_VAL_IDX \
+	((long)&optprobe_template_val - (long)&optprobe_template_entry)
+#define TMPL_CALL_IDX \
+	((long)&optprobe_template_call - (long)&optprobe_template_entry)
+#define TMPL_END_IDX \
+	((long)&optprobe_template_end - (long)&optprobe_template_entry)
+
+/*
+ * ARM can always optimize an instruction when using ARM ISA.
+ */
+int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
+{
+	return 1;
+}
+
+/*
+ * In ARM ISA, kprobe opt always replace one instruction (4 bytes
+ * aligned and 4 bytes long). It is impossiable to encounter another
+ * kprobe in the address range. So always return 0.
+ */
+int arch_check_optimized_kprobe(struct optimized_kprobe *op)
+{
+	return 0;
+}
+
+/* Caller must ensure addr & 3 == 0 */
+static int can_optimize(unsigned long paddr)
+{
+	return 1;
+}
+
+/* Free optimized instruction slot */
+static void
+__arch_remove_optimized_kprobe(struct optimized_kprobe *op, int dirty)
+{
+	if (op->optinsn.insn) {
+		free_optinsn_slot(op->optinsn.insn, dirty);
+		op->optinsn.insn = NULL;
+	}
+}
+
+extern void kprobe_handler(struct pt_regs *regs);
+
+static void
+optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
+{
+	unsigned long flags;
+	struct kprobe *p = &op->kp;
+	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
+
+	/* Save skipped registers */
+	regs->ARM_pc = (unsigned long)op->kp.addr;
+	regs->ARM_ORIG_r0 = ~0UL;
+
+	local_irq_save(flags);
+
+	if (kprobe_running()) {
+		kprobes_inc_nmissed_count(&op->kp);
+	} else {
+		__this_cpu_write(current_kprobe, &op->kp);
+		kcb->kprobe_status = KPROBE_HIT_ACTIVE;
+		opt_pre_handler(&op->kp, regs);
+		__this_cpu_write(current_kprobe, NULL);
+	}
+
+	/* In each case, we must singlestep the replaced instruction. */
+	op->kp.ainsn.insn_singlestep(p->opcode, &p->ainsn, regs);
+
+	local_irq_restore(flags);
+}
+
+int arch_prepare_optimized_kprobe(struct optimized_kprobe *op)
+{
+	u8 *buf;
+	unsigned long rel_chk;
+	unsigned long val;
+
+	if (!can_optimize((unsigned long)op->kp.addr))
+		return -EILSEQ;
+
+	op->optinsn.insn = get_optinsn_slot();
+	if (!op->optinsn.insn)
+		return -ENOMEM;
+
+	/*
+	 * Verify if the address gap is in 32MiB range, because this uses
+	 * a relative jump.
+	 *
+	 * kprobe opt use a 'b' instruction to branch to optinsn.insn.
+	 * According to ARM manual, branch instruction is:
+	 *
+	 *   31  28 27           24 23             0
+	 *  +------+---+---+---+---+----------------+
+	 *  | cond | 1 | 0 | 1 | 0 |      imm24     |
+	 *  +------+---+---+---+---+----------------+
+	 *
+	 * imm24 is a signed 24 bits integer. The real branch offset is computed
+	 * by: imm32 = SignExtend(imm24:'00', 32);
+	 *
+	 * So the maximum forward branch should be:
+	 *   (0x007fffff << 2) = 0x01fffffc =  0x1fffffc
+	 * The maximum backword branch should be:
+	 *   (0xff800000 << 2) = 0xfe000000 = -0x2000000
+	 *
+	 * We can simply check (rel & 0xfe000003):
+	 *  if rel is positive, (rel & 0xfe000000) shoule be 0
+	 *  if rel is negitive, (rel & 0xfe000000) should be 0xfe000000
+	 *  the last '3' is used for alignment checking.
+	 */
+	rel_chk = (unsigned long)((long)op->optinsn.insn -
+			(long)op->kp.addr + 8) & 0xfe000003;
+
+	if ((rel_chk != 0) && (rel_chk != 0xfe000000)) {
+		__arch_remove_optimized_kprobe(op, 0);
+		return -ERANGE;
+	}
+
+	buf = (u8 *)op->optinsn.insn;
+
+	/* Copy arch-dep-instance from template */
+	memcpy(buf, &optprobe_template_entry, TMPL_END_IDX);
+
+	/* Set probe information */
+	val = (unsigned long)op;
+	memcpy(buf + TMPL_VAL_IDX, &val, sizeof(val));
+
+	/* Set probe function call */
+	val = (unsigned long)optimized_callback;
+	memcpy(buf + TMPL_CALL_IDX, &val, sizeof(val));
+
+	flush_icache_range((unsigned long)buf,
+			   (unsigned long)buf + TMPL_END_IDX);
+	return 0;
+}
+
+void arch_optimize_kprobes(struct list_head *oplist)
+{
+	struct optimized_kprobe *op, *tmp;
+
+	list_for_each_entry_safe(op, tmp, oplist, list) {
+		unsigned long insn;
+		WARN_ON(kprobe_disabled(&op->kp));
+
+		/*
+		 * Backup instructions which will be replaced
+		 * by jump address
+		 */
+		memcpy(op->optinsn.copied_insn, op->kp.addr,
+				RELATIVEJUMP_SIZE);
+
+		insn = arm_gen_branch((unsigned long)op->kp.addr,
+				(unsigned long)op->optinsn.insn);
+		BUG_ON(insn == 0);
+
+		/*
+		 * Make it a conditional branch if replaced insn
+		 * is consitional
+		 */
+		insn = (__mem_to_opcode_arm(
+			  op->optinsn.copied_insn[0]) & 0xf0000000) |
+			(insn & 0x0fffffff);
+
+		patch_text(op->kp.addr, insn);
+
+		list_del_init(&op->list);
+	}
+}
+
+void arch_unoptimize_kprobe(struct optimized_kprobe *op)
+{
+	arch_arm_kprobe(&op->kp);
+}
+
+/*
+ * Recover original instructions and breakpoints from relative jumps.
+ * Caller must call with locking kprobe_mutex.
+ */
+void arch_unoptimize_kprobes(struct list_head *oplist,
+ 			    struct list_head *done_list)
+{
+	struct optimized_kprobe *op, *tmp;
+
+	list_for_each_entry_safe(op, tmp, oplist, list) {
+		arch_unoptimize_kprobe(op);
+		list_move(&op->list, done_list);
+	}
+}
+
+int arch_within_optimized_kprobe(struct optimized_kprobe *op,
+ 				unsigned long addr)
+{
+	return ((unsigned long)op->kp.addr <= addr &&
+		(unsigned long)op->kp.addr + RELATIVEJUMP_SIZE > addr);
+}
+
+void arch_remove_optimized_kprobe(struct optimized_kprobe *op)
+{
+	__arch_remove_optimized_kprobe(op, 1);
+}