[RFC,25/28] x86: Use PIE codegen for the core kernel

Message ID	20240925150059.3955569-55-ardb+git@google.com
State	New
Headers	show Received: from mail-wr1-f74.google.com (mail-wr1-f74.google.com [209.85.221.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9826119C55F for <linux-efi@vger.kernel.org>; Wed, 25 Sep 2024 15:02:55 +0000 (UTC) Date: Wed, 25 Sep 2024 17:01:25 +0200 In-Reply-To: <20240925150059.3955569-30-ardb+git@google.com> Precedence: bulk Mime-Version: 1.0 References: <20240925150059.3955569-30-ardb+git@google.com> Message-ID: <20240925150059.3955569-55-ardb+git@google.com> Subject: [RFC PATCH 25/28] x86: Use PIE codegen for the core kernel From: Ard Biesheuvel <ardb+git@google.com> To: linux-kernel@vger.kernel.org Cc: Ard Biesheuvel <ardb@kernel.org>, x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Uros Bizjak <ubizjak@gmail.com>, Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Paolo Bonzini <pbonzini@redhat.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Juergen Gross <jgross@suse.com>, Boris Ostrovsky <boris.ostrovsky@oracle.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Arnd Bergmann <arnd@arndb.de>, Masahiro Yamada <masahiroy@kernel.org>, Kees Cook <kees@kernel.org>, Nathan Chancellor <nathan@kernel.org>, Keith Packard <keithp@keithp.com>, Justin Stitt <justinstitt@google.com>, Josh Poimboeuf <jpoimboe@kernel.org>, Arnaldo Carvalho de Melo <acme@kernel.org>, Namhyung Kim <namhyung@kernel.org>, Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, Kan Liang <kan.liang@linux.intel.com>, linux-doc@vger.kernel.org, linux-pm@vger.kernel.org, kvm@vger.kernel.org, xen-devel@lists.xenproject.org, linux-efi@vger.kernel.org, linux-arch@vger.kernel.org, linux-sparse@vger.kernel.org, linux-kbuild@vger.kernel.org, linux-perf-users@vger.kernel.org, rust-for-linux@vger.kernel.org, llvm@lists.linux.dev Content-Type: text/plain; charset="UTF-8"
Series	x86: Rely on toolchain for relocatable code \| expand [RFC,00/28] x86: Rely on toolchain for relocatable code [RFC,01/28] x86/pvh: Call C code via the kernel virtual mapping [RFC,02/28] Documentation: Bump minimum GCC version to 8.1 [RFC,03/28] x86/tools: Use mmap() to simplify relocs host tool [RFC,04/28] x86/boot: Permit GOTPCREL relocations for x86_64 builds [RFC,05/28] x86: Define the stack protector guard symbol explicitly [RFC,06/28] x86/percpu: Get rid of absolute per-CPU variable placement [RFC,07/28] scripts/kallsyms: Avoid 0x0 as the relative base [RFC,08/28] scripts/kallsyms: Remove support for absolute per-CPU variables [RFC,09/28] x86/tools: Remove special relocation handling for per-CPU variables [RFC,10/28] x86/xen: Avoid relocatable quantities in Xen ELF notes [RFC,11/28] x86/pvh: Avoid absolute symbol references in .head.text [RFC,12/28] x86/pm-trace: Use RIP-relative accesses for .tracedata [RFC,13/28] x86/kvm: Use RIP-relative addressing [RFC,14/28] x86/rethook: Use RIP-relative reference for return address [RFC,15/28] x86/sync_core: Use RIP-relative addressing [RFC,16/28] x86/entry_64: Use RIP-relative addressing [RFC,17/28] x86/hibernate: Prefer RIP-relative accesses [RFC,18/28] x86/boot/64: Determine VA/PA offset before entering C code [RFC,19/28] x86/boot/64: Avoid intentional absolute symbol references in .head.text [RFC,20/28] x64/acpi: Use PIC-compatible references in wakeup_64.S [RFC,21/28] x86/head: Use PIC-compatible symbol references in startup code [RFC,22/28] asm-generic: Treat PIC .data.rel.ro sections as .rodata [RFC,23/28] tools/objtool: Mark generated sections as writable [RFC,24/28] tools/objtool: Treat indirect ftrace calls as direct calls [RFC,25/28] x86: Use PIE codegen for the core kernel [RFC,26/28] x86/boot: Implement support for ELF RELA/RELR relocations [RFC,27/28] x86/kernel: Switch to PIE linking for the core kernel [RFC,28/28] x86/tools: Drop x86_64 support from 'relocs' tool

Ard Biesheuvel Sept. 25, 2024, 3:01 p.m. UTC

From: Ard Biesheuvel <ardb@kernel.org>

As an intermediate step towards enabling PIE linking for the 64-bit x86
kernel, enable PIE codegen for all objects that are linked into the
kernel proper.

This substantially reduces the number of relocations that need to be
processed when booting a relocatable KASLR kernel.

Before (size in bytes of the reloc table):

  797372 arch/x86/boot/compressed/vmlinux.relocs

After:

  400252 arch/x86/boot/compressed/vmlinux.relocs

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/x86/Makefile                 | 11 ++++++++++-
 arch/x86/boot/Makefile            |  1 +
 arch/x86/boot/compressed/Makefile |  2 +-
 arch/x86/entry/vdso/Makefile      |  1 +
 arch/x86/realmode/rm/Makefile     |  1 +
 include/asm-generic/vmlinux.lds.h |  1 +
 6 files changed, 15 insertions(+), 2 deletions(-)

H. Peter Anvin Oct. 1, 2024, 9:13 p.m. UTC | #1

On 9/25/24 08:01, Ard Biesheuvel wrote:
> From: Ard Biesheuvel <ardb@kernel.org>
> 
> As an intermediate step towards enabling PIE linking for the 64-bit x86
> kernel, enable PIE codegen for all objects that are linked into the
> kernel proper.
> 
> This substantially reduces the number of relocations that need to be
> processed when booting a relocatable KASLR kernel.
> 

This really seems like going completely backwards to me.

You are imposing a more restrictive code model on the kernel, optimizing 
for boot time in a way that will exert a permanent cost on the running 
kernel.

There is a *huge* difference between the kernel and user space here:

KERNEL MEMORY IS PERMANENTLY ALLOCATED, AND IS NEVER SHARED.

Dirtying user pages requires them to be unshared and dirty, which is 
undesirable. Kernel pages are *always* unshared and dirty.

> It also brings us much closer to the ordinary PIE relocation model used
> for most of user space, which is therefore much better supported and
> less likely to create problems as we increase the range of compilers and
> linkers that need to be supported.

We have been resisting *for ages* making the kernel worse to accomodate 
broken compilers. We don't "need" to support more compilers -- we need 
the compilers to support us. We have working compilers; any new compiler 
that wants to play should be expected to work correctly.

	-hpa

Ard Biesheuvel Oct. 2, 2024, 3:25 p.m. UTC | #2

Hi Peter,

Thanks for taking a look.

On Tue, 1 Oct 2024 at 23:13, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 9/25/24 08:01, Ard Biesheuvel wrote:
> > From: Ard Biesheuvel <ardb@kernel.org>
> >
> > As an intermediate step towards enabling PIE linking for the 64-bit x86
> > kernel, enable PIE codegen for all objects that are linked into the
> > kernel proper.
> >
> > This substantially reduces the number of relocations that need to be
> > processed when booting a relocatable KASLR kernel.
> >
>
> This really seems like going completely backwards to me.
>
> You are imposing a more restrictive code model on the kernel, optimizing
> for boot time in a way that will exert a permanent cost on the running
> kernel.
>

Fair point about the boot time. This is not the only concern, though,
and arguably the least important one.

As I responded to Andi before, it is also about using a code model and
relocation model that matches the reality of how the code is executed:
- the early C code runs from the 1:1 mapping, and needs special hacks
to accommodate this
- KASLR runs the kernel from a different virtual address than the one
we told the linker about

> There is a *huge* difference between the kernel and user space here:
>
> KERNEL MEMORY IS PERMANENTLY ALLOCATED, AND IS NEVER SHARED.
>

No need to shout.

> Dirtying user pages requires them to be unshared and dirty, which is
> undesirable. Kernel pages are *always* unshared and dirty.
>

I guess you are referring to the use of a GOT? That is a valid
concern, but it does not apply here. With hidden visibility and
compiler command line options like -mdirect-access-extern, all emitted
symbol references are direct. Disallowing text relocations could be
trivially enabled with this series if desired, and actually helps
avoid the tricky bugs we keep fixing in the early startup code that
executes from the 1:1 mapping (the C code in .head.text)

So it mostly comes down to minor differences in addressing modes, e.g.,

  movq $sym, %reg

actually uses more bytes than

  leaq sym(%rip), %reg

whereas

  movq sym, %reg

and

  movq sym(%rip), %reg

are the same length.

OTOH, indexing a statically allocated global array like

  movl array(,%reg1,4), %reg2

will be converted into

  leaq array(%rip), %reg2
  movl (%reg2,%reg1,4), %reg2

and is therefore less efficient in terms of code footprint. But in
general, the x86_64 ISA and psABI are quite flexible in this regard,
and extrapolating from past experiences with PIC code on i386 is not
really justified here.

As Andi also pointed out, what ultimately matters is the performance,
as well as code size where it impacts performance, through the I-cache
footprint. I'll do some testing before reposting, and maybe not bother
if the impact is negative.

> > It also brings us much closer to the ordinary PIE relocation model used
> > for most of user space, which is therefore much better supported and
> > less likely to create problems as we increase the range of compilers and
> > linkers that need to be supported.
>
> We have been resisting *for ages* making the kernel worse to accomodate
> broken compilers. We don't "need" to support more compilers -- we need
> the compilers to support us. We have working compilers; any new compiler
> that wants to play should be expected to work correctly.
>

We are in a much better place now than we were before in that regard,
which is actually how this effort came about: instead of lying to the
compiler, and maintaining our own pile of scripts and relocation
tools, we can just do what other arches are doing in Linux, and let
the toolchain do it for us.

Linus Torvalds Oct. 2, 2024, 8:01 p.m. UTC | #3

On Wed, 2 Oct 2024 at 08:31, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> I guess you are referring to the use of a GOT? That is a valid
> concern, but it does not apply here. With hidden visibility and
> compiler command line options like -mdirect-access-extern, all emitted
> symbol references are direct.

I absolutely hate GOT entries. We definitely shouldn't ever do
anything that causes them on x86-64.

I'd much rather just do boot-time relocation, and I don't think the
"we run code at a different location than we told the linker" is an
arghument against it.

Please, let's make sure we never have any of the global offset table horror.

Yes, yes, you can't avoid them on other architectures.

That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I
feel are a complete no-brainer and should be done regardless of any
other code generation issues.

Let's not do relocation for no good reason.

             Linus

Ard Biesheuvel Oct. 3, 2024, 11:13 a.m. UTC | #4

On Wed, 2 Oct 2024 at 22:02, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 2 Oct 2024 at 08:31, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > I guess you are referring to the use of a GOT? That is a valid
> > concern, but it does not apply here. With hidden visibility and
> > compiler command line options like -mdirect-access-extern, all emitted
> > symbol references are direct.
>
> I absolutely hate GOT entries. We definitely shouldn't ever do
> anything that causes them on x86-64.
>
> I'd much rather just do boot-time relocation, and I don't think the
> "we run code at a different location than we told the linker" is an
> arghument against it.
>
> Please, let's make sure we never have any of the global offset table horror.
>
> Yes, yes, you can't avoid them on other architectures.
>

GCC/binutils never needs them. GCC 13 and older will emit some
horrible indirect fentry hook calls via the GOT, but those are NOPed
out by objtool anyway, so that is easily fixable.

Clang/LLD is slightly trickier, because Clang emits relaxable GOTPCREL
relocations, but LLD doesn't update the relocations emitted via
--emit-relocs, so the relocs tool gets confused. This is one of the
reasons I had for proposing to simply switch to PIE linking, and let
the linker deal with all of that. Note that Clang may emit these even
when not generating PIC/PIE code at all.

So this is the reason I wanted to add support for GOTPCREL relocations
in the relocs tool - it is really quite trivial to do, and makes our
jobs slightly easier when dealing with these compiler quirks. The
alternative would be to teach objtool how to relax 'movq
foo@GOTPCREL(%rip)' into 'leaq foo(%rip)' - these are GOTPCREL
relaxations described in the x86_64 psABI for ELF, which is why
compilers are assuming more and more that emitting these is fine even
without -fpic, given that the linker is supposed to elide them if
possible.

Note that there are 1 or 2 cases in the asm code where it is actually
quite useful to refer to the address of foo as 'foo@GOTPCREL(%rip)' in
instructions that take memory operands, but those individual cases are
easily converted to something else if even having a GOT with just 2
entries is a dealbreaker for you.

> That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I
> feel are a complete no-brainer and should be done regardless of any
> other code generation issues.
>

Yes, this is the primary reason I ended up looking into this in the
first place. Earlier this year, we ended up having to introduce
RIP_REL_REF() to emit those RIP-relative references explicitly, in
order to prevent the C code that is called via the early 1:1 mapping
from exploding. The amount of C code called in that manner has been
growing steadily over time with the introduction of 5-level paging and
SEV-SNP and TDX support, which need to play all kinds of tricks before
the normal kernel mappings are created.

Compiling with -fpie and linking with --pie -z text produces an
executable that is guaranteed to have only RIP-relative references in
the .text segment, removing the need for RIP_REL_REF entirely (it
already does nothing when __pic__ is #define'd).

H. Peter Anvin Oct. 4, 2024, 9:06 p.m. UTC | #5

On 10/3/24 04:13, Ard Biesheuvel wrote:
> 
>> That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I
>> feel are a complete no-brainer and should be done regardless of any
>> other code generation issues.
> 
> Yes, this is the primary reason I ended up looking into this in the
> first place. Earlier this year, we ended up having to introduce
> RIP_REL_REF() to emit those RIP-relative references explicitly, in
> order to prevent the C code that is called via the early 1:1 mapping
> from exploding. The amount of C code called in that manner has been
> growing steadily over time with the introduction of 5-level paging and
> SEV-SNP and TDX support, which need to play all kinds of tricks before
> the normal kernel mappings are created.
> 

movq $sym to leaq sym(%rip) which you said ought to be smaller (and in 
reality appears to be the same size, 7 bytes) seems like a no-brainer 
and can be treated as a code quality issue -- in other words, file bug 
reports against gcc and clang.

> Compiling with -fpie and linking with --pie -z text produces an
> executable that is guaranteed to have only RIP-relative references in
> the .text segment, removing the need for RIP_REL_REF entirely (it
> already does nothing when __pic__ is #define'd).

But -fpie has a considerable cost; specifically when we have indexed 
references, as in that case the base pointer needs to be manifest in a 
register, *and* it takes up a register slot in the EA, which may end 
converting one instruction into three.

Now, the "kernel" memory model is defined in the ABI document, but there 
is nothing that prevents us from making updates to it if we need to; 
e.g. the statement that movq $sym can be used is undesirable, of course.

	-hpa

Uros Bizjak Oct. 5, 2024, 8:31 a.m. UTC | #6

On Fri, Oct 4, 2024 at 11:06 PM H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 10/3/24 04:13, Ard Biesheuvel wrote:
> >
> >> That said, doing changes like changing "mov $sym" to "lea sym(%rip)" I
> >> feel are a complete no-brainer and should be done regardless of any
> >> other code generation issues.
> >
> > Yes, this is the primary reason I ended up looking into this in the
> > first place. Earlier this year, we ended up having to introduce
> > RIP_REL_REF() to emit those RIP-relative references explicitly, in
> > order to prevent the C code that is called via the early 1:1 mapping
> > from exploding. The amount of C code called in that manner has been
> > growing steadily over time with the introduction of 5-level paging and
> > SEV-SNP and TDX support, which need to play all kinds of tricks before
> > the normal kernel mappings are created.
> >
>
> movq $sym to leaq sym(%rip) which you said ought to be smaller (and in
> reality appears to be the same size, 7 bytes) seems like a no-brainer
> and can be treated as a code quality issue -- in other words, file bug
> reports against gcc and clang.

It is the kernel assembly source that should be converted to
rip-relative form, gcc (and probably clang) have nothing with it.

Uros.

Linus Torvalds Oct. 6, 2024, midnight UTC | #7

On Sat, 5 Oct 2024 at 16:37, H. Peter Anvin <hpa@zytor.com> wrote:
>
> Sadly, that is not correct; neither gcc nor clang uses lea:

Looking around, this may be intentional. At least according to Agner,
several cores do better at "mov immediate" compared to "lea".

Eg a RIP-relative LEA on Zen 2 gets a throughput of two per cycle, but
a "MOV r,i" gets four. That got fixed in Zen 3 and later, but
apparently Intel had similar issues (Ivy Bridge: 1 LEA per cycle, vs 3
"mov i,r". Haswell is 1:4).

Of course, Agner's tables are good, but not necessarily always the
whole story. There are other instruction tables on the internet (eg
uops.info) with possibly more info.

And in reality, I would expect it to be a complete non-issue with any
OoO engine and real code, because you are very seldom ALU limited
particularly when there aren't any data dependencies.

But a RIP-relative LEA does seem to put a *bit* more pressure on the
core resources, so the compilers are may be right to pick a "mov".

               Linus

[RFC,25/28] x86: Use PIE codegen for the core kernel

Commit Message

Comments

Patch