From patchwork Thu Jan 4 18:51:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jeff Xu X-Patchwork-Id: 760599 Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A732528E14 for ; Thu, 4 Jan 2024 18:51:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=chromium.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=chromium.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b="iYF+AHN1" Received: by mail-pg1-f180.google.com with SMTP id 41be03b00d2f7-5cdf76cde78so579331a12.1 for ; Thu, 04 Jan 2024 10:51:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1704394306; x=1704999106; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=GXcAixOgj5wC6KzcQWhj3DdvTpZWvVXGr41CZAGmtyc=; b=iYF+AHN1BmR4GEG3XyjJPLMW6brIbLOta6YzKvy8ifpUGAondiu8OZ6oenZqGXBb9R Dvtsge2D9/ccEYz5NjEr3X9AYyI+vhEUSqv3G2CNrGdM/em5v66VtA+S70SdRQ0Ufs1S 9rKPW08Meq7pkDdd9SPKKg582Q6iTpsjhFDLs= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704394306; x=1704999106; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=GXcAixOgj5wC6KzcQWhj3DdvTpZWvVXGr41CZAGmtyc=; b=B0r0bt+hgX+WwUZ81+aSQoZ9atuZwYEtdXdMr3zwACaeFk7N5oz5VYC4fK59rV+/aY t9PqOg/y9EJ5aToWCGGkWyCx+nlZgRoEeNdWLXEc7h7u6owvIO2qWZUId/sXANq0c4dO b6iZODHV3IDgvK8EWyBzruYsTL1SQ97AAyXjrEtcncUlM1gjMsogQ5oX8ZuAO3GjMrfC k0MqbOIsLBeUcXsTfYN6WAgpF5NfLjm6RlPIySkaL1oqohFwBjTsSp58s779zm8QPgdu izaPO9nL+pah6+NW4riJ4l/T7bmsi7+tUquFMyolf15B2FVn5oba7CFkD4Pn/NhcTV6n mJGw== X-Gm-Message-State: AOJu0Yy3h5ufl6yCDfvuyVJ2N+ho8tbDuttj0Ed92UInAmnKGGS+gSLa 3j8zlY6m8VBepNZC5DcSjwPWh2l2CFHr X-Google-Smtp-Source: AGHT+IFwExAyWQESGL8nxH3IbtD5xl/544q4DUNQhpF6xFlhqlALjkjYBABWn4o5fppBtda9VLD1uQ== X-Received: by 2002:a05:6a20:2584:b0:197:5779:4d26 with SMTP id k4-20020a056a20258400b0019757794d26mr1066925pzd.64.1704394305805; Thu, 04 Jan 2024 10:51:45 -0800 (PST) Received: from localhost (34.85.168.34.bc.googleusercontent.com. [34.168.85.34]) by smtp.gmail.com with UTF8SMTPSA id u7-20020a056a00124700b006d96d034befsm24782446pfi.30.2024.01.04.10.51.45 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 04 Jan 2024 10:51:45 -0800 (PST) From: jeffxu@chromium.org To: akpm@linux-foundation.org, keescook@chromium.org, jannh@google.com, sroettger@google.com, willy@infradead.org, gregkh@linuxfoundation.org, torvalds@linux-foundation.org, usama.anjum@collabora.com Cc: jeffxu@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, pedro.falcato@gmail.com, dave.hansen@intel.com, linux-hardening@vger.kernel.org, deraadt@openbsd.org, Jeff Xu Subject: [RFC PATCH v4 0/4] Introduce mseal() Date: Thu, 4 Jan 2024 18:51:33 +0000 Message-ID: <20240104185138.169307-1-jeffxu@chromium.org> X-Mailer: git-send-email 2.43.0.195.gebba966016-goog Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Jeff Xu This is V4 of the patch, the patch has improved significantly since V1, thanks to diverse inputs, a few discussions remain, please read those in the open discussion section of v4 of change history. ----------------------------------------------------------------- This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. In addition: mmap() has two related changes. The PROT_SEAL bit in prot field of mmap(). When present, it marks the map sealed since creation. The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing, i.e. mseal() will fail. Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE in mmap(). The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. -------------------------------------------------------------------- Change history: =============== V4: (Suggested by Linus Torvalds) - new signature: mseal(start,len,flags) - 32 bit is not supported. vm_seal is removed, use vm_flags instead. - single bit in vm_flags for sealed state. - CONFIG_MSEAL kernel config is removed. - single bit of PROT_SEAL in the "Prot" field of mmap(). Other changes: - update selftest (Suggested by Muhammad Usama Anjum) - update documentation. Open discussions: ================= Below discussion were brought up in V3, and did not receive any input: the one important to this patch is MAP_SEALABLE in mmap(), which is in current version of patch, list here for input/comments. --------------------------------------------------------------------- During the development of V3, I had new questions and thoughts and wished to discuss. 1> shm/aio