From patchwork Mon Oct 28 14:13:27 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lorenzo Stoakes X-Patchwork-Id: 839245 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C276B1DE2DA; Mon, 28 Oct 2024 14:14:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.165.32 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124890; cv=fail; b=V5W2WOF8LzYCX3ZM9YI9NomwvrGGUYFB7hetlWA0xGquqoHtG99WzOqtRxaSVf+xeLw25oVmEFzIEZuQf/Z+osLCGs4rcN3pN62n1QiIvmWrmqO0z/nLSdIcnL/S2/vCzpy3NuSLkMQ9Yy6fdJLd/+Amo4DxRbpqXjiHJXc4KGg= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124890; c=relaxed/simple; bh=rFzMm+wTjGC7z/6mArtj4rddXA6XhJYdnS0yA8QQ16I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=XVYBlHF880ab8Kjt1U/TLmbihvTK25cFux5+kL2XtAiLmD9q0BLePkihxrbADAF6vY0iq2K2aRVpIFwqjnVD5nPw6OMybQpCk2y43md59oMmaYTptjGUpWqIG757UcQla8oHE5DlK/QZynkB42N1P6VuUj99X9e/LYQXiJZQL7A= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=FBlMSfqY; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=c/3KqVJf; arc=fail smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="FBlMSfqY"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="c/3KqVJf" Received: from pps.filterd (m0333521.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 49SBSf6r024165; Mon, 28 Oct 2024 14:13:51 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s= corp-2023-11-20; bh=bzuMMdtjg6weBfZvea4Lfu3jFEmQm0TlQxJGaxqxxNY=; b= FBlMSfqY1jKVRNZIGger311woONo20fCzHRbkfVKuT+YMlbcCuov7d7KmMsMN5au Hu14MXkAV2F7shj86kfRsHMLmiqpywvpZsVc/YbrxMf6gMm4AGYjBehWrcMEezSs zEUILqcRhKOlcLLDsaGMx0GQM5GV44j0hmIaIDVJTQgqymLHDldkDVgrJ/ceq1h3 XiYQUBNtOKvnxc2Szij2oYc4kiPAXtLsoZPwJECJx7qxm/YUpKv/oEhOLlhf4lq+ AzPdvknTuHRLypyBFrk/cWJ4OhEGPuaXMwKhkbqY+uv4iILtbXypXndvREy9H4N5 6B+u84fOZ9cTnTE/jbK4DQ== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 42grdxjymr-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:50 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 49SCvICk004790; Mon, 28 Oct 2024 14:13:49 GMT Received: from nam12-bn8-obe.outbound.protection.outlook.com (mail-bn8nam12lp2172.outbound.protection.outlook.com [104.47.55.172]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 42jb2sk9y7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:49 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=kyhzLO8JLn5O8LmT98KDGOV2Q4RY78TAJuhA3YxRiPTMPK0cm5EMretHOFmVCJAb05yoonYi7QzCyB6+6R8Hx3RfbJHOEOZfRSnuZzBEcjBd28UV85K/s2qUqOCgeHsw3lRGqDXMSwLnbMUmT++OlSrkixtElQpuwnrK9YVxytR4DgPwDFnoLPr/Wrd0AI8sTmfugEfqojB8XU6fMAKhsV+dqUFptVrIbG2gViC/cDjESGngimmYOV4MDHaPymUK6suojhzKIXZlJfXftH3hBFoKZG8UFmwZVDoazmfUUTL0U88f0kKIvy1GWqjohc0HweXNmzlYp+tOyLvSTrzfiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=bzuMMdtjg6weBfZvea4Lfu3jFEmQm0TlQxJGaxqxxNY=; b=LIiqEF2Z3+GVExqZMq0HxO82EXm2NkO1chRHXTD9TCfymVoji9HRSnVEj20YRlP9HT4UMiSPJCjzpBN4XhPy5oKq0C8U11pLD/WgIHZt/0s9KU3/obaABJFR79Hxe+aAPSBsvTljrdgabny7CWBFKRr+5i11HgtdSA+jYccNtMiZoYGUkTlJrNycYb/8mWsu8AXhsxc4Oqsg6Yc1ENHecv+xrgJ+Vw+Mbg8td194Lie38tqWUXlAtlMJTgUfY7lcPOolEhKy6m7/HucXugahqsl3fITGDyACZrFRXt+WYOeZT8wpVp3ibO7FAxXgtAavuD6e9hCFZXB6QGVGCWrR0Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=bzuMMdtjg6weBfZvea4Lfu3jFEmQm0TlQxJGaxqxxNY=; b=c/3KqVJfLE/INXtuYEBp9yCzJzzt1mXoMPhVP0eCWrz3cJXXo2nnOSClMguNNWfnP6eH5Po3PaZiRYG41ErKuY9woR8JWUJan/6KpoC46L1CBAKI0tjPA2SnhssH7PAt5IoB6CPn1T0JLwd0nQBEyBGNbnL9PX2gn1z7g0IUTmA= Received: from BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) by SJ2PR10MB7598.namprd10.prod.outlook.com (2603:10b6:a03:540::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8093.24; Mon, 28 Oct 2024 14:13:39 +0000 Received: from BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9]) by BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9%6]) with mapi id 15.20.8093.024; Mon, 28 Oct 2024 14:13:39 +0000 From: Lorenzo Stoakes To: Andrew Morton Cc: Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , Jann Horn , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Jeff Xu , Christoph Hellwig , linux-api@vger.kernel.org, John Hubbard Subject: [PATCH v4 1/5] mm: pagewalk: add the ability to install PTEs Date: Mon, 28 Oct 2024 14:13:27 +0000 Message-ID: <51b432ebef013e3fdf9f92101533435de1bffadf.1730123433.git.lorenzo.stoakes@oracle.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: References: X-ClientProxiedBy: LO2P265CA0263.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:8a::35) To BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR10MB3366:EE_|SJ2PR10MB7598:EE_ X-MS-Office365-Filtering-Correlation-Id: 82c0a6fd-c965-4fec-d972-08dcf75ab863 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|7416014|376014; X-Microsoft-Antispam-Message-Info: ZqR85DfyFEZlLGeWat2NtsKmOZUgYeU05xMSeMndl6CwY0/TeJvVQRFXdY11RcXFDQsflURDYX0xRCavo5Ut/+eipho5gM8dtmeuhzzQXxqswhtOAcRt2lBqGlu5mS83+66lfM+WG3mRru9ThAdDfqgcYBvNoNt/pTiE8b1dtzIRsDYI2EbjAVYYIHRUK10JjulMSI9Mv8JU4CdfZuf/4Oukz2CGpBlv7dx3qHDD+uEHP7sg7eDlctR0m11pOPqFlvconSwltpUApamTpkCZjc+ZbwQPF+LFMPUAowEg9PjbKyKITH+HtXMUiocWD2/BnrjDrKs6DKZRoeLvspDpuApSdacDBS2n/FSO932CQj6R5mY264nfj/D258LgykEPmOPWrdlF8XriXcRXkT5SVRRiKFJn4J+84lmrocBDgmdkXy0kyMpHzFmCr2/BlMwBpkZT8DU/h5//ELxASVGsjuUckgtgTLRNWV6AY468SR3Cn4Ewd8DWnemTOjUFJbTznQK+NGoOG2fknQmWb8p/6Z7gRtlZPOa0UehBmanJd1TcQb0w6GCk2EL//53jIArbB4UckqsWXo0s5l4/hpc0s7FcwAxfyivUMsJlwDQns8ZqHLvS7Y8rBw322Ho9ximFkIQxdEuGoJD1rwDUrsYVWzu7SlOkyWwW4BaxyaUcJ01Wc8NzsHh/zLfAn3Q7np4JzJw48bwYTdyuLQRkO93wRQAIIb7r/jWu1/mnqow3bGFpvEuCSPRsRwdyQtxVyZNYRJPQzTXzXm7PokP6mbq8t6JTTNgGtT5MkeXiEp6PdiFRpSKQlGlEala8hwgCDd+IZQ+Dc/C3FOFI4hq3rdMPou3ud1LMWA9qjjPurXQ9VhrPqjIOL+fOeKF6bbIRIN3JTJ4fGb7nIBhCdh2ngOZSNUI9wbx+mYuITRY/w7t3BzFsaAnrZDDv4w9N8DlOAnqSyTLE8u28GRSuVcGUphsQ1uPe4gFMTQdt/uia7fekWXSmM1ah9yd1UF+06NVyQN8jUmi7l1L8hDsZ5zuoaboC7S73DnYanoHCjgVSROFkolqq4FO0Z+gf+19dpI5xr0veMQSlEBMZJQnJ2pc8aRPBY3neHjOn/QGAHe54a0uqgPPMu0WFKFNQjs8RIW0Dy6XVvkM5VCaQX8vbp9sIWDMhYfhC4cilhgeJXRoYav1HoOMNVJYGQMZfJuVPvctJPhyyFCUT7ROCo7W2/RL7q20mq8+Gb06N68V5mGnaKLdChYZaXGSiHxoUKSoQnQeG/maZc3j5/XxVwzZi1dsR3sHD2N5CBO3CKB9FO+CRCAVvEl5/r34J/gEBCvzh5Zq2w3Nb X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:BYAPR10MB3366.namprd10.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(366016)(7416014)(376014); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: lO6pnk2WeIzb3Jb4u+5BzD0MANwfoILYQtiTGhYc2rLzu0QepdrIL92S7pFK4EvrQ34tcM+i5b4JelgmzP6iiRCHUWYRtvvQP2sky370mBIh/3bPFS9ID+CQ/OCs2U5tQ2wpZ5NWMNmbzvsxPgKSJfa1y5K/rHBDgmQ4P7963i10AHWTL+oDgsZ0j3K0xWhvOWno1BDD1Vs/1o1WJCMBrsESor6TCN+iLIKIp/h1ZSwuI/FlCgogmQLM8j/dfvfOB5XpbqJ/uf9yHzLl2zIRQfzBug/ryYj3A+8dlnePR8IGwBNbR6H6vRqzxgvs5RJsROzEgIiHYxc19rKgoVjPlXRzw311imB2Na6JMVG2phcTXlzsxu+sBqCmhlV8nO79UpOPT59qFKEll4CXD+zG2vy+/NbVlSCNgvyDqqucdyUSw6hXpcrpsZ2NlMMYFGJk9KR/TNie+blThGs/eCdNuUDjCmVjOR+gmmsitriI6DKEjtCrwOf68veSxoFNe6eAT8GMfPc3e+1mdFD2V6iRhmGde4fpO60tNNM3MCnbXngBQmJEawoW34xUVjdj5JMXZMrIRJoyvJ3ZjnmpdHpfVPDV3ZpVT+L8isgXhNtW4w/0YMVUJEdb9jYQetyOuoUmtvSNYWBdDswff2RIR6Yr1REL32UBNX5rh3gtOI0lqvRdrmulEBXfgZqJkCEfwzBZmvBqoWfllJ2j0PvUuKCNY8F6VKQM6tYa7n6rgE1VG+CaudRPOhqNT1XZSO/otIQkxWO0TC0PVyGLOyTdlzBfbvasoLfRN/MVpXXTBtWZi8bAixcwQgADP7/7I6UVAhoqA5esoK162vWQKxjEs7Fitw/9FriusELANaMEZJJ4Zht8mrx8/P95VxPdhFfMl09RGO92ws/Evj6fSxerbxhzfcSTNBXOiZe+ePZA+UftEjoGtC6UMCG/1dLTQw0Hniu65ApuKDkSkHwZDiX4Pz7fu9OVIqQDfsHw4m8/s7/PyNSy8CWaBTojEEifN49kTo96sREBzOL6f4iEBNAT5mRnYHrZOIXwmTYFmUZzAXn7ai4Aov5Aqvsc1OszOYgtgje43NFN9nDY6NXu63PCSwNZ0JFEqaxzJ4kNb4jYWQ321WawgxGEYzY7SzK9Pj9JH0aOneQRUkjIfV3P46dMswZLFQZBalskiPWT1w1DuT0o8l3IfUtHOUgGlGuqys8moei2wS7d5IAyXX6jTolny3LivKSTDgSk7JOAODIBEF+egNuWXNnrhcq30eNiNqjCrWRpCcv2vQdQRTc8ELKLFLv8a9380Df7GxJbLMv0os+ZKVpmm4rvBKtot3Q12Jc/Ls7aMkxUaXege49MHd6Qh1j9jkhIL1ji2vRCEUwcpBJuNBl960OKqkj6H8XYoO7JiTP0mTdZwDhtoCSR9oi7DG/R0owcCEjJeOlzWx8zIHq6xig7jHtZEKUv9rfeLFpJiWV2+I7akdf4Gmc/qzRQLo3SRGWZgNE7w52R3+k1llSKURz6kzrV7I8KQnLRaeo5slwfg81z6Oiwq3pTyKbXAdDR8sEnqr8+bqCbBnqFfGuWXQHPttPOhO6EyWxEAG9hl6U3QMgfMIXZpBPZrRWUSqhnww== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: CzDLAmuCVwPirMfdhBW+Pi8oJJ5lrGWy8xT1Jfbc3swQ7PmnkKw1E8fnP4FjYD27CuZxEX8Hiexjuld+LNT55XZwLWZZHg2AlhtmXJ52R1S2Xg3sMjmJL8/4nyL6cBCE8R6ynRoZb3LyFkUHs1jd5Q5RKpO24hCUI4gq1YzZUNJKMoIjf4GBFinltEhPGrhgpPa90ZO6NRuBuJ5o233E9fPs7ta72tWHxOuagGumw/B2LYKBOVzGviCPtYqMhz7a4lfU7AkqcKAu06W5A3oZ4GRtDwEZYa2Z0KklcaZa4cnRMOLmqt7OaBlX6umEK81itTM2DWiN+oNBgTGHGIpPdJoSvX66v1dVq4McvIfvv1JuzxA28zoWgqmVDwMUGlRSaV9hosv8RBgjMEY9YICDku2CSusgjKewVuQ1siE8aOhBufqOJSoNk9Q9llclOBkYocX+Tmds+DUc6Gy2W8XUWUZXcy8zjczihslrvUhcNOqLMIiaGTMoUzsQG80+lUd0A0Pm2r2qBPJ0Zo3qpJ4bnQMciK5sEV3gE4gZhoW/HyPPkWqGUZrDRo+DBcHr/NqbHs/M3jG7SkniMbydoDRTxKla7SLS2UvwT4Yq8NhSHGw= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 82c0a6fd-c965-4fec-d972-08dcf75ab863 X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB3366.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Oct 2024 14:13:39.5335 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 0Cr2dWqt/DlqxnbMk9Df7jG9417Y5yLJBWuLwvRw6WJkPxSN3/ecY/JpfZflJT/R7lFQCbXuW/tsD67/3YMCd+QMX8O+AxmfCoGkFFjlS/k= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ2PR10MB7598 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-28_04,2024-10-28_02,2024-09-30_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 bulkscore=0 adultscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2409260000 definitions=main-2410280114 X-Proofpoint-GUID: kf8nhKhPlZbW3BkhJxpt9lX7KAzxTqkO X-Proofpoint-ORIG-GUID: kf8nhKhPlZbW3BkhJxpt9lX7KAzxTqkO The existing generic pagewalk logic permits the walking of page tables, invoking callbacks at individual page table levels via user-provided mm_walk_ops callbacks. This is useful for traversing existing page table entries, but precludes the ability to establish new ones. Existing mechanism for performing a walk which also installs page table entries if necessary are heavily duplicated throughout the kernel, each with semantic differences from one another and largely unavailable for use elsewhere. Rather than add yet another implementation, we extend the generic pagewalk logic to enable the installation of page table entries by adding a new install_pte() callback in mm_walk_ops. If this is specified, then upon encountering a missing page table entry, we allocate and install a new one and continue the traversal. If a THP huge page is encountered at either the PMD or PUD level we split it only if there are ops->pte_entry() (or ops->pmd_entry at PUD level), otherwise if there is only an ops->install_pte(), we avoid the unnecessary split. We do not support hugetlb at this stage. If this function returns an error, or an allocation fails during the operation, we abort the operation altogether. It is up to the caller to deal appropriately with partially populated page table ranges. If install_pte() is defined, the semantics of pte_entry() change - this callback is then only invoked if the entry already exists. This is a useful property, as it allows a caller to handle existing PTEs while installing new ones where necessary in the specified range. If install_pte() is not defined, then there is no functional difference to this patch, so all existing logic will work precisely as it did before. As we only permit the installation of PTEs where a mapping does not already exist there is no need for TLB management, however we do invoke update_mmu_cache() for architectures which require manual maintenance of mappings for other CPUs. We explicitly do not allow the existing page walk API to expose this feature as it is dangerous and intended for internal mm use only. Therefore we provide a new walk_page_range_mm() function exposed only to mm/internal.h. We take the opportunity to additionally clean up the page walker logic to be a little easier to follow.{ Reviewed-by: Jann Horn Reviewed-by: Vlastimil Babka Signed-off-by: Lorenzo Stoakes {# modified: mm/pagewalk.c --- include/linux/pagewalk.h | 18 ++- mm/internal.h | 6 + mm/pagewalk.c | 246 ++++++++++++++++++++++++++++----------- 3 files changed, 201 insertions(+), 69 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index f5eb5a32aeed..9700a29f8afb 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -25,12 +25,15 @@ enum page_walk_lock { * this handler is required to be able to handle * pmd_trans_huge() pmds. They may simply choose to * split_huge_page() instead of handling it explicitly. - * @pte_entry: if set, called for each PTE (lowest-level) entry, - * including empty ones + * @pte_entry: if set, called for each PTE (lowest-level) entry + * including empty ones, except if @install_pte is set. + * If @install_pte is set, @pte_entry is called only for + * existing PTEs. * @pte_hole: if set, called for each hole at all levels, * depth is -1 if not known, 0:PGD, 1:P4D, 2:PUD, 3:PMD. * Any folded depths (where PTRS_PER_P?D is equal to 1) - * are skipped. + * are skipped. If @install_pte is specified, this will + * not trigger for any populated ranges. * @hugetlb_entry: if set, called for each hugetlb entry. This hook * function is called with the vma lock held, in order to * protect against a concurrent freeing of the pte_t* or @@ -51,6 +54,13 @@ enum page_walk_lock { * @pre_vma: if set, called before starting walk on a non-null vma. * @post_vma: if set, called after a walk on a non-null vma, provided * that @pre_vma and the vma walk succeeded. + * @install_pte: if set, missing page table entries are installed and + * thus all levels are always walked in the specified + * range. This callback is then invoked at the PTE level + * (having split any THP pages prior), providing the PTE to + * install. If allocations fail, the walk is aborted. This + * operation is only available for userland memory. Not + * usable for hugetlb ranges. * * p?d_entry callbacks are called even if those levels are folded on a * particular architecture/configuration. @@ -76,6 +86,8 @@ struct mm_walk_ops { int (*pre_vma)(unsigned long start, unsigned long end, struct mm_walk *walk); void (*post_vma)(struct mm_walk *walk); + int (*install_pte)(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk); enum page_walk_lock walk_lock; }; diff --git a/mm/internal.h b/mm/internal.h index c4c884d61024..41b60204b059 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -1502,4 +1503,9 @@ static inline void accept_page(struct page *page) } #endif /* CONFIG_UNACCEPTED_MEMORY */ +/* pagewalk.c */ +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 5f9f01532e67..e478777c86e1 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,9 +3,14 @@ #include #include #include +#include #include #include +#include + +#include "internal.h" + /* * We want to know the real level where a entry is located ignoring any * folding of levels which may be happening. For example if p4d is folded then @@ -29,9 +34,23 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr, int err = 0; for (;;) { - err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); - if (err) - break; + if (ops->install_pte && pte_none(ptep_get(pte))) { + pte_t new_pte; + + err = ops->install_pte(addr, addr + PAGE_SIZE, &new_pte, + walk); + if (err) + break; + + set_pte_at(walk->mm, addr, pte, new_pte); + /* Non-present before, so for arches that need it. */ + if (!WARN_ON_ONCE(walk->no_vma)) + update_mmu_cache(walk->vma, addr, pte); + } else { + err = ops->pte_entry(pte, addr, addr + PAGE_SIZE, walk); + if (err) + break; + } if (addr >= end - PAGE_SIZE) break; addr += PAGE_SIZE; @@ -81,6 +100,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, pmd_t *pmd; unsigned long next; const struct mm_walk_ops *ops = walk->ops; + bool has_handler = ops->pte_entry; + bool has_install = ops->install_pte; int err = 0; int depth = real_depth(3); @@ -89,11 +110,14 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, again: next = pmd_addr_end(addr, end); if (pmd_none(*pmd)) { - if (ops->pte_hole) + if (has_install) + err = __pte_alloc(walk->mm, pmd); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!has_install) + continue; } walk->action = ACTION_SUBTREE; @@ -109,18 +133,25 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, if (walk->action == ACTION_AGAIN) goto again; - - /* - * Check this here so we only break down trans_huge - * pages when we _need_ to - */ - if ((!walk->vma && (pmd_leaf(*pmd) || !pmd_present(*pmd))) || - walk->action == ACTION_CONTINUE || - !(ops->pte_entry)) + if (walk->action == ACTION_CONTINUE) continue; + if (!has_handler) { /* No handlers for lower page tables. */ + if (!has_install) + continue; /* Nothing to do. */ + /* + * We are ONLY installing, so avoid unnecessarily + * splitting a present huge page. + */ + if (pmd_present(*pmd) && + (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))) + continue; + } + if (walk->vma) split_huge_pmd(walk->vma, pmd, addr); + else if (pmd_leaf(*pmd) || !pmd_present(*pmd)) + continue; /* Nothing to do. */ err = walk_pte_range(pmd, addr, next, walk); if (err) @@ -140,6 +171,8 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, pud_t *pud; unsigned long next; const struct mm_walk_ops *ops = walk->ops; + bool has_handler = ops->pmd_entry || ops->pte_entry; + bool has_install = ops->install_pte; int err = 0; int depth = real_depth(2); @@ -148,11 +181,14 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, again: next = pud_addr_end(addr, end); if (pud_none(*pud)) { - if (ops->pte_hole) + if (has_install) + err = __pmd_alloc(walk->mm, pud, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!has_install) + continue; } walk->action = ACTION_SUBTREE; @@ -164,14 +200,26 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, if (walk->action == ACTION_AGAIN) goto again; - - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || - walk->action == ACTION_CONTINUE || - !(ops->pmd_entry || ops->pte_entry)) + if (walk->action == ACTION_CONTINUE) continue; + if (!has_handler) { /* No handlers for lower page tables. */ + if (!has_install) + continue; /* Nothing to do. */ + /* + * We are ONLY installing, so avoid unnecessarily + * splitting a present huge page. + */ + if (pud_present(*pud) && + (pud_trans_huge(*pud) || pud_devmap(*pud))) + continue; + } + if (walk->vma) split_huge_pud(walk->vma, pud, addr); + else if (pud_leaf(*pud) || !pud_present(*pud)) + continue; /* Nothing to do. */ + if (pud_none(*pud)) goto again; @@ -189,6 +237,8 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, p4d_t *p4d; unsigned long next; const struct mm_walk_ops *ops = walk->ops; + bool has_handler = ops->pud_entry || ops->pmd_entry || ops->pte_entry; + bool has_install = ops->install_pte; int err = 0; int depth = real_depth(1); @@ -196,18 +246,21 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, do { next = p4d_addr_end(addr, end); if (p4d_none_or_clear_bad(p4d)) { - if (ops->pte_hole) + if (has_install) + err = __pud_alloc(walk->mm, p4d, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) break; - continue; + if (!has_install) + continue; } if (ops->p4d_entry) { err = ops->p4d_entry(p4d, addr, next, walk); if (err) break; } - if (ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (has_handler || has_install) err = walk_pud_range(p4d, addr, next, walk); if (err) break; @@ -222,6 +275,9 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, pgd_t *pgd; unsigned long next; const struct mm_walk_ops *ops = walk->ops; + bool has_handler = ops->p4d_entry || ops->pud_entry || ops->pmd_entry || + ops->pte_entry; + bool has_install = ops->install_pte; int err = 0; if (walk->pgd) @@ -231,18 +287,21 @@ static int walk_pgd_range(unsigned long addr, unsigned long end, do { next = pgd_addr_end(addr, end); if (pgd_none_or_clear_bad(pgd)) { - if (ops->pte_hole) + if (has_install) + err = __p4d_alloc(walk->mm, pgd, addr); + else if (ops->pte_hole) err = ops->pte_hole(addr, next, 0, walk); if (err) break; - continue; + if (!has_install) + continue; } if (ops->pgd_entry) { err = ops->pgd_entry(pgd, addr, next, walk); if (err) break; } - if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || ops->pte_entry) + if (has_handler || has_install) err = walk_p4d_range(pgd, addr, next, walk); if (err) break; @@ -334,6 +393,11 @@ static int __walk_page_range(unsigned long start, unsigned long end, int err = 0; struct vm_area_struct *vma = walk->vma; const struct mm_walk_ops *ops = walk->ops; + bool is_hugetlb = is_vm_hugetlb_page(vma); + + /* We do not support hugetlb PTE installation. */ + if (ops->install_pte && is_hugetlb) + return -EINVAL; if (ops->pre_vma) { err = ops->pre_vma(start, end, walk); @@ -341,7 +405,7 @@ static int __walk_page_range(unsigned long start, unsigned long end, return err; } - if (is_vm_hugetlb_page(vma)) { + if (is_hugetlb) { if (ops->hugetlb_entry) err = walk_hugetlb_range(start, end, walk); } else @@ -380,47 +444,14 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma, #endif } -/** - * walk_page_range - walk page table with caller specific callbacks - * @mm: mm_struct representing the target process of page table walk - * @start: start address of the virtual address range - * @end: end address of the virtual address range - * @ops: operation to call during the walk - * @private: private data for callbacks' usage - * - * Recursively walk the page table tree of the process represented by @mm - * within the virtual address range [@start, @end). During walking, we can do - * some caller-specific works for each entry, by setting up pmd_entry(), - * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these - * callbacks, the associated entries/pages are just ignored. - * The return values of these callbacks are commonly defined like below: - * - * - 0 : succeeded to handle the current entry, and if you don't reach the - * end address yet, continue to walk. - * - >0 : succeeded to handle the current entry, and return to the caller - * with caller specific value. - * - <0 : failed to handle the current entry, and return to the caller - * with error code. - * - * Before starting to walk page table, some callers want to check whether - * they really want to walk over the current vma, typically by checking - * its vm_flags. walk_page_test() and @ops->test_walk() are used for this - * purpose. - * - * If operations need to be staged before and committed after a vma is walked, - * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), - * since it is intended to handle commit-type operations, can't return any - * errors. - * - * struct mm_walk keeps current values of some common data like vma and pmd, - * which are useful for the access from callbacks. If you want to pass some - * caller-specific data to callbacks, @private should be helpful. +/* + * See the comment for walk_page_range(), this performs the heavy lifting of the + * operation, only sets no restrictions on how the walk proceeds. * - * Locking: - * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, - * because these function traverse vma list and/or access to vma's data. + * We usually restrict the ability to install PTEs, but this functionality is + * available to internal memory management code and provided in mm/internal.h. */ -int walk_page_range(struct mm_struct *mm, unsigned long start, +int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private) { @@ -479,6 +510,80 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, return err; } +/* + * Determine if the walk operations specified are permitted to be used for a + * page table walk. + * + * This check is performed on all functions which are parameterised by walk + * operations and exposed in include/linux/pagewalk.h. + * + * Internal memory management code can use the walk_page_range_mm() function to + * be able to use all page walking operations. + */ +static bool check_ops_valid(const struct mm_walk_ops *ops) +{ + /* + * The installation of PTEs is solely under the control of memory + * management logic and subject to many subtle locking, security and + * cache considerations so we cannot permit other users to do so, and + * certainly not for exported symbols. + */ + if (ops->install_pte) + return false; + + return true; +} + +/** + * walk_page_range - walk page table with caller specific callbacks + * @mm: mm_struct representing the target process of page table walk + * @start: start address of the virtual address range + * @end: end address of the virtual address range + * @ops: operation to call during the walk + * @private: private data for callbacks' usage + * + * Recursively walk the page table tree of the process represented by @mm + * within the virtual address range [@start, @end). During walking, we can do + * some caller-specific works for each entry, by setting up pmd_entry(), + * pte_entry(), and/or hugetlb_entry(). If you don't set up for some of these + * callbacks, the associated entries/pages are just ignored. + * The return values of these callbacks are commonly defined like below: + * + * - 0 : succeeded to handle the current entry, and if you don't reach the + * end address yet, continue to walk. + * - >0 : succeeded to handle the current entry, and return to the caller + * with caller specific value. + * - <0 : failed to handle the current entry, and return to the caller + * with error code. + * + * Before starting to walk page table, some callers want to check whether + * they really want to walk over the current vma, typically by checking + * its vm_flags. walk_page_test() and @ops->test_walk() are used for this + * purpose. + * + * If operations need to be staged before and committed after a vma is walked, + * there are two callbacks, pre_vma() and post_vma(). Note that post_vma(), + * since it is intended to handle commit-type operations, can't return any + * errors. + * + * struct mm_walk keeps current values of some common data like vma and pmd, + * which are useful for the access from callbacks. If you want to pass some + * caller-specific data to callbacks, @private should be helpful. + * + * Locking: + * Callers of walk_page_range() and walk_page_vma() should hold @mm->mmap_lock, + * because these function traverse vma list and/or access to vma's data. + */ +int walk_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, const struct mm_walk_ops *ops, + void *private) +{ + if (!check_ops_valid(ops)) + return -EINVAL; + + return walk_page_range_mm(mm, start, end, ops, private); +} + /** * walk_page_range_novma - walk a range of pagetables not backed by a vma * @mm: mm_struct representing the target process of page table walk @@ -494,7 +599,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, * walking the kernel pages tables or page tables for firmware. * * Note: Be careful to walk the kernel pages tables, the caller may be need to - * take other effective approache (mmap lock may be insufficient) to prevent + * take other effective approaches (mmap lock may be insufficient) to prevent * the intermediate kernel page tables belonging to the specified address range * from being freed (e.g. memory hot-remove). */ @@ -513,6 +618,8 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start, if (start >= end || !walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL; /* * 1) For walking the user virtual address space: @@ -556,6 +663,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start, return -EINVAL; if (start < vma->vm_start || end > vma->vm_end) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL; process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -574,6 +683,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops, if (!walk.mm) return -EINVAL; + if (!check_ops_valid(ops)) + return -EINVAL; process_mm_walk_lock(walk.mm, ops->walk_lock); process_vma_walk_lock(vma, ops->walk_lock); @@ -623,6 +734,9 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, unsigned long start_addr, end_addr; int err = 0; + if (!check_ops_valid(ops)) + return -EINVAL; + lockdep_assert_held(&mapping->i_mmap_rwsem); vma_interval_tree_foreach(vma, &mapping->i_mmap, first_index, first_index + nr - 1) { From patchwork Mon Oct 28 14:13:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lorenzo Stoakes X-Patchwork-Id: 839246 Received: from mx0a-00069f02.pphosted.com (mx0a-00069f02.pphosted.com [205.220.165.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99AEC1DDA39; Mon, 28 Oct 2024 14:14:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.165.32 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124888; cv=fail; b=GCj5uSPxPp+CJZpO7Zh9eRel06PnjK3dqr0P2aMPs6nPNip4unUjI33R/BXDdRJ+Y+PU92TT6WEd6m5TnRntThjJc8CJA3ktqnp03i7ZlZzBLtbMCJFK32LOZLEWNJA06ePxOqaUqD04Mzzig0T0webEqcWq/mSg4dXsOEJ0ipw= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124888; c=relaxed/simple; bh=jsccdrM0ptziF+Wm7fNiZgvP7YDffP22jwCXcOMzd9E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=W3Ep4ycRd01RXLLJC1HxuisCkLDzDIVnVr0ehjzIh5/VM64BkIJqJFiQnS8h6sNHYyH61ZijyjzLm9Wnx4MEmCjOc4JfsIIG273oe7uvRVVC2e1+3pn6zwAIUsv53IEJV1t5cWRhwWX3537bl3Fmuz05W8SXCHhQzxR+M/zf5tM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=XRymvCF3; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=jKq8zzXr; arc=fail smtp.client-ip=205.220.165.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="XRymvCF3"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="jKq8zzXr" Received: from pps.filterd (m0333521.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 49SBTGnk024845; Mon, 28 Oct 2024 14:13:53 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s= corp-2023-11-20; bh=UqT2CVmVCpT+D5QlLXoeGeO/AB/2IlXG/SqYjtZ69nI=; b= XRymvCF3ZZ0enH6XIGHeCkGYI+pJ3lwSwD4SKGgyGvBh0FwdQaVVS5WEca+/4G+F A7MKSbfYR0fw1Bcw+z18GPHhujeJeWPUZlkXsBP5Jy8UY+FDCNxUQad2q7qHFivG M1LKYYi4Um374uRIK28xmP7ewvsP1mIdvWZmKAvKvpCgbGwvt69+dW7Va0+BmhHL kldGK7o5Q/cEIu49UQJLhgn71/jQXpYSWWDFGISo3Dlz8vqpli3IyUBEewKM2sPB k63KQEvrY0RMx5cG0Eeaenc/T3wIyEB5wduvEvVt/u/wyu9tX3k0cREfTGRSLES6 4kLJGXl96kNFe0BI+HsKfQ== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 42grdxjymv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:52 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 49SCvIaC004777; Mon, 28 Oct 2024 14:13:51 GMT Received: from nam12-bn8-obe.outbound.protection.outlook.com (mail-bn8nam12lp2168.outbound.protection.outlook.com [104.47.55.168]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 42jb2ska3p-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:51 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=ffGlYtulRWCoJhjveaqHi8iGQIuVh7nB4EVYOoR5S49SE6QrhSgcgwcGkGbNUptTXa00C5+cXl3d98L1USfer7/CX7S+AVw66SZPqz/kB1loGHhSynjMSXQTPV5CaDhtyDPD2L9XL4krazwZDqO6r/mNMK7IVJ1HLaTNr9WU3SfnF27iC8PfgzkGIAMTvp+vRkqPSjx8xMU8VgnOBhF9MBNu1CdkzU8j/jO0FNXypS6KY+f8UlFfFnTVcbiE1QFptirT9DH64y8eCTNEcaWm1HroapxgW1Qbt1AIVsZafZnd/X5qTFJJhDJJmEhrzjnVp6vXtlNp9IUaOrhn1Y7/3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=UqT2CVmVCpT+D5QlLXoeGeO/AB/2IlXG/SqYjtZ69nI=; b=zR3Hk9P6AN+dJ06Ccmn6yeKeNtL27cB8udgRxppjMdU1yNglMVfEsfFEgmRNkJWLZ3NjOu1TUt4jHMmDCjja/CVtNVJ+jtWnRBFW1HeVkXbdqyLrKRqKnFm2za1h1erBK5zO8M1DvikT36lgoYCfttWUOj7HFBV2l8GogcQxOsubuF+jWupTJyFhFaHllzoIiESyMeVghGzF0++kIN8k1aIhEBZmjHst0KdwfaxfJ4RPbvEnk5O9Zm0QWS/4mKBjyHeEpK44FZgINuc2qUwp90DTejIzPUMf1bkDL2eHlPw1/tX8mNLkmV2facgVqTNFQzkgttD8YjDSOvmaZCcxgQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=UqT2CVmVCpT+D5QlLXoeGeO/AB/2IlXG/SqYjtZ69nI=; b=jKq8zzXrCOF6bNv9P7+k+pUNE1+LE8btGSGOwz1QkZzNIG7Erd/nx/T0tvMG2tXEO81sumbBvSXF8qpRSPJF2UVoqi3kXyHHwusQq8izy6hbtQ+RjOvFajeAh9jxvZIdQGf+j7xhnY8or6x8ZJsY9JysosIjI756+5FPd70dCPM= Received: from BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) by SJ2PR10MB7598.namprd10.prod.outlook.com (2603:10b6:a03:540::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8093.24; Mon, 28 Oct 2024 14:13:42 +0000 Received: from BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9]) by BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9%6]) with mapi id 15.20.8093.024; Mon, 28 Oct 2024 14:13:42 +0000 From: Lorenzo Stoakes To: Andrew Morton Cc: Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , Jann Horn , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Jeff Xu , Christoph Hellwig , linux-api@vger.kernel.org, John Hubbard Subject: [PATCH v4 2/5] mm: add PTE_MARKER_GUARD PTE marker Date: Mon, 28 Oct 2024 14:13:28 +0000 Message-ID: X-Mailer: git-send-email 2.47.0 In-Reply-To: References: X-ClientProxiedBy: LO2P265CA0451.GBRP265.PROD.OUTLOOK.COM (2603:10a6:600:e::31) To BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR10MB3366:EE_|SJ2PR10MB7598:EE_ X-MS-Office365-Filtering-Correlation-Id: bd9b84c7-4d43-4e19-5567-08dcf75aba37 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|7416014|376014; X-Microsoft-Antispam-Message-Info: a9xmdxT/Aa2+Z9Xw0HdisOm3Q/fFMHk+M6DvNninGY519qEwkRqnYCWiEtEjjcg57EmMTTnj838SucR2YzC0miTZX4kubXEliqhJn+YnkAkvV4OAcsGchVQAPXKAENvnuPN9cSsHiT4GwaB9UUbDhtipDM3Zeu5mM2QAFgrZLT/psEOvAyYUqbYJiZRrV9SzvS6wIKCnDMtZtLcLVTounTFOJVQLv1FMK0EwcJxilFga/6KQiLfSiogKNTHJk9OnVdRnmqg0kBuZx5eOY8fsrJcm5WLK5+uimdd4SgNR9pKkBckxhDdMnY7dKOLdUuta6g1WbIXUM+xrVkNQMibtmo9Q/MLIBasLw451v6PkWA/JWlA5upgsqw6T43Bawis5o2i63aUy2+jL038nXRpjMD14pQRnXniHFgyZZOxxsvu0Mf7dpMu9tvowTHrYnpTKpKb0yf+H1le8mu8aFwmiKmx7fOUT5OTnpF8I5Mlyu6WqCRvAbtNITFQZqt59wDoN5f4c/QnU5ulieo9J9Yu5Itat+WZSJIqiqA4a0bKtZYYVUbI7/IOUzuK+g5qXsZMp1AU2/LdcYpD1JmNeih63XTktogJew4xRjiFQDZ0hAzyjPxZldtwv1+rWSKVuMoXD/5Az794dCc7ZnGs9Xlla9xCau97vALlkzBys+lzrEoqgJyjmfLC3qBoVqfw6E0/AcpgJ6xtPH7+gyoV2NmgCyMKV8UWqlXyPLqhRs9MkXv5uvzoJuNeSO3InxOzw/RkD0D9hyBOxSPWfTMShSTlSOmBLGq14OZOsgfELkaaX4rJaU7h7BSWhcxYwcnADyX1XSWtvkh+Urc9hvMroMU/H68gPDblyhpKLIumQxuUJDzCmIKrOXGffbkoQ09AjVgGhbpd+0bzh7vMpb4+LJ5Dd+l4ekozIulatzNbtSfPF2inLqMN/N4Q2IZEL5QBq+FmdiKGuWlPksEhvo1T/eEkwosGOfJjpkMr8HBc6U//mIOX/USkpV6CJLKjEeo5G/NMW36pactzE2Ywl/8BM4nip3tM3m7+YMiTRh4RyjXb0QPRb/r1qT3rTGRadREuAnZ47QwR1YBDythMX+n6h2EhgKsH1R9laeoKhCit8a6mtHu78uHaLjhmOR0WLEZTqmLF+I8uqJycTHE8pl6z+6fgclcQjjUdiKuMX01wRsJcPNRKMul93IQE9mM7nPbZorwWRk+nT2oGqh/YuF/ap+ADmxfs1Tht7EtSnzAtc8gbXco9O3CgZGbbIqWkZA9Og3bq5mNkULHaf3fdeUhoYbXpM+CwSr1GIbzZbdU/Sps2X63M= X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:BYAPR10MB3366.namprd10.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(366016)(7416014)(376014); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: C2Z1HHx7dZGn2tFDqi6o4VqXRVzjuDfmgkIqBfiDMwMLq5ntHTGSU8O4mwEkY1zvNRwtE53LZYepUoCajfnM1TmYH4zfVJVD2OvCd11Pavqh2xLG6XWKzv7BgDL3faFOgxahh/Twsyq1i7sVgwvOJ7JJpp55Fhci90mh2DBT7Zf3FePw8xQh2ScsXyiDG+LvDGBKpcUEwprI++TigT1BlSn2OKfUDeGBhErm4iBZjzaTTDkYbINFo6T6uVKXUeaCxmC2t9IiZ/f+Y3PWQ/Vk3BNEiCXudOdXGCmIMmIksdVYD4PA9HOLr576zQ4sRI2aoyKIDjUdNvkwemGca6fwP3+fmv1EdX1TvAkD1DiHRsFgH7z/pQ3YC2fjQMZl9xbgZ4SNBeoREgw8GP+/26iY8pKkP0j+r9dIg+zJpnkynIuaVLjAPReeRkVHf/GHUQUVi1c/Jdi6rQe+DUPTYSnebAoFoLWAyQyMOCGMO+UhDknRaTOUIofnIsctlQxTKPxUkG6sCKbCElB3f2KivS++YAYKIA4yhnKqyOkkIR3MMRiqn3dnb2AcV6cN/Jn2q8y6Bhkb5NScAGyEhp1bjc23zXV5/gusOdeZZz2F9Rijcl7EZn7shrmOdhJWFSzmzuS/QjXZ8wgC6L1ffhL5fe5jPyoFswvcdFsPdhvId8u2ZdX3HcSJ+T4t6Wq2IJ7eB9gRoMHrEu4OszhQs4skS+r/fAfQsxTd3ARWGWp8qUVszqbqTMoGwy73duCRlC4tWGbHbXobPihuVOIX2LjPlnG5mLcu1e5iSfCEyFoyVhXf8xUq1C6wOLALC0A6OQ/aNSh9EP4YBGrjjo9B45f49Gyj4uFMc9fnn0QBpf5TpPnd7F8WToWoiNqDAa2vnKWXVwqMD2kvYySyo/wwJ/iGahKqE25gJTkmhbLlXUzcKzFvwHfam7JP/QCanpkRGEWYWtm0J6L3ZP44t8W3aLNXWunFmU8dFmm5HKFVqv/1/MoI9Ty4WjRnow8Qzv8/6JSfuq0JwKzxAEdlFyJMMkVVg5eZfoN9K9zj6yhla1J0/Q8u2PTCqU3pFLzkrRZ4SzlPbLO6K2yGdbx/fbOvAHey1mvp11KZpij/htScbTrRsemZ3zXEbja1kJFb1kVXEkvXS0oE1Jn7e2wDex/tCvU86IOJ+B4EkbJVnsY5f3i0na4+eQLleRu5nEvUlARsvvNIkANyXd/wB32HQuhqoxsgplmfW6gIiQHFpIMy22tftdhT38v9qFI9rSFkhj1Ar3SlP2xWyq0mS+HcIiuVgiYmqMfT9dLk3N5/JEwI/MlayjnjyBdHhL/AOgpMxq6Ci1aGiJdyCnS+LYrDksXH2GQi50naGKvGXqSvBwijfRIfbwvY/yP73Df/sQ87Dt5D3O7Iv1srZcFQs0sYwOpsWUJHMjPCLS3jujHtBRjjuNCY/y7PWDkp7R2JnLdYhQFPuEnM9GPeU+3WWPltG9ujByzeEuF69lmU/qZCnhEmxCPyeKB52ML5NP+Oqhi14F/VNBK+lNfIqZHZCrgcALRwmqi0CmIIAb+ynh295CWPcdBjjo7fPsUACU2uqWtdDl71dPPtJMr+yANm1qdDC7xGc8/z4FtvSw== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: TXhCBhqRrFIvaIKZBV8hptvi2TSst5kqnoZaBavkTsSeU9PmrRZueFrD69Q/JWpVDJWqYb4EVk3TmJEY1s8/wP+2/oS3a4NBboymaPaPHi52s5J/ceJZZe2o6t4dhYUJxu0xGKGhqfHIbEriGIYygRJPK6WdXGE+tHDYGlOQr6tHR8QvgdK8eyEvruxxRC8VoMp0Anl2jrMQDWjApFzunZ8HfXXeBHHZR8Yf4oIpwP2eN3XMVbEdv2nK5I6pES/3il5T+gxKgBRs4pyXFDMER3IMIFSVyRf4ZFaM2c2oOOrQRRcmEkMsSk8uC2wAWj34unAvVxJDKCJqqrYxm18a9XP+qqDjK9MvKyDLVinJRqK1PxfG6MvtYnAk1vVWnSsu00y+3lB5f2A4z9MA9CGnlPfm9kIOBo0sv8BLM0zu3n3jBalyKrGUeXmRLnFlKLQJyVTwTJUuW9IGodMtE/bRf1S98GqO8KoVAjxQaaPbgNy2LxmWCal5YyWShfPWsq2ehJWyisT35vyM1hxeYpCaTrOvhey2CRlYig5R8hagQV5QmHOq5nR43nYLTZ6bdFHat/xVgo1yi1HEDNnZIsSrzfZo1/o2OKi7rgXeO8gT0Uw= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: bd9b84c7-4d43-4e19-5567-08dcf75aba37 X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB3366.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Oct 2024 14:13:42.5426 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: XWVeaT8jYTaCZrStXvasEydF852pbFF7Njrd/ZQp7JABjFsKW7Kl6vFHNooT4/x741sqWaRuq8XnQgtohhRcXhEGBHt0VFFMZht9z89u6dE= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ2PR10MB7598 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-28_04,2024-10-28_02,2024-09-30_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 bulkscore=0 adultscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2409260000 definitions=main-2410280114 X-Proofpoint-GUID: anEky7ilylRdCvo4SPsMd6iJo0bzrgj3 X-Proofpoint-ORIG-GUID: anEky7ilylRdCvo4SPsMd6iJo0bzrgj3 Add a new PTE marker that results in any access causing the accessing process to segfault. This is preferable to PTE_MARKER_POISONED, which results in the same handling as hardware poisoned memory, and is thus undesirable for cases where we simply wish to 'soft' poison a range. This is in preparation for implementing the ability to specify guard pages at the page table level, i.e. ranges that, when accessed, should cause process termination. Additionally, rename zap_drop_file_uffd_wp() to zap_drop_markers() - the function checks the ZAP_FLAG_DROP_MARKER flag so naming it for this single purpose was simply incorrect. We then reuse the same logic to determine whether a zap should clear a guard entry - this should only be performed on teardown and never on MADV_DONTNEED or MADV_FREE. We additionally add a WARN_ON_ONCE() in hugetlb logic should a guard marker be encountered there, as we explicitly do not support this operation and this should not occur. Acked-by: Vlastimil Babka Signed-off-by: Lorenzo Stoakes --- include/linux/mm_inline.h | 2 +- include/linux/swapops.h | 24 +++++++++++++++++++++++- mm/hugetlb.c | 4 ++++ mm/memory.c | 18 +++++++++++++++--- mm/mprotect.c | 6 ++++-- 5 files changed, 47 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 355cf46a01a6..1b6a917fffa4 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -544,7 +544,7 @@ static inline pte_marker copy_pte_marker( { pte_marker srcm = pte_marker_get(entry); /* Always copy error entries. */ - pte_marker dstm = srcm & PTE_MARKER_POISONED; + pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD); /* Only copy PTE markers if UFFD register matches. */ if ((srcm & PTE_MARKER_UFFD_WP) && userfaultfd_wp(dst_vma)) diff --git a/include/linux/swapops.h b/include/linux/swapops.h index cb468e418ea1..96f26e29fefe 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -426,9 +426,19 @@ typedef unsigned long pte_marker; * "Poisoned" here is meant in the very general sense of "future accesses are * invalid", instead of referring very specifically to hardware memory errors. * This marker is meant to represent any of various different causes of this. + * + * Note that, when encountered by the faulting logic, PTEs with this marker will + * result in VM_FAULT_HWPOISON and thus regardless trigger hardware memory error + * logic. */ #define PTE_MARKER_POISONED BIT(1) -#define PTE_MARKER_MASK (BIT(2) - 1) +/* + * Indicates that, on fault, this PTE will case a SIGSEGV signal to be + * sent. This means guard markers behave in effect as if the region were mapped + * PROT_NONE, rather than if they were a memory hole or equivalent. + */ +#define PTE_MARKER_GUARD BIT(2) +#define PTE_MARKER_MASK (BIT(3) - 1) static inline swp_entry_t make_pte_marker_entry(pte_marker marker) { @@ -464,6 +474,18 @@ static inline int is_poisoned_swp_entry(swp_entry_t entry) { return is_pte_marker_entry(entry) && (pte_marker_get(entry) & PTE_MARKER_POISONED); + +} + +static inline swp_entry_t make_guard_swp_entry(void) +{ + return make_pte_marker_entry(PTE_MARKER_GUARD); +} + +static inline int is_guard_swp_entry(swp_entry_t entry) +{ + return is_pte_marker_entry(entry) && + (pte_marker_get(entry) & PTE_MARKER_GUARD); } /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 906294ac85dc..2c8c5da0f5d3 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6353,6 +6353,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, ret = VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); goto out_mutex; + } else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) { + /* This isn't supported in hugetlb. */ + ret = VM_FAULT_SIGSEGV; + goto out_mutex; } } diff --git a/mm/memory.c b/mm/memory.c index 2d32023d4eb8..75c2dfd04f72 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1455,7 +1455,7 @@ static inline bool should_zap_folio(struct zap_details *details, return !folio_test_anon(folio); } -static inline bool zap_drop_file_uffd_wp(struct zap_details *details) +static inline bool zap_drop_markers(struct zap_details *details) { if (!details) return false; @@ -1476,7 +1476,7 @@ zap_install_uffd_wp_if_needed(struct vm_area_struct *vma, if (vma_is_anonymous(vma)) return; - if (zap_drop_file_uffd_wp(details)) + if (zap_drop_markers(details)) return; for (;;) { @@ -1671,7 +1671,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, * drop the marker if explicitly requested. */ if (!vma_is_anonymous(vma) && - !zap_drop_file_uffd_wp(details)) + !zap_drop_markers(details)) + continue; + } else if (is_guard_swp_entry(entry)) { + /* + * Ordinary zapping should not remove guard PTE + * markers. Only do so if we should remove PTE markers + * in general. + */ + if (!zap_drop_markers(details)) continue; } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { @@ -4003,6 +4011,10 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) if (marker & PTE_MARKER_POISONED) return VM_FAULT_HWPOISON; + /* Hitting a guard page is always a fatal condition. */ + if (marker & PTE_MARKER_GUARD) + return VM_FAULT_SIGSEGV; + if (pte_marker_entry_uffd_wp(entry)) return pte_marker_handle_uffd_wp(vmf); diff --git a/mm/mprotect.c b/mm/mprotect.c index 6f450af3252e..516b1d847e2c 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,9 +236,11 @@ static long change_pte_range(struct mmu_gather *tlb, } else if (is_pte_marker_entry(entry)) { /* * Ignore error swap entries unconditionally, - * because any access should sigbus anyway. + * because any access should sigbus/sigsegv + * anyway. */ - if (is_poisoned_swp_entry(entry)) + if (is_poisoned_swp_entry(entry) || + is_guard_swp_entry(entry)) continue; /* * If this is uffd-wp pte marker and we'd like From patchwork Mon Oct 28 14:13:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lorenzo Stoakes X-Patchwork-Id: 839247 Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 303F91DD865; Mon, 28 Oct 2024 14:14:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=205.220.177.32 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124875; cv=fail; b=BfYlNwC178xCcAfJ/gTYQUHgU+6x/qv2X+ox9yI30NosxX8cLUMF2SiFeOmxIU7qJD6wmafP2GKGh2YBM1M1WZG6OpYDh2WAhJ/IUhGwu686wm7DNQpi2arPPA7RFpxETd8tigmq7hjxRNzD6f1HHD96HQVTlSlBrh0dzkyzHKs= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730124875; c=relaxed/simple; bh=COMxte/8THhdE6E4CFe1DCKkK2lYjQb4TIH5zNCoU6U=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: Content-Type:MIME-Version; b=Droil2eVhY+H85Hjf5ydWB1Z+048VsTq9LBJ/+GNobUmuiZCG7gI31UTHYR/o12t8wGxMbhqlYgfZmmw3nDuHEsQ9gsJBFpriL2XxHwMNF7OGMLqHZ3rRthnd8WOUa7Gicymu1VPgVgQ3px/XQntZz4K/5NwGK9jxEuq50wY/co= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com; spf=pass smtp.mailfrom=oracle.com; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b=nr1cG2fD; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b=LN/psi/+; arc=fail smtp.client-ip=205.220.177.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=oracle.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=oracle.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="nr1cG2fD"; dkim=pass (1024-bit key) header.d=oracle.onmicrosoft.com header.i=@oracle.onmicrosoft.com header.b="LN/psi/+" Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 49SBSjp2008698; Mon, 28 Oct 2024 14:13:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to; s= corp-2023-11-20; bh=/a/g8vrRwp3Wqsx+pVfT7qcLVrpT2JgIhmq7e+m43J4=; b= nr1cG2fD7xsJ0AnBorwi7fbGg1pknFyT13ZrEoV+t7i7uYTAjtIf1zHwC3ifwypD cmCfGNnrCQe0taTs/gRd8PKxZGoyC6zBN0NanQRsKAxYE2Ibf1XHdaZO20eBUBVU jLMfXySd23c7LgzvkeefURplQrqPNCCaSN6DZwLACZmnKaNdTjfArcKGIc2xC7tH PSSFH28j/umVjgpgvW/jWiaCrXF6LTS48hBv8xVzrgLlkTVUd59UOdxAZ/KpJRTa RofVeF/5m5f7U8hCyg1JSZ9YPCjaW3hHwxkHVFgdsB7C0JopsW/HUIC/onwuxVm3 F+b5mT2cQttd2nN5p8GHfQ== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 42grc1tx2q-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:54 +0000 (GMT) Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 49SCvIaE004777; Mon, 28 Oct 2024 14:13:53 GMT Received: from nam12-bn8-obe.outbound.protection.outlook.com (mail-bn8nam12lp2168.outbound.protection.outlook.com [104.47.55.168]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 42jb2ska3p-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 28 Oct 2024 14:13:53 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=M3YwNxeQknatxUbbi1v47HAiKrptPOZHDR1uWT07Rttp8xUvlm3nroLD2M1KUEybUwURb0zuoeyNAHEvX+U7LtLmTEEVdgDOoKm2JH1An2ffMnHMJObks8BDl8NHtrsA6OOBLa8R++UmhlE2u1BIIGUBbj/kqeuKDzbDqGVkGq+4pdav+NCWaOMtsB3HHIKb4XJ3LKzwVZk0MmZlMJ+n2dSNySCML4gXbqqFzypo3cU/iXIeazHQ0sKwFKDpLggJ99p71JjMpCQB3GPkn+6uTh2+SwmYn8/S+aEbqPEvTZPplACUQ+9j7O3YkYsd8/Bi0RU0XhzvRb/KfeO730RpvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/a/g8vrRwp3Wqsx+pVfT7qcLVrpT2JgIhmq7e+m43J4=; b=trnfwNEB05f9J6O+E8xI5WEuFmPO8kVB5Rusm4BVq1i4K83wedYOnAAYC6ZHnC/V/DW5u9S/vnVy1MLMRwxXc+NlMXXW62h8KBOrVha80nAPMjh8Z6mODfwV2cieUzphi+VcVsc9GpEX3bjukjOiPz0DnBogoAV8vFl56sqyxH1C267L9j1PCKLQ7AMQT17vy7aZoF0sZxBdxWicJ1mXRIBx7d7QC7uVGci71OaafL8f5wzJi6oTISgCJeQrZLfJ56xn/icxNComiCkZ83l8XPf5h0HYW97Cbrfjkj2qO4dffiBKReh6R1bFDBndO1S/ZDl3h76rVxO10omuNpCF+w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.onmicrosoft.com; s=selector2-oracle-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/a/g8vrRwp3Wqsx+pVfT7qcLVrpT2JgIhmq7e+m43J4=; b=LN/psi/+wKx+7mUVr9T+YtVaz2KTIb0gyXUTCPqeY8Isx0vCiEzIhj3BrDU7EQaHXkjwm93sqty6Az+ASBC3JP3t0lIcZaQSoiql+zabkQnOvduOqYwP7UjvH3s5+wK20y2myJzuY7opiTX19+SfZvFfu3/0rHaobR+OXTgSefI= Received: from BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) by SJ2PR10MB7598.namprd10.prod.outlook.com (2603:10b6:a03:540::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8093.24; Mon, 28 Oct 2024 14:13:46 +0000 Received: from BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9]) by BYAPR10MB3366.namprd10.prod.outlook.com ([fe80::baf2:dff1:d471:1c9%6]) with mapi id 15.20.8093.024; Mon, 28 Oct 2024 14:13:46 +0000 From: Lorenzo Stoakes To: Andrew Morton Cc: Suren Baghdasaryan , "Liam R . Howlett" , Matthew Wilcox , Vlastimil Babka , "Paul E . McKenney" , Jann Horn , David Hildenbrand , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Muchun Song , Richard Henderson , Matt Turner , Thomas Bogendoerfer , "James E . J . Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Arnd Bergmann , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-arch@vger.kernel.org, Shuah Khan , Christian Brauner , linux-kselftest@vger.kernel.org, Sidhartha Kumar , Jeff Xu , Christoph Hellwig , linux-api@vger.kernel.org, John Hubbard Subject: [PATCH v4 3/5] mm: madvise: implement lightweight guard page mechanism Date: Mon, 28 Oct 2024 14:13:29 +0000 Message-ID: <6aafb5821bf209f277dfae0787abb2ef87a37542.1730123433.git.lorenzo.stoakes@oracle.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: References: X-ClientProxiedBy: LNXP123CA0006.GBRP123.PROD.OUTLOOK.COM (2603:10a6:600:d2::18) To BYAPR10MB3366.namprd10.prod.outlook.com (2603:10b6:a03:14f::25) Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BYAPR10MB3366:EE_|SJ2PR10MB7598:EE_ X-MS-Office365-Filtering-Correlation-Id: 29006ab5-2040-46ee-3f9c-08dcf75abc7b X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|366016|7416014|376014; X-Microsoft-Antispam-Message-Info: GFKJeSLIqDJNKzNWBdwfKnsyLVYTztLce0YrUKBTRh4Jr189G0I77EriBPmBj5CZGzu4yql4Bn7fNhMraVJeZnP8RimZqj+4XXhAAQb1iLMIhXyPs7sSvjyC1G9dOpOY7SOjBstpMSfGMhKj6mDVlc06OBuIjJz8nXQtNZbn2dpzsdyDwvksXbZzGQ648mGUWQGw/IUli2XRK5ILbXW8QZfzwBJ7coHhbkoElenCMTg+cstg9mmFavMuCzEwFH+UP3AOVapYTsLimfBWNMIA2/cobJelqgBbWNKnychgj5ulxzcJfMa3wRnJvahRdVjxbh4ixpW6oJVvWQ1ve6tW7BsFrKP41exB6uQGxgp69FHwEDJIBf1kxYoHtGWXkK8gpOEZ0ORnGG5qy9pUQVShXsPvz7UBOj1kvZCna5kO4FFs7Sa+gWDXXNUEWomK20egsBWUReGRS8CF/K4ZkgooXx3uekBY5rM61vukfAMO0alZVjQz9VNpBzCazcoxg+1YBo7a0x0RHd2rGo4sBshunvP1A2tGfyqNJVwHRS21S9DJjB6c3+phb4d0LSR+H2MPIdhopEFQn9GEaJ/RxrkM441xL/GhvNfArgt297rHuebfy48dXDdnn7dYQjrhsaYZIjjX+zrzww37eX04UdHWWkNdO0gff8ZDb480ohWfZxcR9o3+4Vj//bnJ3AufzknvYq9SLed8c6Tlqrz3RWJ3yrnsx0avTpaQD1VyqLUffU2RIuWXnfaZhNRGEhiRBbqwSjw2ePOqGnpxxJVPN4VwtiQ+Xtg592I+qo0jAR35ynZbuGacFwgneMmQ4DLKWcNXiq952ZTbgeW1PGayGrGGZNEGBo1twI8BcjTS900wA40Gay0+FxcHdiuoW33bQz7HNV5lR7/5DgrslCF4bC2DVvn92u0+JUM8MFWPYGuJ93hTu/sYlw7uW0PgSrfwI2qFFNhc/3wgfRjO0WT6PYAOnf3OAR2eih2LdQfEr+CsqqOWCBvG/ppJjIGTBWa2MXgYjdEszumLvHHjnXmiQv/jIMXsGSFUMffJifCLgvdk+MEl4OWX3XbmQFIV/H4VDAS1fTpGPy3V0UVdssQFWxw/CXcTmxiUCqLiI5k5ka5zEyHPlN6qa6BuSsAAkQtSp0VOGbyXHDF5SY93D6fkfZtWGl0kDH8OD8tvvDngTZ+gpNmYpkeyC1lPf1z/OpWqH9Ky/+L/zagHnYg3fIwPiqD/gGw4cIl8gRCSjP5cRbbZJtdtFq1VCjacp1Uj+dDAxs3ulBPT5qv/OH+gNrSW72RZMJfdnAYb+D4h8A8pfdPKrDS8IrhvN5q3r9YX1JfTVYYK X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:BYAPR10MB3366.namprd10.prod.outlook.com; PTR:; CAT:NONE; SFS:(13230040)(1800799024)(366016)(7416014)(376014); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: aNa8m6VDmZd23EXU8IgV22W3H0aBQLENmONOdDHoNooElkMZTuZKJStBgCPJYYzlYW7ZrKmAq+j+0diNP4S4CI+r71aSEJAVqfFvaVrqRCZ//yX0zWXtnchYOYzMM3WrS+p4i7TG/KDDUIemtHvyYWJsrqwJY/Pn2zvi2BCu83LEIxemVXRX4YGguQDIZgpuZ3oBgFnm6MZ9bVX8iBxLUN9oHDP0AGqDcFSDcTD/WVcvqFCMVIlvvyDHiNFMTwioqM/8JPiJynV8UyI/0nFgN4UrYgMoD765Evy4+go4r3ZnBKJ/hR6GpAuUme9sUn2xbIQC4pzkzrjdEEgiTU2qb0V4z4rwIyKD7Nob7ptyCy6xU8Fjsrxs0MyOmlQ4ezVD2xNCmE1fPTzIamcKwx06n9bLIovKCutJy32UAJu/MQYGuX0/rUr5BmPyIT57p1IH5/vRfwDMMmdU1laYwCRiMslkYIFjIwvTX++/0pxA/RZJeoJ8tOo6fYC1YdfbFQ2EctFZwz0tr9NkwTX8pwgF+Yrc+16G/sE+zLkO1uhZhAuMcSVuiVNBJxalqyGySjbzY2vCgSKL2Qcv+gpm8run/DTh1Xmiv1X0xJik9jX5yXttfPuTaOenk/dTZ33eRt6rccVy9eIJgyKLj6MgOXvSeU+Jx67Ymhv1Cr1pv69gI1oyEiNBNclb0F4sJYIEkv3CyuU6NwjDx/p3nZmwFO2EPi5Z8AYp5T4tLbZVeFcLQ9e0xIyRHHxa9KGeCYdese3z024ThZDbOmdE0N/XYUmC+t2wD062nCtAkeZ+Xb7iq1BsQHwSbr7X4ug6KLyGs9CKLSow8pgD00QVyikDrUpU7Kz4olsEePbLjVTKJRBVHSq3RJ3jqxwfsL7J7HYwswo4v1RE2Ec52TMvCSPzd8FUZWGNcAe98+kurwjleIUgRNLe5552JVPpNoO+pqK3vkge7BPuBE+i/mfkoUa6IzMjJ+226UmcJFGfuAwtDxfFrDLtq9YD7shbxUCzmgIBLcgbZ1Bfblb847dpXb3qgpmllBYiiWgNfFJXiL6xbAGErnXQupETfG1Mihnkme+gTManvBYb0bEbtUAs3VaEdCBzan+MU7fahywhI0T1U2WMsMEBBvAiIb87gZ7U4Mmd4mECUwIMPV3Y+kZKkwkNvUCteXfNsC6J9+24KXpV4+ZafEl6kuPatYb4YwXTgxV2yycvPHcLiMYCTOzA0Nc3SEvSG1fE6EUFAz+yKPgEW7aKk3WK1ihOvBGPUIGyx6kX8BMOCr6bSggd/L9Ow/9t8miLrKLsldQpJnuozSQ9qrdAHjMW9FpRoC97UdlMG/QT3YWD/hY83jp5KHgLWoAYn22/UW1abck2U0FevMcMyKErdxrwHFLmMp12+wMlcEfzwyiCTZlVnqWkHNLkcSKGYt9ctZ9+iaFW1JFqHhtMx4oZdBLF19Y7a4PXtpv686NrLv4B3WOE+sfVQstJbOR6sSUQvFjyKjvN8WGtyeklDLqR7Q5JKH8rc5p6ons9U3IN2Y9YECsOcF/cBv0+74IHwsr7FfOQmAB525PZYk1SHJpUUWyGIqQFZgO/L2wkDnfgptEm3s4dBLm5Y5Kh6EPmrrHY1A== X-MS-Exchange-AntiSpam-ExternalHop-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-ExternalHop-MessageData-0: xDCYECgtIb0rU90X1gbtFnDd6keZ3NUJjR1R0cCO8tnQLWYW5gkjfGlkS88bHGTSlT3mGRKfr9y4eVlV/txaDOkwXHnCJy1qNhlC5BDbtzIfnU7Gj2q3C0nGVMhut8mz/Rnz+qorem81Zjv2OYMHlYJQeSRL21wk6sW3BFTtX9OrmBWMA6OZNhHSmN28EUxaqOdg420xfGuviXm24AJv8hlVDQdX0lwIQ75SDKTWI38y1nJLjTfiB0v5R/XXmsjhQw3OIFxQbaqcUDC+dBubpe5jakKgNAQNWNdnUYAhoEBFwscH8jFbslQOGZvgTjdTxvXiFioDIga4mKvZ6bv7YoB7P+hh8feF+zunPXRsWxrAPNYG+AN3pt0oHFClKXqOhidmNGS6FrXRDO2eiDE+RDGZzUQxZ70P1Rm1tSPC1JqljzuEa1rj6JMX0ZzHOAXPuNICke0ERMv2Y0vtWDPROkNyIj8HRawkiuW+xMp3kDHYM32EiKKOVTnex1KyrOKRkXNBt4VLS+s6tlPRxF6A4VRF3cW/lrFrT96UA6cA2ksi+jE64juaOdY9QnoijSoWfpSBa8AwNZwqjDl3oiBGjo055oyF6A/wUlG5NccyAiY= X-OriginatorOrg: oracle.com X-MS-Exchange-CrossTenant-Network-Message-Id: 29006ab5-2040-46ee-3f9c-08dcf75abc7b X-MS-Exchange-CrossTenant-AuthSource: BYAPR10MB3366.namprd10.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 28 Oct 2024 14:13:46.3527 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 4e2c6054-71cb-48f1-bd6c-3a9705aca71b X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: wBPljziL8xFIqySBWFOEmzSiK6EbBX4vygf7T5DT2BS1gf74Krxihc/i/uHfH512VD+TS+vdKGPI3BskUqgElD3kHQ/qqM+POHv9AsLX8po= X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ2PR10MB7598 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1051,Hydra:6.0.680,FMLib:17.12.62.30 definitions=2024-10-28_04,2024-10-28_02,2024-09-30_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 bulkscore=0 adultscore=0 phishscore=0 malwarescore=0 mlxlogscore=999 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2409260000 definitions=main-2410280114 X-Proofpoint-GUID: jzhR6gDz1AHsF9yJTaYGA2comuL22xF0 X-Proofpoint-ORIG-GUID: jzhR6gDz1AHsF9yJTaYGA2comuL22xF0 Implement a new lightweight guard page feature, that is regions of userland virtual memory that, when accessed, cause a fatal signal to arise. Currently users must establish PROT_NONE ranges to achieve this. However this is very costly memory-wise - we need a VMA for each and every one of these regions AND they become unmergeable with surrounding VMAs. In addition repeated mmap() calls require repeated kernel context switches and contention of the mmap lock to install these ranges, potentially also having to unmap memory if installed over existing ranges. The lightweight guard approach eliminates the VMA cost altogether - rather than establishing a PROT_NONE VMA, it operates at the level of page table entries - establishing PTE markers such that accesses to them cause a fault followed by a SIGSGEV signal being raised. This is achieved through the PTE marker mechanism, which we have already extended to provide PTE_MARKER_GUARD, which we installed via the generic page walking logic which we have extended for this purpose. These guard ranges are established with MADV_GUARD_INSTALL. If the range in which they are installed contain any existing mappings, they will be zapped, i.e. free the range and unmap memory (thus mimicking the behaviour of MADV_DONTNEED in this respect). Any existing guard entries will be left untouched. There is therefore no nesting of guarded pages. Guarded ranges are NOT cleared by MADV_DONTNEED nor MADV_FREE (in both instances the memory range may be reused at which point a user would expect guards to still be in place), but they are cleared via MADV_GUARD_REMOVE, process teardown or unmapping of memory ranges. The guard property can be removed from ranges via MADV_GUARD_REMOVE. The ranges over which this is applied, should they contain non-guard entries, will be untouched, with only guard entries being cleared. We permit this operation on anonymous memory only, and only VMAs which are non-special, non-huge and not mlock()'d (if we permitted this we'd have to drop locked pages which would be rather counterintuitive). Racing page faults can cause repeated attempts to install guard pages that are interrupted, result in a zap, and this process can end up being repeated. If this happens more than would be expected in normal operation, we rescind locks and retry the whole thing, which avoids lock contention in this scenario. Suggested-by: Vlastimil Babka Suggested-by: Jann Horn Suggested-by: David Hildenbrand Signed-off-by: Lorenzo Stoakes --- arch/alpha/include/uapi/asm/mman.h | 3 + arch/mips/include/uapi/asm/mman.h | 3 + arch/parisc/include/uapi/asm/mman.h | 3 + arch/xtensa/include/uapi/asm/mman.h | 3 + include/uapi/asm-generic/mman-common.h | 3 + mm/madvise.c | 239 +++++++++++++++++++++++++ mm/mseal.c | 1 + 7 files changed, 255 insertions(+) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index 763929e814e9..1e700468a685 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -78,6 +78,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ +#define MADV_GUARD_REMOVE 103 /* unguard range */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 9c48d9a21aa0..b700dae28c48 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -105,6 +105,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ +#define MADV_GUARD_REMOVE 103 /* unguard range */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index 68c44f99bc93..b6a709506987 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -75,6 +75,9 @@ #define MADV_HWPOISON 100 /* poison a page for testing */ #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */ +#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ +#define MADV_GUARD_REMOVE 103 /* unguard range */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index 1ff0c858544f..99d4ccee7f6e 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -113,6 +113,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ +#define MADV_GUARD_REMOVE 103 /* unguard range */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..1ea2c4c33b86 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -79,6 +79,9 @@ #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ +#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */ +#define MADV_GUARD_REMOVE 103 /* unguard range */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index e871a72a6c32..0ceae57da7da 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -37,6 +37,12 @@ #include "internal.h" #include "swap.h" +/* + * Maximum number of attempts we make to install guard pages before we give up + * and return -ERESTARTNOINTR to have userspace try again. + */ +#define MAX_MADVISE_GUARD_RETRIES 3 + struct madvise_walk_private { struct mmu_gather *tlb; bool pageout; @@ -60,6 +66,8 @@ static int madvise_need_mmap_write(int behavior) case MADV_POPULATE_READ: case MADV_POPULATE_WRITE: case MADV_COLLAPSE: + case MADV_GUARD_INSTALL: + case MADV_GUARD_REMOVE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -1017,6 +1025,214 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } +static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked) +{ + vm_flags_t disallowed = VM_SPECIAL | VM_HUGETLB; + + /* + * A user could lock after setting a guard range but that's fine, as + * they'd not be able to fault in. The issue arises when we try to zap + * existing locked VMAs. We don't want to do that. + */ + if (!allow_locked) + disallowed |= VM_LOCKED; + + if (!vma_is_anonymous(vma)) + return false; + + if ((vma->vm_flags & (VM_MAYWRITE | disallowed)) != VM_MAYWRITE) + return false; + + return true; +} + +static bool is_guard_pte_marker(pte_t ptent) +{ + return is_pte_marker(ptent) && + is_guard_swp_entry(pte_to_swp_entry(ptent)); +} + +static int guard_install_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pud_t pudval = pudp_get(pud); + + /* If huge return >0 so we abort the operation + zap. */ + return pud_trans_huge(pudval) || pud_devmap(pudval); +} + +static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pmd_t pmdval = pmdp_get(pmd); + + /* If huge return >0 so we abort the operation + zap. */ + return pmd_trans_huge(pmdval) || pmd_devmap(pmdval); +} + +static int guard_install_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t pteval = ptep_get(pte); + unsigned long *nr_pages = (unsigned long *)walk->private; + + /* If there is already a guard page marker, we have nothing to do. */ + if (is_guard_pte_marker(pteval)) { + (*nr_pages)++; + + return 0; + } + + /* If populated return >0 so we abort the operation + zap. */ + return 1; +} + +static int guard_install_set_pte(unsigned long addr, unsigned long next, + pte_t *ptep, struct mm_walk *walk) +{ + unsigned long *nr_pages = (unsigned long *)walk->private; + + /* Simply install a PTE marker, this causes segfault on access. */ + *ptep = make_pte_marker(PTE_MARKER_GUARD); + (*nr_pages)++; + + return 0; +} + +static const struct mm_walk_ops guard_install_walk_ops = { + .pud_entry = guard_install_pud_entry, + .pmd_entry = guard_install_pmd_entry, + .pte_entry = guard_install_pte_entry, + .install_pte = guard_install_set_pte, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_install(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + long err; + int i; + + *prev = vma; + if (!is_valid_guard_vma(vma, /* allow_locked = */false)) + return -EINVAL; + + /* + * If we install guard markers, then the range is no longer + * empty from a page table perspective and therefore it's + * appropriate to have an anon_vma. + * + * This ensures that on fork, we copy page tables correctly. + */ + err = anon_vma_prepare(vma); + if (err) + return err; + + /* + * Optimistically try to install the guard marker pages first. If any + * non-guard pages are encountered, give up and zap the range before + * trying again. + * + * We try a few times before giving up and releasing back to userland to + * loop around, releasing locks in the process to avoid contention. This + * would only happen if there was a great many racing page faults. + * + * In most cases we should simply install the guard markers immediately + * with no zap or looping. + */ + for (i = 0; i < MAX_MADVISE_GUARD_RETRIES; i++) { + unsigned long nr_pages = 0; + + /* Returns < 0 on error, == 0 if success, > 0 if zap needed. */ + err = walk_page_range_mm(vma->vm_mm, start, end, + &guard_install_walk_ops, &nr_pages); + if (err < 0) + return err; + + if (err == 0) { + unsigned long nr_expected_pages = PHYS_PFN(end - start); + + VM_WARN_ON(nr_pages != nr_expected_pages); + return 0; + } + + /* + * OK some of the range have non-guard pages mapped, zap + * them. This leaves existing guard pages in place. + */ + zap_page_range_single(vma, start, end - start, NULL); + } + + /* + * We were unable to install the guard pages due to being raced by page + * faults. This should not happen ordinarily. We return to userspace and + * immediately retry, relieving lock contention. + */ + return restart_syscall(); +} + +static int guard_remove_pud_entry(pud_t *pud, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pud_t pudval = pudp_get(pud); + + /* If huge, cannot have guard pages present, so no-op - skip. */ + if (pud_trans_huge(pudval) || pud_devmap(pudval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pmd_t pmdval = pmdp_get(pmd); + + /* If huge, cannot have guard pages present, so no-op - skip. */ + if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) + walk->action = ACTION_CONTINUE; + + return 0; +} + +static int guard_remove_pte_entry(pte_t *pte, unsigned long addr, + unsigned long next, struct mm_walk *walk) +{ + pte_t ptent = ptep_get(pte); + + if (is_guard_pte_marker(ptent)) { + /* Simply clear the PTE marker. */ + pte_clear_not_present_full(walk->mm, addr, pte, false); + update_mmu_cache(walk->vma, addr, pte); + } + + return 0; +} + +static const struct mm_walk_ops guard_remove_walk_ops = { + .pud_entry = guard_remove_pud_entry, + .pmd_entry = guard_remove_pmd_entry, + .pte_entry = guard_remove_pte_entry, + .walk_lock = PGWALK_RDLOCK, +}; + +static long madvise_guard_remove(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + *prev = vma; + /* + * We're ok with removing guards in mlock()'d ranges, as this is a + * non-destructive action. + */ + if (!is_valid_guard_vma(vma, /* allow_locked = */true)) + return -EINVAL; + + return walk_page_range(vma->vm_mm, start, end, + &guard_remove_walk_ops, NULL); +} + /* * Apply an madvise behavior to a region of a vma. madvise_update_vma * will handle splitting a vm area into separate areas, each area with its own @@ -1098,6 +1314,10 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; case MADV_COLLAPSE: return madvise_collapse(vma, prev, start, end); + case MADV_GUARD_INSTALL: + return madvise_guard_install(vma, prev, start, end); + case MADV_GUARD_REMOVE: + return madvise_guard_remove(vma, prev, start, end); } anon_name = anon_vma_name(vma); @@ -1197,6 +1417,8 @@ madvise_behavior_valid(int behavior) case MADV_DODUMP: case MADV_WIPEONFORK: case MADV_KEEPONFORK: + case MADV_GUARD_INSTALL: + case MADV_GUARD_REMOVE: #ifdef CONFIG_MEMORY_FAILURE case MADV_SOFT_OFFLINE: case MADV_HWPOISON: @@ -1490,6 +1712,23 @@ static ssize_t vector_madvise(struct mm_struct *mm, struct iov_iter *iter, while (iov_iter_count(iter)) { ret = do_madvise(mm, (unsigned long)iter_iov_addr(iter), iter_iov_len(iter), behavior); + /* + * An madvise operation is attempting to restart the syscall, + * but we cannot proceed as it would not be correct to repeat + * the operation in aggregate, and would be surprising to the + * user. + * + * As we have already dropped locks, it is safe to just loop and + * try again. We check for fatal signals in case we need exit + * early anyway. + */ + if (ret == -ERESTARTNOINTR) { + if (fatal_signal_pending(current)) { + ret = -EINTR; + break; + } + continue; + } if (ret < 0) break; iov_iter_advance(iter, iter_iov_len(iter)); diff --git a/mm/mseal.c b/mm/mseal.c index ece977bd21e1..81d6e980e8a9 100644 --- a/mm/mseal.c +++ b/mm/mseal.c @@ -30,6 +30,7 @@ static bool is_madv_discard(int behavior) case MADV_REMOVE: case MADV_DONTFORK: case MADV_WIPEONFORK: + case MADV_GUARD_INSTALL: return true; }