mbox series

[RFC,0/4] API for inhibiting speculative arbitrary read primitives

Message ID 20180103223827.39601-1-mark.rutland@arm.com
Headers show
Series API for inhibiting speculative arbitrary read primitives | expand

Message

Mark Rutland Jan. 3, 2018, 10:38 p.m. UTC
Recently, Google Project Zero discovered several classes of attack
against speculative execution. One of these, known as variant-1, allows
explicit bounds checks to be bypassed under speculation, providing an
arbitrary read gadget. Further details can be found on the GPZ blog [1]
and the Documentation patch in this series.

There are a number of potential gadgets in the Linux codebase, and
mitigations for these are architecture-specific.

This RFC attempts to provide a cross-architecture API for inhibiting
these primitives. Hopefully, architecture-specific mitigations can be
unified behind this. An arm64 implementation is provided following the
architecturally recommended sequence laid out in the Arm whitepaper [2].
The API is based on a proposed compiler intrinsic [3].

I've provided a patch to BPF as an example use of the API. I know that
this is incomplete and less than optimal. I'd appreciate feedback from
other affected architectures as to whether this API is suitable for
their required mitigation.

I've pushed the series to my kernel.org repo [4].

[1] https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html
[2] https://developer.arm.com/support/security-update
[3] https://developer.arm.com/support/security-update/compiler-support-for-mitigations
[4] git://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git core/nospec

Thanks,
Mark.

Mark Rutland (4):
  asm-generic/barrier: add generic nospec helpers
  Documentation: document nospec helpers
  arm64: implement nospec_{load,ptr}()
  bpf: inhibit speculated out-of-bounds pointers

 Documentation/speculation.txt    | 99 ++++++++++++++++++++++++++++++++++++++++
 arch/arm64/include/asm/barrier.h | 61 +++++++++++++++++++++++++
 include/asm-generic/barrier.h    | 76 ++++++++++++++++++++++++++++++
 kernel/bpf/arraymap.c            | 21 ++++++---
 kernel/bpf/cpumap.c              |  8 ++--
 kernel/bpf/devmap.c              |  6 ++-
 kernel/bpf/sockmap.c             |  6 ++-
 7 files changed, 265 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/speculation.txt

-- 
2.11.0

Comments

Linus Torvalds Jan. 4, 2018, 12:39 a.m. UTC | #1
On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> The 'if_nospec' primitive marks locations where the kernel is disabling

> speculative execution that could potentially access privileged data. It

> is expected to be paired with a 'nospec_{ptr,load}' where the user

> controlled value is actually consumed.


I'm much less worried about these "nospec_load/if" macros, than I am
about having a sane way to determine when they should be needed.

Is there such a sane model right now, or are we talking "people will
randomly add these based on strong feelings"?

                    Linus
Alan Cox Jan. 4, 2018, 1:07 a.m. UTC | #2
On Wed, 3 Jan 2018 16:39:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:

> > The 'if_nospec' primitive marks locations where the kernel is disabling

> > speculative execution that could potentially access privileged data. It

> > is expected to be paired with a 'nospec_{ptr,load}' where the user

> > controlled value is actually consumed.  

> 

> I'm much less worried about these "nospec_load/if" macros, than I am

> about having a sane way to determine when they should be needed.

> 

> Is there such a sane model right now, or are we talking "people will

> randomly add these based on strong feelings"?


There are people trying to tune coverity and other tool rules to identify
cases, and some of the work so far was done that way. For x86 we didn't
find too many so far so either the needed pattern is uncommon or ....  8)

Given you can execute over a hundred basic instructions in a speculation
window it does need to be a tool that can explore not just in function
but across functions. That's really tough for the compiler itself to do
without help.

What remains to be seen is if there are other patterns that affect
different processors.

In the longer term the compiler itself needs to know what is and isn't
safe (ie you need to be able to write things like

void foo(tainted __user int *x)

and have the compiler figure out what level of speculation it can do and
(on processors with those features like IA64) when it can and can't do
various kinds of non-trapping loads.

Alan
Dan Williams Jan. 4, 2018, 1:13 a.m. UTC | #3
[ adding Julia and Dan ]

On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> On Wed, 3 Jan 2018 16:39:31 -0800

> Linus Torvalds <torvalds@linux-foundation.org> wrote:

>

>> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:

>> > The 'if_nospec' primitive marks locations where the kernel is disabling

>> > speculative execution that could potentially access privileged data. It

>> > is expected to be paired with a 'nospec_{ptr,load}' where the user

>> > controlled value is actually consumed.

>>

>> I'm much less worried about these "nospec_load/if" macros, than I am

>> about having a sane way to determine when they should be needed.

>>

>> Is there such a sane model right now, or are we talking "people will

>> randomly add these based on strong feelings"?

>

> There are people trying to tune coverity and other tool rules to identify

> cases, and some of the work so far was done that way. For x86 we didn't

> find too many so far so either the needed pattern is uncommon or ....  8)

>

> Given you can execute over a hundred basic instructions in a speculation

> window it does need to be a tool that can explore not just in function

> but across functions. That's really tough for the compiler itself to do

> without help.

>

> What remains to be seen is if there are other patterns that affect

> different processors.

>

> In the longer term the compiler itself needs to know what is and isn't

> safe (ie you need to be able to write things like

>

> void foo(tainted __user int *x)

>

> and have the compiler figure out what level of speculation it can do and

> (on processors with those features like IA64) when it can and can't do

> various kinds of non-trapping loads.

>


It would be great if coccinelle and/or smatch could be taught to catch
some of these case at least as a first pass "please audit this code
block" type of notification.
Jiri Kosina Jan. 4, 2018, 1:27 a.m. UTC | #4
On Thu, 4 Jan 2018, Alan Cox wrote:

> There are people trying to tune coverity and other tool rules to identify

> cases, 


Yeah, but that (and *especially* Coverity) is so inconvenient to be 
applied to each and every patch ... that this is not the way to go.

If the CPU speculation can cause these kinds of side-effects, it just must 
not happen, full stop. OS trying to work it around is just a whack-a-mole 
(which we can apply for old silicon, sure ... but not something that 
becomes a new standard) that's never going to lead to any ultimate 
solution.

-- 
Jiri Kosina
SUSE Labs
Alan Cox Jan. 4, 2018, 1:41 a.m. UTC | #5
On Thu, 4 Jan 2018 02:27:54 +0100 (CET)
Jiri Kosina <jikos@kernel.org> wrote:

> On Thu, 4 Jan 2018, Alan Cox wrote:

> 

> > There are people trying to tune coverity and other tool rules to identify

> > cases,   

> 

> Yeah, but that (and *especially* Coverity) is so inconvenient to be 

> applied to each and every patch ... that this is not the way to go.


Agreed enitely - and coverity is non-free which is even worse as a
dependancy. Right now we are in the 'what could be done quickly by a few
people' space. The papers are now published, so the world can work on
better solutions and extending more convenient tooling.

> If the CPU speculation can cause these kinds of side-effects, it just must 

> not happen, full stop. 


At which point your performance will resemble that of a 2012 atom
processor at best.

> OS trying to work it around is just a whack-a-mole 

> (which we can apply for old silicon, sure ... but not something that 

> becomes a new standard) that's never going to lead to any ultimate 

> solution.


In the ideal world it becomes possible for future processors to resolve
such markings down to no-ops. Will that be possible or will we get more
explicit ways to tell the processor what is unsafe - I don't
personally know but I do know that turning off speculation is not the
answer.

Clearly if the CPU must be told then C is going to have to grow some
syntax for it and some other languages are going to see 'taint' moving
from a purely software construct to a real processor one.

Alan
Jiri Kosina Jan. 4, 2018, 1:47 a.m. UTC | #6
On Thu, 4 Jan 2018, Alan Cox wrote:

> > If the CPU speculation can cause these kinds of side-effects, it just must 

> > not happen, full stop. 

> 

> At which point your performance will resemble that of a 2012 atom

> processor at best.


You know what? I'd be completely fine with that, if it's traded for "my 
ssh and internet banking keys are JUST MINE, ok?" :)

> Clearly if the CPU must be told then C is going to have to grow some

> syntax for it and some other languages are going to see 'taint' moving

> from a purely software construct to a real processor one.


Again, what is the problem with "yeah, it turned out to speed things up in 
the past, but hey, it has security issues, so we stopped doing it" 
aproach?

Thanks,

-- 
Jiri Kosina
SUSE Labs
Dan Williams Jan. 4, 2018, 1:51 a.m. UTC | #7
On Wed, Jan 3, 2018 at 5:27 PM, Jiri Kosina <jikos@kernel.org> wrote:
> On Thu, 4 Jan 2018, Alan Cox wrote:

>

>> There are people trying to tune coverity and other tool rules to identify

>> cases,

>

> Yeah, but that (and *especially* Coverity) is so inconvenient to be

> applied to each and every patch ... that this is not the way to go.

>

> If the CPU speculation can cause these kinds of side-effects, it just must

> not happen, full stop. OS trying to work it around is just a whack-a-mole

> (which we can apply for old silicon, sure ... but not something that

> becomes a new standard) that's never going to lead to any ultimate

> solution.


Speaking from a purely Linux kernel maintenance process perspective we
play wack-a-mole with missed endian conversions and other bugs that
coccinelle, sparse, etc help us catch. So this is in that same
category, but yes, it's inconvenient.

Alan already mentioned potential compiler changes and annotating
variables so that the compiler can help in the future, but until then
we need these explicit checks.

Elena has done the work of auditing static analysis reports to a dozen
or so locations that need some 'nospec' handling.

The point of this RFC is to level set across architectures on the
macros/names/mechanisms that should be used for the first round of
fixes.
Linus Torvalds Jan. 4, 2018, 1:54 a.m. UTC | #8
On Wed, Jan 3, 2018 at 5:51 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>

> Elena has done the work of auditing static analysis reports to a dozen

> or so locations that need some 'nospec' handling.


I'd love to see that patch, just to see how bad things look.

Because I think that really is very relevant to the interface too.

If we're talking "a dozen locations" that are fairly well constrained,
that's very different from having thousands all over the place.

                  Linus
Jiri Kosina Jan. 4, 2018, 1:59 a.m. UTC | #9
On Wed, 3 Jan 2018, Dan Williams wrote:

> Speaking from a purely Linux kernel maintenance process perspective we

> play wack-a-mole with missed endian conversions and other bugs that

> coccinelle, sparse, etc help us catch. 


Fully agreed.

> So this is in that same category, but yes, it's inconvenient.


Disagreed, violently. CPU has to execute the instructions I ask it to 
execute, and if it executes *anything* else that reveals any information 
about the instructions that have *not* been executed, it's flawed.

> Elena has done the work of auditing static analysis reports to a dozen

> or so locations that need some 'nospec' handling.


How exactly is that related (especially in longer-term support terms) to 
BPF anyway?

Thanks,

-- 
Jiri Kosina
SUSE Labs
Alan Cox Jan. 4, 2018, 2:15 a.m. UTC | #10
> Disagreed, violently. CPU has to execute the instructions I ask it to 

> execute, and if it executes *anything* else that reveals any information 

> about the instructions that have *not* been executed, it's flawed.


Then stick to in order processors. Just don't be in a hurry to get your
computation finished.

> > Elena has done the work of auditing static analysis reports to a dozen

> > or so locations that need some 'nospec' handling.  

> 

> How exactly is that related (especially in longer-term support terms) to 

> BPF anyway?


If you read the papers you need a very specific construct in order to not
only cause a speculative load of an address you choose but also to then
manage to cause a second operation that in some way reveals bits of data
or allows you to ask questions.

BPF allows you to construct those sequences relatively easily and it's
the one case where a user space application can fairly easily place code
it wants to execute in the kernel. Without BPF you have to find the right
construct in the kernel, prime all the right predictions and measure the
result without getting killed off. There are places you can do that but
they are not so easy and we don't (at this point) think there are that
many.

The same situation occurs in user space with interpreters and JITs,hence
the paper talking about javascript. Any JIT with the ability to do timing
is particularly vulnerable to versions of this specific attack because
the attacker gets to create the code pattern rather than have to find it.

Alan
Dan Williams Jan. 4, 2018, 3:10 a.m. UTC | #11
On Wed, 2018-01-03 at 17:54 -0800, Linus Torvalds wrote:
> On Wed, Jan 3, 2018 at 5:51 PM, Dan Williams <dan.j.williams@intel.co

> m> wrote:

> > 

> > Elena has done the work of auditing static analysis reports to a

> > dozen

> > or so locations that need some 'nospec' handling.

> 

> I'd love to see that patch, just to see how bad things look.

> 

> Because I think that really is very relevant to the interface too.

> 

> If we're talking "a dozen locations" that are fairly well

> constrained,

> that's very different from having thousands all over the place.

> 


Note, the following is Elena's work, I'm just helping poke the upstream
discussion along while she's offline.

Elena audited the static analysis reports down to the following
locations where userspace provides a value for a conditional branch and
then later creates a dependent load on that same userspace controlled
value.

'osb()', observable memory barrier, resolves to an lfence on x86. My
concern with these changes is that it is not clear what content within
the branch block is of concern. Peter's 'if_nospec' proposal combined
with Mark's 'nospec_load' would seem to clear that up, from a self
documenting code perspective, and let archs optionally protect entire
conditional blocks or individual loads within those blocks.

Note that these are "a human looked at static analysis reports and
could not rationalize that these are false positives". Specific domain
knowledge about these paths may find that some of them are indeed false
positives.

The change to m_start in kernel/user_namespace.c is interesting because
that's an example where the nospec_load() approach by itself would need
to barrier speculation twice whereas if_nospec can do it once for the
whole block.

[ forgive any whitespace damage ]

8<---
diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c
index 3e7e283a44a8..65175bbe805f 100644
--- a/drivers/media/usb/uvc/uvc_v4l2.c
+++ b/drivers/media/usb/uvc/uvc_v4l2.c
@@ -821,6 +821,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,
 		}
 		pin = iterm->id;
 	} else if (index < selector->bNrInPins) {
+		osb();
 		pin = selector->baSourceID[index];
 		list_for_each_entry(iterm, &chain->entities, chain) {
 			if (!UVC_ENTITY_IS_ITERM(iterm))
diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c
index 988c8857d78c..cf267b709af6 100644
--- a/drivers/net/wireless/ath/carl9170/main.c
+++ b/drivers/net/wireless/ath/carl9170/main.c
@@ -1388,6 +1388,7 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,
 
 	mutex_lock(&ar->mutex);
 	if (queue < ar->hw->queues) {
+		osb();
 		memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));
 		ret = carl9170_set_qos(ar);
 	} else {
diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c
index ab6d39e12069..f024f1ad81ff 100644
--- a/drivers/net/wireless/intersil/p54/main.c
+++ b/drivers/net/wireless/intersil/p54/main.c
@@ -415,6 +415,7 @@ static int p54_conf_tx(struct ieee80211_hw *dev,
 
 	mutex_lock(&priv->conf_mutex);
 	if (queue < dev->queues) {
+		osb();
 		P54_SET_QUEUE(priv->qos_params[queue], params->aifs,
 			params->cw_min, params->cw_max, params->txop);
 		ret = p54_set_edcf(priv);
diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c
index 38678e9a0562..b4a2a7ba04e8 100644
--- a/drivers/net/wireless/st/cw1200/sta.c
+++ b/drivers/net/wireless/st/cw1200/sta.c
@@ -619,6 +619,7 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct ieee80211_vif *vif,
 	mutex_lock(&priv->conf_mutex);
 
 	if (queue < dev->queues) {
+		osb();
 		old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);
 
 		WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0);
diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c
index d5da3981cefe..dec8b6e087e3 100644
--- a/drivers/scsi/qla2xxx/qla_mr.c
+++ b/drivers/scsi/qla2xxx/qla_mr.c
@@ -2304,10 +2304,12 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt)
 	req = ha->req_q_map[que];
 
 	/* Validate handle. */
-	if (handle < req->num_outstanding_cmds)
+	if (handle < req->num_outstanding_cmds) {
+		osb();
 		sp = req->outstanding_cmds[handle];
-	else
+	} else {
 		sp = NULL;
+	}
 
 	if (sp == NULL) {
 		ql_dbg(ql_dbg_io, vha, 0x3034,
@@ -2655,10 +2657,12 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha,
 		req = ha->req_q_map[que];
 
 		/* Validate handle. */
-		if (handle < req->num_outstanding_cmds)
+		if (handle < req->num_outstanding_cmds) {
+			osb();
 			sp = req->outstanding_cmds[handle];
-		else
+		} else {
 			sp = NULL;
+		}
 
 		if (sp == NULL) {
 			ql_dbg(ql_dbg_io, vha, 0x3044,
diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
index 145a5c53ff5c..d732b34e120d 100644
--- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
+++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c
@@ -57,15 +57,16 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,
 	if (d->override_ops && d->override_ops->get_trip_temp)
 		return d->override_ops->get_trip_temp(zone, trip, temp);
 
-	if (trip < d->aux_trip_nr)
+	if (trip < d->aux_trip_nr) {
+		osb();
 		*temp = d->aux_trips[trip];
-	else if (trip == d->crt_trip_id)
+	} else if (trip == d->crt_trip_id) {
 		*temp = d->crt_temp;
-	else if (trip == d->psv_trip_id)
+	} else if (trip == d->psv_trip_id) {
 		*temp = d->psv_temp;
-	else if (trip == d->hot_trip_id)
+	} else if (trip == d->hot_trip_id) {
 		*temp = d->hot_temp;
-	else {
+	} else {
 		for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) {
 			if (d->act_trips[i].valid &&
 			    d->act_trips[i].id == trip) {
diff --git a/fs/udf/misc.c b/fs/udf/misc.c
index 401e64cde1be..c5394760d26b 100644
--- a/fs/udf/misc.c
+++ b/fs/udf/misc.c
@@ -104,6 +104,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
 					iinfo->i_lenEAttr) {
 				uint32_t aal =
 					le32_to_cpu(eahd->appAttrLocation);
+
+				osb();
 				memmove(&ea[offset - aal + size],
 					&ea[aal], offset - aal);
 				offset -= aal;
@@ -114,6 +116,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
 					iinfo->i_lenEAttr) {
 				uint32_t ial =
 					le32_to_cpu(eahd->impAttrLocation);
+
+				osb();
 				memmove(&ea[offset - ial + size],
 					&ea[ial], offset - ial);
 				offset -= ial;
@@ -125,6 +129,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,
 					iinfo->i_lenEAttr) {
 				uint32_t aal =
 					le32_to_cpu(eahd->appAttrLocation);
+
+				osb();
 				memmove(&ea[offset - aal + size],
 					&ea[aal], offset - aal);
 				offset -= aal;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 1c65817673db..dbc12007da51 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i
 {
 	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
 
-	if (fd < fdt->max_fds)
+	if (fd < fdt->max_fds) {
+		osb();
 		return rcu_dereference_raw(fdt->fd[fd]);
+	}
 	return NULL;
 }
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 246d4d4ce5c7..aa0be8cef2d4 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,
 {
 	loff_t pos = *ppos;
 	unsigned extents = map->nr_extents;
-	smp_rmb();
 
-	if (pos >= extents)
-		return NULL;
+	/* paired with smp_wmb in map_write */
+	smp_rmb();
 
-	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
-		return &map->extent[pos];
+	if (pos < extents) {
+		osb();
+		if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
+			return &map->extent[pos];
+		return &map->forward[pos];
+	}
 
-	return &map->forward[pos];
+	return NULL;
 }
 
 static void *uid_m_start(struct seq_file *seq, loff_t *ppos)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 125c1eab3eaa..56909c8fa025 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -476,6 +476,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,
 	if (offset < rfv->hlen) {
 		int copy = min(rfv->hlen - offset, len);
 
+		osb();
 		if (skb->ip_summed == CHECKSUM_PARTIAL)
 			memcpy(to, rfv->hdr.c + offset, copy);
 		else
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 761a473a07c5..0942784f3f8d 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -729,6 +729,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,
 	if (offset < rfv->hlen) {
 		int copy = min(rfv->hlen - offset, len);
 
+		osb();
 		if (skb->ip_summed == CHECKSUM_PARTIAL)
 			memcpy(to, rfv->c + offset, copy);
 		else
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..7f83abdea255 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 	if (index < net->mpls.platform_labels) {
 		struct mpls_route __rcu **platform_label =
 			rcu_dereference(net->mpls.platform_label);
+
+		osb();
 		rt = rcu_dereference(platform_label[index]);
 	}
 	return rt;
Alexei Starovoitov Jan. 4, 2018, 3:12 a.m. UTC | #12
On Thu, Jan 04, 2018 at 02:15:53AM +0000, Alan Cox wrote:
> 

> > > Elena has done the work of auditing static analysis reports to a dozen

> > > or so locations that need some 'nospec' handling.  

> > 

> > How exactly is that related (especially in longer-term support terms) to 

> > BPF anyway?

> 

> If you read the papers you need a very specific construct in order to not

> only cause a speculative load of an address you choose but also to then

> manage to cause a second operation that in some way reveals bits of data

> or allows you to ask questions.

> 

> BPF allows you to construct those sequences relatively easily and it's

> the one case where a user space application can fairly easily place code

> it wants to execute in the kernel. Without BPF you have to find the right

> construct in the kernel, prime all the right predictions and measure the

> result without getting killed off. There are places you can do that but

> they are not so easy and we don't (at this point) think there are that

> many.


for BPF in particular we're thinking to do a different fix.
Instead of killing speculation we can let cpu speculate.
The fix will include rounding up bpf maps to nearest power of 2 and
inserting bpf_and operation on the index after bounds check,
so cpu can continue speculate beyond bounds check, but it will
load from zero-ed memory.
So this nospec arch dependent hack won't be necessary for bpf side,
but may still be needed in other parts of the kernel.

Also note that variant 1 is talking about exploiting prog_array
bpf feature which had 64-bit access prior to
commit 90caccdd8cc0 ("bpf: fix bpf_tail_call() x64 JIT")
That was a fix for JIT and not related to this cpu issue, but
I believe it breaks the existing exploit.

Since it's not clear whether it's still possible to use bpf
with 32-bit speculation only, we're going to do this rounding fix
for unpriv just in case.
Al Viro Jan. 4, 2018, 4:44 a.m. UTC | #13
On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

> index 1c65817673db..dbc12007da51 100644

> --- a/include/linux/fdtable.h

> +++ b/include/linux/fdtable.h

> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

>  {

>  	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

>  

> -	if (fd < fdt->max_fds)

> +	if (fd < fdt->max_fds) {

> +		osb();

>  		return rcu_dereference_raw(fdt->fd[fd]);

> +	}

>  	return NULL;

>  }


... and the point of that would be?  Possibly revealing the value of files->fdt?
Why would that be a threat, assuming you manage to extract the information in
question in the first place?
Eric W. Biederman Jan. 4, 2018, 5:01 a.m. UTC | #14
"Williams, Dan J" <dan.j.williams@intel.com> writes:



> Note that these are "a human looked at static analysis reports and

> could not rationalize that these are false positives". Specific domain

> knowledge about these paths may find that some of them are indeed false

> positives.

>

> The change to m_start in kernel/user_namespace.c is interesting because

> that's an example where the nospec_load() approach by itself would need

> to barrier speculation twice whereas if_nospec can do it once for the

> whole block.



This user_namespace.c change is very convoluted for what it is trying to
do.  It simplifies to a one liner that just adds osb() after pos >=
extents. AKA:

	if (pos >= extents)
        	return NULL;
+ 	osb();

Is the intent to hide which branch branch we take based on extents,
after the pos check?

I suspect this implies that using a user namespace and a crafted uid
map you can hit this in stat, on the fast path.

At which point I suspect we will be better off extending struct
user_namespace by a few pointers, so there is no union and remove the
need for blocking speculation entirely.

> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c

> index 246d4d4ce5c7..aa0be8cef2d4 100644

> --- a/kernel/user_namespace.c

> +++ b/kernel/user_namespace.c

> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,

>  {

>  	loff_t pos = *ppos;

>  	unsigned extents = map->nr_extents;

> -	smp_rmb();

>  

> -	if (pos >= extents)

> -		return NULL;

> +	/* paired with smp_wmb in map_write */

> +	smp_rmb();

>  

> -	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

> -		return &map->extent[pos];

> +	if (pos < extents) {

> +		osb();

> +		if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

> +			return &map->extent[pos];

> +		return &map->forward[pos];

> +	}

>  

> -	return &map->forward[pos];

> +	return NULL;

>  }

>  

>  static void *uid_m_start(struct seq_file *seq, loff_t *ppos)




> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c

> index 8ca9915befc8..7f83abdea255 100644

> --- a/net/mpls/af_mpls.c

> +++ b/net/mpls/af_mpls.c

> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)

>  	if (index < net->mpls.platform_labels) {

>  		struct mpls_route __rcu **platform_label =

>  			rcu_dereference(net->mpls.platform_label);

> +

> +		osb();

>  		rt = rcu_dereference(platform_label[index]);

>  	}

>  	return rt;


Ouch!  This adds a barrier in the middle of an rcu lookup, on the
fast path for routing mpls packets.  Which if memory serves will
noticably slow down software processing of mpls packets.

Why does osb() fall after the branch for validity?  So that we allow
speculation up until then? 

I suspect it would be better to have those barriers in the tun/tap
interfaces where userspace can inject packets and thus time them.  Then
the code could still speculate and go fast for remote packets.

Or does the speculation stomping have to be immediately at the place
where we use data from userspace to perform a table lookup?

Eric

p.s. osb() seems to be mispelled. obs() would make much more sense.
Dan Williams Jan. 4, 2018, 5:44 a.m. UTC | #15
On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

>

>> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

>> index 1c65817673db..dbc12007da51 100644

>> --- a/include/linux/fdtable.h

>> +++ b/include/linux/fdtable.h

>> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

>>  {

>>       struct fdtable *fdt = rcu_dereference_raw(files->fdt);

>>

>> -     if (fd < fdt->max_fds)

>> +     if (fd < fdt->max_fds) {

>> +             osb();

>>               return rcu_dereference_raw(fdt->fd[fd]);

>> +     }

>>       return NULL;

>>  }

>

> ... and the point of that would be?  Possibly revealing the value of files->fdt?

> Why would that be a threat, assuming you manage to extract the information in

> question in the first place?


No, the concern is that an fd value >= fdt->max_fds may cause the cpu
to read arbitrary memory addresses relative to files->fdt and
userspace can observe that it got loaded. With the barrier the
speculation stops and never allows that speculative read to issue.
With the change, the cpu only issues a read for fdt->fd[fd] when fd is
valid.
Dave Hansen Jan. 4, 2018, 5:49 a.m. UTC | #16
On 01/03/2018 09:44 PM, Dan Williams wrote:
> No, the concern is that an fd value >= fdt->max_fds may cause the cpu

> to read arbitrary memory addresses relative to files->fdt and

> userspace can observe that it got loaded.


Yep, it potentially tells you someting about memory after fdt->fd[].
For instance, you might be able to observe if some random bit of memory
after the actual fd[] array had 'mask' set because the CPU is running
this code with a 'file' that actually fails the "fd < fdt->max_fds" check:

                file = __fcheck_files(files, fd);
                if (!file || unlikely(file->f_mode & mask))
                        return 0;
                return (unsigned long)file;
Al Viro Jan. 4, 2018, 5:50 a.m. UTC | #17
On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote:
> On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:

> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

> >

> >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

> >> index 1c65817673db..dbc12007da51 100644

> >> --- a/include/linux/fdtable.h

> >> +++ b/include/linux/fdtable.h

> >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

> >>  {

> >>       struct fdtable *fdt = rcu_dereference_raw(files->fdt);

> >>

> >> -     if (fd < fdt->max_fds)

> >> +     if (fd < fdt->max_fds) {

> >> +             osb();

> >>               return rcu_dereference_raw(fdt->fd[fd]);

> >> +     }

> >>       return NULL;

> >>  }

> >

> > ... and the point of that would be?  Possibly revealing the value of files->fdt?

> > Why would that be a threat, assuming you manage to extract the information in

> > question in the first place?

> 

> No, the concern is that an fd value >= fdt->max_fds may cause the cpu

> to read arbitrary memory addresses relative to files->fdt and

> userspace can observe that it got loaded.


Yes.  And all that might reveal is the value of files->fdt.  Who cares?
Al Viro Jan. 4, 2018, 5:55 a.m. UTC | #18
On Thu, Jan 04, 2018 at 05:50:12AM +0000, Al Viro wrote:
> On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote:

> > On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:

> > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

> > >

> > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

> > >> index 1c65817673db..dbc12007da51 100644

> > >> --- a/include/linux/fdtable.h

> > >> +++ b/include/linux/fdtable.h

> > >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

> > >>  {

> > >>       struct fdtable *fdt = rcu_dereference_raw(files->fdt);

> > >>

> > >> -     if (fd < fdt->max_fds)

> > >> +     if (fd < fdt->max_fds) {

> > >> +             osb();

> > >>               return rcu_dereference_raw(fdt->fd[fd]);

> > >> +     }

> > >>       return NULL;

> > >>  }

> > >

> > > ... and the point of that would be?  Possibly revealing the value of files->fdt?

> > > Why would that be a threat, assuming you manage to extract the information in

> > > question in the first place?

> > 

> > No, the concern is that an fd value >= fdt->max_fds may cause the cpu

> > to read arbitrary memory addresses relative to files->fdt and

> > userspace can observe that it got loaded.

> 

> Yes.  And all that might reveal is the value of files->fdt.  Who cares?


Sorry, s/files->fdt/files->fdt->fd/.  Still the same question - what information
would that extract and how would attacker use that?
Julia Lawall Jan. 4, 2018, 6:28 a.m. UTC | #19
On Wed, 3 Jan 2018, Dan Williams wrote:

> [ adding Julia and Dan ]

>

> On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:

> > On Wed, 3 Jan 2018 16:39:31 -0800

> > Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >

> >> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:

> >> > The 'if_nospec' primitive marks locations where the kernel is disabling

> >> > speculative execution that could potentially access privileged data. It

> >> > is expected to be paired with a 'nospec_{ptr,load}' where the user

> >> > controlled value is actually consumed.

> >>

> >> I'm much less worried about these "nospec_load/if" macros, than I am

> >> about having a sane way to determine when they should be needed.

> >>

> >> Is there such a sane model right now, or are we talking "people will

> >> randomly add these based on strong feelings"?

> >

> > There are people trying to tune coverity and other tool rules to identify

> > cases, and some of the work so far was done that way. For x86 we didn't

> > find too many so far so either the needed pattern is uncommon or ....  8)

> >

> > Given you can execute over a hundred basic instructions in a speculation

> > window it does need to be a tool that can explore not just in function

> > but across functions. That's really tough for the compiler itself to do

> > without help.

> >

> > What remains to be seen is if there are other patterns that affect

> > different processors.

> >

> > In the longer term the compiler itself needs to know what is and isn't

> > safe (ie you need to be able to write things like

> >

> > void foo(tainted __user int *x)

> >

> > and have the compiler figure out what level of speculation it can do and

> > (on processors with those features like IA64) when it can and can't do

> > various kinds of non-trapping loads.

> >

>

> It would be great if coccinelle and/or smatch could be taught to catch

> some of these case at least as a first pass "please audit this code

> block" type of notification.

>


What should one be looking for.  Do you have a typical example?

julia
Dan Williams Jan. 4, 2018, 6:32 a.m. UTC | #20
On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> "Williams, Dan J" <dan.j.williams@intel.com> writes:

>

>

>

>> Note that these are "a human looked at static analysis reports and

>> could not rationalize that these are false positives". Specific domain

>> knowledge about these paths may find that some of them are indeed false

>> positives.

>>

>> The change to m_start in kernel/user_namespace.c is interesting because

>> that's an example where the nospec_load() approach by itself would need

>> to barrier speculation twice whereas if_nospec can do it once for the

>> whole block.

>

>

> This user_namespace.c change is very convoluted for what it is trying to

> do.


Sorry this was my rebase on top of commit d5e7b3c5f51f "userns: Don't
read extents twice in m_start" the original change from Elena was
simpler. Part of the complexity arises from converting the common
kernel pattern of

if (<invalid condition>)
   return NULL;
do_stuff;

...to:

if (<valid conidtion>) {
   barrier();
   do_stuff;
}

> It simplifies to a one liner that just adds osb() after pos >=

> extents. AKA:

>

>         if (pos >= extents)

>                 return NULL;

> +       osb();

>

> Is the intent to hide which branch branch we take based on extents,

> after the pos check?


The intent is to prevent speculative execution from triggering any
reads when 'pos' is invalid.

> I suspect this implies that using a user namespace and a crafted uid

> map you can hit this in stat, on the fast path.

>

> At which point I suspect we will be better off extending struct

> user_namespace by a few pointers, so there is no union and remove the

> need for blocking speculation entirely.


How does this help prevent a speculative read with an invalid 'pos'
reading arbitrary kernel addresses?

>

>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c

>> index 246d4d4ce5c7..aa0be8cef2d4 100644

>> --- a/kernel/user_namespace.c

>> +++ b/kernel/user_namespace.c

>> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,

>>  {

>>       loff_t pos = *ppos;

>>       unsigned extents = map->nr_extents;

>> -     smp_rmb();

>>

>> -     if (pos >= extents)

>> -             return NULL;

>> +     /* paired with smp_wmb in map_write */

>> +     smp_rmb();

>>

>> -     if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

>> -             return &map->extent[pos];

>> +     if (pos < extents) {

>> +             osb();

>> +             if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

>> +                     return &map->extent[pos];

>> +             return &map->forward[pos];

>> +     }

>>

>> -     return &map->forward[pos];

>> +     return NULL;

>>  }

>>

>>  static void *uid_m_start(struct seq_file *seq, loff_t *ppos)

>

>

>

>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c

>> index 8ca9915befc8..7f83abdea255 100644

>> --- a/net/mpls/af_mpls.c

>> +++ b/net/mpls/af_mpls.c

>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)

>>       if (index < net->mpls.platform_labels) {

>>               struct mpls_route __rcu **platform_label =

>>                       rcu_dereference(net->mpls.platform_label);

>> +

>> +             osb();

>>               rt = rcu_dereference(platform_label[index]);

>>       }

>>       return rt;

>

> Ouch!  This adds a barrier in the middle of an rcu lookup, on the

> fast path for routing mpls packets.  Which if memory serves will

> noticably slow down software processing of mpls packets.

>

> Why does osb() fall after the branch for validity?  So that we allow

> speculation up until then?


It falls there so that the cpu only issues reads with known good 'index' values.

> I suspect it would be better to have those barriers in the tun/tap

> interfaces where userspace can inject packets and thus time them.  Then

> the code could still speculate and go fast for remote packets.

>

> Or does the speculation stomping have to be immediately at the place

> where we use data from userspace to perform a table lookup?


The speculation stomping barrier has to be between where we validate
the input and when we may speculate on invalid input. So, yes, moving
the user controllable input validation earlier and out of the fast
path would be preferred. Think of this patch purely as a static
analysis warning that something might need to be done to resolve the
report.
Dan Williams Jan. 4, 2018, 6:42 a.m. UTC | #21
On Wed, Jan 3, 2018 at 9:55 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Jan 04, 2018 at 05:50:12AM +0000, Al Viro wrote:

>> On Wed, Jan 03, 2018 at 09:44:33PM -0800, Dan Williams wrote:

>> > On Wed, Jan 3, 2018 at 8:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:

>> > > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

>> > >

>> > >> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

>> > >> index 1c65817673db..dbc12007da51 100644

>> > >> --- a/include/linux/fdtable.h

>> > >> +++ b/include/linux/fdtable.h

>> > >> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

>> > >>  {

>> > >>       struct fdtable *fdt = rcu_dereference_raw(files->fdt);

>> > >>

>> > >> -     if (fd < fdt->max_fds)

>> > >> +     if (fd < fdt->max_fds) {

>> > >> +             osb();

>> > >>               return rcu_dereference_raw(fdt->fd[fd]);

>> > >> +     }

>> > >>       return NULL;

>> > >>  }

>> > >

>> > > ... and the point of that would be?  Possibly revealing the value of files->fdt?

>> > > Why would that be a threat, assuming you manage to extract the information in

>> > > question in the first place?

>> >

>> > No, the concern is that an fd value >= fdt->max_fds may cause the cpu

>> > to read arbitrary memory addresses relative to files->fdt and

>> > userspace can observe that it got loaded.

>>

>> Yes.  And all that might reveal is the value of files->fdt.  Who cares?

>

> Sorry, s/files->fdt/files->fdt->fd/.  Still the same question - what information

> would that extract and how would attacker use that?


The question is if userspace can ex-filtrate any data from the kernel
that would otherwise be blocked by a bounds check should the kernel
close that hole?

For these patches I do not think the bar should be "can I prove an
information leak is exploitable" it should be "can I prove that a leak
is not exploitable", especially when possibly combined with other leak
sites.
Reshetova, Elena Jan. 4, 2018, 9:16 a.m. UTC | #22
> On Thu, Jan 04, 2018 at 02:15:53AM +0000, Alan Cox wrote:

> >

> > > > Elena has done the work of auditing static analysis reports to a dozen

> > > > or so locations that need some 'nospec' handling.

> > >

> > > How exactly is that related (especially in longer-term support terms) to

> > > BPF anyway?

> >

> > If you read the papers you need a very specific construct in order to not

> > only cause a speculative load of an address you choose but also to then

> > manage to cause a second operation that in some way reveals bits of data

> > or allows you to ask questions.

> >

> > BPF allows you to construct those sequences relatively easily and it's

> > the one case where a user space application can fairly easily place code

> > it wants to execute in the kernel. Without BPF you have to find the right

> > construct in the kernel, prime all the right predictions and measure the

> > result without getting killed off. There are places you can do that but

> > they are not so easy and we don't (at this point) think there are that

> > many.

> 

> for BPF in particular we're thinking to do a different fix.

> Instead of killing speculation we can let cpu speculate.

> The fix will include rounding up bpf maps to nearest power of 2 and

> inserting bpf_and operation on the index after bounds check,

> so cpu can continue speculate beyond bounds check, but it will

> load from zero-ed memory.

> So this nospec arch dependent hack won't be necessary for bpf side,

> but may still be needed in other parts of the kernel.


I think this is a much better approach than what we have been doing internally so
far. My initial fix back in August for this was to insert a minimal set of lfence
barriers (osb() barrier below) that prevent speculation just based on how attack was using constants
to index eBPF maps:

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
#define BPF_R0	regs[BPF_REG_0]
@@ -939,6 +940,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn,
 		DST = IMM;
 		CONT;
 	LD_IMM_DW:
+		osb();
 		DST = (u64) (u32) insn[0].imm | ((u64) (u32) insn[1].imm) << 32;
 		insn++;
 		CONT;
@@ -1200,6 +1202,7 @@ static unsigned int ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn,
 		*(SIZE *)(unsigned long) (DST + insn->off) = IMM;	\
 		CONT;							\
 	LDX_MEM_##SIZEOP:						\
+		osb();							\
 		DST = *(SIZE *)(unsigned long) (SRC + insn->off);	\
 		CONT;
 


And similar stuff also for x86 JIT (blinding dependent):

@@ -400,7 +413,7 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 			case BPF_ADD: b2 = 0x01; break;
 			case BPF_SUB: b2 = 0x29; break;
 			case BPF_AND: b2 = 0x21; break;
-			case BPF_OR: b2 = 0x09; break;
+			case BPF_OR: b2 = 0x09; emit_memory_barrier(&prog); break;
 			case BPF_XOR: b2 = 0x31; break;
 			}
 			if (BPF_CLASS(insn->code) == BPF_ALU64)
@@ -647,6 +660,16 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image,
 		case BPF_ALU64 | BPF_RSH | BPF_X:
 		case BPF_ALU64 | BPF_ARSH | BPF_X:
 
+			/* If blinding is enabled, each
+			 * BPF_LD | BPF_IMM | BPF_DW instruction
+			 * is converted to 4 eBPF instructions with
+			 * BPF_ALU64_IMM(BPF_LSH, BPF_REG_AX, 32)
+			 * always present(number 3). Detect such cases
+			 * and insert memory barriers. */
+			if ((BPF_CLASS(insn->code) == BPF_ALU64)
+				&& (BPF_OP(insn->code) == BPF_LSH)
+				&& (src_reg == BPF_REG_AX))
+				emit_memory_barrier(&prog);
 			/* check for bad case when dst_reg == rcx */
 			if (dst_reg == BPF_REG_4) {
 				/* mov r11, dst_reg */

But this is really ugly fix IMO to prevent speculation from happen, so with your
approach I think it is really much better. 

If you need help in testing the fixes, I can do it unless you already have the
attack code yourself. 

> 

> Also note that variant 1 is talking about exploiting prog_array

> bpf feature which had 64-bit access prior to

> commit 90caccdd8cc0 ("bpf: fix bpf_tail_call() x64 JIT")

> That was a fix for JIT and not related to this cpu issue, but

> I believe it breaks the existing exploit.


Actually I could not get existing exploit working on anything past 4.11
but at that time I could not see any fundamental change in eBPF that
would prevent it. 

Best Regards,
Elena.
Mark Rutland Jan. 4, 2018, 11:47 a.m. UTC | #23
Hi Dan,

Thanks for these examples.

On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:
> Note, the following is Elena's work, I'm just helping poke the upstream

> discussion along while she's offline.

> 

> Elena audited the static analysis reports down to the following

> locations where userspace provides a value for a conditional branch and

> then later creates a dependent load on that same userspace controlled

> value.

> 

> 'osb()', observable memory barrier, resolves to an lfence on x86. My

> concern with these changes is that it is not clear what content within

> the branch block is of concern. Peter's 'if_nospec' proposal combined

> with Mark's 'nospec_load' would seem to clear that up, from a self

> documenting code perspective, and let archs optionally protect entire

> conditional blocks or individual loads within those blocks.


I'm a little concerned by having to use two helpers that need to be paired. It
means that we have to duplicate the bounds information, and I suspect in
practice that they'll often be paired improperly.

I think that we can avoid those problems if we use nospec_ptr() on its own. It
returns NULL if the pointer would be out-of-bounds, so we can use it in the
if-statement.

On x86, that can contain the bounds checks followed be an OSB / lfence, like
if_nospec(). On arm we can use our architected sequence. In either case,
nospec_ptr() returns NULL if the pointer would be out-of-bounds.

Then, rather than sequences like:

	if_nospec(idx < max) {
		val = nospec_array_load(array, idx, max);
		...
	}

... we could have:

	if ((elem_ptr = nospec_array_ptr(array, idx, max)) {
		val = *elem_ptr;
		...
	}

... which also means we don't have to annotate every single load in the branch
if the element is a structure with multiple fields.

Does that sound workable to you?

From a quick scan, it looks like it would fit all of the example cases below.

Thanks,
Mark.

> Note that these are "a human looked at static analysis reports and

> could not rationalize that these are false positives". Specific domain

> knowledge about these paths may find that some of them are indeed false

> positives.

> 

> The change to m_start in kernel/user_namespace.c is interesting because

> that's an example where the nospec_load() approach by itself would need

> to barrier speculation twice whereas if_nospec can do it once for the

> whole block.

> 

> [ forgive any whitespace damage ]

> 

> 8<---

> diff --git a/drivers/media/usb/uvc/uvc_v4l2.c b/drivers/media/usb/uvc/uvc_v4l2.c

> index 3e7e283a44a8..65175bbe805f 100644

> --- a/drivers/media/usb/uvc/uvc_v4l2.c

> +++ b/drivers/media/usb/uvc/uvc_v4l2.c

> @@ -821,6 +821,7 @@ static int uvc_ioctl_enum_input(struct file *file, void *fh,

>  		}

>  		pin = iterm->id;

>  	} else if (index < selector->bNrInPins) {

> +		osb();

>  		pin = selector->baSourceID[index];

>  		list_for_each_entry(iterm, &chain->entities, chain) {

>  			if (!UVC_ENTITY_IS_ITERM(iterm))

> diff --git a/drivers/net/wireless/ath/carl9170/main.c b/drivers/net/wireless/ath/carl9170/main.c

> index 988c8857d78c..cf267b709af6 100644

> --- a/drivers/net/wireless/ath/carl9170/main.c

> +++ b/drivers/net/wireless/ath/carl9170/main.c

> @@ -1388,6 +1388,7 @@ static int carl9170_op_conf_tx(struct ieee80211_hw *hw,

>  

>  	mutex_lock(&ar->mutex);

>  	if (queue < ar->hw->queues) {

> +		osb();

>  		memcpy(&ar->edcf[ar9170_qmap[queue]], param, sizeof(*param));


>  		ret = carl9170_set_qos(ar);

>  	} else {

> diff --git a/drivers/net/wireless/intersil/p54/main.c b/drivers/net/wireless/intersil/p54/main.c

> index ab6d39e12069..f024f1ad81ff 100644

> --- a/drivers/net/wireless/intersil/p54/main.c

> +++ b/drivers/net/wireless/intersil/p54/main.c

> @@ -415,6 +415,7 @@ static int p54_conf_tx(struct ieee80211_hw *dev,

>  

>  	mutex_lock(&priv->conf_mutex);

>  	if (queue < dev->queues) {

> +		osb();

>  		P54_SET_QUEUE(priv->qos_params[queue], params->aifs,

>  			params->cw_min, params->cw_max, params->txop);

>  		ret = p54_set_edcf(priv);

> diff --git a/drivers/net/wireless/st/cw1200/sta.c b/drivers/net/wireless/st/cw1200/sta.c

> index 38678e9a0562..b4a2a7ba04e8 100644

> --- a/drivers/net/wireless/st/cw1200/sta.c

> +++ b/drivers/net/wireless/st/cw1200/sta.c

> @@ -619,6 +619,7 @@ int cw1200_conf_tx(struct ieee80211_hw *dev, struct ieee80211_vif *vif,

>  	mutex_lock(&priv->conf_mutex);

>  

>  	if (queue < dev->queues) {

> +		osb();

>  		old_uapsd_flags = le16_to_cpu(priv->uapsd_info.uapsd_flags);

>  

>  		WSM_TX_QUEUE_SET(&priv->tx_queue_params, queue, 0, 0, 0);

> diff --git a/drivers/scsi/qla2xxx/qla_mr.c b/drivers/scsi/qla2xxx/qla_mr.c

> index d5da3981cefe..dec8b6e087e3 100644

> --- a/drivers/scsi/qla2xxx/qla_mr.c

> +++ b/drivers/scsi/qla2xxx/qla_mr.c

> @@ -2304,10 +2304,12 @@ qlafx00_status_entry(scsi_qla_host_t *vha, struct rsp_que *rsp, void *pkt)

>  	req = ha->req_q_map[que];

>  

>  	/* Validate handle. */

> -	if (handle < req->num_outstanding_cmds)

> +	if (handle < req->num_outstanding_cmds) {

> +		osb();

>  		sp = req->outstanding_cmds[handle];

> -	else

> +	} else {

>  		sp = NULL;

> +	}

>  

>  	if (sp == NULL) {

>  		ql_dbg(ql_dbg_io, vha, 0x3034,

> @@ -2655,10 +2657,12 @@ qlafx00_multistatus_entry(struct scsi_qla_host *vha,

>  		req = ha->req_q_map[que];

>  

>  		/* Validate handle. */

> -		if (handle < req->num_outstanding_cmds)

> +		if (handle < req->num_outstanding_cmds) {

> +			osb();

>  			sp = req->outstanding_cmds[handle];

> -		else

> +		} else {

>  			sp = NULL;

> +		}

>  

>  		if (sp == NULL) {

>  			ql_dbg(ql_dbg_io, vha, 0x3044,

> diff --git a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c

> index 145a5c53ff5c..d732b34e120d 100644

> --- a/drivers/thermal/int340x_thermal/int340x_thermal_zone.c

> +++ b/drivers/thermal/int340x_thermal/int340x_thermal_zone.c

> @@ -57,15 +57,16 @@ static int int340x_thermal_get_trip_temp(struct thermal_zone_device *zone,

>  	if (d->override_ops && d->override_ops->get_trip_temp)

>  		return d->override_ops->get_trip_temp(zone, trip, temp);

>  

> -	if (trip < d->aux_trip_nr)

> +	if (trip < d->aux_trip_nr) {

> +		osb();

>  		*temp = d->aux_trips[trip];

> -	else if (trip == d->crt_trip_id)

> +	} else if (trip == d->crt_trip_id) {

>  		*temp = d->crt_temp;

> -	else if (trip == d->psv_trip_id)

> +	} else if (trip == d->psv_trip_id) {

>  		*temp = d->psv_temp;

> -	else if (trip == d->hot_trip_id)

> +	} else if (trip == d->hot_trip_id) {

>  		*temp = d->hot_temp;

> -	else {

> +	} else {

>  		for (i = 0; i < INT340X_THERMAL_MAX_ACT_TRIP_COUNT; i++) {

>  			if (d->act_trips[i].valid &&

>  			    d->act_trips[i].id == trip) {

> diff --git a/fs/udf/misc.c b/fs/udf/misc.c

> index 401e64cde1be..c5394760d26b 100644

> --- a/fs/udf/misc.c

> +++ b/fs/udf/misc.c

> @@ -104,6 +104,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,

>  					iinfo->i_lenEAttr) {

>  				uint32_t aal =

>  					le32_to_cpu(eahd->appAttrLocation);

> +

> +				osb();

>  				memmove(&ea[offset - aal + size],

>  					&ea[aal], offset - aal);

>  				offset -= aal;

> @@ -114,6 +116,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,

>  					iinfo->i_lenEAttr) {

>  				uint32_t ial =

>  					le32_to_cpu(eahd->impAttrLocation);

> +

> +				osb();

>  				memmove(&ea[offset - ial + size],

>  					&ea[ial], offset - ial);

>  				offset -= ial;

> @@ -125,6 +129,8 @@ struct genericFormat *udf_add_extendedattr(struct inode *inode, uint32_t size,

>  					iinfo->i_lenEAttr) {

>  				uint32_t aal =

>  					le32_to_cpu(eahd->appAttrLocation);

> +

> +				osb();

>  				memmove(&ea[offset - aal + size],

>  					&ea[aal], offset - aal);

>  				offset -= aal;

> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h

> index 1c65817673db..dbc12007da51 100644

> --- a/include/linux/fdtable.h

> +++ b/include/linux/fdtable.h

> @@ -82,8 +82,10 @@ static inline struct file *__fcheck_files(struct files_struct *files, unsigned i

>  {

>  	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

>  

> -	if (fd < fdt->max_fds)

> +	if (fd < fdt->max_fds) {

> +		osb();

>  		return rcu_dereference_raw(fdt->fd[fd]);

> +	}

>  	return NULL;

>  }

>  

> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c

> index 246d4d4ce5c7..aa0be8cef2d4 100644

> --- a/kernel/user_namespace.c

> +++ b/kernel/user_namespace.c

> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,

>  {

>  	loff_t pos = *ppos;

>  	unsigned extents = map->nr_extents;

> -	smp_rmb();

>  

> -	if (pos >= extents)

> -		return NULL;

> +	/* paired with smp_wmb in map_write */

> +	smp_rmb();

>  

> -	if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

> -		return &map->extent[pos];

> +	if (pos < extents) {

> +		osb();

> +		if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

> +			return &map->extent[pos];

> +		return &map->forward[pos];

> +	}

>  

> -	return &map->forward[pos];

> +	return NULL;

>  }

>  

>  static void *uid_m_start(struct seq_file *seq, loff_t *ppos)

> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c

> index 125c1eab3eaa..56909c8fa025 100644

> --- a/net/ipv4/raw.c

> +++ b/net/ipv4/raw.c

> @@ -476,6 +476,7 @@ static int raw_getfrag(void *from, char *to, int offset, int len, int odd,

>  	if (offset < rfv->hlen) {

>  		int copy = min(rfv->hlen - offset, len);

>  

> +		osb();

>  		if (skb->ip_summed == CHECKSUM_PARTIAL)

>  			memcpy(to, rfv->hdr.c + offset, copy);

>  		else

> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c

> index 761a473a07c5..0942784f3f8d 100644

> --- a/net/ipv6/raw.c

> +++ b/net/ipv6/raw.c

> @@ -729,6 +729,7 @@ static int raw6_getfrag(void *from, char *to, int offset, int len, int odd,

>  	if (offset < rfv->hlen) {

>  		int copy = min(rfv->hlen - offset, len);

>  

> +		osb();

>  		if (skb->ip_summed == CHECKSUM_PARTIAL)

>  			memcpy(to, rfv->c + offset, copy);

>  		else

> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c

> index 8ca9915befc8..7f83abdea255 100644

> --- a/net/mpls/af_mpls.c

> +++ b/net/mpls/af_mpls.c

> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)

>  	if (index < net->mpls.platform_labels) {

>  		struct mpls_route __rcu **platform_label =

>  			rcu_dereference(net->mpls.platform_label);

> +

> +		osb();

>  		rt = rcu_dereference(platform_label[index]);

>  	}

>  	return rt;
Eric W. Biederman Jan. 4, 2018, 2:54 p.m. UTC | #24
Dan Williams <dan.j.williams@intel.com> writes:

> On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:

>> "Williams, Dan J" <dan.j.williams@intel.com> writes:

>>

>>

>>

>>> Note that these are "a human looked at static analysis reports and

>>> could not rationalize that these are false positives". Specific domain

>>> knowledge about these paths may find that some of them are indeed false

>>> positives.

>>>

>>> The change to m_start in kernel/user_namespace.c is interesting because

>>> that's an example where the nospec_load() approach by itself would need

>>> to barrier speculation twice whereas if_nospec can do it once for the

>>> whole block.

>>

>>

>> This user_namespace.c change is very convoluted for what it is trying to

>> do.

>

> Sorry this was my rebase on top of commit d5e7b3c5f51f "userns: Don't

> read extents twice in m_start" the original change from Elena was

> simpler. Part of the complexity arises from converting the common

> kernel pattern of

>

> if (<invalid condition>)

>    return NULL;

> do_stuff;

>

> ...to:

>

> if (<valid conidtion>) {

>    barrier();

>    do_stuff;

> }

>

>> It simplifies to a one liner that just adds osb() after pos >=

>> extents. AKA:

>>

>>         if (pos >= extents)

>>                 return NULL;

>> +       osb();

>>

>> Is the intent to hide which branch branch we take based on extents,

>> after the pos check?

>

> The intent is to prevent speculative execution from triggering any

> reads when 'pos' is invalid.


If that is the intent I think the patch you posted is woefully
inadequate.  We have many many more seq files in proc than just
/proc/<pid>/uid_map.

>> I suspect this implies that using a user namespace and a crafted uid

>> map you can hit this in stat, on the fast path.

>>

>> At which point I suspect we will be better off extending struct

>> user_namespace by a few pointers, so there is no union and remove the

>> need for blocking speculation entirely.

>

> How does this help prevent a speculative read with an invalid 'pos'

> reading arbitrary kernel addresses?


I though the concern was extents.

I am now convinced that collectively we need a much better description
of the problem than currently exists.

Either the patch you presented missed a whole lot like 90%+ of the
user/kernel interface or there is some mitigating factor that I am not
seeing.  Either way until reasonable people can read the code and
agree on the potential exploitability of it, I will be nacking these
patches.

>>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c

>>> index 246d4d4ce5c7..aa0be8cef2d4 100644

>>> --- a/kernel/user_namespace.c

>>> +++ b/kernel/user_namespace.c

>>> @@ -648,15 +648,18 @@ static void *m_start(struct seq_file *seq, loff_t *ppos,

>>>  {

>>>       loff_t pos = *ppos;

>>>       unsigned extents = map->nr_extents;

>>> -     smp_rmb();

>>>

>>> -     if (pos >= extents)

>>> -             return NULL;

>>> +     /* paired with smp_wmb in map_write */

>>> +     smp_rmb();

>>>

>>> -     if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

>>> -             return &map->extent[pos];

>>> +     if (pos < extents) {

>>> +             osb();

>>> +             if (extents <= UID_GID_MAP_MAX_BASE_EXTENTS)

>>> +                     return &map->extent[pos];

>>> +             return &map->forward[pos];

>>> +     }

>>>

>>> -     return &map->forward[pos];

>>> +     return NULL;

>>>  }

>>>

>>>  static void *uid_m_start(struct seq_file *seq, loff_t *ppos)

>>

>>

>>

>>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c

>>> index 8ca9915befc8..7f83abdea255 100644

>>> --- a/net/mpls/af_mpls.c

>>> +++ b/net/mpls/af_mpls.c

>>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)

>>>       if (index < net->mpls.platform_labels) {

>>>               struct mpls_route __rcu **platform_label =

>>>                       rcu_dereference(net->mpls.platform_label);

>>> +

>>> +             osb();

>>>               rt = rcu_dereference(platform_label[index]);

>>>       }

>>>       return rt;

>>

>> Ouch!  This adds a barrier in the middle of an rcu lookup, on the

>> fast path for routing mpls packets.  Which if memory serves will

>> noticably slow down software processing of mpls packets.

>>

>> Why does osb() fall after the branch for validity?  So that we allow

>> speculation up until then?

>

> It falls there so that the cpu only issues reads with known good 'index' values.

>

>> I suspect it would be better to have those barriers in the tun/tap

>> interfaces where userspace can inject packets and thus time them.  Then

>> the code could still speculate and go fast for remote packets.

>>

>> Or does the speculation stomping have to be immediately at the place

>> where we use data from userspace to perform a table lookup?

>

> The speculation stomping barrier has to be between where we validate

> the input and when we may speculate on invalid input.


So a serializing instruction at the kernel/user boundary (like say
loading cr3) is not enough?  That would seem to break any chance of a
controlled timing.

> So, yes, moving

> the user controllable input validation earlier and out of the fast

> path would be preferred. Think of this patch purely as a static

> analysis warning that something might need to be done to resolve the

> report.


That isn't what I was suggesting.  I was just suggesting a serialization
instruction earlier in the pipeline.

Given what I have seen in other parts of the thread I think an and
instruction that just limits the index to a sane range is generally
applicable, and should be cheap enough to not care about.  Further
it seems to apply to the pattern the static checkers were catching,
so I suspect that is the pattern we want to stress for limiting
speculation.  Assuming of course the compiler won't just optimize the
and of the index out.

Eric
Mark Rutland Jan. 4, 2018, 4:39 p.m. UTC | #25
On Thu, Jan 04, 2018 at 08:54:11AM -0600, Eric W. Biederman wrote:
> Dan Williams <dan.j.williams@intel.com> writes:

> > On Wed, Jan 3, 2018 at 9:01 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:

> >> "Williams, Dan J" <dan.j.williams@intel.com> writes:

> Either the patch you presented missed a whole lot like 90%+ of the

> user/kernel interface or there is some mitigating factor that I am not

> seeing.  Either way until reasonable people can read the code and

> agree on the potential exploitability of it, I will be nacking these

> patches.


As Dan mentioned, this is the result of auditing some static analysis reports.
I don't think it was claimed that this was complete, just that these are
locations that we're fairly certain need attention.

Auditing the entire user/kernel interface is going to take time, and I don't
think we should ignore this corpus in the mean time (though we certainly want
to avoid a whack-a-mole game).

[...]

> >>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c

> >>> index 8ca9915befc8..7f83abdea255 100644

> >>> --- a/net/mpls/af_mpls.c

> >>> +++ b/net/mpls/af_mpls.c

> >>> @@ -81,6 +81,8 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)

> >>>       if (index < net->mpls.platform_labels) {

> >>>               struct mpls_route __rcu **platform_label =

> >>>                       rcu_dereference(net->mpls.platform_label);

> >>> +

> >>> +             osb();

> >>>               rt = rcu_dereference(platform_label[index]);

> >>>       }

> >>>       return rt;

> >>

> >> Ouch!  This adds a barrier in the middle of an rcu lookup, on the

> >> fast path for routing mpls packets.  Which if memory serves will

> >> noticably slow down software processing of mpls packets.

> >>

> >> Why does osb() fall after the branch for validity?  So that we allow

> >> speculation up until then?

> >

> > It falls there so that the cpu only issues reads with known good 'index' values.

> >

> >> I suspect it would be better to have those barriers in the tun/tap

> >> interfaces where userspace can inject packets and thus time them.  Then

> >> the code could still speculate and go fast for remote packets.

> >>

> >> Or does the speculation stomping have to be immediately at the place

> >> where we use data from userspace to perform a table lookup?

> >

> > The speculation stomping barrier has to be between where we validate

> > the input and when we may speculate on invalid input.

> 

> So a serializing instruction at the kernel/user boundary (like say

> loading cr3) is not enough?  That would seem to break any chance of a

> controlled timing.


Unfortunately, it isn't sufficient to do this at the kernel/user boundary. Any
subsequent bounds check can be mis-speculated regardless of prior
serialization.

Such serialization has to occur *after* the relevant bounds check, but *before*
use of the value that was checked.

Where it's possible to audit user-provided values up front, we may be able to
batch checks to amortize the cost of such serialization, but typically bounds
checks are spread arbitrarily deep in the kernel.

[...]

> Given what I have seen in other parts of the thread I think an and

> instruction that just limits the index to a sane range is generally

> applicable, and should be cheap enough to not care about.


Where feasible, this sounds good to me.

However, since many places have dynamic bounds which aren't necessarily
powers-of-two, I'm not sure how applicable this is.

Thanks,
Mark.
Dan Williams Jan. 4, 2018, 5:58 p.m. UTC | #26
On Wed, Jan 3, 2018 at 10:28 PM, Julia Lawall <julia.lawall@lip6.fr> wrote:
>

>

> On Wed, 3 Jan 2018, Dan Williams wrote:

>

>> [ adding Julia and Dan ]

>>

>> On Wed, Jan 3, 2018 at 5:07 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:

>> > On Wed, 3 Jan 2018 16:39:31 -0800

>> > Linus Torvalds <torvalds@linux-foundation.org> wrote:

>> >

>> >> On Wed, Jan 3, 2018 at 4:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:

>> >> > The 'if_nospec' primitive marks locations where the kernel is disabling

>> >> > speculative execution that could potentially access privileged data. It

>> >> > is expected to be paired with a 'nospec_{ptr,load}' where the user

>> >> > controlled value is actually consumed.

>> >>

>> >> I'm much less worried about these "nospec_load/if" macros, than I am

>> >> about having a sane way to determine when they should be needed.

>> >>

>> >> Is there such a sane model right now, or are we talking "people will

>> >> randomly add these based on strong feelings"?

>> >

>> > There are people trying to tune coverity and other tool rules to identify

>> > cases, and some of the work so far was done that way. For x86 we didn't

>> > find too many so far so either the needed pattern is uncommon or ....  8)

>> >

>> > Given you can execute over a hundred basic instructions in a speculation

>> > window it does need to be a tool that can explore not just in function

>> > but across functions. That's really tough for the compiler itself to do

>> > without help.

>> >

>> > What remains to be seen is if there are other patterns that affect

>> > different processors.

>> >

>> > In the longer term the compiler itself needs to know what is and isn't

>> > safe (ie you need to be able to write things like

>> >

>> > void foo(tainted __user int *x)

>> >

>> > and have the compiler figure out what level of speculation it can do and

>> > (on processors with those features like IA64) when it can and can't do

>> > various kinds of non-trapping loads.

>> >

>>

>> It would be great if coccinelle and/or smatch could be taught to catch

>> some of these case at least as a first pass "please audit this code

>> block" type of notification.

>>

>

> What should one be looking for.  Do you have a typical example?

>


See "Exploiting Conditional Branch Misprediction" from the paper [1].

The typical example is an attacker controlled index used to trigger a
dependent read near a branch. Where an example of "near" from the
paper is "up to 188 simple instructions inserted in the source code
between the ‘if’ statement and the line accessing array...".

if (attacker_controlled_index < bound)
     val = array[attacker_controlled_index];
else
    return error;

...when the cpu speculates that the 'index < bound' branch is taken it
reads index and uses that value to read array[index]. The result of an
'array' relative read is potentially observable in the cache.

[1]: https://spectreattack.com/spectre.pdf
Pavel Machek Jan. 4, 2018, 7:26 p.m. UTC | #27
Hi!

> >> > What remains to be seen is if there are other patterns that affect

> >> > different processors.

> >> >

> >> > In the longer term the compiler itself needs to know what is and isn't

> >> > safe (ie you need to be able to write things like

> >> >

> >> > void foo(tainted __user int *x)

> >> >

> >> > and have the compiler figure out what level of speculation it can do and

> >> > (on processors with those features like IA64) when it can and can't do

> >> > various kinds of non-trapping loads.

> >> >

> >>

> >> It would be great if coccinelle and/or smatch could be taught to catch

> >> some of these case at least as a first pass "please audit this code

> >> block" type of notification.

> >>

> >

> > What should one be looking for.  Do you have a typical example?

> >

> 

> See "Exploiting Conditional Branch Misprediction" from the paper [1].

> 

> The typical example is an attacker controlled index used to trigger a

> dependent read near a branch. Where an example of "near" from the

> paper is "up to 188 simple instructions inserted in the source code

> between the ‘if’ statement and the line accessing array...".

> 

> if (attacker_controlled_index < bound)

>      val = array[attacker_controlled_index];

> else

>     return error;

> 

> ...when the cpu speculates that the 'index < bound' branch is taken it

> reads index and uses that value to read array[index]. The result of an

> 'array' relative read is potentially observable in the cache.


You still need

	(void) array2[val];

after that to get something observable, right?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek Jan. 4, 2018, 7:39 p.m. UTC | #28
On Thu 2018-01-04 02:47:51, Jiri Kosina wrote:
> On Thu, 4 Jan 2018, Alan Cox wrote:

> 

> > > If the CPU speculation can cause these kinds of side-effects, it just must 

> > > not happen, full stop. 

> > 

> > At which point your performance will resemble that of a 2012 atom

> > processor at best.

> 

> You know what? I'd be completely fine with that, if it's traded for "my 

> ssh and internet banking keys are JUST MINE, ok?" :)


Agreed.

For kernel, we may be able to annonate "tainted" pointers. But then
there's quite a lot of code in userspace... What will need to be
modified? Just JITs? Setuid programs?

And we can get part of the performance back by adding more of
SMT... AFAICT.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Alan Cox Jan. 4, 2018, 8:32 p.m. UTC | #29
> For kernel, we may be able to annonate "tainted" pointers. But then

> there's quite a lot of code in userspace... What will need to be

> modified? Just JITs? Setuid programs?


You never go from one user process to another except via the kernel. We
have no hardware scheduling going on. That means that if the kernel
and/or CPU imposes the correct speculation barriers you can't attack
anyone but yourself.

Sandboxes and JITs are the critical components where the issue matters.
That IMHO is also an argument for why once the basics are in place there
is a good argument for a prctl and maybe cgroup support to run some
groups of processes with different trust levels.

For example there's not much point protecting a user process from root,
or a process from another process that could use ptrace on it instead.

> And we can get part of the performance back by adding more of

> SMT... AFAICT.


The current evidence is not because most workloads are not sufficiently
parallelisable. Likewise doing the speculation in the compiler doesn't
appear to have been the success people hoped for (IA64).

Alan
Jiri Kosina Jan. 4, 2018, 8:39 p.m. UTC | #30
On Thu, 4 Jan 2018, Alan Cox wrote:

> You never go from one user process to another except via the kernel. We

> have no hardware scheduling going on. That means that if the kernel

> and/or CPU imposes the correct speculation barriers you can't attack

> anyone but yourself.


So how does this work on HT with the shared BTB? There is no context 
switch (and hence no IBPB) happening between the threads sharing it.

-- 
Jiri Kosina
SUSE Labs
Pavel Machek Jan. 4, 2018, 8:40 p.m. UTC | #31
Hi!

> > So this is in that same category, but yes, it's inconvenient.

> 

> Disagreed, violently. CPU has to execute the instructions I ask it to 

> execute, and if it executes *anything* else that reveals any information 

> about the instructions that have *not* been executed, it's flawed.


I agree that's a flaw. Unfortunately... CPUs do execute instructions
you did not ask them to execute all the time.

Plus CPU designers forgot that cache state (and active row in DRAM) is
actually observable side-effect. ....and that's where we are today.

If you want, I have two systems with AMD Geode. One is PC. Neither is
very fast.

All the other general purpose CPUs I have -- and that includes
smartphones -- are likely out-of-order, and that means flawed.

So... situation is bad.

CPUs do execute intruction you did not ask them to execute. I don't
think that's reasonable to change.

I believe "right" fix would be for CPUs to treat even DRAM read as
side-effects, and adjust speculation accordingly. I'm not sure Intel/AMD
is going to do the right thing here.

Oh, I have an FPGA, too, if you want to play with RISC-V :-).

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek Jan. 4, 2018, 8:56 p.m. UTC | #32
Hi!


> > It falls there so that the cpu only issues reads with known good 'index' values.

> >

> >> I suspect it would be better to have those barriers in the tun/tap

> >> interfaces where userspace can inject packets and thus time them.  Then

> >> the code could still speculate and go fast for remote packets.

> >>

> >> Or does the speculation stomping have to be immediately at the place

> >> where we use data from userspace to perform a table lookup?

> >

> > The speculation stomping barrier has to be between where we validate

> > the input and when we may speculate on invalid input.

> 

> So a serializing instruction at the kernel/user boundary (like say

> loading cr3) is not enough?  That would seem to break any chance of a

> controlled timing.


Timing attack is not as straightforward as this.

You can assume cache snooping from second CPU _while_ kernel is
executing. Yes, that will mean timing, but....

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Alan Cox Jan. 4, 2018, 9:23 p.m. UTC | #33
On Thu, 4 Jan 2018 21:39:24 +0100 (CET)
Jiri Kosina <jikos@kernel.org> wrote:

> On Thu, 4 Jan 2018, Alan Cox wrote:

> 

> > You never go from one user process to another except via the kernel. We

> > have no hardware scheduling going on. That means that if the kernel

> > and/or CPU imposes the correct speculation barriers you can't attack

> > anyone but yourself.  

> 

> So how does this work on HT with the shared BTB? There is no context 

> switch (and hence no IBPB) happening between the threads sharing it.

> 


If you are paranoid in that case you either need to schedule things that
trust each other together or disable the speculation while that situation
occurs. However the kernel is always in the position to make that
decision.

Alan
Dan Williams Jan. 4, 2018, 9:43 p.m. UTC | #34
On Thu, Jan 4, 2018 at 11:26 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!

>

>> >> > What remains to be seen is if there are other patterns that affect

>> >> > different processors.

>> >> >

>> >> > In the longer term the compiler itself needs to know what is and isn't

>> >> > safe (ie you need to be able to write things like

>> >> >

>> >> > void foo(tainted __user int *x)

>> >> >

>> >> > and have the compiler figure out what level of speculation it can do and

>> >> > (on processors with those features like IA64) when it can and can't do

>> >> > various kinds of non-trapping loads.

>> >> >

>> >>

>> >> It would be great if coccinelle and/or smatch could be taught to catch

>> >> some of these case at least as a first pass "please audit this code

>> >> block" type of notification.

>> >>

>> >

>> > What should one be looking for.  Do you have a typical example?

>> >

>>

>> See "Exploiting Conditional Branch Misprediction" from the paper [1].

>>

>> The typical example is an attacker controlled index used to trigger a

>> dependent read near a branch. Where an example of "near" from the

>> paper is "up to 188 simple instructions inserted in the source code

>> between the ‘if’ statement and the line accessing array...".

>>

>> if (attacker_controlled_index < bound)

>>      val = array[attacker_controlled_index];

>> else

>>     return error;

>>

>> ...when the cpu speculates that the 'index < bound' branch is taken it

>> reads index and uses that value to read array[index]. The result of an

>> 'array' relative read is potentially observable in the cache.

>

> You still need

>

>         (void) array2[val];

>

> after that to get something observable, right?


As far as I understand the presence of array2[val] discloses more
information, but in terms of the cpu taking an action that it is
observable in the cache that's already occurred when "val =
array[attacker_controlled_index];" is speculated. Lets err on the side
of caution and shut down all the observable actions that are already
explicitly gated by an input validation check. In other words, a low
bandwidth information leak is still a leak.
Pavel Machek Jan. 4, 2018, 9:48 p.m. UTC | #35
On Thu 2018-01-04 21:23:59, Alan Cox wrote:
> On Thu, 4 Jan 2018 21:39:24 +0100 (CET)

> Jiri Kosina <jikos@kernel.org> wrote:

> 

> > On Thu, 4 Jan 2018, Alan Cox wrote:

> > 

> > > You never go from one user process to another except via the kernel. We

> > > have no hardware scheduling going on. That means that if the kernel

> > > and/or CPU imposes the correct speculation barriers you can't attack

> > > anyone but yourself.  

> > 

> > So how does this work on HT with the shared BTB? There is no context 

> > switch (and hence no IBPB) happening between the threads sharing it.

> > 

> 

> If you are paranoid in that case you either need to schedule things that

> trust each other together or disable the speculation while that situation

> occurs. However the kernel is always in the position to make that

> decision.


Actually... I'm not paranoid but would like to run flightgear on one
core (smt cpu #0), with smt cpu#1 being idle, while running
compilations on second core (smt cpus #2 and #3).

Is there easy way to do that?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Dan Williams Jan. 4, 2018, 10:09 p.m. UTC | #36
On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> Hi Dan,

>

> Thanks for these examples.

>

> On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

>> Note, the following is Elena's work, I'm just helping poke the upstream

>> discussion along while she's offline.

>>

>> Elena audited the static analysis reports down to the following

>> locations where userspace provides a value for a conditional branch and

>> then later creates a dependent load on that same userspace controlled

>> value.

>>

>> 'osb()', observable memory barrier, resolves to an lfence on x86. My

>> concern with these changes is that it is not clear what content within

>> the branch block is of concern. Peter's 'if_nospec' proposal combined

>> with Mark's 'nospec_load' would seem to clear that up, from a self

>> documenting code perspective, and let archs optionally protect entire

>> conditional blocks or individual loads within those blocks.

>

> I'm a little concerned by having to use two helpers that need to be paired. It

> means that we have to duplicate the bounds information, and I suspect in

> practice that they'll often be paired improperly.


Hmm, will they be often mispaired? All of the examples to date the
load occurs in the same code block as the bound checking if()
statement.

> I think that we can avoid those problems if we use nospec_ptr() on its own. It

> returns NULL if the pointer would be out-of-bounds, so we can use it in the

> if-statement.

>

> On x86, that can contain the bounds checks followed be an OSB / lfence, like

> if_nospec(). On arm we can use our architected sequence. In either case,

> nospec_ptr() returns NULL if the pointer would be out-of-bounds.

>

> Then, rather than sequences like:

>

>         if_nospec(idx < max) {

>                 val = nospec_array_load(array, idx, max);

>                 ...

>         }

>

> ... we could have:

>

>         if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

>                 val = *elem_ptr;

>                 ...

>         }

>

> ... which also means we don't have to annotate every single load in the branch

> if the element is a structure with multiple fields.


We wouldn't need to annotate each load in that case, right? Just the
instance that uses idx to calculate an address.

if_nospec (idx < max_idx) {
   elem_ptr = nospec_array_ptr(array, idx, max);
   val = elem_ptr->val;
   val2 = elem_ptr->val2;
}

> Does that sound workable to you?


Relative to the urgency of getting fixes upstream I'm ok with whatever
approach generates the least debate from sub-system maintainers.

One concern is that on x86 the:

    if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

...approach produces two conditional branches whereas:

    if_nospec (idx < max_idx) {
        elem_ptr = nospec_array_ptr(array, idx, max);

....can be optimized to one conditional with the barrier.

Is that convincing enough to proceed with if_nospec + nospec_* combination?
Linus Torvalds Jan. 4, 2018, 10:20 p.m. UTC | #37
On Thu, Jan 4, 2018 at 1:43 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>

> As far as I understand the presence of array2[val] discloses more

> information, but in terms of the cpu taking an action that it is

> observable in the cache that's already occurred when "val =

> array[attacker_controlled_index];" is speculated. Lets err on the side

> of caution and shut down all the observable actions that are already

> explicitly gated by an input validation check. In other words, a low

> bandwidth information leak is still a leak.


I do think that the change to __fcheck_files() is interesting, because
that one is potentially rather performance-sensitive.

And the data access speculation annotations we presumably won't be
able to get rid of eventually with newer hardware, so to some degree
this is a bigger deal than the random hacks that affect _current_
hardware but hopefully hw designers will fix in the future.

That does make me think that we should strive to perhaps use the same
model that the BPF people are looking at: making the address
generation part of the computation, rather than using a barrier. I'm
not sure how expensive lfence is, but from every single time ever in
the history of man, barriers have been a bad idea. Which makes me
suspect that lfence is a bad idea too.

If one common pattern is going to be bounces checking array accesses
like this, then we probably should strive to turn

    if (index < bound)
       val = array[index];

into making sure we actually have that data dependency on the address
instead. Whether that is done with a "select" instruction or by using
an "and" with -1/0 is then a small detail.

I wonder if we can make the interface be something really straightforward:

 - specify a special

        array_access(array, index, max)

    operation, and say that the access is guaranteed to not
speculatively access outside the array, and return 0 (of the type
"array entry") if outside the range.

 - further specify that it's safe as READ_ONCE() and/or RCU access
(and only defined for native data types)

That would turn that existing __fcheck_files() from

        if (fd < fdt->max_fds)
                return rcu_dereference_raw(fdt->fd[fd]);
        return NULL;

into just

        return array_access(fdt->fd, fd, ft->max_fds);

and the *implementation* might be architecture-specific, but one
particular implementation would be something like simply

#define array_access(base, idx, max) ({                         \
        union { typeof(base[0]) _val; unsigned long _bit; } __u;\
        unsigned long _i = (idx);                               \
        unsigned long _m = (max);                               \
        unsigned long _mask = _i < _m ? ~0 : 0;                 \
        OPTIMIZER_HIDE_VAR(_mask);                              \
        __u._val = base[_i & _mask];                            \
        __u._bit &= _mask;                                      \
        __u._val; })

(Ok, the above is not exhaustively tested, but you get the idea, and
gcc doesn't seem to mess it up *too* badly).

No barriers anywhere, except for the compiler barrier to make sure the
compiler doesn't see how it all depends on that comparison, and
actually uses two "and" instructions to fix the address and the data.

How slow is lfence? Maybe it's not slow, but the above looks like it
*might* be better..

                     Linus
Linus Torvalds Jan. 4, 2018, 10:23 p.m. UTC | #38
On Thu, Jan 4, 2018 at 2:20 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>

> #define array_access(base, idx, max) ({                         \

>         union { typeof(base[0]) _val; unsigned long _bit; } __u;\

>         unsigned long _i = (idx);                               \

>         unsigned long _m = (max);                               \

>         unsigned long _mask = _i < _m ? ~0 : 0;                 \

>         OPTIMIZER_HIDE_VAR(_mask);                              \

>         __u._val = base[_i & _mask];                            \

>         __u._bit &= _mask;                                      \

>         __u._val; })


That

    __u._val = base[_i & _mask];

thing would have to be READ_ONCE() to make it all ok for the special
cases without locking etc.

              Linus
Pavel Machek Jan. 4, 2018, 10:44 p.m. UTC | #39
Hi!

> >> > What should one be looking for.  Do you have a typical example?

> >> >

> >>

> >> See "Exploiting Conditional Branch Misprediction" from the paper [1].

> >>

> >> The typical example is an attacker controlled index used to trigger a

> >> dependent read near a branch. Where an example of "near" from the

> >> paper is "up to 188 simple instructions inserted in the source code

> >> between the ‘if’ statement and the line accessing array...".

> >>

> >> if (attacker_controlled_index < bound)

> >>      val = array[attacker_controlled_index];

> >> else

> >>     return error;

> >>

> >> ...when the cpu speculates that the 'index < bound' branch is taken it

> >> reads index and uses that value to read array[index]. The result of an

> >> 'array' relative read is potentially observable in the cache.

> >

> > You still need

> >

> >         (void) array2[val];

> >

> > after that to get something observable, right?

> 

> As far as I understand the presence of array2[val] discloses more

> information, but in terms of the cpu taking an action that it is

> observable in the cache that's already occurred when "val =

> array[attacker_controlled_index];" is speculated. Lets err on the


Well yes, attacker can observe val = 
array[attacker_controlled_index]; . But that's not something he's
interested in. So the CPU cheats and attacker has a proof. But he knew
that before.

>side

> of caution and shut down all the observable actions that are already

> explicitly gated by an input validation check. In other words, a low

> bandwidth information leak is still a leak.


What did it leak? Nothing. Attacker had to know
array+attacker_controlled_index, and he now knows
(array+attacker_controlled_index)%CACHELINE_SIZE.

With (void) array2[val];, the attack gets interesting -- I now know
*(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to
get information from arbitrary place in memory -- which is useful for
.. reading ssh keys, for example.

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Alan Cox Jan. 4, 2018, 10:55 p.m. UTC | #40
> and the *implementation* might be architecture-specific, but one

> particular implementation would be something like simply

> 

> #define array_access(base, idx, max) ({                         \

>         union { typeof(base[0]) _val; unsigned long _bit; } __u;\

>         unsigned long _i = (idx);                               \

>         unsigned long _m = (max);                               \

>         unsigned long _mask = _i < _m ? ~0 : 0;                 \

>         OPTIMIZER_HIDE_VAR(_mask);                              \

>         __u._val = base[_i & _mask];                            \

>         __u._bit &= _mask;                                      \

>         __u._val; })

> 

> (Ok, the above is not exhaustively tested, but you get the idea, and

> gcc doesn't seem to mess it up *too* badly).


How do you ensure that the CPU doesn't speculate j < _m  ? ~0 : 0 pick the
wrong mask and then reference base[] ?

It's a nice idea but I think we'd need to run it past the various CPU
vendors especially before it was implemented as a generic solution.
Anding with a constant works because the constant doesn't get speculated
and nor does the and with a constant, but you've got a whole additional
conditional path in your macro.

Alan
Linus Torvalds Jan. 4, 2018, 11:06 p.m. UTC | #41
On Thu, Jan 4, 2018 at 2:55 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>

> How do you ensure that the CPU doesn't speculate j < _m  ? ~0 : 0 pick the

> wrong mask and then reference base[] ?


.. yeah, that's exactly where we want to make sure that the compiler
uses a select or 'setb'.

That's what gcc does for me in testing:

        xorl    %eax, %eax
        setbe   %al
        negq    %rax

 but yes, we'd need to guarantee it somehow.

Presumably that is where we end up having some arch-specific stuff.
Possibly there is some gcc builtin. I wanted to avoid actually writing
architecture-specific asm.

> Anding with a constant works because the constant doesn't get speculated

> and nor does the and with a constant, but you've got a whole additional

> conditional path in your macro.


Absolutely. Think of it as an example, not "the solution".

It's also possible that x86 'lfence' really is so fast that it doesn't
make sense to try to do this. Agner Fog claims that it's single-cycle
(well, except for P4, surprise, surprise), but I suspect that his
timings are simply for 'lfence' in a loop or something. Which may not
show the real cost of actually halting things until they are stable.

Also, maybe that __fcheck_files() pattern where getting a NULL pointer
happens to be the right thing for out-of-range is so unusual as to be
useless, and most people end up having to have that limit check for
other reasons anyway.

          Linus
Alan Cox Jan. 4, 2018, 11:11 p.m. UTC | #42
> It's also possible that x86 'lfence' really is so fast that it doesn't

> make sense to try to do this.


If you've got a lot going on it's not 1 cycle.

Alan
Dan Williams Jan. 4, 2018, 11:12 p.m. UTC | #43
On Thu, Jan 4, 2018 at 2:44 PM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!

>

>> >> > What should one be looking for.  Do you have a typical example?

>> >> >

>> >>

>> >> See "Exploiting Conditional Branch Misprediction" from the paper [1].

>> >>

>> >> The typical example is an attacker controlled index used to trigger a

>> >> dependent read near a branch. Where an example of "near" from the

>> >> paper is "up to 188 simple instructions inserted in the source code

>> >> between the ‘if’ statement and the line accessing array...".

>> >>

>> >> if (attacker_controlled_index < bound)

>> >>      val = array[attacker_controlled_index];

>> >> else

>> >>     return error;

>> >>

>> >> ...when the cpu speculates that the 'index < bound' branch is taken it

>> >> reads index and uses that value to read array[index]. The result of an

>> >> 'array' relative read is potentially observable in the cache.

>> >

>> > You still need

>> >

>> >         (void) array2[val];

>> >

>> > after that to get something observable, right?

>>

>> As far as I understand the presence of array2[val] discloses more

>> information, but in terms of the cpu taking an action that it is

>> observable in the cache that's already occurred when "val =

>> array[attacker_controlled_index];" is speculated. Lets err on the

>

> Well yes, attacker can observe val =

> array[attacker_controlled_index]; . But that's not something he's

> interested in. So the CPU cheats and attacker has a proof. But he knew

> that before.

>

>>side

>> of caution and shut down all the observable actions that are already

>> explicitly gated by an input validation check. In other words, a low

>> bandwidth information leak is still a leak.

>

> What did it leak? Nothing. Attacker had to know

> array+attacker_controlled_index, and he now knows

> (array+attacker_controlled_index)%CACHELINE_SIZE.

>

> With (void) array2[val];, the attack gets interesting -- I now know

> *(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to

> get information from arbitrary place in memory -- which is useful for

> .. reading ssh keys, for example.


Right, but how far away from "val = array[attacker_controlled_index];"
in the instruction stream do you need to look before you're
comfortable there's no 'val' dependent reads in the speculation window
on all possible architectures. Until we have variable annotations and
compiler help my guess is that static analysis has an easier time
pointing us to the first potentially bad speculative access.
Alan Cox Jan. 4, 2018, 11:21 p.m. UTC | #44
> Right, but how far away from "val = array[attacker_controlled_index];"

> in the instruction stream do you need to look before you're

> comfortable there's no 'val' dependent reads in the speculation window

> on all possible architectures. Until we have variable annotations and

> compiler help my guess is that static analysis has an easier time

> pointing us to the first potentially bad speculative access.


On x86 the researchers are saying up to 180 or so simple instructions.
I've not seen any Intel or AMD official quoted value. It's enough that
you need to peer across functions.

Alan
Pavel Machek Jan. 4, 2018, 11:33 p.m. UTC | #45
Hi!

> > What did it leak? Nothing. Attacker had to know

> > array+attacker_controlled_index, and he now knows

> > (array+attacker_controlled_index)%CACHELINE_SIZE.

> >

> > With (void) array2[val];, the attack gets interesting -- I now know

> > *(array+attacker_controlled_index) % CACHELINE_SIZE ... allowing me to

> > get information from arbitrary place in memory -- which is useful for

> > .. reading ssh keys, for example.

> 

> Right, but how far away from "val = array[attacker_controlled_index];"

> in the instruction stream do you need to look before you're

> comfortable there's no 'val' dependent reads in the speculation window

> on all possible architectures. Until we have variable annotations and

> compiler help my guess is that static analysis has an easier time

> pointing us to the first potentially bad speculative access.


Well, you are already scanning for if (attacker_controlled_index <
limit) .... array[attacker_controlled_index] and those can already be
far away from each other....

Anyway, likely in the end human should be creating the patch, and if
there's no array2[val], we do not need the patch after all.

Best regards,

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Dan Williams Jan. 5, 2018, 12:24 a.m. UTC | #46
On Thu, Jan 4, 2018 at 3:06 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jan 4, 2018 at 2:55 PM, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:

>>

>> How do you ensure that the CPU doesn't speculate j < _m  ? ~0 : 0 pick the

>> wrong mask and then reference base[] ?

>

> .. yeah, that's exactly where we want to make sure that the compiler

> uses a select or 'setb'.

>

> That's what gcc does for me in testing:

>

>         xorl    %eax, %eax

>         setbe   %al

>         negq    %rax

>

>  but yes, we'd need to guarantee it somehow.

>

> Presumably that is where we end up having some arch-specific stuff.

> Possibly there is some gcc builtin. I wanted to avoid actually writing

> architecture-specific asm.

>

>> Anding with a constant works because the constant doesn't get speculated

>> and nor does the and with a constant, but you've got a whole additional

>> conditional path in your macro.

>

> Absolutely. Think of it as an example, not "the solution".

>

> It's also possible that x86 'lfence' really is so fast that it doesn't

> make sense to try to do this. Agner Fog claims that it's single-cycle

> (well, except for P4, surprise, surprise), but I suspect that his

> timings are simply for 'lfence' in a loop or something. Which may not

> show the real cost of actually halting things until they are stable.

>

> Also, maybe that __fcheck_files() pattern where getting a NULL pointer

> happens to be the right thing for out-of-range is so unusual as to be

> useless, and most people end up having to have that limit check for

> other reasons anyway.


This potential barrier avoidance optimization technique is something
that could fit in the nospec_{ptr,load,array_load} interface that Mark
defined for ARM, and is constructed around a proposed compiler
builtin. Although, lfence is not a full serializing instruction, so
before we spend too much effort trying to kill it we should measure
how bad it is in practice in these paths.

At this point I'll go ahead with rewriting the osb() patches into
using Mark's nospec_* accessors plus the if_nospec macro. We can kill
the barrier in if_nospec once we are sure the compiler will always "Do
the Right Thing" with the array_access() approach, and can otherwise
make the array_access() approach the 'best-effort' default fallback.
Julia Lawall Jan. 5, 2018, 8:11 a.m. UTC | #47
It looks like the problem in terms of detection is to find values that
should be annotated as __user.  Poking around a bit, it seems like this
tool is doing just that:

http://www.cs.umd.edu/~jfoster/cqual/

It dates from 2004, but perhaps the developer could be motivated to pick
it up again.

I don't think Coccinelle would be good for doing this (ie implementing
taint analysis) because the dataflow is too complicated.

julia
Mark Rutland Jan. 5, 2018, 2:40 p.m. UTC | #48
On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote:
> On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote:

> > Hi Dan,

> >

> > Thanks for these examples.

> >

> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

> >> Note, the following is Elena's work, I'm just helping poke the upstream

> >> discussion along while she's offline.

> >>

> >> Elena audited the static analysis reports down to the following

> >> locations where userspace provides a value for a conditional branch and

> >> then later creates a dependent load on that same userspace controlled

> >> value.

> >>

> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My

> >> concern with these changes is that it is not clear what content within

> >> the branch block is of concern. Peter's 'if_nospec' proposal combined

> >> with Mark's 'nospec_load' would seem to clear that up, from a self

> >> documenting code perspective, and let archs optionally protect entire

> >> conditional blocks or individual loads within those blocks.

> >

> > I'm a little concerned by having to use two helpers that need to be paired. It

> > means that we have to duplicate the bounds information, and I suspect in

> > practice that they'll often be paired improperly.

> 

> Hmm, will they be often mispaired? All of the examples to date the

> load occurs in the same code block as the bound checking if()

> statement.


Practically speaking, barriers get misused all the time, and having a
single helper minimizes the risk or misuse.

> > I think that we can avoid those problems if we use nospec_ptr() on its own. It

> > returns NULL if the pointer would be out-of-bounds, so we can use it in the

> > if-statement.

> >

> > On x86, that can contain the bounds checks followed be an OSB / lfence, like

> > if_nospec(). On arm we can use our architected sequence. In either case,

> > nospec_ptr() returns NULL if the pointer would be out-of-bounds.

> >

> > Then, rather than sequences like:

> >

> >         if_nospec(idx < max) {

> >                 val = nospec_array_load(array, idx, max);

> >                 ...

> >         }

> >

> > ... we could have:

> >

> >         if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

> >                 val = *elem_ptr;

> >                 ...

> >         }

> >

> > ... which also means we don't have to annotate every single load in the branch

> > if the element is a structure with multiple fields.

> 

> We wouldn't need to annotate each load in that case, right? Just the

> instance that uses idx to calculate an address.


Correct.

> if_nospec (idx < max_idx) {

>    elem_ptr = nospec_array_ptr(array, idx, max);

>    val = elem_ptr->val;

>    val2 = elem_ptr->val2;

> }

> 

> > Does that sound workable to you?

> 

> Relative to the urgency of getting fixes upstream I'm ok with whatever

> approach generates the least debate from sub-system maintainers.

> 

> One concern is that on x86 the:

> 

>     if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

> 

> ...approach produces two conditional branches whereas:

> 

>     if_nospec (idx < max_idx) {

>         elem_ptr = nospec_array_ptr(array, idx, max);

> 

> ....can be optimized to one conditional with the barrier.


Do you mean because the NULL check is redundant? I was hoping that the
compiler would have the necessary visibility to fold that with the
bounds check (on the assumption that the array base must not be NULL).

Thanks,
Mark.
Dan Williams Jan. 5, 2018, 4:44 p.m. UTC | #49
On Fri, Jan 5, 2018 at 6:40 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote:

>> On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> > Hi Dan,

>> >

>> > Thanks for these examples.

>> >

>> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

>> >> Note, the following is Elena's work, I'm just helping poke the upstream

>> >> discussion along while she's offline.

>> >>

>> >> Elena audited the static analysis reports down to the following

>> >> locations where userspace provides a value for a conditional branch and

>> >> then later creates a dependent load on that same userspace controlled

>> >> value.

>> >>

>> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My

>> >> concern with these changes is that it is not clear what content within

>> >> the branch block is of concern. Peter's 'if_nospec' proposal combined

>> >> with Mark's 'nospec_load' would seem to clear that up, from a self

>> >> documenting code perspective, and let archs optionally protect entire

>> >> conditional blocks or individual loads within those blocks.

>> >

>> > I'm a little concerned by having to use two helpers that need to be paired. It

>> > means that we have to duplicate the bounds information, and I suspect in

>> > practice that they'll often be paired improperly.

>>

>> Hmm, will they be often mispaired? All of the examples to date the

>> load occurs in the same code block as the bound checking if()

>> statement.

>

> Practically speaking, barriers get misused all the time, and having a

> single helper minimizes the risk or misuse.


I agree, but this is why if_nospec hides the barrier / makes it implicit.

>

>> > I think that we can avoid those problems if we use nospec_ptr() on its own. It

>> > returns NULL if the pointer would be out-of-bounds, so we can use it in the

>> > if-statement.

>> >

>> > On x86, that can contain the bounds checks followed be an OSB / lfence, like

>> > if_nospec(). On arm we can use our architected sequence. In either case,

>> > nospec_ptr() returns NULL if the pointer would be out-of-bounds.

>> >

>> > Then, rather than sequences like:

>> >

>> >         if_nospec(idx < max) {

>> >                 val = nospec_array_load(array, idx, max);

>> >                 ...

>> >         }

>> >

>> > ... we could have:

>> >

>> >         if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

>> >                 val = *elem_ptr;

>> >                 ...

>> >         }

>> >

>> > ... which also means we don't have to annotate every single load in the branch

>> > if the element is a structure with multiple fields.

>>

>> We wouldn't need to annotate each load in that case, right? Just the

>> instance that uses idx to calculate an address.

>

> Correct.

>

>> if_nospec (idx < max_idx) {

>>    elem_ptr = nospec_array_ptr(array, idx, max);

>>    val = elem_ptr->val;

>>    val2 = elem_ptr->val2;

>> }

>>

>> > Does that sound workable to you?

>>

>> Relative to the urgency of getting fixes upstream I'm ok with whatever

>> approach generates the least debate from sub-system maintainers.

>>

>> One concern is that on x86 the:

>>

>>     if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

>>

>> ...approach produces two conditional branches whereas:

>>

>>     if_nospec (idx < max_idx) {

>>         elem_ptr = nospec_array_ptr(array, idx, max);

>>

>> ....can be optimized to one conditional with the barrier.

>

> Do you mean because the NULL check is redundant? I was hoping that the

> compiler would have the necessary visibility to fold that with the

> bounds check (on the assumption that the array base must not be NULL).


...but there's legitimately 2 conditionals one to control the
non-speculative flow, and one to sink the speculation ala the
array_access() approach from Linus. Keeping those separate seems to
lead to less change in the suspected areas. In any event I'll post the
reworked patches and we can iterate from there.
Dan Williams Jan. 5, 2018, 6:05 p.m. UTC | #50
On Fri, Jan 5, 2018 at 8:44 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Jan 5, 2018 at 6:40 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>> On Thu, Jan 04, 2018 at 02:09:52PM -0800, Dan Williams wrote:

>>> On Thu, Jan 4, 2018 at 3:47 AM, Mark Rutland <mark.rutland@arm.com> wrote:

>>> > Hi Dan,

>>> >

>>> > Thanks for these examples.

>>> >

>>> > On Thu, Jan 04, 2018 at 03:10:51AM +0000, Williams, Dan J wrote:

>>> >> Note, the following is Elena's work, I'm just helping poke the upstream

>>> >> discussion along while she's offline.

>>> >>

>>> >> Elena audited the static analysis reports down to the following

>>> >> locations where userspace provides a value for a conditional branch and

>>> >> then later creates a dependent load on that same userspace controlled

>>> >> value.

>>> >>

>>> >> 'osb()', observable memory barrier, resolves to an lfence on x86. My

>>> >> concern with these changes is that it is not clear what content within

>>> >> the branch block is of concern. Peter's 'if_nospec' proposal combined

>>> >> with Mark's 'nospec_load' would seem to clear that up, from a self

>>> >> documenting code perspective, and let archs optionally protect entire

>>> >> conditional blocks or individual loads within those blocks.

>>> >

>>> > I'm a little concerned by having to use two helpers that need to be paired. It

>>> > means that we have to duplicate the bounds information, and I suspect in

>>> > practice that they'll often be paired improperly.

>>>

>>> Hmm, will they be often mispaired? All of the examples to date the

>>> load occurs in the same code block as the bound checking if()

>>> statement.

>>

>> Practically speaking, barriers get misused all the time, and having a

>> single helper minimizes the risk or misuse.

>

> I agree, but this is why if_nospec hides the barrier / makes it implicit.

>

>>

>>> > I think that we can avoid those problems if we use nospec_ptr() on its own. It

>>> > returns NULL if the pointer would be out-of-bounds, so we can use it in the

>>> > if-statement.

>>> >

>>> > On x86, that can contain the bounds checks followed be an OSB / lfence, like

>>> > if_nospec(). On arm we can use our architected sequence. In either case,

>>> > nospec_ptr() returns NULL if the pointer would be out-of-bounds.

>>> >

>>> > Then, rather than sequences like:

>>> >

>>> >         if_nospec(idx < max) {

>>> >                 val = nospec_array_load(array, idx, max);

>>> >                 ...

>>> >         }

>>> >

>>> > ... we could have:

>>> >

>>> >         if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

>>> >                 val = *elem_ptr;

>>> >                 ...

>>> >         }

>>> >

>>> > ... which also means we don't have to annotate every single load in the branch

>>> > if the element is a structure with multiple fields.

>>>

>>> We wouldn't need to annotate each load in that case, right? Just the

>>> instance that uses idx to calculate an address.

>>

>> Correct.

>>

>>> if_nospec (idx < max_idx) {

>>>    elem_ptr = nospec_array_ptr(array, idx, max);

>>>    val = elem_ptr->val;

>>>    val2 = elem_ptr->val2;

>>> }

>>>

>>> > Does that sound workable to you?

>>>

>>> Relative to the urgency of getting fixes upstream I'm ok with whatever

>>> approach generates the least debate from sub-system maintainers.

>>>

>>> One concern is that on x86 the:

>>>

>>>     if ((elem_ptr = nospec_array_ptr(array, idx, max)) {

>>>

>>> ...approach produces two conditional branches whereas:

>>>

>>>     if_nospec (idx < max_idx) {

>>>         elem_ptr = nospec_array_ptr(array, idx, max);

>>>

>>> ....can be optimized to one conditional with the barrier.

>>

>> Do you mean because the NULL check is redundant? I was hoping that the

>> compiler would have the necessary visibility to fold that with the

>> bounds check (on the assumption that the array base must not be NULL).

>

> ...but there's legitimately 2 conditionals one to control the

> non-speculative flow, and one to sink the speculation ala the

> array_access() approach from Linus. Keeping those separate seems to

> lead to less change in the suspected areas. In any event I'll post the

> reworked patches and we can iterate from there.


Disregard this, I'm going ahead with your new nospec_array_ptr()
helper as it falls out nicely and seems to be equally self documenting
as 'if_nospec'. Since I was already done with the if_nospec +
nospec_array_load conversions it's not much more work to go back and
roll the nospec_array_ptr conversion on top.