From patchwork Fri Dec 14 13:54:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Alex_Benn=C3=A9e?= X-Patchwork-Id: 153852 Delivered-To: patch@linaro.org Received: by 2002:a2e:299d:0:0:0:0:0 with SMTP id p29-v6csp2111464ljp; Fri, 14 Dec 2018 06:08:03 -0800 (PST) X-Google-Smtp-Source: AFSGD/VLQGimzD4rOx61QEYiDCu9ID+ML5Tf8MslH5ZamsxdTlZBp5d9L38Cs0IRkZC8FkhX5TGB X-Received: by 2002:a0c:b8ac:: with SMTP id y44mr2866250qvf.76.1544796482931; Fri, 14 Dec 2018 06:08:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544796482; cv=none; d=google.com; s=arc-20160816; b=UasFcyYmllsHRBXUnUCjFZoB9X12Av8ylninM6TOl6bphUggrmS1fCzB8a8saMNDBN 4BamFYtX5D70TKCm9QI31P40MBaD+i16pV5CduHRwbQ5B/Xfl9mQIXlbNdUDv781XNHA Al8YmEsr9nJff4xcqugroxt2dx6mjdEixU2V9M0K1Ta28L/LZTt6qvfzaseqCQ2/on2O KInd8dK89/kQT2zraC9roSpBbKQDPQUT5eAdccj15nrINlvicOMry64fBNAdiza5AIT7 XobVmjXO+NNKCFBvtP19iFCYl2YVlh2BlM54cP1vUUNEm/trtLJflbmz66xBUkQaJ2ra 9JAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:cc:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:subject :content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:to:from:dkim-signature; bh=v2LkXUad3d8kbcUVsjTYbMWDRQeu5nyQch9NbYRGvp8=; b=STZmkH9mf/d8c2v6dMXUuv6n9m10egcI/G7i3ODdS2IoWK9JeB5zAuPrjDWMGUwqGx l9fDj/35BMZTWRnvS3qe6E6b1HUN2P5EQylVE5pPpXq3/hgdd2j+E09bBhjXWFRCy64d vnaflWhbU3RCtWTPF/pp5MbX7xFG06zTplzZR0S/0DPfQ5o6ToBHJukmnQIzNu99qObg eQg/fPEMsYjpymgcvWlKgL0mwMkkVKPdTIqQBNIVYsDBBKzK0M8X9CVP8UoF6zyKlJQ1 KKOl3tUtQHL4SlpoNF+QFiYeccL9j2+t5p1Ja4kAXVD3S27hOcwK7XxxO9Oz3HPBXtA6 Jr9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@linaro.org header.s=google header.b=Hb7X+6Tv; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [2001:4830:134:3::11]) by mx.google.com with ESMTPS id 37si944764qvb.136.2018.12.14.06.08.02 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 14 Dec 2018 06:08:02 -0800 (PST) Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; Authentication-Results: mx.google.com; dkim=fail header.i=@linaro.org header.s=google header.b=Hb7X+6Tv; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from localhost ([::1]:33721 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gXo8H-0004Fm-CQ for patch@linaro.org; Fri, 14 Dec 2018 09:08:01 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49198) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gXnvj-0008N4-VO for qemu-devel@nongnu.org; Fri, 14 Dec 2018 08:55:05 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gXnvh-0001xO-W1 for qemu-devel@nongnu.org; Fri, 14 Dec 2018 08:55:03 -0500 Received: from mail-wm1-x334.google.com ([2a00:1450:4864:20::334]:52326) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gXnvh-0001wf-NV for qemu-devel@nongnu.org; Fri, 14 Dec 2018 08:55:01 -0500 Received: by mail-wm1-x334.google.com with SMTP id m1so5774237wml.2 for ; Fri, 14 Dec 2018 05:55:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=v2LkXUad3d8kbcUVsjTYbMWDRQeu5nyQch9NbYRGvp8=; b=Hb7X+6TvXABzyel4moAeK6NRZ7g9C8cEwBbz2caJHMtXHgHCd/BJsc9RgbmnlhkL9I 4GqsAjBhKbXtxiKbR9o+Wjcj0rETl9NWqhoJielTJVTH0fSvtHy8tKVJln0Hr21MqVjO kCmX9Aj//qEB2ipoXUW7NsNaJh0OT3yMkZQxc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=v2LkXUad3d8kbcUVsjTYbMWDRQeu5nyQch9NbYRGvp8=; b=Abgxs9jE1fVIeMRXSehbev8xkj4oEwf9x6mtBN4VxSHJXiGUae1RveSn6sjZevweXG lXr/i8A3alcuGBldOtxubu5c1taZvYldDV4ZLtHctGN+01pRJ5aJik0iLjyBtfICqFmV CYdgOAH5c5iSrFY0CBANzD1QDMjARc/Y2su8cbOYJAqUzwGP3ZuHcRgNWiPF8zCXhwXh Jh+WwQYn703QuSH5lNgdrdYDEMAuqkLstjNjsyv+t65HS+wR0TPmTNmLukvuEo5VHziv Y/qWk6wPr3O/+bjZvI+sy9YJOPe3MluO8TXgz4U1D31zNI7zRUF4zM8QKY8bceQNmh+H dKVw== X-Gm-Message-State: AA+aEWb9f5VJ+PyieiIaRuS16S9MN85MMkX5jyqmBnV9ypLEwW2Yak9E TtnRiuXAJSkYgciFSWZP2CdMTQ== X-Received: by 2002:a7b:c853:: with SMTP id c19mr3181399wml.61.1544795700429; Fri, 14 Dec 2018 05:55:00 -0800 (PST) Received: from zen.linaro.local ([81.128.185.34]) by smtp.gmail.com with ESMTPSA id b129sm3356042wmd.24.2018.12.14.05.54.55 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 14 Dec 2018 05:54:56 -0800 (PST) Received: from zen.linaroharston (localhost [127.0.0.1]) by zen.linaro.local (Postfix) with ESMTP id B5C1E3E056A; Fri, 14 Dec 2018 13:54:53 +0000 (GMT) From: =?utf-8?q?Alex_Benn=C3=A9e?= To: peter.maydell@linaro.org Date: Fri, 14 Dec 2018 13:54:46 +0000 Message-Id: <20181214135452.25936-10-alex.bennee@linaro.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181214135452.25936-1-alex.bennee@linaro.org> References: <20181214135452.25936-1-alex.bennee@linaro.org> MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::334 Subject: [Qemu-devel] [PULL 09/15] fpu: introduce hardfloat X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Emilio G. Cota" , =?utf-8?q?Alex_Benn=C3=A9e?= , qemu-devel@nongnu.org, Aurelien Jarno Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" From: "Emilio G. Cota" The appended paves the way for leveraging the host FPU for a subset of guest FP operations. For most guest workloads (e.g. FP flags aren't ever cleared, inexact occurs often and rounding is set to the default [to nearest]) this will yield sizable performance speedups. The approach followed here avoids checking the FP exception flags register. See the added comment for details. This assumes that QEMU is running on an IEEE754-compliant FPU and that the rounding is set to the default (to nearest). The implementation-dependent specifics of the FPU should not matter; things like tininess detection and snan representation are still dealt with in soft-fp. However, this approach will break on most hosts if we compile QEMU with flags that break IEEE compatibility. There is no way to detect all of these flags at compilation time, but at least we check for -ffast-math (which defines __FAST_MATH__) and disable hardfloat (plus emit a #warning) when it is set. This patch just adds common code. Some operations will be migrated to hardfloat in subsequent patches to ease bisection. Note: some architectures (at least PPC, there might be others) clear the status flags passed to softfloat before most FP operations. This precludes the use of hardfloat, so to avoid introducing a performance regression for those targets, we add a flag to disable hardfloat. In the long run though it would be good to fix the targets so that at least the inexact flag passed to softfloat is indeed sticky. Reviewed-by: Alex Bennée Signed-off-by: Emilio G. Cota Signed-off-by: Alex Bennée -- 2.17.1 diff --git a/fpu/softfloat.c b/fpu/softfloat.c index ecdc00c633..468c4554d5 100644 --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -83,6 +83,7 @@ this code that are retained. * target-dependent and needs the TARGET_* macros. */ #include "qemu/osdep.h" +#include #include "qemu/bitops.h" #include "fpu/softfloat.h" @@ -95,6 +96,324 @@ this code that are retained. *----------------------------------------------------------------------------*/ #include "fpu/softfloat-macros.h" +/* + * Hardfloat + * + * Fast emulation of guest FP instructions is challenging for two reasons. + * First, FP instruction semantics are similar but not identical, particularly + * when handling NaNs. Second, emulating at reasonable speed the guest FP + * exception flags is not trivial: reading the host's flags register with a + * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp], + * and trapping on every FP exception is not fast nor pleasant to work with. + * + * We address these challenges by leveraging the host FPU for a subset of the + * operations. To do this we expand on the idea presented in this paper: + * + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a + * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615. + * + * The idea is thus to leverage the host FPU to (1) compute FP operations + * and (2) identify whether FP exceptions occurred while avoiding + * expensive exception flag register accesses. + * + * An important optimization shown in the paper is that given that exception + * flags are rarely cleared by the guest, we can avoid recomputing some flags. + * This is particularly useful for the inexact flag, which is very frequently + * raised in floating-point workloads. + * + * We optimize the code further by deferring to soft-fp whenever FP exception + * detection might get hairy. Two examples: (1) when at least one operand is + * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 result + * and the result is < the minimum normal. + */ +#define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (unlikely(soft_t ## _is_denormal(*a))) { \ + *a = soft_t ## _set_sign(soft_t ## _zero, \ + soft_t ## _is_neg(*a)); \ + s->float_exception_flags |= float_flag_input_denormal; \ + } \ + } + +GEN_INPUT_FLUSH__NOCHECK(float32_input_flush__nocheck, float32) +GEN_INPUT_FLUSH__NOCHECK(float64_input_flush__nocheck, float64) +#undef GEN_INPUT_FLUSH__NOCHECK + +#define GEN_INPUT_FLUSH1(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + } + +GEN_INPUT_FLUSH1(float32_input_flush1, float32) +GEN_INPUT_FLUSH1(float64_input_flush1, float64) +#undef GEN_INPUT_FLUSH1 + +#define GEN_INPUT_FLUSH2(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + } + +GEN_INPUT_FLUSH2(float32_input_flush2, float32) +GEN_INPUT_FLUSH2(float64_input_flush2, float64) +#undef GEN_INPUT_FLUSH2 + +#define GEN_INPUT_FLUSH3(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, soft_t *c, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + soft_t ## _input_flush__nocheck(c, s); \ + } + +GEN_INPUT_FLUSH3(float32_input_flush3, float32) +GEN_INPUT_FLUSH3(float64_input_flush3, float64) +#undef GEN_INPUT_FLUSH3 + +/* + * Choose whether to use fpclassify or float32/64_* primitives in the generated + * hardfloat functions. Each combination of number of inputs and float size + * gets its own value. + */ +#if defined(__x86_64__) +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 1 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 1 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 1 +#else +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 0 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 0 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 0 +#endif + +/* + * QEMU_HARDFLOAT_USE_ISINF chooses whether to use isinf() over + * float{32,64}_is_infinity when !USE_FP. + * On x86_64/aarch64, using the former over the latter can yield a ~6% speedup. + * On power64 however, using isinf() reduces fp-bench performance by up to 50%. + */ +#if defined(__x86_64__) || defined(__aarch64__) +# define QEMU_HARDFLOAT_USE_ISINF 1 +#else +# define QEMU_HARDFLOAT_USE_ISINF 0 +#endif + +/* + * Some targets clear the FP flags before most FP operations. This prevents + * the use of hardfloat, since hardfloat relies on the inexact flag being + * already set. + */ +#if defined(TARGET_PPC) || defined(__FAST_MATH__) +# if defined(__FAST_MATH__) +# warning disabling hardfloat due to -ffast-math: hardfloat requires an exact \ + IEEE implementation +# endif +# define QEMU_NO_HARDFLOAT 1 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN +#else +# define QEMU_NO_HARDFLOAT 0 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN __attribute__((noinline)) +#endif + +static inline bool can_use_fpu(const float_status *s) +{ + if (QEMU_NO_HARDFLOAT) { + return false; + } + return likely(s->float_exception_flags & float_flag_inexact && + s->float_rounding_mode == float_round_nearest_even); +} + +/* + * Hardfloat generation functions. Each operation can have two flavors: + * either using softfloat primitives (e.g. float32_is_zero_or_normal) for + * most condition checks, or native ones (e.g. fpclassify). + * + * The flavor is chosen by the callers. Instead of using macros, we rely on the + * compiler to propagate constants and inline everything into the callers. + * + * We only generate functions for operations with two inputs, since only + * these are common enough to justify consolidating them into common code. + */ + +typedef union { + float32 s; + float h; +} union_float32; + +typedef union { + float64 s; + double h; +} union_float64; + +typedef bool (*f32_check_fn)(union_float32 a, union_float32 b); +typedef bool (*f64_check_fn)(union_float64 a, union_float64 b); + +typedef float32 (*soft_f32_op2_fn)(float32 a, float32 b, float_status *s); +typedef float64 (*soft_f64_op2_fn)(float64 a, float64 b, float_status *s); +typedef float (*hard_f32_op2_fn)(float a, float b); +typedef double (*hard_f64_op2_fn)(double a, double b); + +/* 2-input is-zero-or-normal */ +static inline bool f32_is_zon2(union_float32 a, union_float32 b) +{ + if (QEMU_HARDFLOAT_2F32_USE_FP) { + /* + * Not using a temp variable for consecutive fpclassify calls ends up + * generating faster code. + */ + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s); +} + +static inline bool f64_is_zon2(union_float64 a, union_float64 b) +{ + if (QEMU_HARDFLOAT_2F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s); +} + +/* 3-input is-zero-or-normal */ +static inline +bool f32_is_zon3(union_float32 a, union_float32 b, union_float32 c) +{ + if (QEMU_HARDFLOAT_3F32_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s) && + float32_is_zero_or_normal(c.s); +} + +static inline +bool f64_is_zon3(union_float64 a, union_float64 b, union_float64 c) +{ + if (QEMU_HARDFLOAT_3F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s) && + float64_is_zero_or_normal(c.s); +} + +static inline bool f32_is_inf(union_float32 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinf(a.h); + } + return float32_is_infinity(a.s); +} + +static inline bool f64_is_inf(union_float64 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinf(a.h); + } + return float64_is_infinity(a.s); +} + +/* Note: @fast_test and @post can be NULL */ +static inline float32 +float32_gen2(float32 xa, float32 xb, float_status *s, + hard_f32_op2_fn hard, soft_f32_op2_fn soft, + f32_check_fn pre, f32_check_fn post, + f32_check_fn fast_test, soft_f32_op2_fn fast_op) +{ + union_float32 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float32_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f32_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + +static inline float64 +float64_gen2(float64 xa, float64 xb, float_status *s, + hard_f64_op2_fn hard, soft_f64_op2_fn soft, + f64_check_fn pre, f64_check_fn post, + f64_check_fn fast_test, soft_f64_op2_fn fast_op) +{ + union_float64 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float64_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f64_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabs(ur.h) <= DBL_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + /*---------------------------------------------------------------------------- | Returns the fraction bits of the half-precision floating-point value `a'. *----------------------------------------------------------------------------*/