From patchwork Wed Dec 4 13:22:21 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ragesh Radhakrishnan X-Patchwork-Id: 22020 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-ie0-f200.google.com (mail-ie0-f200.google.com [209.85.223.200]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 0991E23FCB for ; Wed, 4 Dec 2013 13:22:42 +0000 (UTC) Received: by mail-ie0-f200.google.com with SMTP id at1sf50940298iec.11 for ; Wed, 04 Dec 2013 05:22:42 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:delivered-to:from:to:cc:subject :date:message-id:x-original-sender:x-original-authentication-results :precedence:mailing-list:list-id:list-post:list-help:list-archive :list-unsubscribe; bh=wOuImt8Zsg8DINeVwG6UZLkXy0lmRhJ9HJza5lPOgws=; b=YgfcGQ+YYkiwAXr1f1ebcej72MY3fOafP4B3j5hOj0Onx6qXXySV9gTzAFAYlgXu4e yUE72Yeb0RE29i8Le5DmG1YhhdEEvBg9zjBKiUVDdJY+4Zf3IeBAEszx2QoG5knYqf41 My64cAORYw/jhH5bnij+EM8xV4Xy0Tmok7BPmVNyKNhIKibk2UnkwY+RcxhypJ2YiHXT F58BEEQpJ6rUgU9SqRkDPQg+00PE/kxdonEJDXJMXr3O9Yq4N9QqBcMR1MVwZ4dCMid0 5UoZqTeVJcXIQJ0qTrVajc3JJuv5NumASmYmZ8DXBigx4hUBS9DPGuy+Yogh5fQi6YF8 te4A== X-Gm-Message-State: ALoCoQk1fFYUn6sR6eZOtKmxlJaO5s108zhrcOxmM26ttpfMLeSn/ycBFPUt6H6DQ0UD8krErTLb X-Received: by 10.42.131.129 with SMTP id z1mr8855007ics.25.1386163362205; Wed, 04 Dec 2013 05:22:42 -0800 (PST) MIME-Version: 1.0 X-BeenThere: patchwork-forward@linaro.org Received: by 10.49.37.195 with SMTP id a3ls302801qek.88.gmail; Wed, 04 Dec 2013 05:22:42 -0800 (PST) X-Received: by 10.220.74.69 with SMTP id t5mr44636350vcj.18.1386163362052; Wed, 04 Dec 2013 05:22:42 -0800 (PST) Received: from mail-vc0-f181.google.com (mail-vc0-f181.google.com [209.85.220.181]) by mx.google.com with ESMTPS id uw1si7925833vdc.136.2013.12.04.05.22.41 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Dec 2013 05:22:41 -0800 (PST) Received-SPF: neutral (google.com: 209.85.220.181 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.220.181; Received: by mail-vc0-f181.google.com with SMTP id ks9so11172772vcb.40 for ; Wed, 04 Dec 2013 05:22:41 -0800 (PST) X-Received: by 10.52.230.35 with SMTP id sv3mr7791798vdc.27.1386163361729; Wed, 04 Dec 2013 05:22:41 -0800 (PST) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patches@linaro.org Received: by 10.220.174.196 with SMTP id u4csp291628vcz; Wed, 4 Dec 2013 05:22:39 -0800 (PST) X-Received: by 10.68.254.164 with SMTP id aj4mr17840106pbd.161.1386163358646; Wed, 04 Dec 2013 05:22:38 -0800 (PST) Received: from mail-pd0-f179.google.com (mail-pd0-f179.google.com [209.85.192.179]) by mx.google.com with ESMTPS id ob10si54771100pbb.157.2013.12.04.05.22.38 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Dec 2013 05:22:38 -0800 (PST) Received-SPF: neutral (google.com: 209.85.192.179 is neither permitted nor denied by best guess record for domain of ragesh.r@linaro.org) client-ip=209.85.192.179; Received: by mail-pd0-f179.google.com with SMTP id r10so22554848pdi.10 for ; Wed, 04 Dec 2013 05:22:38 -0800 (PST) X-Received: by 10.66.249.202 with SMTP id yw10mr61019186pac.111.1386163358114; Wed, 04 Dec 2013 05:22:38 -0800 (PST) Received: from ragesh-Latitude-E6420.LGE.NET ([203.247.149.152]) by mx.google.com with ESMTPSA id y9sm157304036pas.10.2013.12.04.05.22.35 for (version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 04 Dec 2013 05:22:37 -0800 (PST) From: Ragesh Radhakrishnan To: patches@linaro.org Cc: Ragesh Radhakrishnan Subject: [PATCH 5/9] Add armv8 port for yuv-rgb armv7 implementation Date: Wed, 4 Dec 2013 18:52:21 +0530 Message-Id: <1386163341-3267-1-git-send-email-ragesh.r@linaro.org> X-Mailer: git-send-email 1.7.9.5 X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: ragesh.r@linaro.org X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.220.181 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org List-ID: X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , Add armv8 yuv-rgb conversion, macros generate_jsimd_ycc_rgb_convert_neon have been modified to support armv8 instruction and register literals. RTSM integer saturation instruction issue workaround added. --- simd/jsimd_arm_neon_64.S | 347 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 347 insertions(+) diff --git a/simd/jsimd_arm_neon_64.S b/simd/jsimd_arm_neon_64.S index ac38d39..9403bbe 100644 --- a/simd/jsimd_arm_neon_64.S +++ b/simd/jsimd_arm_neon_64.S @@ -1532,3 +1532,350 @@ asm_function jsimd_idct_2x2_neon .endfunc .purgem idct_helper + +/*****************************************************************************/ + +/* + * jsimd_ycc_extrgb_convert_neon + * jsimd_ycc_extbgr_convert_neon + * jsimd_ycc_extrgbx_convert_neon + * jsimd_ycc_extbgrx_convert_neon + * jsimd_ycc_extxbgr_convert_neon + * jsimd_ycc_extxrgb_convert_neon + * + * Colorspace conversion YCbCr -> RGB + */ + + +.macro do_load size + .if \size == 8 + ld1 {v4.8b}, [U],8 + ld1 {v5.8b}, [V],8 + ld1 {v0.8b}, [Y],8 + prfm PLDL1KEEP,[U,#64] + prfm PLDL1KEEP,[V,#64] + prfm PLDL1KEEP,[Y,#64] + .elseif \size == 4 + ld1 {v4.b}[0], [U] + ld1 {v4.b}[1], [U] + ld1 {v4.b}[2], [U] + ld1 {v4.b}[3], [U] + ld1 {v5.b}[0], [V] + ld1 {v5.b}[1], [V],1 + ld1 {v5.b}[2], [V],1 + ld1 {v5.b}[3], [V],1 + ld1 {v0.b}[0], [Y],1 + ld1 {v0.b}[1], [Y],1 + ld1 {v0.b}[2], [Y],1 + ld1 {v0.b}[3], [Y],1 + .elseif \size == 2 + ld1 {v4.b}[4], [U],1 + ld1 {v4.b}[5], [U],1 + ld1 {v5.b}[4], [V],1 + ld1 {v5.b}[5], [V],1 + ld1 {v0.b}[4], [Y],1 + ld1 {v0.b}[5], [Y],1 + .elseif \size == 1 + ld1 {v4.b}[6], [U],1 + ld1 {v5.b}[6], [V],1 + ld1 {v0.b}[6], [Y],1 + .else + .error unsupported macroblock size + .endif +.endm + +.macro do_store bpp, size + .if \bpp == 24 + .if \size == 8 + st3 {v10.8b, v11.8b, v12.8b}, [RGB],24 + .elseif \size == 4 + st3 {v10.b, v11.b, v12.b}[0], [RGB],3 + st3 {v10.b, v11.b, v12.b}[1], [RGB],3 + st3 {v10.b, v11.b, v12.b}[2], [RGB],3 + st3 {v10.b, v11.b, v12.b}[3], [RGB],3 + .elseif \size == 2 + st3 {v10.b, v11.b, v12.b}[4], [RGB],3 + st3 {v10.b, v11.b, v12.b}[4], [RGB],3 + .elseif \size == 1 + st3 {v10.b, v11.b, v12.b}[6], [RGB],3 + .else + .error unsupported macroblock size + .endif + .elseif \bpp == 32 + .if \size == 8 + st4 {v10.8b, v11.8b, v12.8b, v13.8b}, [RGB],32 + .elseif \size == 4 + st4 {v10.b, v11.b, v12.b, v13.b}[0], [RGB],4 + st4 {v10.b, v11.b, v12.b, v13.b}[1], [RGB],4 + st4 {v10.b, v11.b, v12.b, v13.b}[2], [RGB],4 + st4 {v10.b, v11.b, v12.b, v13.b}[3], [RGB],4 + .elseif \size == 2 + st4 {v10.b, v11.b, v12.b, v13.b}[4], [RGB],4 + st4 {v10.b, v11.b, v12.b, v13.b}[5], [RGB],4 + .elseif \size == 1 + st4 {v10.b, v11.b, v12.b, v13.b}[6], [RGB],4 + .else + .error unsupported macroblock size + .endif + .else + .error unsupported bpp + .endif +.endm +#ifdef RTSM_SQSHRN_SIM_ISSUE +.macro generate_jsimd_ycc_rgb_convert_neon colorid, bpp, r_offs,rsize, g_offs,gsize, b_offs,bsize,defsize +#else +.macro generate_jsimd_ycc_rgb_convert_neon colorid, bpp, r_offs,rsize, g_offs,gsize, b_offs,bsize +#endif +/* + * 2 stage pipelined YCbCr->RGB conversion + */ + +.macro do_yuv_to_rgb_stage1 + uaddw v6.8h, v2.8h, v4.8b /* q3 = u - 128 */ + uaddw v8.8h, v2.8h, v5.8b /* q2 = v - 128 */ + smull v20.4s, v6.4h, v1.4h[1] /* multiply by -11277 */ + smlal v20.4s, v8.4h, v1.4h[2] /* multiply by -23401 */ + smull2 v22.4s, v6.8h, v1.4h[1] /* multiply by -11277 */ + smlal2 v22.4s, v8.8h, v1.4h[2] /* multiply by -23401 */ + smull v24.4s, v8.4h, v1.4h[0] /* multiply by 22971 */ + smull2 v26.4s, v8.8h, v1.4h[0] /* multiply by 22971 */ + smull v28.4s, v6.4h, v1.4h[3] /* multiply by 29033 */ + smull2 v30.4s, v6.8h, v1.4h[3] /* multiply by 29033 */ +.endm + +.macro do_yuv_to_rgb_stage2 + rshrn v20.4h, v20.4s, #15 + rshrn2 v20.8h, v22.4s, #15 + rshrn v24.4h, v24.4s, #14 + rshrn2 v24.8h, v26.4s, #14 + rshrn v28.4h, v28.4s, #14 + rshrn2 v28.8h, v30.4s, #14 + uaddw v20.8h, v20.8h, v0.8b + uaddw v24.8h, v24.8h, v0.8b + uaddw v28.8h, v28.8h, v0.8b +#ifdef RTSM_SQSHRN_SIM_ISSUE + sqxtun v1\g_offs\defsize, v20.8h + sqxtun v1\r_offs\defsize, v24.8h + sqxtun v1\b_offs\defsize, v28.8h + +#else + sqxtun v1\g_offs\gsize, v20.4s + sqxtun v1\r_offs\rsize, v24.4s + sqxtun v1\b_offs\bsize, v28.4s +#endif +.endm + +.macro do_yuv_to_rgb_stage2_store_load_stage1 + ld1 {v4.8b}, [U],8 + rshrn v20.4h, v20.4s, #15 + rshrn2 v20.8h, v22.4s, #15 + rshrn v24.4h, v24.4s, #14 + rshrn2 v24.8h, v26.4s, #14 + rshrn v28.4h, v28.4s, #14 + ld1 {v5.8b}, [V],8 + rshrn2 v28.8h, v30.4s, #14 + uaddw v20.8h, v20.8h, v0.8b + uaddw v24.8h, v24.8h, v0.8b + uaddw v28.8h, v28.8h, v0.8b +#ifdef RTSM_SQSHRN_SIM_ISSUE + sqxtun v1\g_offs\defsize, v20.8h +#else + sqxtun v1\g_offs\gsize, v20.4s +#endif + ld1 {v0.8b}, [Y],8 +#ifdef RTSM_SQSHRN_SIM_ISSUE + sqxtun v1\r_offs\defsize, v24.8h +#else + sqxtun v1\r_offs\rsize, v24.4s +#endif + prfm PLDL1KEEP,[U,#64] + prfm PLDL1KEEP,[V,#64] + prfm PLDL1KEEP,[Y,#64] +#ifdef RTSM_SQSHRN_SIM_ISSUE + sqxtun v1\b_offs\defsize, v28.8h +#else + sqxtun v1\b_offs\gsize, v28.4s +#endif + uaddw v6.8h, v2.8h, v4.8b /* v6.16b = u - 128 */ + uaddw v8.8h, v2.8h, v5.8b /* q2 = v - 128 */ + do_store \bpp, 8 + smull v20.4s, v6.4h, v1.4h[1] /* multiply by -11277 */ + smlal v20.4s, v8.4h, v1.4h[2] /* multiply by -23401 */ + smull2 v22.4s, v6.8h, v1.4h[1] /* multiply by -11277 */ + smlal2 v22.4s, v8.8h, v1.4h[2] /* multiply by -23401 */ + smull v24.4s, v8.4h, v1.4h[0] /* multiply by 22971 */ + smull2 v26.4s, v8.8h, v1.4h[0] /* multiply by 22971 */ + smull v28.4s, v6.4h, v1.4h[3] /* multiply by 29033 */ + smull2 v30.4s, v6.8h, v1.4h[3] /* multiply by 29033 */ +.endm + +.macro do_yuv_to_rgb + do_yuv_to_rgb_stage1 + do_yuv_to_rgb_stage2 +.endm + +/* Apple gas crashes on adrl, work around that by using adr. + * But this requires a copy of these constants for each function. + */ + +.balign 16 +jsimd_ycc_\colorid\()_neon_consts: + .short 0, 0, 0, 0 + .short 22971, -11277, -23401, 29033 + .short -128, -128, -128, -128 + .short -128, -128, -128, -128 + +asm_function jsimd_ycc_\colorid\()_convert_neon + OUTPUT_WIDTH .req x0 + INPUT_BUF .req x1 + INPUT_ROW .req x2 + OUTPUT_BUF .req x3 + NUM_ROWS .req x4 + + INPUT_BUF0 .req x5 + INPUT_BUF1 .req x6 + INPUT_BUF2 .req INPUT_BUF + + RGB .req x7 + Y .req x8 + U .req x9 + V .req x10 + N .req x15 + + /* Load constants to d1, d2, d3 (v0.4h is just used for padding) */ + adr x15, jsimd_ycc_\colorid\()_neon_consts + ld1 {v0.4h, v1.4h},[x15],16 + ld1 {v2.8h}, [x15] + + /* Save ARM registers and handle input arguments */ + /*push {x4, x5, x6, x7, x8, x9, x10, x30}*/ + stp x4, x5, [sp,-16]! + stp x6, x7, [sp,-16]! + stp x8, x9, [sp,-16]! + stp x10, x30, [sp,-16]! + ldr INPUT_BUF0, [INPUT_BUF] + ldr INPUT_BUF1, [INPUT_BUF,8] + ldr INPUT_BUF2, [INPUT_BUF,16] + .unreq INPUT_BUF + + /* Save NEON registers */ + /*vpush {v8.4h-v15.4h}*/ + sub sp, sp, #32 + st1 {v8.4h-v11.4h}, [sp] + sub sp, sp, #32 + st1 {v12.4h-v15.4h}, [sp] + + /* Initially set v10, v11.4h, v12.8b, d13 to 0xFF */ + movi v10.16b, #255 + movi v12.16b, #255 + + /* Outer loop over scanlines */ + cmp NUM_ROWS, #1 + blt 9f +0: + lsl x16, INPUT_ROW,#3 + ldr Y, [INPUT_BUF0,x16] + ldr U, [INPUT_BUF1,x16] + mov N, OUTPUT_WIDTH + ldr V, [INPUT_BUF2,x16] + add INPUT_ROW, INPUT_ROW, #1 + ldr RGB, [OUTPUT_BUF], #8 + + /* Inner loop over pixels */ + subs N, N, #8 + blt 3f + do_load 8 + do_yuv_to_rgb_stage1 + subs N, N, #8 + blt 2f +1: + do_yuv_to_rgb_stage2_store_load_stage1 + subs N, N, #8 + bge 1b +2: + do_yuv_to_rgb_stage2 + do_store \bpp, 8 + tst N, #7 + beq 8f +3: + tst N, #4 + beq 3f + do_load 4 +3: + tst N, #2 + beq 4f + do_load 2 +4: + tst N, #1 + beq 5f + do_load 1 +5: + do_yuv_to_rgb + tst N, #4 + beq 6f + do_store \bpp, 4 +6: + tst N, #2 + beq 7f + do_store \bpp, 2 +7: + tst N, #1 + beq 8f + do_store \bpp, 1 +8: + subs NUM_ROWS, NUM_ROWS, #1 + bgt 0b +9: + /* Restore all registers and return */ + /* vpop {v8.4h-v15.4h}*/ + ld1 {v12.4h-v15.4h}, [sp], #32 + ld1 {v8.4h-v11.4h}, [sp], #32 + /* pop {r4, r5, r6, r7, r8, r9, r10, pc}*/ + ldp x10, x30, [sp],#16 + ldp x8, x9, [sp],#16 + ldp x6, x5, [sp],#16 + ldp x4, x5, [sp],#16 + br x30 + .unreq OUTPUT_WIDTH + .unreq INPUT_ROW + .unreq OUTPUT_BUF + .unreq NUM_ROWS + .unreq INPUT_BUF0 + .unreq INPUT_BUF1 + .unreq INPUT_BUF2 + .unreq RGB + .unreq Y + .unreq U + .unreq V + .unreq N +.endfunc + +.purgem do_yuv_to_rgb +.purgem do_yuv_to_rgb_stage1 +.purgem do_yuv_to_rgb_stage2 +.purgem do_yuv_to_rgb_stage2_store_load_stage1 +.endm + +/* RTSM simulator fix integer saturation works on 8b boundry add a new parameter + * as a workaround for the simulator fix + */ +#ifdef RTSM_SQSHRN_SIM_ISSUE +/*--------------------------------- id ----- bpp R rsize G gsize B bsize defsize */ +generate_jsimd_ycc_rgb_convert_neon extrgb, 24, 0, .4h, 1, .4h, 2, .4h, .8b +generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h, .8b +generate_jsimd_ycc_rgb_convert_neon extrgbx, 32, 0, .4h, 1, .4h, 2, .4h, .8b +generate_jsimd_ycc_rgb_convert_neon extbgrx, 32, 2, .4h, 1, .4h, 0, .4h, .8b +generate_jsimd_ycc_rgb_convert_neon extxbgr, 32, 3, .4h, 2, .4h, 1, .4h, .8b +generate_jsimd_ycc_rgb_convert_neon extxrgb, 32, 1, .4h, 2, .4h, 3, .4h, .8b +#else +/*--------------------------------- id ----- bpp R rsize G gsize B bsize */ +generate_jsimd_ycc_rgb_convert_neon extrgb, 24, 0, .4h, 1, .4h, 2, .4h +generate_jsimd_ycc_rgb_convert_neon extbgr, 24, 2, .4h, 1, .4h, 0, .4h +generate_jsimd_ycc_rgb_convert_neon extrgbx, 32, 0, .4h, 1, .4h, 2, .4h +generate_jsimd_ycc_rgb_convert_neon extbgrx, 32, 2, .4h, 1, .4h, 0, .4h +generate_jsimd_ycc_rgb_convert_neon extxbgr, 32, 3, .4h, 2, .4h, 1, .4h +generate_jsimd_ycc_rgb_convert_neon extxrgb, 32, 1, .4h, 2, .4h, 3, .4h +#endif + +.purgem do_load +.purgem do_store