Commit Graph

237 Commits

Author SHA1 Message Date
Martin Storsjö 6a62795d40 aarch64: h264idct: Use the offset parameter to movrel
Signed-off-by: Martin Storsjö <martin@martin.st>
2016-11-10 11:18:22 +02:00
Martin Storsjö 383d96aa22 aarch64: vp9: Add NEON optimizations of VP9 MC functions
This work is sponsored by, and copyright, Google.

These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                     ARM   AArch64
vp9_avg4_neon:                      27.2      23.7
vp9_avg8_neon:                      56.5      54.7
vp9_avg16_neon:                    169.9     167.4
vp9_avg32_neon:                    585.8     585.2
vp9_avg64_neon:                   2460.3    2294.7
vp9_avg_8tap_smooth_4h_neon:       132.7     125.2
vp9_avg_8tap_smooth_4hv_neon:      478.8     442.0
vp9_avg_8tap_smooth_4v_neon:       126.0      93.7
vp9_avg_8tap_smooth_8h_neon:       241.7     234.2
vp9_avg_8tap_smooth_8hv_neon:      690.9     646.5
vp9_avg_8tap_smooth_8v_neon:       245.0     205.5
vp9_avg_8tap_smooth_64h_neon:    11273.2   11280.1
vp9_avg_8tap_smooth_64hv_neon:   22980.6   22184.1
vp9_avg_8tap_smooth_64v_neon:    11549.7   10781.1
vp9_put4_neon:                      18.0      17.2
vp9_put8_neon:                      40.2      37.7
vp9_put16_neon:                     97.4      99.5
vp9_put32_neon/armv8:              346.0     307.4
vp9_put64_neon/armv8:             1319.0    1107.5
vp9_put_8tap_smooth_4h_neon:       126.7     118.2
vp9_put_8tap_smooth_4hv_neon:      465.7     434.0
vp9_put_8tap_smooth_4v_neon:       113.0      86.5
vp9_put_8tap_smooth_8h_neon:       229.7     221.6
vp9_put_8tap_smooth_8hv_neon:      658.9     621.3
vp9_put_8tap_smooth_8v_neon:       215.0     187.5
vp9_put_8tap_smooth_64h_neon:    10636.7   10627.8
vp9_put_8tap_smooth_64hv_neon:   21076.8   21026.9
vp9_put_8tap_smooth_64v_neon:     9635.0    9632.4

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.

Signed-off-by: Martin Storsjö <martin@martin.st>
2016-11-10 11:15:56 +02:00
Diego Biurrun 72a19f4013 mpegaudiodsp: aarch64: Adjust function prototype after 2caa93b813 2016-11-10 00:13:48 +01:00
Martin Storsjö 9b2ccafb48 aarch64: Add missing sign extension in ff_h264_idct8_add_neon
Signed-off-by: Martin Storsjö <martin@martin.st>
2016-10-10 14:57:53 +03:00
James Almer 42111e8543 avcodec: fix arguments on xmm/neon clobber test wrappers
Signed-off-by: James Almer <jamrial@gmail.com>
2016-10-02 02:15:47 -03:00
James Almer 449f263f9f avcodec: add missing xmm/neon clobber test wrappers for the new encode API
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-10-01 14:08:50 -03:00
Diego Biurrun 2caa93b813 mpegaudiodsp: Change type of array stride parameters to ptrdiff_t
This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.
2016-09-29 17:54:24 +02:00
Diego Biurrun e4a94d8b36 h264chroma: Change type of stride parameters to ptrdiff_t
This avoids SIMD-optimized functions having to sign-extend their
stride argument manually to be able to do pointer arithmetic.
2016-09-29 14:48:04 +02:00
Anton Khirnov de2ae3c1fa lavc: add clobber tests for the new encoding/decoding API 2016-09-28 10:01:52 +02:00
Xiaolei Yu 5a70e56f2f avcodec: fix vc1dsp dependencies 2016-09-25 13:11:45 +02:00
James Almer 293484fa5e avcodec: add missing xmm/neon clobber test wrappers for the new decode API
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-07-03 18:04:30 -03:00
Clément Bœsch 4a081f224e libavcodec: fix constness in clobber test avcodec_open2() wrappers
Signed-off-by: Martin Storsjö <martin@martin.st>
2016-06-26 21:34:04 +03:00
Clément Bœsch dfd0c0f981 lavc/neontest: fix constness in arm/aarch64 avcodec_open2() wrappers 2016-06-25 13:41:13 +02:00
Clément Bœsch 8ef57a0d61 Merge commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb'
* commit '41ed7ab45fc693f7d7fc35664c0233f4c32d69bb':
  cosmetics: Fix spelling mistakes

Merged-by: Clément Bœsch <u@pkh.me>
2016-06-21 21:55:34 +02:00
James Almer c8c14d0ffc aarch64/synth_filter: fix compilation
Signed-off-by: James Almer <jamrial@gmail.com>
2016-05-10 23:33:12 -03:00
Derek Buitenhuis ca5ec2bf51 Merge commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec'
* commit '01621202aad7e27b2a05c71d9ad7a19dfcbe17ec':
  build: miscellaneous cosmetics

Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2016-05-09 16:25:28 +01:00
Vittorio Giovara 41ed7ab45f cosmetics: Fix spelling mistakes
Signed-off-by: Diego Biurrun <diego@biurrun.de>
2016-05-04 18:16:21 +02:00
Derek Buitenhuis 87b8e95008 Merge commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873'
* commit 'cdb1665f70def544ddab3e3ed3763ef99c8b3873':
  aarch64: Make transpose_4x4H do a regular transpose

Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2016-04-24 12:51:42 +01:00
Derek Buitenhuis 197fa698c6 Merge commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555'
* commit '97aec6e75ef36ed0402653519daa8e1fc8ddb555':
  fft: arm: Drop unnecessary #include, add missing ones

Merged-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
2016-04-12 15:43:09 +01:00
Diego Biurrun 01621202aa build: miscellaneous cosmetics
Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
2016-04-07 15:26:08 +02:00
Martin Storsjö cdb1665f70 aarch64: Make transpose_4x4H do a regular transpose
Previously, ff_h264_idct_add_neon (originally in the arm version) used
a non-regular transpose in order to be able to use more instructions
that deal with registers as 128 bit register pairs. The aarch64
translation doesn't do it to the same extent, but brought along the
same structure since it was a straight translation.

This reshuffles ff_h264_idct_add_neon, bringing it closer to
the C implementation, making the transpose_4x4H macro do a regular
transpose, usable for other algorithms as well.

Previously, the third and fourth output from transpose_4x4H were
swapped, and prior to cc29d96d5a, the same inputs as well. In
addition to just swapping the outputs, also renumber the intermediate
registers for better readability (making the register order match
transpose_4x8B).

This runs with the same number of cycles as before.

Signed-off-by: Martin Storsjö <martin@martin.st>
2016-03-26 21:25:56 +02:00
Diego Biurrun 1a094af638 fft: Split MDCT bits off from FFT 2016-03-01 10:18:28 +01:00
Diego Biurrun 97aec6e75e fft: arm: Drop unnecessary #include, add missing ones 2016-02-26 14:34:58 +01:00
foo86 ae5b2c5250 avcodec/dca: add new decoder based on libdcadec 2016-01-31 17:09:38 +01:00
foo86 4608996772 avcodec/dca: remove old decoder
Remove all files and functions which are not going to be reused,
and disable all functions and FATE tests temporarily which will be.
2016-01-31 17:09:38 +01:00
James Almer 209f50e16b avcodec/synth_filter: split off remaining code from dcadec files
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-25 14:57:38 -03:00
Hendrik Leppkes d03da3e240 Merge commit '2008f76054906e9ff6bf744800af0e5a5bfe61be'
* commit '2008f76054906e9ff6bf744800af0e5a5bfe61be':
  dca: remove unused decode_hf function and quant_d tables

Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-01-02 13:17:48 +01:00
Hendrik Leppkes e97e2588ca Merge commit 'a0fc780a2093784e8664f88205ee1b215e109cee'
* commit 'a0fc780a2093784e8664f88205ee1b215e109cee':
  arm64: int32_to_float_fmul neon asm

Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-01-02 11:21:16 +01:00
Hendrik Leppkes 10e075c138 Merge commit '705f5e5e155f6f280a360af220fc5b30cfcee702'
* commit '705f5e5e155f6f280a360af220fc5b30cfcee702':
  arm64: port synth_filter_float_neon from arm

Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-01-02 11:14:28 +01:00
Hendrik Leppkes de3a33784c Merge commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e'
* commit 'c33c1fa8af2b2e82418a06901b6ad17b3d61b73e':
  arm64: convert dcadsp neon asm from arm

Merged-by: Hendrik Leppkes <h.leppkes@gmail.com>
2016-01-02 11:10:24 +01:00
Alexandra Hájková 2008f76054 dca: remove unused decode_hf function and quant_d tables
They were superseded with their integer equivalents. Rename integer
decode_hf to decode_hf.
2015-12-24 13:58:18 +01:00
Janne Grunau cc29d96d5a arm64: fix inverted register order in transpose_4x4H
Fix related register order issue in ff_h264_idct_add_neon.

Found-by: zjh8890 <243186085@qq.com>
2015-12-21 13:44:20 +01:00
Janne Grunau 2dba0407fd avcodec/arm64: fix inverted register order in transpose_4x4H
Fix related register order issue in ff_h264_idct_add_neon.

Found-by: zjh8890 <243186085@qq.com>

Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2015-12-19 03:58:46 +01:00
Michael Niedermayer 95b59bfb9d Revert "avcodec/aarch64/neon.S: Update neon.s for transpose_4x4H"
The change was not correct and broke H264

This reverts commit cd83f899c94f691b045697d12efa21f83eb2329f.
2015-12-17 21:26:37 +01:00
Janne Grunau a0fc780a20 arm64: int32_to_float_fmul neon asm
3% faster dts decoding on a cortex-a57.

                                 cortex-a57   cortex-a53
int32_to_float_fmul_array8_c:    1270.9       4475.6
int32_to_float_fmul_array8_neon:  328.6        569.2
int32_to_float_fmul_scalar_c:     928.5       4119.6
int32_to_float_fmul_scalar_neon:  309.1        524.1
2015-12-14 16:45:02 +01:00
Janne Grunau 705f5e5e15 arm64: port synth_filter_float_neon from arm
~25% faster dts decoding overall. The checkasm CPU cycles numbers are
not that useful since synth_filter_float() calls FFTContext.imdct_half().

                         cortex-a57   cortex-a53
synth_filter_float_c:    1866.2       3490.9
synth_filter_float_neon:  915.0       1531.5

With fftc.imdct_half forced to imdct_half_neon:
                         cortex-a57   cortex-a53
synth_filter_float_c:    1718.4       3025.3
synth_filter_float_neon:  926.2       1530.1
2015-12-14 16:45:01 +01:00
Janne Grunau c33c1fa8af arm64: convert dcadsp neon asm from arm
~2% faster dts decoding overall.

                    cortex-a57   cortex-a53
dca_decode_hf_c:    474.8        1659.9
dca_decode_hf_neon: 225.2         301.1
dca_lfe_fir0_c:     913.2        1537.7
dca_lfe_fir0_neon:  286.8         451.9
dca_lfe_fir1_c:     848.7        1711.5
dca_lfe_fir1_neon:  387.1         506.4
2015-12-14 16:45:01 +01:00
zjh8890 c18176bd55 avcodec/aarch64/neon.S: Update neon.s for transpose_4x4H
The transpose_4x4H is wrong which cost me much time to find this bug. The orders of r2 and r3 are wrong,
this bug waste me much time while I make aarch64 arm instruction which used the function.
2015-12-12 14:20:01 +01:00
Michael Niedermayer 5d5f8b29b4 Merge commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68'
* commit 'f56d8d8dd72b1ab52aa814c5a0fccabf8040ef68':
  h264: aarch64: intra prediction optimisations

Conflicts:
	libavcodec/h264pred.c

Merged-by: Michael Niedermayer <michael@niedermayer.cc>
2015-07-21 01:39:30 +02:00
Janne Grunau f56d8d8dd7 h264: aarch64: intra prediction optimisations 2015-07-20 23:10:29 +02:00
Janne Grunau c2de2cf0d2 arm64: constify src in h264qpel dsp function definitions 2015-06-24 08:41:32 +02:00
Michael Niedermayer 7b32b35bf5 Merge commit '3d5d46233cd81f78138a6d7418d480af04d3f6c8'
* commit '3d5d46233cd81f78138a6d7418d480af04d3f6c8':
  opus: Factor out imdct15 into a standalone component

Conflicts:
	configure
	libavcodec/opus_celt.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2015-02-02 20:43:13 +01:00
Diego Biurrun 3d5d46233c opus: Factor out imdct15 into a standalone component
It will be reused by the AAC decoder.
2015-02-02 16:07:33 +01:00
Carl Eugen Hoyos 4faea46bd9 lavc/aarch64: Do not use the neon horizontal chroma loop filter for H.264 4:2:2. 2015-01-31 10:05:10 +01:00
Michael Niedermayer 92d47e2aa3 Merge commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793'
* commit '780cd20b00a69e26bbfffbb8eec16fbe999ea793':
  aarch64: Use .data.rel.ro for const data with relocations

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-12-09 12:08:29 +01:00
Martin Storsjö 780cd20b00 aarch64: Use .data.rel.ro for const data with relocations
This reverts commit c00365b46d
in addition to using a different section.

Signed-off-by: Martin Storsjö <martin@martin.st>
2014-12-09 11:43:31 +02:00
Michael Niedermayer f3cba01cce Merge commit 'c00365b46d464ce47716315c1801818d811bdb9a'
* commit 'c00365b46d464ce47716315c1801818d811bdb9a':
  aarch64: Make the function pointer tables position independent

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-11-16 01:05:31 +01:00
Martin Storsjö c00365b46d aarch64: Make the function pointer tables position independent
This allows running the code on android, where 64 bit binaries with
text relocations aren't allowed to be loaded.

Signed-off-by: Martin Storsjö <martin@martin.st>
2014-11-16 01:07:24 +02:00
Michael Niedermayer e16b7338d8 avcodec/aarch64/h264qpel_init_aarch64: mark src as const
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-08-30 12:48:31 +02:00
Michael Niedermayer 7fd60d1e7a Merge commit 'ac6b95dbc0b53b3ea461bd5e5e7f7f31d2983733'
* commit 'ac6b95dbc0b53b3ea461bd5e5e7f7f31d2983733':
  aarch64: add ',' between assembler macro arguments where missing

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2014-08-04 04:06:13 +02:00