Implement the forward transform

Question

Implement the forward transform

lu-zero opened this issue 8 years ago · 24 comments

lu-zero commented 8 years ago

Functions to implement for this task:

vpx_fdct4x4_vsx
vpx_fdct8x8_vsx
vpx_fdct16x16_vsx
vpx_fdct32x32_vsx

Answer 1 · 2017-10-06T07:17:36.000Z

Is this task in progress or could I take it?

Answer 2 · 2017-10-17T11:37:37.000Z

Hello @sasshka, I had some problems at the beginning but I managed to make the first transformation, I'll publish it soon.

I also noticed that most PPC codes for other transforms do not compile with high bitdepth, so I'll probably narrow the scope of this issue.

Answer 3 · 2017-10-17T12:05:03.000Z

high bitdepth is outside this since would make the problem much bigger. (and when this set of task started wasn't even stable^^)

…

On Tue, Oct 17, 2017 at 1:37 PM, Rafael de Lucena Valle < ***@***.***> wrote: Hello @sasshka <https://github.com/sasshka>, I had some problems at the beginning but I managed to make the first transformation, I'll publish it soon. I also noticed that most PPC codes for other transforms do not compile with high bitdepth, so I'll probably narrow the scope of this issue. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOlpKrI7JCEsz2wdZZQOF61zsfzqtCwks5stJGBgaJpZM4KZ6JA> .

Answer 4 · 2017-10-17T12:56:47.000Z

Okay! list updated x)

Answer 5 · 2017-11-19T21:03:03.000Z

Until now I got about 35% of improvement making a VSX version of vpx_fdct4x4_vsx. Which would be considered the minimum acceptable?

Note: Google Test filter = *Trans4x4DC*.DISABLED*
[==========] Running 2 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 1 test from C/Trans4x4DCT
[ RUN      ] C/Trans4x4DCT.DISABLED_Speed/0
Fdct4x4[          10 runs]: 6 us
Fdct4x4[       10000 runs]: 600 us
Fdct4x4[    10000000 runs]: 602526 us
[       OK ] C/Trans4x4DCT.DISABLED_Speed/0 (604 ms)
[----------] 1 test from C/Trans4x4DCT (604 ms total)

[----------] 1 test from VSX/Trans4x4DCT
[ RUN      ] VSX/Trans4x4DCT.DISABLED_Speed/0
Fdct4x4[          10 runs]: 2 us
Fdct4x4[       10000 runs]: 384 us
Fdct4x4[    10000000 runs]: 383780 us
[       OK ] VSX/Trans4x4DCT.DISABLED_Speed/0 (384 ms)
[----------] 1 test from VSX/Trans4x4DCT (384 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test cases ran. (988 ms total)
[  PASSED  ] 2 tests.

Answer 6 · 2017-11-19T21:29:41.000Z

4x4 has a tiny kernel so I'd be already happy with this initial speedup, you can compare with the x86_64 and arm64 variants to see if it is in line with those.

8x8 and 16x16 should have a more substantial speedup though.

Answer 7 · 2018-02-01T19:12:15.000Z

Any progress on this?

Answer 8 · 2018-02-02T23:09:10.000Z

I've rebased with upstream but some adjusts are needed, ASAP I'll create the PR to WebM repository.

Answer 9 · 2018-05-10T15:14:12.000Z

Any news about this? :) (please CC me and david in gerrit)

Answer 10 · 2018-05-10T20:42:26.000Z

Hello Luca,

After rebasing some things broken (specially at store instructions) and looking at implementations for other architectures, they implement the Forward Transform, using operation with columns to reuse on 8x8, 16x16, ...

My implementation doesn't to this, so maybe I'll struggle to go for bigger matrices (will require some refactory).

For now I'm without time to complete this, maybe at the end of the month, but if you want to complete this, please go on.

Sorry for the delay!

[]s

Answer 11 · 2018-06-04T14:04:41.000Z

Any news on that?

Answer 12 · 2018-06-05T03:26:33.000Z

I went back to work on this task, but I have not finished yet.

I'm redoing the algorithm in a simpler way, after have written down the steps on octave, I think I'll have something to deliver by the end of the week.

Answer 13 · 2018-06-05T11:12:37.000Z

That's great :) Please CC me and David when you push to gerrit, looking forward to it :)

Answer 14 · 2018-09-02T14:08:05.000Z

@rafaeldelucena usual ping.

Answer 15 · 2018-09-03T04:15:52.000Z

I have implemented the fdct4x4, but I'm still not satisfied with the performance gain.

I'll do some adjustments and create a pull request to upstream.

Answer 16 · 2018-12-10T13:52:27.000Z

Hey, any news?

Answer 17 · 2018-12-12T13:19:17.000Z

Hi!

I created a PR to upstream for fdct4x4, https://chromium-review.googlesource.com/c/webm/libvpx/+/1360172

Answer 18 · 2019-02-14T10:40:37.000Z

Hey rafaeldelucena,
Are you still working on the issue? I'd take it if you don't have time to finish it.

Answer 19 · 2019-02-15T14:54:37.000Z

Hello @sasshka you can take it!

My last PR is in https://chromium-review.googlesource.com/c/webm/libvpx/+/1404181

Answer 20 · 2019-05-15T19:47:52.000Z

vpx_fdct32x32_vsx was implemented by Luc Trudeau luc@trud.ca
@luctrudeau in dc93b62
What is the status on this? Is there still a bounty for this, and what about high bit-depth?

Answer 21 · 2019-05-15T19:49:06.000Z

@sasshka is working on that and I want to ask additional bounties for high bit-depth.

Answer 22 · 2019-05-16T11:37:10.000Z

I want to ask additional bounties for high bit-depth.

Please do. There are not many bounties available right now. Most have been fixed by internal people and maintainers that never claimed the bounties, which leaves the OpenBLAS ones that the man from Azerbaijan @quickwritereader is working on, the NumPy one, and the last libmvec one (pow powf), which I have made progress on.

I also accelerated WireGuard with VSX on my own (using code from openssl), including allowing simd during interrupts.

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=107892
https://lists.zx2c4.com/pipermail/wireguard/2019-May/004149.html

Answer 23 · 2019-05-16T13:24:30.000Z

On Thu, May 16, 2019 at 1:37 PM Shawn Landden ***@***.***> wrote: I want to ask additional bounties for high bit-depth. Please do. There are not many bounties available right now. Most have been fixed by internal people and maintainers that never claimed the bounties, which leaves the OpenBLAS ones that the man from Azerbaijan @quickwritereader <https://github.com/quickwritereader> is working on, the NumPy one, and the last libmvec one (pow powf), which I have made progress on. I also accelerated WireGuard with VSX on my own (using code from openssl), including allowing simd during interrupts. https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=107892 https://lists.zx2c4.com/pipermail/wireguard/2019-May/004149.html

Hello Shawn, good for you you're working on various VSX stuff and making a progress. But why do you have a feeling Luca should work on getting bounties for you? Luca is a very nice but also very busy person and I didn't get how some people got an impression he should manage all the VSX work around. Luca is not payed to do that . If you'd like to have high bit-depth bounties or any other kind of bounties available, please ask for them yourself. Best regards, Sasha

Answer 24 · 2019-05-16T22:04:57.000Z

Sorry for the misunderstanding. I will stick to BountySource's interface, and keep here to technical discussion.