Optimized C results in faster hashrate.

Question

Optimized C results in faster hashrate.

ASICseer opened this issue 8 years ago · 22 comments

I have been looking at the C and there's a lot of stuff that can be optimized. A preliminary overview resulted in an increase of 5 h/s per gpu.

Please review my changes and try to integrate them into the .c / .cl files of your miner. Please note, I have only a preliminary knowledge of this, so don't merge my changes. Instead, use it as an example for your own work.

I have been following this guide:

http://www.it.uom.gr/teaching/c_optimization/tutorial.html

Here is what I have done (where I could):

Changed loops to count down to 0 instead of up from 0.
Unrolled loops.
Break out of loops early instead of late.
Optimized if/else/if statements to return known good results early instead of late.
Simplified math operations where possible.

This was a fun exercise for me, but I unfortunately don't have the time or knowledge for future improvements. Please take a look at my branches for ideas, and don't make fun of me too much :)

https://github.com/gpushack/silentarmy/commits/master
https://github.com/gpushack/silentarmy/commits/gpushack

Answer 1 · 2016-11-11T21:50:57.000Z

Loop unrolling should be done by the compiler and pragmas, not by the coder.
https://www.khronos.org/registry/cl/extensions/nv/cl_nv_pragma_unroll.txt

Answer 2 · 2016-11-12T02:09:21.000Z

@gpushack Cool. Have you actually run it and measured any improvement in performance?

Answer 3 · 2016-11-12T04:02:28.000Z

@mbevand Yes, my team and I tested my changes, there was a 5h/s increase compared to a version we compiled approximately 24 hours ago. I have tested some of my changes with the latest commits from yourself and padrino, and they stack, resulting in further hashrate increases.

I must stress that you should only use my commits as guidelines for increasing code efficiency before it reaches the compiler.

Answer 4 · 2016-11-12T20:34:14.000Z

Oh ok I re-read your post and somehow missed the increase you mentioned in the first sentence :-P Right now I am working on much higher perf increases (+50% or more) but eventually I'll have time to look at your changes. It would be helpful if someone could pinpoint specifically the 1 or 2 most effective changes, because you have a lot to sift through...

Answer 5 · 2016-11-12T21:30:49.000Z

Effective changes below

throw away bad result earlier:

https://github.com/gpushack/silentarmy/commit/8c0cceb95b87e8e13bdd076786a193a75f7ce891

The logic for checking that "nr_inputs - 1;" is simpler than the further down if/else/if logic, so you can break out earlier.

short pairs faster:

https://github.com/gpushack/silentarmy/commit/5ff9a3b0cad6f1a8e9e67b1158be2e1784ef80d9

Don't sort if you don't need to sort.

faster math:

https://github.com/gpushack/silentarmy/commit/f4cba690e337a010a1ac2f01915b9be6cd857144
https://github.com/gpushack/silentarmy/commit/cfdfa10c2379aa1455fa1cdbe5e35487b7c99ab5

floating point multiplication is faster than division.

Everything else is your basic loop unrolling and counting down loops instead of up.

Answer 6 · 2016-11-12T23:04:49.000Z

I don't observe any improvements:
At 1000 nonces:

master: Total 1825 solutions in 25115.3 ms (72.7 Sol/s)
with 8c0cceb95b87e8e13bdd076786a193a75f7ce891 Total 1825 solutions in 25137.8 ms (72.6 Sol/s) - no change
with 5ff9a3b0cad6f1a8e9e67b1158be2e1784ef80d9 : Total 1824 solutions in 25098.1 ms (72.7 Sol/s) - no change

The multiplication optimization will 100% be performed by compiler.

Answer 7 · 2016-11-13T00:23:57.000Z

@Kubuxu

Please try this one: https://github.com/gpushack/silentarmy/commit/75445f32d4e46eb70790ae3d9e657fd18af7718c

Answer 8 · 2016-11-13T00:39:28.000Z

The test is failing but I am running perf tests either way.

Answer 9 · 2016-11-13T00:43:54.000Z

RX480, with the patch: Total 1979 solutions in 26674.4 ms (74.2 Sol/s)
Baseline: Total 1825 solutions in 25084.7 ms (72.8 Sol/s)

The Wolf0 patch doesn't change anything for me, more testing will continue.
EDIT1: -O3 doesn't change anything (as expected)

Answer 10 · 2016-11-13T00:49:54.000Z

Sorry, this was my bad, made a mistake when I was manually applying the patch.

~~The shift of if (!i) { break; } removes some solutions form the set, it is one that breaks tests.~~

Answer 11 · 2016-11-13T00:58:16.000Z

The shift of if (!i) { break; } increases hash rate by 0.2, the swap of else conditions does nothing.

Answer 12 · 2016-11-13T01:01:51.000Z

I just switched pools and noticed a lot of duplicate and invalid shares. Please make sure you're on the right branch: "gpushack" in https://github.com/gpushack/silentarmy/tree/gpushack

Maybe you can help me find it the issue.

Answer 13 · 2016-11-13T01:03:11.000Z

I am applying your optimizations one by one trying to find those that work. I recomment you doing the same when you are creating them. ./sa-solver --use 0 --nonces 1000 is a good baseline for comparison.

Answer 14 · 2016-11-13T01:04:43.000Z

The wolf0 patch seem to decrease hash rate to 68 for me.

Answer 15 · 2016-11-13T01:08:00.000Z

Understood. This one resulted in a big increase:

uint32_t verify_sol(sols_t *sols, unsigned sol_i)
{
uint32_t *inputs = sols->values[sol_i];
uint32_t seen_len = (1 << (PREFIX + 1)) * 0.125;
uint8_t seen[seen_len];
uint32_t i;
uint8_t tmp;
// look for duplicate inputs
memset(seen, 0, seen_len);

// the valid flag is already set by the GPU, but set it again because
// I plan to change the GPU code to not set it
sols->valid[sol_i] = 1;
// sort the pairs in place
for (uint32_t level = PARAM_K - 1; level--;)
    for (i = 0; i < (1 << PARAM_K); i += (2 << level))
        sort_pair(&inputs[i], 1 << level);
return 1;


for (i = (1 << PARAM_K) -1; i--;)
  {
if (inputs[i]  >= (1 << 21))
  {
    warn("Invalid input retrieved from device: %d\n", inputs[i]);
    sols->valid[sol_i] = 0;
    return 0;
  }
tmp = seen[inputs[i] / 8];
seen[inputs[i] / 8] |= 1 << (inputs[i] & 7);
if (tmp == seen[inputs[i] / 8])
  {
    // at least one input value is a duplicate
    sols->valid[sol_i] = 0;
    return 0;
  }
  }

}

Answer 16 · 2016-11-13T01:09:22.000Z

Can you post it as a patch or diff or commit?

Answer 17 · 2016-11-13T01:13:18.000Z

By removing the wolf0 I am upto 79Sol/s but tests still fail I will be stripping it more.

Answer 18 · 2016-11-13T01:22:55.000Z

I should have read it before hand, you bypassing the check code and always returning solution as valid with early return 1, which isn't true.

Answer 19 · 2016-11-13T01:25:56.000Z

and your sort is wrong.

Answer 20 · 2016-11-13T01:28:53.000Z

@mbevand only thing that works from this optimizations is d30e022 feel free to cherry pick or manually commit, credit to @gpushack

Answer 21 · 2016-11-13T03:02:42.000Z

@Kubuxu Thanks, I have reverted the early return and will work on it some more.

Answer 22 · 2016-11-15T17:39:56.000Z

I was unable to measure perf improvements by d30e022
Also it doesn't make sense why it would improve perf. It will cause (a very small) number of invalid solutions to be reported as valid, so it's going to cause the GPU (and CPU) to perform a little more useless work. So I won't merge this change.

As to the other changes, I'll defer on Kubuxu on this, and not take any of them either, and close this issue. (I have also briefly looked and not seen any worthwhile changes.)