Work overlap in Cuda miner

Question

Work overlap in Cuda miner

krypdkat opened this issue 6 years ago · 5 comments

Since my modified miner usually send duplicate nonces so I created a python script to mimic what GPU done at this kernel code: https://github.com/mochimodev/mochimo/blob/master/src/trigg/cuda_trigg.cu#L258

#copy from trigg.cu
Z_PREP  = [12,13,14,15,16,17,12,13] 
Z_ING  = [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,23,24,31,32,33,34] 
Z_INF  = [44,45,46,47,48,50,51,52,53,54,55,56,57,58,59,60] 
Z_ADJ =[61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,94,95,96,97,98,99,100,101,102,103,104,105,107,108,109,110,112,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128] 
Z_AMB  = [77,94,95,96,126,214,217,218,220,222,223,224,225,226,227,228] 
Z_TIMED = [84,243,249,250,251,252,253,255] 
Z_NS = [129,130,131,132,133,134,135,136,137,138,145,149,154,155,156,157,177,178,179,180,182,183,184,185,186,187,188,189,190,191,192,193,194,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,241,244,245,246,247,248,249,250,251,252,253,254,255] 
Z_NPL = [139,140,141,142,143,144,146,147,148,150,151,153,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,181] 
Z_MASS = [214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,242,214,215,216,219] 
Z_INGINF = [18,19,20,21,22,25,26,27,28,29,30,36,37,38,39,40,41,42,44,46,47,48,49,51,52,53,54,55,56,57,58,59] 
Z_TIME = [82,83,84,85,86,87,88,243,249,250,251,252,253,254,255,253] 
Z_INGADJ  = [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,23,24,31,32,33,34,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92]

overlap = 0
dict = []
for thread in range(131071):
	seed = []
	seed.append(Z_PREP[(thread & 7)])
	seed.append(Z_TIMED[(thread >> 3) & 7])
	seed.append(1)
	seed.append(5)
	seed.append(Z_NS[(thread >> 6) & 63])
	seed.append(1)
	seed.append(Z_ING[(thread >> 12) & 31])

	T = str(seed[0]) + "_" + str(seed[1]) + "_" + str(seed[2]) + "_" + str(seed[3]) + "_" + str(seed[4]) + "_" + str(seed[5]) + "_" + str(seed[6])

	if T in dict:
		overlap += 1
	else:
		dict.append(T)


print overlap

Explanation for above code: I generate a seed for 1st frame, then push them into a list and check for duplicated seed (overlapped job)

The result is pretty high: 51199 (39%), that's mean we're currently wasting around 39% GPU power at 1st frame. I haven't tested other frames yet, but I think it has this issue as well.
For example, overlapped work at thread ID: (933, 999), (968, 974),...

print Z_PREP[993 & 7] == Z_PREP[999 & 7] #true
print Z_TIMED[(993 >> 3) & 7] == Z_TIMED[(999 >> 3) & 7] #true
print Z_NS[(993 >> 6) & 63] == Z_NS[(999 >> 6) & 63] #true
print Z_ING[(993 >> 12) & 31] == Z_ING[(999 >> 12) & 31] #true

The number for 2nd frame is 23553/ 131073 (18%).

Please correct me if I'm wrong.

Answer 1 · 2019-03-14T05:00:51.000Z

turn out that the root cause comes from Z_PREP, because Z_PREP[0] == Z_PREP[6], Z_PREP[1] == Z_PREP[7]. I guess you guys intentionally padded that array so it has 2^N in length => can do bitwise operation, but I think there's another solution. Do you have any whitepaper or article about this algorithm?

Answer 2 · 2019-03-14T05:51:16.000Z

There's a CPU version in trigg.c, which gives a rundown of the algorithm in a simpler form.

https://github.com/mochimodev/mochimo/blob/master/src/trigg/trigg.c

Answer 3 · 2019-03-14T06:31:30.000Z

@chrisdigity Thanks, I went through that file already, just want to make sure math stuff before implementing a fix

Answer 4 · 2019-04-26T23:25:18.000Z

@krypdkat Did you manage a fix for this work overlap?

Answer 5 · 2019-04-27T08:48:55.000Z

yes, I've fixed that few days after opening this issue. I'll try to merge that change to your miner code base. Here is an idea for any devs read this issue can try:
The target is removing duplicated seed and keeping those arrays at size 2^(N). For example, instead of defining Z_PREP with 8 seeds(2 overlaps) we should define it with size 4, and fill Z_PREP at run-time via symbol (aka constant memory, which likely has the same perf as shared mem).
pseudo:

old
__constant__ uint8_t Z_PREP[8]  = {12,13,14,15,16,17,12,13};

//new
__constant__ uint8_t Z_PREP[4]
// In run-time, should be at beginning
uint8_t cpu_z_prep[4] = {random(), random(), random(), random()};
cudamemcpytosymbol(Z_PREP, cpu_z_prep, 4,0, cudaMemcpyHostToDevice);

do the same for other arrays. We should do this for OpenCL code as well, by passing compilation flag. Or replace array values at kernel compilation time.