Generation of dense sketching operators doesn't return updated RNGState

Ping @kaiwenhe7. It looks like your big PR a couple of weeks back introduced a bug. The current implementation (renamed to fill_dense_submat_impl a while back) can be found here

RandBLAS/RandBLAS/dense.hh

Lines 203 to 270 in 9debb57

    
           template<typename T, typename RNG, typename OP> 
        
           static auto fill_dense_submat_impl( 
        
               int64_t n_cols, 
        
               T* smat, 
        
               int64_t n_srows, 
        
               int64_t n_scols, 
        
               int64_t ptr, 
        
               const RNGState<RNG> & seed 
        
           ) { 
        
               RNG rng; 
        
               typename RNG::ctr_type c = seed.counter; 
        
               typename RNG::key_type k = seed.key; 
        
               int64_t i0, i1, r0, r1, s0, e1; 
        
               int64_t prev = 0; 
        
               int64_t i; 
        
               #pragma omp parallel firstprivate(c, k) private(i0, i1, r0, r1, s0, e1, prev, i) 
        
               { 
        
               auto cc = c; 
        
               prev = 0; 
        
               #pragma omp for 
        
               for (int row = 0; row < n_srows; row++) { 
        
                   int64_t ind = 0; 
        
                   i0 = ptr + row * n_cols; // start index in each row 
        
                   i1 = ptr + row * n_cols + n_scols - 1; // end index in each row 
        
                   r0 = (int64_t) i0 / RNG::ctr_type::static_size; // start counter 
        
                   r1 = (int64_t) i1 / RNG::ctr_type::static_size; // end counter 
        
                   s0 = i0 % RNG::ctr_type::static_size; 
        
                   e1 = i1 % RNG::ctr_type::static_size; 
        
                   cc.incr(r0 - prev); 
        
                   prev = r0; 
        
                   auto rv =  OP::generate(rng, cc, k); 
        
                   int64_t range = (r1 > r0)? RNG::ctr_type::static_size-1 : e1; 
        
                   for (i = s0; i <= range; i++) { 
        
                       smat[ind + row*n_scols] = rv[i]; 
        
                       ind++; 
        
                   } 
        
                   // middle  
        
                   int64_t tmp = r0; 
        
                   while( tmp < r1 - 1) { 
        
                       cc.incr(); 
        
                       prev++; 
        
                       rv = OP::generate(rng, cc, k); 
        
                       for (i = 0; i < RNG::ctr_type::static_size; i++) { 
        
                           smat[ind + row*n_scols] = rv[i]; 
        
                           ind++; 
        
                       } 
        
                       tmp++; 
        
                   } 
        
                   // end 
        
                   if ( r1 > r0 ){ 
        
                       cc.incr(); 
        
                       prev++; 
        
                       rv = OP::generate(rng, cc, k); 
        
                       for (i = 0; i <= e1; i++) { 
        
                           smat[ind + row*n_scols] = rv[i]; 
        
                           ind++; 
        
                       } 
        
                   } 
        
               } 
        
               } 
        
               return RNGState<RNG> {c, k}; 
        
           }

The old implementation can be found right around here:
29ddfb7#diff-454044b3f0993309c60438575edec1d146c1cdfd03f6ca6002c59fb8729bec8cL208

The problem with the new implementation is that the RNGState it returns is based on {c, k}, but c is never updated in the function. Compare this to the old implementation, which did update c, e.g., here:
29ddfb7#diff-454044b3f0993309c60438575edec1d146c1cdfd03f6ca6002c59fb8729bec8cL267

@TeachRaccooon is going to try working on this first.

Oh yes! In the parallel case, each thread creates its own copy of c. I forgot to update the actual counter at the end. Perhaps it should really simple to just advance c all the way to the final state at the end.

@kaiwenhe7 Guess my RandBLAS skills aren't strong enough, wasn't able to fix your code.
If you know the solution, please also add a basic test for it so that we make sure this issue is fuly taken care of.

@TeachRaccooon The solution is more like a hack. Since the main body of the code only uses a private copy of c, the actual c is left untouched during the whole process of the sub matrix generation. Given the position of the last entry of the sub matrix n_scols*n_srows, the associated position of the counter can be computed. So before returning the state, I was thinking of just advancing the counter all the way to the final position.

@kaiwenhe7 do you mean just literally doing
c.incr(n_srows * n_scols);
return RNGState<RNG> {c, k};?

@TeachRaccooon Yes. But I think it would be something like
c.incr( (int64_t) (ptr + n_scols + (n_srows-1)*n_cols) / RNG::ctr_type::static_size) )

It's the position of the last entry of the sub matrix within the larger random matrix. And since each counter is associated with an array of random numbers of size RND::ctr_type::static_size. The start of the counter is associated with the first entry of the larger random matrix.

@rileyjmurray there's another subtle issue with the new implementation. OpenMP does not specify how loop iterations are divided among threads. There are different ways to handle the case when the number of threads does not evenly divide the loop count. Therefor there is no portable way to calculate the per thread start index that is needed to initialize the counter so that the sequence is the same independent of the number of threads (line 243 above) . This can lead to the same counter values being used in different threads. this implementation might give the correct result when the number of threads evenly divides the number of rows for instance. This will be a very subtle bug to identify and fix later on, since behavior depends on: the number of rows, the number of threads, and the OpenMP implementation. A user may not even notice that different threads are generating the same values. This is why the old implementation did not use OpenMP parallel for, and instead explicitly partitioned loop indices to threads.

@burlen thanks for those insights. We'll be adding tests to try and stress-test the cases you mentioned. If we can't get them to pass then I'd be happy for us to consider alternative implementations, such as partitioning indices as you suggest.

@burlen @rileyjmurray For OpenMP is it incorrect to assume the for loops are partitioned into blocks where each thread handles a block? I changed the sub matrix generation test so that the dimensions are prime and it seems to pass all the tests so far.

Actually there is a test for this, TestDenseThreads varies the number of threads and checks that the matrix is the same as with 1 thread. Since your code passes that, it must be correct. I think I was wrong about the new code duplicating values and I think I see why. Sorry for the confusion on my part!

OpenMP has different scheduling strategies. static scheduling with a fixed chuck size is what you're describing. to ensure this is what is used you have to add schedule(static) to your directive. if you don't set a chunk size, schedule(static, chunk_size) then there's ambiguity in how the implementation deals with the remainder of iterations when the num threads doesn't divide the loop count evenly. But if you do set a chunk size it may be sub-optimal. And if you don't specify a scheduling strategy explicitly, it's up to the implementation which one to use, it could choose dynamic scheduling. With dynamic scheduling the chunk size defaults to 1, and threads take an new index as soon as they finish processing the current one. In that case indices are processed 1 by 1 in a non deterministic order.

At any rate, I'm seeing now that your code does not depend on how OpenMP partitions the iterations, your using the loop index to back out the counter value. I think this is correct. My bad!

Resolved by #62.

	template<typename T, typename RNG, typename OP>
	static auto fill_dense_submat_impl(
	int64_t n_cols,
	T* smat,
	int64_t n_srows,
	int64_t n_scols,
	int64_t ptr,
	const RNGState<RNG> & seed
	) {
	RNG rng;
	typename RNG::ctr_type c = seed.counter;
	typename RNG::key_type k = seed.key;

	int64_t i0, i1, r0, r1, s0, e1;
	int64_t prev = 0;
	int64_t i;

	#pragma omp parallel firstprivate(c, k) private(i0, i1, r0, r1, s0, e1, prev, i)
	{
	auto cc = c;
	prev = 0;
	#pragma omp for
	for (int row = 0; row < n_srows; row++) {
	int64_t ind = 0;
	i0 = ptr + row * n_cols; // start index in each row
	i1 = ptr + row * n_cols + n_scols - 1; // end index in each row
	r0 = (int64_t) i0 / RNG::ctr_type::static_size; // start counter
	r1 = (int64_t) i1 / RNG::ctr_type::static_size; // end counter
	s0 = i0 % RNG::ctr_type::static_size;
	e1 = i1 % RNG::ctr_type::static_size;

	cc.incr(r0 - prev);
	prev = r0;
	auto rv = OP::generate(rng, cc, k);
	int64_t range = (r1 > r0)? RNG::ctr_type::static_size-1 : e1;
	for (i = s0; i <= range; i++) {
	smat[ind + row*n_scols] = rv[i];
	ind++;
	}

	// middle
	int64_t tmp = r0;
	while( tmp < r1 - 1) {
	cc.incr();
	prev++;
	rv = OP::generate(rng, cc, k);
	for (i = 0; i < RNG::ctr_type::static_size; i++) {
	smat[ind + row*n_scols] = rv[i];
	ind++;
	}
	tmp++;
	}

	// end
	if ( r1 > r0 ){
	cc.incr();
	prev++;
	rv = OP::generate(rng, cc, k);
	for (i = 0; i <= e1; i++) {
	smat[ind + row*n_scols] = rv[i];
	ind++;
	}
	}
	}

	}
	return RNGState<RNG> {c, k};
	}