Support sparse matrices with more than 2^31-1 elements

Currently there are a number of places in the codebase where the dtype for indptr and indices are hardcoded to 'i' (i.e., int32), for example:

cupy/cupyx/scipy/sparse/compressed.py

Lines 330 to 331 in 22e270f

    
           indices = cupy.array(x.indices, dtype='i') 
        
           indptr = cupy.array(x.indptr, dtype='i')

But they shouldn't be. In SciPy, they use a helper function get_index_dtype() to determine this:
https://github.com/scipy/scipy/blob/f4b5605031f738bc87ae4e193d614d525b98ffba/scipy/sparse/compressed.py#L65-L71
I think this change shouldn't be hard, but it is worth a standalone PR with sufficient test coverage added.

For the short term, I'd say let's encourage PR authors to take this into account, so we can make the changes gradually. In particular, when a new ElementwiseKernel or ReductionKernel is added, we just need to use template types T as oppose to hardcoded int32.

It seems that the newer cuSPARSE Generic API supports 64bit indices (and the legacy API we currently use only supports 32bit indices).

https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-api-reference

The cuSPARSE Generic APIs allow computing the most common sparse linear algebra operat
ions, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:

Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on)

Set input/output/compute data types. This also allows mixed data-type computation

Set types of sparse matrix indices

Choose the algorithm for the computation

Provide external device memory for internal operations

Provide extensive consistency checks across input matrices and vectors for a given routine. This includes the validation of matrix sizes, data types, layout, allowed operations, etc.

cuSPARSE Generic API is starting with CUDA 10.1.
https://docs.nvidia.com/cuda/archive/10.1/cusparse/index.html#cusparse-generic-api-reference

CuPy already has cuSPARSE Generic API supports, #3129 and #3242, awesome @anaruse!

That's right, so using 64-bit indices becomes possible!

When working on #4778 I noticed one obstacle: In some index calculations we need scatter_add to handle int64, but it relies on CUDA's builtin atomicAdd, which does not have a int64 specialization. Don't see an easy workaround ftm.

ref: 5422788

I noticed docs say several cuSPARSE APIs will be removed in the next major release. e.g.

https://docs.nvidia.com/cuda/cusparse/index.html#csrgemm2

11.2. cusparse<t>csrgemm2() [DEPRECATED]
[[DEPRECATED]] use cusparseSpGEMM() instead. The routine will be removed in the next major release

Maybe we should prioritize this?

Yes there are a ton of deprecation warnings emitted when compiling the cuSPARSE modules. Even without considering the 64-bit support it's still nice if we can remove these functions.

#3513 (comment) Fortunately, the oldest CTK that CuPy supports (10.2) has the Generic API now, so we don't have to think of mixing it and older API (supports only 32bit indices).

Is there still work being done on 64bit indptr support for future cupy versions?

We haven't started working on this one.

@takagi We're getting requests for this one several times so maybe we need to reconsider the priority?

Agree. I've marked this as high priority and have will to work on this.

	indices = cupy.array(x.indices, dtype='i')
	indptr = cupy.array(x.indptr, dtype='i')