BaseCudaDataBuffer.getAllocationPoint is null

Question

BaseCudaDataBuffer.getAllocationPoint is null

adiantek opened this issue a year ago · 4 comments

[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [JCublasBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 32
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CUDA]; OS: [Windows 11]
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Cores: [12]; Memory: [16,0GB];
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Blas vendor: [CUBLAS]
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - ND4J CUDA build version: 11.6.55
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - CUDA device 0: [NVIDIA GeForce GTX 1050 Ti]; cc: [6.1]; Total memory: [4294836224]
[main] INFO org.nd4j.linalg.jcublas.JCublasBackend - Backend build information:
 MSVC: 192930146
STD version: 201402L
DEFAULT_ENGINE: samediff::ENGINE_CUDA
HAVE_FLATBUFFERS
HAVE_CUDNN
[main] INFO org.deeplearning4j.models.sequencevectors.SequenceVectors - Starting vocabulary building...
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [100000]; Current vocabulary size: [3677729]; Sequences/sec: 6567,71; Words/sec: 2053314,46;
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [200000]; Current vocabulary size: [6409578]; Sequences/sec: 7578,05; Words/sec: 2372158,23;
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [300000]; Current vocabulary size: [8891821]; Sequences/sec: 8194,03; Words/sec: 2564908,06;
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [400000]; Current vocabulary size: [11216023]; Sequences/sec: 8352,13; Words/sec: 2616189,59;
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [500000]; Current vocabulary size: [13451125]; Sequences/sec: 7713,67; Words/sec: 2415591,87;
[main] INFO org.deeplearning4j.models.word2vec.wordstore.VocabConstructor - Sequences checked: [534341], Current vocabulary size: [2744938]; Sequences/sec: [6167,23];
[main] INFO org.deeplearning4j.models.embeddings.loader.WordVectorSerializer - Projected memory use for model: [209,42 MB]
[main] INFO org.deeplearning4j.models.embeddings.inmemory.InMemoryLookupTable - Initializing syn1...
[main] INFO org.deeplearning4j.models.sequencevectors.SequenceVectors - Building learning algorithms:
[main] INFO org.deeplearning4j.models.sequencevectors.SequenceVectors -           building ElementsLearningAlgorithm: [SkipGram]
[main] INFO org.deeplearning4j.models.sequencevectors.SequenceVectors - Starting learning process...
Exception in thread "VectorCalculationsThread 0" java.lang.RuntimeException: java.lang.RuntimeException: Op [skipgram] execution failed
        at org.deeplearning4j.models.sequencevectors.SequenceVectors$VectorCalculationsThread.run(SequenceVectors.java:1328)
Caused by: java.lang.RuntimeException: Op [skipgram] execution failed
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1881)
        at org.deeplearning4j.models.embeddings.learning.impl.elements.SkipGram.iterateSample(SkipGram.java:533)
        at org.deeplearning4j.models.sequencevectors.SequenceVectors$VectorCalculationsThread.run(SequenceVectors.java:1302)
Caused by: java.lang.NullPointerException: Cannot invoke "org.nd4j.linalg.jcublas.buffer.BaseCudaDataBuffer.getAllocationPoint()" because "buffer" is null
        at org.nd4j.jita.allocator.impl.AtomicAllocator.getAllocationPoint(AtomicAllocator.java:957)
        at org.nd4j.jita.allocator.impl.AtomicAllocator.tickDeviceWrite(AtomicAllocator.java:947)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:2083)
        at org.nd4j.linalg.jcublas.ops.executioner.CudaExecutioner.exec(CudaExecutioner.java:1870)
        ... 2 more

it is only on GPU

GPU: GTX 1050 Ti
Dependencies:

    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-core</artifactId>
        <version>1.0.0-M2.1</version>
    </dependency>
    <dependency>
        <groupId>org.deeplearning4j</groupId>
        <artifactId>deeplearning4j-nlp</artifactId>
        <version>1.0.0-M2.1</version>
    </dependency>
    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-native</artifactId>
        <version>1.0.0-M2.1</version>
    </dependency>
    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-cuda-11.6</artifactId>
        <version>1.0.0-M2.1</version>
    </dependency>
    <dependency>
        <groupId>org.nd4j</groupId>
        <artifactId>nd4j-cuda-11.6</artifactId>
        <version>1.0.0-M2.1</version>
        <classifier>windows-x86_64-cudnn</classifier>
    </dependency>
    <dependency>
        <groupId>org.bytedeco</groupId>
        <artifactId>cuda</artifactId>
        <version>11.6-8.3-1.5.7</version>
        <classifier>windows-x86_64-redist</classifier>
    </dependency>

When I run the same code on nd4j-native-platform, I don't see any exceptions.

Answer 1 · 2023-05-26T09:04:15.000Z

@adiantek JFYI skipgram/word2vec won't be very efficient on gpu. It's not really a supported implementation if that's your use case. Stick to CPU for that. I'll double check the issue outside of that.

Answer 2 · 2023-05-26T09:05:54.000Z

Okey, thanks. It's not very efficient on GPU in any implementation or only in deeplearning4j? What is the reason?

So, I can use CPU for skipgram, then MultiLayerNetwork on GPU.

Answer 3 · 2023-05-26T09:08:45.000Z

@adiantek it can sorta work in general but the main problem with skipgram is you do lots of small allocations + comms with GPU. It's just in general the worst kind of algorithm to parallelize on GPU. It's all sparse updates. You can see some benefits from batching but not a lot due to the constant allocation. If you want something more optimized, I'll be releasing M3 soon and it will be much faster. You can build from source if you want to try it.

Edit: Yes I would suggest that. Run them in separate processes then train the MLN on your final embeddings.

Answer 4 · 2023-06-23T08:53:38.000Z

Closing this. Skipgram won't be supported on GPU (the op just doesn't fit there)