deeplearning4j/deeplearning4j-examples

Spark read data good approch

sjsdfg opened this issue · 2 comments

in example: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-spark-examples/dl4j-spark/src/main/java/org/deeplearning4j/mlp/MnistMLPExample.java

there has some comment:

//Load the data into memory then parallelize
//This isn't a good approach in general - but is simple to use for this example
DataSetIterator iterTrain = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
DataSetIterator iterTest = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
List<DataSet> trainDataList = new ArrayList<>();
List<DataSet> testDataList = new ArrayList<>();
while (iterTrain.hasNext()) {
    trainDataList.add(iterTrain.next());
 }
while (iterTest.hasNext()) {
     testDataList.add(iterTest.next());
}

i know this way will limit by the machine memory.
do you have any advice about the good approch in parallelizing data?

ok, so there's 2 ways
(a) use SparkContext.parallelize (that's a standard spark op) - easy but bad performance (all preprocessing happens on master)
(b) write a better data pipeline that does the proper reading + conversion in parallel