Spark read data good approch

Question

Spark read data good approch

sjsdfg opened this issue 7 years ago · 2 comments

in example: https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-spark-examples/dl4j-spark/src/main/java/org/deeplearning4j/mlp/MnistMLPExample.java

there has some comment:

//Load the data into memory then parallelize
//This isn't a good approach in general - but is simple to use for this example
DataSetIterator iterTrain = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
DataSetIterator iterTest = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
List<DataSet> trainDataList = new ArrayList<>();
List<DataSet> testDataList = new ArrayList<>();
while (iterTrain.hasNext()) {
    trainDataList.add(iterTrain.next());
 }
while (iterTest.hasNext()) {
     testDataList.add(iterTest.next());
}

i know this way will limit by the machine memory.
do you have any advice about the good approch in parallelizing data?

Answer 1 · 2018-05-15T04:47:21.000Z

Alex Black @AlexDBlack 12:41
@sjsdfg use SparkContext.textFile(path)
plus this: https://github.com/deeplearning4j/deeplearning4j/blob/344e0f49a21308540162194d1c952c8446f30318/deeplearning4j-scaleout/spark/dl4j-spark/src/test/java/org/deeplearning4j/spark/datavec/TestDataVecDataSetFunctions.java#L126-L127

Answer 2 · 2018-05-27T04:59:34.000Z

ok, so there's 2 ways
(a) use SparkContext.parallelize (that's a standard spark op) - easy but bad performance (all preprocessing happens on master)
(b) write a better data pipeline that does the proper reading + conversion in parallel