Spark read data good approch
sjsdfg opened this issue · 2 comments
sjsdfg commented
there has some comment:
//Load the data into memory then parallelize
//This isn't a good approach in general - but is simple to use for this example
DataSetIterator iterTrain = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
DataSetIterator iterTest = new MnistDataSetIterator(batchSizePerWorker, true, 12345);
List<DataSet> trainDataList = new ArrayList<>();
List<DataSet> testDataList = new ArrayList<>();
while (iterTrain.hasNext()) {
trainDataList.add(iterTrain.next());
}
while (iterTest.hasNext()) {
testDataList.add(iterTest.next());
}
i know this way will limit by the machine memory.
do you have any advice about the good approch in parallelizing data?
sjsdfg commented
Alex Black @AlexDBlack 12:41
@sjsdfg use SparkContext.textFile(path)
plus this: https://github.com/deeplearning4j/deeplearning4j/blob/344e0f49a21308540162194d1c952c8446f30318/deeplearning4j-scaleout/spark/dl4j-spark/src/test/java/org/deeplearning4j/spark/datavec/TestDataVecDataSetFunctions.java#L126-L127
sjsdfg commented
ok, so there's 2 ways
(a) use SparkContext.parallelize (that's a standard spark op) - easy but bad performance (all preprocessing happens on master)
(b) write a better data pipeline that does the proper reading + conversion in parallel