hclhkbu/dlbench

Contact libraries authors for possible enhancements

Opened this issue · 10 comments

Randl commented

I'm Tensorflow user, so I've opened an issue regarding performance of Tensorflow in several cases.

One of things we found out is that the code used by dlbench is suboptimal - tensorflow/tensorflow#7187 (comment)

So I thought you might consider to contact other libraries authors too, to get feedback from them

Thanks for your suggestion. We have tried to contact the authors of other tools to confirm the scripts and configuration files. Feel free to submit your pull request if you have optimal implementations of our tested networks.

@Randl The ResNet-50 script in MXNet was found a mistake configuration, and we have revised it to the correct one. So we start to re-run the revised script to generate new results. Could you also provide the tensorflow code that avoids to use feed_dict=feed_dict in FCN so that we can release the newer results together. Please be noted that the TF version should be 0.11. Thank you!

Randl commented

@shyhuai You should ask @tfboyd for optimal code.

Randl commented

@shelhamer @KeDengMS @piiswrong @soumith Sorry if you're wrong people to tag. Do you have something to add? Do think the benchmark can be improved somehow or your framework isn't used in a most efficient way?

Hi @shyhuai ,

I know how hard it is to run a bunch benchmarks using a wide range of tools. I do not know if I will have time to submit any PRs in the near future but I will if I can find time. One idea I did have that would make it easier for us to help. We do not do a lot with the CIFRA data sets because the image sizes are really small and GPUs end up processing in some cases 6,000+ samples (images) / sec. I understand moving to ImageNet could be a big change given you have done multiple rounds with CIFAR.

Good luck on future iterations. I cannot say I will always have time, but please feel free to reach out to me for code or whatever. I do not want to influence your results, but I am happy to help as impartially as I can.

Oh sorry, one more thing. We should have an MNIST example that does not use the python feed soon. It was intended as a tutorial. I will try to submit a PR or at a minimum link it to you when it is released.

edit: will have a MNIST example soon.

@tfboyd Thank you very much for your kindly response and help. We also try to include real ImageNet data set into evaluation, but it could take more time to generate results since it takes several days to train a network model converged. I will inform you if we have further progress.

@shihuai, appreciate your effort on building benchmarks for major DL platforms. Please let me know if you found any issues in testing CNTK.
As to CIFAR vs. ImageNet, I think having both would be beneficial to measure the speed of computation and I/O separately. CIFAR-10 is a small data set, but one can still build complex networks on it, like ResNet110 in CNTK. That would be a very good indicator of how the platform performs when computation is intensive. ImageNet would put more pressure on I/O comparing to CIFAR-10.

Maybe it worth looking something in between ImageNet and CIFAR, like Pascal VOC dataset?

I rewrote all of the TensorFlow examples with the exception of the RNN. I think this can be closed once the PRs are accepted. I suspect our ResNet is still off as there should not be as large of a gap between any of the platforms on one or even multiple GPUs especially a K80. they should all be with in about 5-10% maybe 20% in some weird cases but in general and as tested by NVIDIA the top frameworks are nearly identical (yes some are faster and slower but not dramatically) with CNNs. RNNs might be a different story but if everyone is using cuDNN again it should be similar and not dramatically different.