nicholas-leonard/dp

Multi-threaded loading of multi-label dataset

jrbtaylor opened this issue ยท 9 comments

A flat folder structure is not possible with a multi-label dataset (in this case, video). Is it possible to set up asynchronous loading (and label extraction from a .json file) in dp?

Perhaps a more specific question: what method should I overwrite to load batches with a custom function so I can otherwise use the dp framework for training?

Sports-1m? Maybe pre-process and save tensors to disks the hard way and use dp to just load these tensors? What did you end up doing?

It was for the ActivityNet challenge. I ended up using optim instead of dp and then it was pretty simple with the threads package.

So while loading data asynchronously, you handled the workers yourself?

Yes, that is the only way I found to do it. The threads package is pretty easy to use once the basic concept is figured out. You pass each worker two functions: one it runs independent of the other workers and one that is called by the main thread (not a worker) upon completion of the first. Data loading happens in the worker thread (the first function). GPU training has to occur in the latter function, as each worker has no idea what the other is doing, so they might write to overlapping GPU memory.

Thanks! I was considering ElementResearch's dataload. In particular, AsyncIterator provides override-able methods.
Close this issue with a link to your code perhaps, for future searches?

AsyncIterator looks great. Torch was lacking a general solution like that (or maybe I missed that entirely back when I was working on it).

If you're curious, here's my code: https://github.com/jrbtaylor/ActivityNet

@jrbtaylor, you have mis-linked. Good example, thanks!

I fixed the link. Thanks.