EleutherAI/gpt-neo

Performance issue in tasks.py

DLPerf opened this issue · 4 comments

Describe the bug
I've found a performance issue in "tasks.py": dataset = dataset.batch(params['eval_batch_size'], drop_remainder=True)(here) should be called before dataset = dataset.map(_get_output)(here), which would make your program more efficient.
Here is the tensorflow document to support this thing.

To Reproduce
Steps to reproduce the behavior:

  1. Go to "tasks.py"
  2. Scroll down to line 104
  3. See error

Expected behavior
call dataset = dataset.batch(params['eval_batch_size'], drop_remainder=True) before dataset = dataset.map(_get_output)

Proposed solution
Swap the order of dataset = dataset.map(_get_output) and dataset = dataset.batch(params['eval_batch_size'], drop_remainder=True) in "tasks.py".
Besides, you need to check the function _get_output(here) called in dataset.map() whether to be affected or not to make the changed code work properly. For example, if _get_output needs data with shape(x, y, z) as its input before fix, it will require data with shape(batch_size, x, y, z).

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Thanks for letting us know! It would be awesome if you could submit a PR with plots showing the performance improvement

Thanks for your reply! Is there any benchmark to show the performance of function lambada_input(here)? @StellaAthena

Thanks for your reply! Is there any benchmark to show the performance of function lambada_input(here)? @StellaAthena

Maybe I’m misunderstanding, but I was expecting you to run the code both ways and use a timer to show how long it takes.

OK,
I'll try my best to run the code and calculate the time it takes.
Thank you, Dear Stella~ @StellaAthena