minimaxir/gpt-2-simple

Allow users to use Colaboratory's TPU for finetuning

minimaxir opened this issue · 7 comments

This alone will be the single-biggest improvement for gpt-2-simple.

  • 8 cores
  • ~2x speed increase relative to a K80

= 16x training speed

Unfortunately documentation for using Colaboratory's TPU is a bit messy.

According to what I see here, the 8x speed up is only if you have a batch_size of 8 or more, as the batches are distributed among the cores. However, if you're already using a batch_size of 2, the training speed should be about 4x if you change batch_size to 8, which is still a very nice speed-up.

All the documentation I can find on using the TPUs seems to be using tensorflow's keras api, like this example, so the model might have to be converted to that.

I do not use batch_size=2 and I believe it is a trap, as the GPU is almost fully utilized even at batch_size=1. In testing, they have the same throughput, except batch_size=1 doesn't max out the GPU memory.

Workflows that use the TPU do use batch_size=8 (or implement it by batch_size *= 8), and it's theoretically possible with the correct TensorFlow distribution strategy.

Hi all! I've been playing with the idea of making this run in a Colaboratory TPU. So far, no luck, but I seem to be really close.

I have a mess in the code right now -- my approach was first to make it work and then simplify and clean up.

I'm currently stuck at the point of loading the initialized model so it can be finetuned. It will complain that the local file system scheme is not implemented. I understand that the TPU is instructed (through tf.Saver) to pick up the model from a local source even though we specify a Google Cloud address. It fails because apparently, TPUs work with GCS addresses for storage.

This is where I'm currently at: https://colab.research.google.com/drive/1_WVxlRgUjfAVZ5im2LaBoQcA5XnpU0K6

A few notes on that code:

  1. I reload the code from git instead of pip so I can keep sending modifications into the code and testing them. I thought this would be the easiest way to test but it's way too black-boxy to see what's going on.
  2. I suspect Google Drive may not be needed anymore, we can probably deal with GCS and the local colaboratory storage only. I started with the approach of making this an option that would branch out in the same code, but it would quickly become messy, so I think the best approach might be to have a whole new finetune notebook and method that is specific to TPU processing.
  3. Everything is pretty much the same as the original code. :)

I might drop out of this effort for a couple of days (weeks?) unless someone has a quick approach I might take. Regardless, if someone benefits from my advances, it has been worth it!

For reference, this is the error:

Full error and stack trace
InvalidArgumentError                      Traceback (most recent call last)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on models/117M/model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'models/117M/model.ckpt')
	 [[node save/RestoreV2 (defined at <ipython-input-8-9187be9325b3>:101) ]]

Caused by op 'save/RestoreV2', defined at:
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2828, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-8-9187be9325b3>", line 201, in <module>
    save_every=100   # how many steps between saving checkpoint
  File "<ipython-input-8-9187be9325b3>", line 101, in finetune_tpu
    save_relative_paths=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 832, in __init__
    self.build()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 844, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 881, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 513, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on models/117M/model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: 'models/117M/model.ckpt')
	 [[node save/RestoreV2 (defined at <ipython-input-8-9187be9325b3>:101) ]]

@AlphaGit
Also I think the actually gpt-2 model weights are actually stored in GCS so shouldn't we be able to read them directly?

Furthermore, you do not need to do that.
You don't need to actually save the weights in Google Cloud storage. Keras does not do that, but you would need to copy the weights over to the CPU and then save them using tf.saver:
https://github.com/tensorflow/tensorflow/blob/234025c31013f5aa38b63fee5cfcd6e8d5c21e17/tensorflow/contrib/tpu/python/tpu/keras_support.py#L2098

@Skylion007 Hi! Thanks for the response.

My problem doesn't seem to be really saving the weights, but rather loading them in the first place. (At least... not so far.)

Yours is an interesting approach. I initially moved out of it because loading the model in memory to later on transfer to the TPU meant moving around 500MBs of data (word embeddings and all). But it might not be that bad, the colaboratory should be prepared to deal with bigger datasets all the time, right?

Regarding, tf.saver, I believe it is really tied to a filesystem. At least, that's what prevented me from using it against GCS... but I might have done it wrong. This is what I'm stuck on right now.

Loading them is pretty straight forward. I almost have a working solution using https://github.com/CyberZHG/keras-gpt-2 but I still need to debug some keras vs tf.keras issues. I did get it working with a fixed input shape so it is possible to at least load it on the TPU, but I need to fix the input and output layers.

@AlphaGit
Okay, it was way harder than it needed to be, but I got it running on a TPU. It needs a lot of optimization, and runs out of memory but otherwise, it should work. Right now all it can do it is load the model weights on the TPU and run inference on them: https://colab.research.google.com/drive/17I7VZrcxM-BfadRqAFWb2DzG3RoAtG5n
It's able to load and save model weights to the local disk.