This repository ontains useful scripts for adding common services to non-persistent colaboratory
VM sessions
see: https://colab.research.google.com/notebook
create a public tensorboard
URL using secure introspective tunnels via ngrok
When training on colaboratory
VMs it it often useful to monitor the session via
tensorboard
. This script helps you launches tensorboard on the colaboratory
VM and
uses ngrok
to create a secure introspective tunnel to access tensorboard via public URL.
************************************
* A simple working script *
************************************
import os
import colab_utils.tboard
# set paths
ROOT = %pwd
LOG_DIR = os.path.join(ROOT, 'log')
# will install `ngrok`, if necessary
# will create `log_dir` if path does not exist
colab_utils.tboard.launch_tensorboard( bin_dir=ROOT, log_dir=LOG_DIR )
launch tensorboard on colaboratory
VM and open a tunnel for access by public URL. Automatically installs ngrok
. if necessary.
tboard.launch_tensorboard( bin_dir=ROOT, log_dir=LOG_DIR )
install ngrok
package, if necessary
tboard.install_ngrok( bin_dir=ROOT, log_dir=LOG_DIR )
Access Google Cloud from colaboratory
VM and save/restore checkpoints to cloud storage
Note: these methods currently use ipython magic
commands and therefore cannot be loaded from a module at this time. For now, you can copy/paste the entire script to a colaboratory
cell to run.
Long-running training sessions on colaboratory
VMs are at risk of reset after 90 mins of
inactivity or shutdown after 12hrs of training. This script allows you to save/restore
checkpoints to Google Cloud Storage to avoid losing your results.
You can also mount a GCS bucket on the local filesystem using the gcsfuse
package for syncing
checkpoints automatically to the cloud
************************************
* A simple working script *
************************************
import os
import colab_utils.gcloud
# authorize access to Google Cloud SDK from `colaboratory` VM
project_name = "my-project-123"
colab_utils.gcloud.gcloud_auth(project_name)
# colab_utils.gcloud.config_project(project_name)
# set paths
ROOT = %pwd
LOG_DIR = os.path.join(ROOT, 'log')
TRAIN_LOG = os.path.join(LOG_DIR, 'training-run-1')
# save latest checkpoint as a zipfile to a GCS bucket `gs://my-checkpoints/`
# zipfile name = "{}.{}.zip".format() os.path.basename(TRAIN_LOG), global_step)
# e.g. gs://my-checkpoints/training-run-1.1000.zip"
bucket_name = "my-checkpoints"
colab_utils.gcloud.save_to_bucket(TRAIN_LOG, bucket_name, project_name, save_events=True, force=False)
# restore a zipfile from GCS bucket to a local directory, usually in
# tensorboard `log_dir`
CHECKPOINTS = os.path.join(LOG_DIR, 'training-run-2')
zipfile = os.path.basename(TRAIN_LOG) # training-run-1
colab_utils.gcloud.load_from_bucket("training-run-1.1000.zip", bucket_name, CHECKPOINTS )
# mount gcs bucket to local fs using the `gcsfuse` package, installs automatically
bucket = "my-bucket"
local_path = colab_utils.gcloud.gcsfuse(bucket=bucket)
# gcsfuse(): Using mount point: /tmp/gcs-bucket/my-bucket
!ls -l local_path
!umount local_path
authorize access to Google Cloud SDK from colaboratory
VM and set default project
colab_utils.gcloud.gcloud_auth(project_name)
Save and restore checkpoints and events to a zipfile in a GCS bucket
zip the latest checkpoint files from train_dir and save to GCS bucket
colab_utils.gcloud.save_to_bucket(train_dir, bucket,
step=None,
save_events=False,
force=False)
download and unzip checkpoint files from GCS bucket, save to train_dir
colab_utils.gcloud.load_from_bucket(zip_filename, bucket, train_dir ):
adds a callback to the tf.train.Saver.save()
method. This can be used to archive checkpoint and tensorboard event files to a GCS bucket
import os, re
import colab_utils.gcloud
# define callback
def save_checkpoint_to_bucket( sess, save_path, **kwargs ):
# be sure to call `colab_utils.gcloud.gcloud_auth(project_id)` beforehand
bucket = "my-bucket"
project_name = "my-project-123"
# e.g. model_checkpoint_path = /tensorflow/log/run1/model.ckpt-14
train_log, checkpoint = os.path.split(kwargs['checkpoint_path'])
bucket_path = colab_utils.gcloud.save_to_bucket(train_log, bucket, project_name,
step=kwargs['checkpoint_step'],
save_events=True)
return bucket_path
# create subclassed `tf.train.Saver()`
saver = SaverWithCallback(save_checkpoint_to_bucket)
ckpt_interval = 3600 # save checkpoint every 1 hour and save to bucket
tf.reset_default_graph()
with tf.Graph().as_default():
# ...
checkpoint_saver = colab_utils.gcloud.SaverWithCallback(save_checkpoint_to_bucket)
loss = slim.learning.train(train_op, train_log,
save_interval_secs=ckpt_interval,
saver=checkpoint_saver,
)
Use GcsArchiveHook
as an implementation of tf.train.SessionRunHook
to archive checkpoint and
events as a tar.gz
archive to a Google Cloud Storage bucket. Works together with model_fn()
and thetf.Estimator
API
def model_fn(features, labels, mode, params):
# params["start"] = time.time()
# params["log_dir"]=TRAIN_LOG
[...]
loss = [...]
if mode == tf.estimator.ModeKeys.TRAIN:
train_op = [...]
#
# add training_hooks
#
bucket = "my-bucket"
project_name = "my-project-123"
archiveHook = GcsArchiveHook(every_n_secs=3600,
start = params["start"],
log=params["log_dir"],
bucket=bucket,
project=project_name)
return tf.estimator.EstimatorSpec(mode=mode, loss=loss,
train_op=train_op,
training_hooks=[archiveHook],
)
use gcsfuse
to automatically sync to GCS
Note: While the lastest checkpoints can be restored, tensorboard event files are sometimes lost (size 0) if the VM resets upon hitting the 12 hour limit. It is generally better to use
SaverWithCallback()
to archive checkpoint and event files to a GCS bucket before the VM resets.
local_path = gcsfuse(bucket=None, gcs_class="regional", gcs_location="asia-east1", project_id=None)