Writing large number of metrics to Google Cloud Storage via Cloud Storage FUSE is very slow

Question

Writing large number of metrics to Google Cloud Storage via Cloud Storage FUSE is very slow

xor-xor opened this issue 4 months ago · 0 comments

This can be trivially reproduced given that you have an access to some Google Cloud project in which you have permissions to run Vertex AI custom training jobs and create Google Cloud Storage (GCS) buckets.

In Vertex AI, submit some fake custom training job (using e.g. python:3.10 Docker image and command: ["sleep", "3600"]) with interactive shell enabled (if you're submitting this job via Python's SDK, simply pass enable_web_access=True to CustomJob's .submit() method).
Once you'd get this job in "Training" state, open two terminals by clicking "Launch web terminal" in its GUI - we will use them for Python and Bash, respectively.
In GCS, create some bucket of your choice and make sure that you can write to it.
In the first terminal, run pip install tensorboard and then launch Python interpreter.
Copy & paste this code to Python's shell, replacing <NAME_OF_YOUR_BUCKET> with the name of the bucket that you've created in step (3):

from time import time

from tensorboard.compat.proto import event_pb2
from tensorboard.compat.proto.summary_pb2 import Summary
from tensorboard.summary.writer.event_file_writer import EventFileWriter

path = "/gcs/<NAME_OF_YOUR_BUCKET>/"

def get_fake_metric(i):
    summary = Summary(value=[Summary.Value(tag=f"val_metric{i}", simple_value=0.123)])
    event = event_pb2.Event(summary=summary)
    event.wall_time = time()
    event.step = 1
    return event

In the second terminal (Bash), cd to /gcs/<NAME_OF_YOUR_BUCKET>/ and make sure that there are no existing event files there.
We will start with the default writer - paste this to Python's shell:

writer = EventFileWriter(path)

Switch to Bash and run the following (we will be using this to monitor the size of event files - at the beginning you should see one such file having 88 bytes):

while true; do date '+%H:%M:%S'; ls -l | grep event; sleep 1; done

Switch to Python and run:

for i in range(200): writer.add_event(get_fake_metric(i))

Switch to Bash and observe how your event file is growing - it should reach slightly above 10 KB. Once it will stop growing, terminate your monitoring loop and note how long it took (for me, it was 2m23s). Delete this file afterwards.
Now, we will repeat this experiment with slightly modified writer. Switch to Python and paste the following:

writer = EventFileWriter(path)
assert writer._general_file_writer.fs_supports_append
writer._general_file_writer.fs_supports_append = False

Switch to Bash and run the same monitoring loop as in step (8).
Switch to Python and run the same loop as in step (9), followed by writer.flush() because otherwise you'd have to wait extra 2 minutes.
Switch to Bash and notice that your event file reached its expected size almost instantly.

Q: What does this experiment show?
A: It shows that EventFileWriter is not aware of Cloud Storage FUSE filesystem, which leads to suboptimal performance (and to make it clear: this monkey-patching of fs_supports_append doesn't mean that this fs doesn't support append - it's just a hack meant to foolEventFileWriter).

Q: But writing events to disk is asynchronous, so even if it is slow, why should we care?
A: Because some training frameworks (e.g. PyTorch Lightning) explicitly flush metrics at the end of the validation loop (which is quite reasonable thing to do, I'd say), and with hundreds of them (600+ in my case), training is stuck for non-negligible time after each validation (7-8 minutes in my case) - and the more frequent the validations, the more badly it accumulates w.r.t. total training time and GPU utilization. And besides, good engineers simply can't stand seeing that writing such relatively small amount of data takes so much time ;)