Writing large number of metrics to Google Cloud Storage via Cloud Storage FUSE is very slow
xor-xor opened this issue · 0 comments
This can be trivially reproduced given that you have an access to some Google Cloud project in which you have permissions to run Vertex AI custom training jobs and create Google Cloud Storage (GCS) buckets.
- In Vertex AI, submit some fake custom training job (using e.g.
python:3.10
Docker image andcommand: ["sleep", "3600"]
) with interactive shell enabled (if you're submitting this job via Python's SDK, simply passenable_web_access=True
toCustomJob
's.submit()
method). - Once you'd get this job in "Training" state, open two terminals by clicking "Launch web terminal" in its GUI - we will use them for Python and Bash, respectively.
- In GCS, create some bucket of your choice and make sure that you can write to it.
- In the first terminal, run
pip install tensorboard
and then launch Python interpreter. - Copy & paste this code to Python's shell, replacing
<NAME_OF_YOUR_BUCKET>
with the name of the bucket that you've created in step (3):
from time import time
from tensorboard.compat.proto import event_pb2
from tensorboard.compat.proto.summary_pb2 import Summary
from tensorboard.summary.writer.event_file_writer import EventFileWriter
path = "/gcs/<NAME_OF_YOUR_BUCKET>/"
def get_fake_metric(i):
summary = Summary(value=[Summary.Value(tag=f"val_metric{i}", simple_value=0.123)])
event = event_pb2.Event(summary=summary)
event.wall_time = time()
event.step = 1
return event
- In the second terminal (Bash),
cd
to/gcs/<NAME_OF_YOUR_BUCKET>/
and make sure that there are no existing event files there. - We will start with the default writer - paste this to Python's shell:
writer = EventFileWriter(path)
- Switch to Bash and run the following (we will be using this to monitor the size of event files - at the beginning you should see one such file having 88 bytes):
while true; do date '+%H:%M:%S'; ls -l | grep event; sleep 1; done
- Switch to Python and run:
for i in range(200): writer.add_event(get_fake_metric(i))
- Switch to Bash and observe how your event file is growing - it should reach slightly above 10 KB. Once it will stop growing, terminate your monitoring loop and note how long it took (for me, it was
2m23s
). Delete this file afterwards. - Now, we will repeat this experiment with slightly modified writer. Switch to Python and paste the following:
writer = EventFileWriter(path)
assert writer._general_file_writer.fs_supports_append
writer._general_file_writer.fs_supports_append = False
- Switch to Bash and run the same monitoring loop as in step (8).
- Switch to Python and run the same loop as in step (9), followed by
writer.flush()
because otherwise you'd have to wait extra 2 minutes. - Switch to Bash and notice that your event file reached its expected size almost instantly.
Q: What does this experiment show?
A: It shows that EventFileWriter
is not aware of Cloud Storage FUSE filesystem, which leads to suboptimal performance (and to make it clear: this monkey-patching of fs_supports_append
doesn't mean that this fs doesn't support append - it's just a hack meant to foolEventFileWriter
).
Q: But writing events to disk is asynchronous, so even if it is slow, why should we care?
A: Because some training frameworks (e.g. PyTorch Lightning) explicitly flush metrics at the end of the validation loop (which is quite reasonable thing to do, I'd say), and with hundreds of them (600+ in my case), training is stuck for non-negligible time after each validation (7-8 minutes in my case) - and the more frequent the validations, the more badly it accumulates w.r.t. total training time and GPU utilization. And besides, good engineers simply can't stand seeing that writing such relatively small amount of data takes so much time ;)