neuro-inc/flow-template

Beginning of job logs is lost after update of project template

Closed this issue · 4 comments

After I updated my project Makefile with cookiecutter-neuro-project v1.3, the beginning part of the logs started disappearing. Here is my command to run a train job on the platform:

TRAIN_CMD="bash -c 'cd $(PROJECT_PATH_ENV) && python -u $(CODE_DIR)/train.py -c $(CODE_DIR)/configs/$(CONFIG_NAME)'"


.PHONY: train
train: _check_setup upload-code upload-config   ### Run a training job (set up env var 'RUN' to specify the training job),
	$(NEURO) run \
		--name $(TRAIN_JOB)-$(RUN) \
		--description "$(PROJECT_ID):train" \
		--preset $(PRESET) \
		--detach \
		$(TRAIN_WAIT_START_OPTION) \
		--volume $(DATA_DIR_STORAGE):$(PROJECT_PATH_ENV)/$(DATA_DIR):ro \
		--volume $(PROJECT_PATH_STORAGE)/$(CODE_DIR):$(PROJECT_PATH_ENV)/$(CODE_DIR):ro \
		--volume $(PROJECT_PATH_STORAGE)/$(CONFIG_DIR):$(PROJECT_PATH_ENV)/$(CONFIG_DIR):ro \
		--volume $(RESULTS_DIR_STORAGE):$(PROJECT_PATH_ENV)/$(RESULTS_DIR):rw \
		--env PYTHONPATH=$(PROJECT_PATH_ENV) \
		--env EXPOSE_SSH=yes \
		--env JOB_TIMEOUT=0 \
		${OPTION_GCP_CREDENTIALS} ${OPTION_AWS_CREDENTIALS} ${OPTION_WANDB_CREDENTIALS} \
		$(CUSTOM_ENV_NAME) \
		$(TRAIN_CMD)
ifeq (${TRAIN_STREAM_LOGS}, yes)
	@echo "Streaming logs of the job $(TRAIN_JOB)-$(RUN)"
	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."
endif

After a run of this command I typically get the following output:
Screenshot from 2020-01-15 18-36-20
where part of the console outputs of my script is missed. The supposed script output can be viewed in the attached log file:
15-01-2020_15-34-51.log

Before the update, there was not any problem like this.

I know why this happens. This is because of the tqdm support. The process goes as follows:

  • The job starts and immediately detaches, all the output goes to a dedicated pipe '/output'.
  • neuro exec runs right after the job detach with tail -f /output.

There is a tiny delay when no one reads the logs. I assume that your output is pretty intensive at the beginning, thus the very start of them is lost.

May I ask you to make an experiment: update the line

	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."

to

	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f -n 1000000 /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."

and run your experiments again?

I changed the command as you said, and it helped me. Now, the console output looks as follow:
Screenshot from 2020-01-20 21-35-56

Thank you a lot!

So, I can see my logs right now, but there is no sign of error output if it appears at script execution beginning (e.g. incorrect parser key, import error ...).

outdated, we are not writing job logs into the /output file anymore