Beginning of job logs is lost after update of project template

Question

Beginning of job logs is lost after update of project template

Closed this issue 3 years ago · 4 comments

After I updated my project Makefile with cookiecutter-neuro-project v1.3, the beginning part of the logs started disappearing. Here is my command to run a train job on the platform:

TRAIN_CMD="bash -c 'cd $(PROJECT_PATH_ENV) && python -u $(CODE_DIR)/train.py -c $(CODE_DIR)/configs/$(CONFIG_NAME)'"


.PHONY: train
train: _check_setup upload-code upload-config   ### Run a training job (set up env var 'RUN' to specify the training job),
	$(NEURO) run \
		--name $(TRAIN_JOB)-$(RUN) \
		--description "$(PROJECT_ID):train" \
		--preset $(PRESET) \
		--detach \
		$(TRAIN_WAIT_START_OPTION) \
		--volume $(DATA_DIR_STORAGE):$(PROJECT_PATH_ENV)/$(DATA_DIR):ro \
		--volume $(PROJECT_PATH_STORAGE)/$(CODE_DIR):$(PROJECT_PATH_ENV)/$(CODE_DIR):ro \
		--volume $(PROJECT_PATH_STORAGE)/$(CONFIG_DIR):$(PROJECT_PATH_ENV)/$(CONFIG_DIR):ro \
		--volume $(RESULTS_DIR_STORAGE):$(PROJECT_PATH_ENV)/$(RESULTS_DIR):rw \
		--env PYTHONPATH=$(PROJECT_PATH_ENV) \
		--env EXPOSE_SSH=yes \
		--env JOB_TIMEOUT=0 \
		${OPTION_GCP_CREDENTIALS} ${OPTION_AWS_CREDENTIALS} ${OPTION_WANDB_CREDENTIALS} \
		$(CUSTOM_ENV_NAME) \
		$(TRAIN_CMD)
ifeq (${TRAIN_STREAM_LOGS}, yes)
	@echo "Streaming logs of the job $(TRAIN_JOB)-$(RUN)"
	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."
endif

After a run of this command I typically get the following output:

where part of the console outputs of my script is missed. The supposed script output can be viewed in the attached log file:
15-01-2020_15-34-51.log

Before the update, there was not any problem like this.

Answer 1 · 2020-01-20T17:34:57.000Z

I know why this happens. This is because of the tqdm support. The process goes as follows:

The job starts and immediately detaches, all the output goes to a dedicated pipe '/output'.
neuro exec runs right after the job detach with tail -f /output.

There is a tiny delay when no one reads the logs. I assume that your output is pretty intensive at the beginning, thus the very start of them is lost.

May I ask you to make an experiment: update the line

	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."

to

	$(NEURO) exec --no-key-check -T $(TRAIN_JOB)-$(RUN) "tail -f -n 1000000 /output" || echo -e "Stopped streaming logs.\nUse 'neuro logs <job>' to see full logs."

and run your experiments again?

Answer 2 · 2020-01-20T18:36:50.000Z

I changed the command as you said, and it helped me. Now, the console output looks as follow:

Thank you a lot!

Answer 3 · 2020-01-22T08:56:44.000Z

So, I can see my logs right now, but there is no sign of error output if it appears at script execution beginning (e.g. incorrect parser key, import error ...).

Answer 4 · 2021-08-20T14:35:50.000Z

outdated, we are not writing job logs into the /output file anymore