Huggingface Trainer closes run automatically after training
Opened this issue ยท 3 comments
Is your feature request related to a problem? Please describe.
When I use a Huggingface Trainer with a NeptuneCallback, it seems that the Trainer closes the run automatically & thus disconnects it from the python logger.
If I want to log anything to Neptune after training, I have to reinitialize the run, which makes the code complex in bigger training pipelines.
Describe the solution you'd like
Would be great if the run persists.
Describe alternatives you've considered
My workaround looks like this:
main.py:
from dotenv import find_dotenv, load_dotenv
import logging
import neptune
from neptune.integrations.python_logger import NeptuneHandler
from training_function import training_function
def setup_main_logger(run, run_id):
logger = logging.getLogger() # Get the root logger
logger.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
run, neptune_handler = get_neptune_handler(run, run_id, formatter)
logger.addHandler(neptune_handler)
return run, logging.getLogger(__name__)
def get_neptune_handler(run, run_id, formatter):
try:
run.stop()
finally:
run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True)
neptune_handler = NeptuneHandler(run=run)
neptune_handler.setFormatter(formatter)
return run, neptune_handler
if __name__ == "__main__":
# load ENV variables
load_dotenv(find_dotenv(), override=True)
NEPTUNE_API_TOKEN = os.environ.get("NEPTUNE_API_TOKEN")
NEPTUNE_PROJECT = os.environ.get("NEPTUNE_PROJECT")
# Initialize Neptune run
run = neptune.init_run(capture_stderr=True, capture_stdout=True)
run_id = run["sys/id"].fetch()
# Set up logging
run, logger = setup_main_logger(run, run_id)
...
logger.info("This logs perfectly to Neptune! ")
training_function(..., run)
logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
run, logger = setup_main_logger(run, run_id)
logger.info("This logs perfectly to Neptune! ")
training_function.py:
from transformers.integrations import NeptuneCallback
from transformers import Trainer
import logging
logger = logging.getLogger() # root logger
def training_function(..., run) -> None:
...
# Create neptune callback for training logs
neptune_callback = NeptuneCallback(
run=run,
log_parameters=True,
log_checkpoints="all",
)
logger.info("This logs perfectly to Neptune! ")
# Initialize the trainer using our model, training args & dataset, and train
trainer = Trainer(
model=model,
args=args,
...
callbacks=[neptune_callback],
)
logger.info("This logs perfectly to Neptune! ")
trainer.train()
logger.info("THIS NEVER GETS LOGGED TO NEPTUNE!")
Hey @Ulipenitz ๐
Neptune does indeed automatically stop the run once the training loop is done. However, we do provide multiple options to log additional metadata to the run once training is over.
Here is our Transformers integration guide that lists these options ๐ https://docs.neptune.ai/integrations/transformers/#logging-additional-metadata-after-training
Please let me know if any of these work for you ๐ค
Thanks for the answer @SiddhantSadangi!
This is indeed useful to log metadata like test metrics after training.
My problem though is that I need to set up the python logger again after the training function.
I am training on a remote machine in the cloud & unfortunately capture_stderr=True, capture_stdout=True
only captures neptune specific logs, but I want to have all logs in neptune, including the python logger.
My proposed workaround with calling setup_main_logger
works, but I think it is not a nice solution.
Ah, understood!
Yes, this is definitely inconvenient.
I think your workaround does handle this pretty well in the absence of official support for this use case. I'll just suggest using neptune_callback
's get_run()
method to access the run used by the Transformer callback. This will remove the need for storing the run_id
and reinitializing the run.
trainer = Trainer(
...
callbacks=[neptune_callback],
)
logger.info("This will be logged to Neptune")
trainer.train()
logger.info("This won't be logged to Neptune")
run = neptune_callback.get_run(trainer)
neptune_handler = NeptuneHandler(run=run)
logger.addHandler(neptune_handler)
logger.info("This will be logged to Neptune")
Please let me know if this workaround works better for you ๐
I will also pass this feedback to the product team โ