neptune-ai/neptune-client

BUG: TypeError: neptune.metadata_containers.run.Run() got multiple values for keyword argument 'with_id'

Closed this issue · 6 comments

Describe the bug

When I pass with_id to NeptuneCallback (which I have to, because Trainer in setfit library calls on_train_begin twice & would initialize a new run), then I get the error from the title when calling neptune_callback.run after the training, because the Trainer (as the Huggingface Trainer) kills the run/callback.

I think this happens, because the run property in NeptuneCallback calls _initialize_run with the argument with_id, but with_id is already a key in _init_run_kwargs, because I passed it to the NeptuneCallback.

Reproduction

Init a NeptuneCallback:
neptune_callback = NeptuneCallback(run=run, with_id=neptune_run_id)

Init a Trainer object from setfit (I think this could be the same for the huggingface Trainer class, because setfit inherits from it):

from setfit import Trainer
trainer = Trainer(..., callbacks=[neptune_callback])

Train and try to access run afterwards:

trainer.train()
run = NeptuneCallback.get_run(trainer)

Expected behavior

Either I want to be able to pass with_id to the callback or I want to have the same run, even if on_train_begin is called twice (the first call sets self._initial_run to None, the second call initializes a new run).

This is my current workaround to avoid calling the run method (I use NeptuneCallbackSetFit, because I had to overwrite some class methods which I mentioned here setfit498 and here setfit464 ):

def init_neptune_logging(model_tag: str, run_id=None):
    neptune_run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True, tags=model_tag)
    neptune_run_id = neptune_run["sys/id"].fetch()
    neptune_callback = NeptuneCallbackSetFit(
        run=neptune_run,
        log_parameters=True,
        log_checkpoints="best",
        with_id=neptune_run_id 
        )
    return neptune_run, neptune_run_id, neptune_callback

_, neptune_run_id, neptune_callback = init_neptune_logging(model_tag)
        
trainer = Trainer(
    ...
    callbacks=[neptune_callback,
)

trainer.train()
metrics = trainer.evaluate(test_dataset, metric_key_prefix="test")
 
neptune_run, _, _ = init_neptune_logging(model_tag, neptune_run_id)  # instead of NeptuneCallback.get_run(trainer) or trainer.log()
neptune_run["finetuning/test"].append(metrics)

Hey @Ulipenitz 👋

Could you share a minimal end-to-end example of your script where you are facing this error, along with the details below:

  • Python version:
  • OS:
  • Output of pip list

I could not extract this from my original code, but I put together the example script from the setfit github start page and my steps for reproduction.
Unfortunately, this issue does not occur now, but it logs to 3 seperate runs instead of 1. Maybe this is, because I had to overwrite class methods.

I'm sorry, but I don't have time to try to reproduce the original bug right now.

from datasets import load_dataset
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
import neptune
from transformers.integrations import NeptuneCallback

import tempfile

from dotenv import find_dotenv, load_dotenv
import os

class NeptuneCallbackSetFit(NeptuneCallback):
    # NOTE: We have to overwrite the class method, because the TrainingArguments class in SetFit is missing overwrite_output_dir as an argument.
    def on_init_end(self, args, state, control, **kwargs):
        self._volatile_checkpoints_dir = None
        # original line:
        # if self._log_checkpoints and (args.overwrite_output_dir or args.save_total_limit is not None):
        if self._log_checkpoints and (args.save_total_limit is not None):
            self._volatile_checkpoints_dir = tempfile.TemporaryDirectory().name

        if self._log_checkpoints == "best" and not args.load_best_model_at_end:
            raise ValueError("To save the best model checkpoint, the load_best_model_at_end argument must be enabled.")
        
    # NOTE: We have to overwrite the class method, because the Trainer in SetFit does not use f"checkpoint-{state.global_step}" in checkpoint saving
    def on_save(self, args, state, control, **kwargs):
        if self._should_upload_checkpoint:
            # original line:
            #self._log_model_checkpoint(args.output_dir, f"checkpoint-{state.global_step}")
            self._log_model_checkpoint(args.output_dir, f"step_{state.global_step}")
    
    # NOTE: We have to overwrite the class method, because model.config can be a dict as well
    def _log_model_parameters(self, model):
        from neptune.utils import stringify_unsupported

        if model and hasattr(model, "config") and model.config is not None:
            try:
                self._metadata_namespace[NeptuneCallback.model_parameters_key] = stringify_unsupported(
                    model.config.to_dict()
                )
            except AttributeError:
                self._metadata_namespace[NeptuneCallback.model_parameters_key] = stringify_unsupported(
                    model.config
                )

def init_neptune_logging(model_tag: str, run_id=None):
    # Create neptune callback for training logs
    neptune_run = neptune.init_run(with_id=run_id, capture_stderr=True, capture_stdout=True, tags=model_tag)
    neptune_run_id = neptune_run["sys/id"].fetch()
    neptune_callback = NeptuneCallbackSetFit(
        run=neptune_run,
        log_parameters=True,
        log_checkpoints="best",
        # NOTE: Really important to add this here because the Trainer._init_ somehow kills the run & initializes new one
        # -> def _initialize_run(self, **additional_neptune_kwargs):
        # -> self._run = init_run(**self._init_run_kwargs, **additional_neptune_kwargs)
        # ==> with_id will be in _init_run_kwargs as well! => possible duplication of key with_id
        with_id=neptune_run_id 
        )
    return neptune_run, neptune_run_id, neptune_callback

# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=4)
eval_dataset = dataset["validation"].select(range(50))
test_dataset = dataset["validation"].select(range(50, len(dataset["validation"])))

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "paraphrase-MiniLM-L3-v2",
    labels=["negative", "positive"],
)

args = TrainingArguments(
    batch_size=16,
    num_epochs=4,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Load ENV variables
load_dotenv(find_dotenv(), override=True)
NEPTUNE_API_TOKEN = os.environ.get("NEPTUNE_API_TOKEN")
NEPTUNE_PROJECT = os.environ.get("NEPTUNE_PROJECT")

_, neptune_run_id, neptune_callback = init_neptune_logging("example_run")

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    column_mapping={"sentence": "text", "label": "label"},  # Map dataset columns to text/label expected by trainer
    callbacks=[neptune_callback]
)

# Train and evaluate
trainer.train()

metrics = trainer.evaluate(eval_dataset, metric_key_prefix="test")

run = NeptuneCallback.get_run(trainer)
run["finetuning/test"].append(metrics)
#neptune_run, _, _ = init_neptune_logging("example_run", neptune_run_id)
#neptune_run["finetuning/test"].append(metrics)

Python version: 3.9.18
OS: Windows
Output of pip list:

Package Version


accelerate 0.27.2
aiohttp 3.9.3
aiosignal 1.3.1
alembic 1.13.1
arrow 1.3.0
asttokens 2.4.1
async-timeout 4.0.3
attrs 23.2.0
azure-common 1.1.28
azure-core 1.30.0
azure-identity 1.15.0
azure-mgmt-compute 30.5.0
azure-mgmt-core 1.4.0
backcall 0.2.0
beautifulsoup4 4.12.3
bleach 6.1.0
boto3 1.34.48
botocore 1.34.48
bravado 11.0.3
bravado-core 6.1.1
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
click 8.1.7
colorama 0.4.6
colorlog 6.8.2
comm 0.2.2
contourpy 1.2.0
cryptography 42.0.4
cycler 0.12.1
datasets 2.17.1
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.8
docopt 0.6.2
et-xmlfile 1.1.0
evaluate 0.4.1
executing 2.0.1
fastjsonschema 2.19.1
filelock 3.13.1
fonttools 4.50.0
fqdn 1.5.1
frozenlist 1.4.1
fsspec 2023.10.0
future 1.0.0
gitdb 4.0.11
GitPython 3.1.42
greenlet 3.0.3
huggingface-hub 0.20.3
idna 3.6
importlib-metadata 7.0.1
importlib_resources 6.3.1
ipykernel 6.29.3
ipython 8.12.0
isodate 0.6.1
isoduration 20.11.0
jedi 0.19.1
Jinja2 3.1.3
jmespath 1.0.1
joblib 1.3.2
jsonpointer 2.4
jsonref 1.1.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter_client 8.6.0
jupyter_core 4.12.0
jupyterlab_pygments 0.3.0
kiwisolver 1.4.5
Mako 1.3.3
MarkupSafe 2.1.5
matplotlib 3.8.3
matplotlib-inline 0.1.6
mistune 3.0.2
monotonic 1.6
mpmath 1.3.0
msal 1.27.0
msal-extensions 1.1.0
msgpack 1.0.7
multidict 6.0.5
multiprocess 0.70.16
nbclient 0.9.0
nbconvert 7.16.1
nbformat 5.9.2
neptune 1.9.1
nest_asyncio 1.6.0
networkx 3.2.1
numpy 1.26.4
oauthlib 3.2.2
openpyxl 3.1.2
optuna 3.6.1
packaging 23.2
pandas 2.2.0
pandocfilters 1.5.1
parso 0.8.3
pickleshare 0.7.5
pillow 10.2.0
pip 23.3.1
pipreqs 0.5.0
platformdirs 4.2.0
portalocker 2.8.2
prompt-toolkit 3.0.42
psutil 5.9.0
pure-eval 0.2.2
pyarrow 15.0.0
pyarrow-hotfix 0.6
pycparser 2.21
Pygments 2.17.2
PyJWT 2.8.0
PyMuPDF 1.23.25
PyMuPDFb 1.23.22
pyparsing 3.1.2
python-dateutil 2.8.2
python-dotenv 1.0.1
pytz 2024.1
pywin32 227
PyYAML 6.0.1
pyzmq 25.1.2
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
requests-oauthlib 1.3.1
responses 0.18.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.18.0
s3transfer 0.10.0
safetensors 0.4.2
scikit-learn 1.4.1.post1
scipy 1.12.0
sentence-transformers 2.5.1
seqeval 1.2.2
setfit 1.0.3
setuptools 68.2.2
simplejson 3.19.2
six 1.16.0
smmap 5.0.1
soupsieve 2.5
span-marker 1.5.0
SQLAlchemy 2.0.29
stack-data 0.6.2
swagger-spec-validator 3.0.3
sympy 1.12
threadpoolctl 3.3.0
tinycss2 1.2.1
tokenizers 0.15.2
torch 2.2.1+cu118
torchaudio 2.2.1+cu118
torchvision 0.17.1+cu118
tornado 6.2
tqdm 4.66.2
traitlets 5.14.1
transformers 4.38.1
types-python-dateutil 2.8.19.20240106
typing_extensions 4.10.0
tzdata 2024.1
uri-template 1.3.0
urllib3 1.26.18
wcwidth 0.2.13
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
wheel 0.41.2
xgboost 2.0.3
XlsxWriter 3.2.0
xxhash 3.4.1
yarg 0.1.9
yarl 1.9.4
zipp 3.17.0

but it logs to 3 seperate runs instead of 1. Maybe this is, because I had to overwrite class methods.

Not because of the overwritten class methods... I noticed the same behaviour when I initially tried to reproduce the issue using setfit's example code.

I was able to solve this by using custom_run_id instead of with_id. Just add the below at the top of your script:

from uuid import uuid4

os.environ["NEPTUNE_CUSTOM_RUN_ID"] = str(uuid4())

and remove with_id from neptune.init_run().

You will see a few Error occurred during asynchronous operation processing: X-coordinates (step) must be strictly increasing for series attribute errors, but these are either harmless or can be handled pretty easily: https://docs.neptune.ai/help/error_step_must_be_increasing/

Can you try this approach and let me know if it works for you?

Sorry for not coming back earlier.
Yes, this solves the issue of initializing multiple runs in parallel!
Thank you!

Perfect, thanks for confirming!

Closing this thread, but please feel free to comment or create new issue if there's anything I can help you with 🤗