wandb/examples

Why isn't wandb logging at the last iteration? (when I want to log figs, ckpts etc)

Closed this issue ยท 28 comments

I was running my experiments in pycharm and noticed that the last iteration/epoch I do is not logged for some reason (until I close the pychar debugger) + even when I close it my figure is not being logged for some reason...

The code is really trivial, and I saved the run + report for help of reproducibility:

https://wandb.ai/brando/playground/runs/wpupxvg1?workspace=user-brando
https://wandb.ai/brando/playground/reports/Untitled-Report--VmlldzoxMDEzNjAw

what seems to happen is that the last epoch/iteration (it=11) is not saved when the script reaches the Done! print statement. For some reason, wandb delays the last log (and even once it reaches it it doesn't save it properly).:

        # - log to wandb
        if log_to_wandb:
            if it == 0:
                wandb.watch(args.mdl)
                print('watching model')
            # log_2_wandb(train_loss=train_loss, train_acc=train_acc)
            print('inside wandb log')
            wandb.log(data={'train loss': train_loss, 'train acc': train_acc, 'val loss': val_loss, 'val acc': val_acc}, step=it)
            wandb.log(data={'it': it}, step=it)
            if it == total_its - 1:
                print(f'logging fig at {it=}')
                wandb.log(data={'fig': fig}, step=it)
        plt.close('all')

output:

it=11: train_loss=tensor(25.5445, grad_fn=<MseLossBackward>), train_acc=-0.7582679473604099
it=11: val_loss=tensor(19.8431, grad_fn=<MseLossBackward>), val_acc=-0.6233638321583634
InvalidVersionSpec: Invalid version '1<2': invalid character(s)
inside wandb log
logging fig at it=11
train_loss=tensor(25.5445, grad_fn=<MseLossBackward>), train_loss=tensor(25.5445, grad_fn=<MseLossBackward>)
val_loss=tensor(22.0630, grad_fn=<MseLossBackward>), val_acc=-0.6465475801974256
Success, time passed: hours:0.005498635040389167, minutes=0.32991810242335, seconds=19.795086145401
Done!

(note that if I kill the debug window it seems to finally push to deployment the last epoch/it that was missing)

note I am trying to hang the run until someone sees this but if the run finishes by itself then you can see it failed to push the last epoch information here in the report I made:

https://wandb.ai/brando/playground/reports/Untitled-Report--VmlldzoxMDEzNjAw

(related, but it seems that .watch is not working for me)

note that the figure isn't being logged unless I log ALL the figures during every iteration... :/

Screen Shot 2021-09-11 at 12 04 34 PM

Hi there!

If you provide the step argument to wandb.log, the metrics data will be accumulated but not saved (and then sent to the server). That is, when providing the step yourself, you have control over/responsibility for when information is saved.

There are several ways that you might trigger saving, after logging some information to the step 0. (Below, metrics is a dictionary.)

  1. wandb.log(metrics, step=1), ie log metrics at a new step value, which will save anything that has accumulated for the current step (0). The metrics will be associated with step 1 but not committed yet.
  2. wandb.log(metrics, step=0, commit=True), ie log metrics to the same step value, but explicitly state that it is time to commit/save all the information for that step. The metrics will be associated with step 0 and committed alongside everything else you've logged to that step.
  3. wandb.log(metrics), ie log metrics without providing a step value. Under the hood, we'll save the information logged to step 0, increment the step counter, log the metrics to that step, and then save them.

Here's a quick colab demonstrating this behavior.

It seems like you're using the step because you want control over the alignment of your logged metrics. For that, we actually recommend using define_metric, which lets you define the "x-axis" for the information you are logging as another metric. I think that will substantially simplify your logging. The docs also include a colab demonstrating define_metric.

  • wandb.log(metrics, step=1), ie log metrics at a new step value, which will save anything that has accumulated for the current step (0). The metrics will be associated with step 1 but not committed yet.
  • wandb.log(metrics, step=0, commit=True), ie log metrics to the same step value, but explicitly state that it is time to commit/save all the information for that step. The metrics will be associated with step 0 and committed alongside everything else you've logged to that step.

Why do we need to do 2? For me as a user I find this very unintuitive and spent at least 2 days of working trying to figure out why my code was not saving my stuff and the final metrics being missing. I think having commit=True is a better default and is in line with user expectations (my hunch is that most ppl expect log to log when its called rather than randomly delay on us...).

Thanks for the help! Will continue reading your current reply. :)

I think having commit=True is a better default and is in line with user expectations (my hunch is that most ppl expect log to log when its called rather than randomly delay on us...).

I agree that it's a bit surprising. But note that this will only happen if you provide an explicit step value. If you don't provide a step, then we log immediately. Also, the information is always logged eventually -- either when you increment the step yourself or when training finishes.

And also see the section of my reply on the preferred way for controlling the time/x axis of logged metrics, define_metric.

I think having commit=True is a better default and is in line with user expectations (my hunch is that most ppl expect log to log when its called rather than randomly delay on us...).

I agree that it's a bit surprising. But note that this will only happen if you provide an explicit step value.

I vote that it should always be true unless the user explicitly says it's false.

If you don't provide a step, then we log immediately. Also, the information is always logged eventually -- either when you increment the step yourself or when training finishes.

Actually, my code does call finish at the end but my plot was not pushed, perhaps that is a bug in the wandb side?

And also see the section of my reply on the preferred way for controlling the time/x axis of logged metrics, define_metric.

I did see that, but I'm not there yet! I will check it out soon.

Btw, Appreciate your time. :)

@charlesfrye

actually there is a line in your example that is confusing me:

# logs information to the history, but does not save/commit it
#  and so the run workspace above will not show the metric
for i in range(100):
  wandb.log({"a": np.random.rand()} , step=0)

what is this doing? Isn't this overwriting the value of a all the time of step zero?

what is this doing? Isn't this overwriting the value of a all the time of step zero?

Yes, and only the final value is logged. The example is just demonstrating the behavior; I didn't intend to show a realistic workflow.

Actually, my code does call finish at the end but my plot was not pushed, perhaps that is a bug in the wandb side?

That's definitely possible -- the other metrics are logged but the plot is not. And looking at your summary, a None value is logged to that key instead.

Unfortunately, I can't debug your issue directly any further because of the code's dependence on ultimate-utils.

Btw, Appreciate your time. :)

Thanks! I appreciate the thoughtful questions.

@charlesfrye apologies for the spam, but somehow something else is acting unexpectedly. I am trying to plot a list as a histogram at each step but wandb server complains Im not logging correctly...why might it be? is it because I am not definining the metrics correctly? colab: https://colab.research.google.com/drive/1uegSY1HRGlKfK-07Uuw-ZxPJsNA9BN_9#scrollTo=2Lkko6FUOXqS

# define our custom x axis metric
wandb.define_metric("outer_step")
# define which metrics will be plotted against it
wandb.define_metric("outer_train_loss", step_metric="outer_step")
wandb.define_metric("inner_train_losses", step_metric="outer_step")

# outer loop
for i_o in range(10):
  # inner loop
  i_is = []
  for i_i in range(5):
    i_is.append(i_i/5)
  # log
  log_dict = {
      "outer_train_loss": i_o,
      "outer_step": i_o,
      "inner_train_losses": wandb.Histogram(i_is)
  }
  wandb.log(log_dict, commit=True)
print('Done')
# wandb.finish()

Thanks in advance for the patience and care for teaching us all.

@brando90 I think you might need to change the permissions on that colab. If you make it public

Thanks in advance for the patience and care for teaching us all.

Happy to help!

@brando90 I think you might need to change the permissions on that colab. If you make it public

Thanks in advance for the patience and care for teaching us all.

Happy to help!

@charlesfrye I think this might work: https://colab.research.google.com/drive/1pLe4peekE5ixKqCTfjVQofKB6w3mdEDP?usp=sharing

sorry sort of new to colab

sorry sort of new to colab

np! I've really leaned into it as a way to share literate code that can execute on almost any device while also being fully decomposable. If there's another solution in that space that you like (repl.it? paperspace gradient?), lmk, I'm always down to learn more.

I am trying to plot a list as a histogram at each step but wandb server complains Im not logging correctly...why might it be?

I don't see this anywhere in your code or logs, and I don't see it when I run the same. Do you have a screenshot?

sorry sort of new to colab

np! I've really leaned into it as a way to share literate code that can execute on almost any device while also being fully decomposable. If there's another solution in that space that you like (repl.it? paperspace gradient?), lmk, I'm always down to learn more.

I am trying to plot a list as a histogram at each step but wandb server complains Im not logging correctly...why might it be?

I don't see this anywhere in your code or logs, and I don't see it when I run the same. Do you have a screenshot?

interesting, do you see it now? https://colab.research.google.com/drive/1pLe4peekE5ixKqCTfjVQofKB6w3mdEDP#scrollTo=0BIYhmSROGmq&line=1&uniqifier=1

Hmm, I still don't see any issues there. Could you screenshot?

of course happy to. Hopefully I screened shot what you expected, let me know if not :)

Screen Shot 2021-09-13 at 1 28 44 PM

note left plot has nothing

code:

# define our custom x axis metric
wandb.define_metric("outer_step")
# define which metrics will be plotted against it
wandb.define_metric("outer_train_loss", step_metric="outer_step")
wandb.define_metric("inner_train_losses", step_metric="outer_step")

# outer loop
for i_o in range(10):
  # inner loop
  i_is = []
  for i_i in range(5):
    i_is.append(i_i/5)
  # log
  log_dict = {
      "outer_train_loss": i_o,
      "outer_step": i_o,
      "inner_train_losses": wandb.Histogram(i_is)
  }
  wandb.log(log_dict, commit=True)
print('Done')
# wandb.finish()

Ah interesting -- if you check my workspace, you'll see that the histograms appear, which is why I didn't see your issue.

I think this is because you previously logged information under the same key (inner_train_losses) but with a different type (my guess, a raw Python list). The chart type is fixed by the first thing logged to the key. Here's a video showing how you could get a working chart: https://share.descript.com/view/FpBABwBjrl0.

Ah interesting -- if you check my workspace, you'll see that the histograms appear, which is why I didn't see your issue.

I think this is because you previously logged information under the same key (inner_train_losses) but with a different type (my guess, a raw Python list). The chart type is fixed by the first thing logged to the key. Here's a video showing how you could get a working chart: https://share.descript.com/view/FpBABwBjrl0.

Hmmm, ok creating the plot explicitly worked (as in the video).

I am skeptical if that's why, I've created new runs each time to avoid issues. Is the code I sent wrong in any way? No matter how many times I run it doesn't seem to plit the histogram without me having to create it explicitly in the UI. Are you able to create histogram automatically directly from wandb?

I've created new runs each time to avoid issues. Is the code I sent wrong in any way?

Nope, the code is correct. But starting a new run doesn't avoid the issue, because the relevant state here is at the level of the workspace. The first time you logged that key, you used a list, so the default chart for that key is based on a list. Unfortunately, we don't know how to plot lists, so as you saw when you did your first runs here, the results aren't what you wanted.

Switching to logging Histogram makes it possible to create the chart that you want, but it still doesn't happen by default, because that key's logged values are consumed by the broken chart.

If you create the correct style of chart, as in the video, new runs should have histograms when you create them. Let me know if that's not the case.

But there's a simpler fix: change the wandb.init call to point to a new project (histo-define-metric-test, perhaps?). If you only log with wandb.Histogram to that project, you won't see this issue at all.

I guess that fixed it.

But if I may give feedback as a user - having me log things and explicitly give the right type but the server refusing to do what I am explicitly giving is very unintuitive. I would suggest that the user command overwrite the things the server tries to infer and even have it update based on our new actions. Otherwise, it's extremely confusing and puzzling.

Thanks for your time again!