trackmania-rl/tmrl

Constantly getting thrown TypeError: Could not convert [tensor(1851.6046) tensor(326.9721)] to numeric

starexplorer21 opened this issue · 7 comments

I am currently experimenting with the command line interface for this library, but am constantly getting thrown a TypeError.

I have tried experimenting with tweaking the parameters, restarting openplanet, etc, but to no avail.

Here are the command line logs from the trainer, the other 2 scripts appear to be functioning properly.

I am running this in powershell with python 3.10, with no wandb connection

INFO:root:Namespace(server=False, trainer=True, worker=False, test=False, benchmark=False, record_reward=False, check_env=False, no_wandb=True, config={})
INFO:root:10/08/23 23:58:30 server IP: 127.0.0.1
INFO:root:--- NOW RUNNING SAC on TrackMania ---
INFO:root:Loading checkpoint...
INFO:root: Loaded checkpoint in 0.0033698081970214844 seconds.
INFO:root:Updating checkpoint...
INFO:root:Target entropy: -0.5.
INFO:root:Max epochs changed to 100 (old: 10000).
INFO:root:Rounds per epoch changed to 5 (old: 10).
INFO:root:Checkpoint updated in 0.0 seconds.
INFO:root:=== epoch 0/100 ==== round 0/5 =======================================
INFO:root: Waiting for new samples
INFO:root: Resuming training
INFO:root:starting training
C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\custom\utils\nn.py:44: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert b.storage().data_ptr() == a.storage().data_ptr()
Traceback (most recent call last):
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\__main__.py", line 82, in <module>
    main(arguments)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\__main__.py", line 56, in main
    trainer.run()
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 393, in run
    run(interface=self.interface,
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 326, in run
    for stats in iterate_epochs_tm(run_cls, interface, checkpoint_path, dump_run_instance_fn, load_run_instance_fn, 1, updater_fn):
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 269, in iterate_epochs_tm
    yield run_instance.run_epoch(interface=interface)  # yield stats data frame (this makes this function a generator)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\training_offline.py", line 153, in run_epoch
    stats += pandas_dict(memory_len=len(self.memory), round_time=round_time, idle_time=idle_time, **DataFrame(stats_training).mean(skipna=True)),
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11338, in mean
    result = super().mean(axis, skipna, numeric_only, **kwargs)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 11978, in mean
    return self._stat_function(
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 11935, in _stat_function
    return self._reduce(
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11207, in _reduce
    res = df._mgr.reduce(blk_func)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\managers.py", line 1459, in reduce
    nbs = blk.reduce(func)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\blocks.py", line 377, in reduce
    result = func(self.values)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11139, in blk_func
    return op(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 720, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 1678, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric")
TypeError: Could not convert [tensor(1851.6046) tensor(326.9721)] to numeric

Hi, apparently there was a breaking change in a recent version of pandas or torch (most likely pandas), can you provide your versions of pandas and torch so that I can reproduce the issue please?

Also do you still have this error if you delete the checkpoint?

I am currently experimenting with the command line interface for this library, but am constantly getting thrown a TypeError.

I have tried experimenting with tweaking the parameters, restarting openplanet, etc, but to no avail.

Here are the command line logs from the trainer, the other 2 scripts appear to be functioning properly.

I am running this in powershell with python 3.10, with no wandb connection

INFO:root:Namespace(server=False, trainer=True, worker=False, test=False, benchmark=False, record_reward=False, check_env=False, no_wandb=True, config={})
INFO:root:10/08/23 23:58:30 server IP: 127.0.0.1
INFO:root:--- NOW RUNNING SAC on TrackMania ---
INFO:root:Loading checkpoint...
INFO:root: Loaded checkpoint in 0.0033698081970214844 seconds.
INFO:root:Updating checkpoint...
INFO:root:Target entropy: -0.5.
INFO:root:Max epochs changed to 100 (old: 10000).
INFO:root:Rounds per epoch changed to 5 (old: 10).
INFO:root:Checkpoint updated in 0.0 seconds.
INFO:root:=== epoch 0/100 ==== round 0/5 =======================================
INFO:root: Waiting for new samples
INFO:root: Resuming training
INFO:root:starting training
C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\custom\utils\nn.py:44: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert b.storage().data_ptr() == a.storage().data_ptr()
Traceback (most recent call last):
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\__main__.py", line 82, in <module>
    main(arguments)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\__main__.py", line 56, in main
    trainer.run()
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 393, in run
    run(interface=self.interface,
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 326, in run
    for stats in iterate_epochs_tm(run_cls, interface, checkpoint_path, dump_run_instance_fn, load_run_instance_fn, 1, updater_fn):
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\networking.py", line 269, in iterate_epochs_tm
    yield run_instance.run_epoch(interface=interface)  # yield stats data frame (this makes this function a generator)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\tmrl\training_offline.py", line 153, in run_epoch
    stats += pandas_dict(memory_len=len(self.memory), round_time=round_time, idle_time=idle_time, **DataFrame(stats_training).mean(skipna=True)),
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11338, in mean
    result = super().mean(axis, skipna, numeric_only, **kwargs)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 11978, in mean
    return self._stat_function(
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 11935, in _stat_function
    return self._reduce(
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11207, in _reduce
    res = df._mgr.reduce(blk_func)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\managers.py", line 1459, in reduce
    nbs = blk.reduce(func)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\internals\blocks.py", line 377, in reduce
    result = func(self.values)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\frame.py", line 11139, in blk_func
    return op(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 147, in f
    result = alt(values, axis=axis, skipna=skipna, **kwds)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 404, in new_func
    result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 720, in nanmean
    the_sum = _ensure_numeric(the_sum)
  File "C:\Users\Yile0\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\nanops.py", line 1678, in _ensure_numeric
    raise TypeError(f"Could not convert {x} to numeric")
TypeError: Could not convert [tensor(1851.6046) tensor(326.9721)] to numeric

I am receiving the exact same Typerror, and yes tmrl --server and --worker is working properly. Its just the trainer that is currently broken.

Hi, this is probably torch or pandas as they released new versions recently. Can you try to downgrade torch to 2.0.1 and, if this doesn't work, pandas to, say 1.5.3? Hopefully that should work until we publish a hotfix.

(PS: don't forget to delete the checkpoint saved by the trainer in TmrlData/checkpoints as it would otherwise be corrupted)

Okay, I uninstalled and installed the versions of the current torch and panda and reinstalled the versions you recommended, and it has worked. The issue is with the panda's update, as when I downgraded with torch to 2.0.1, the TypeError was being thrown back, but after downgrading pandas to 1.5.3 it worked.
Thanks so much for the help!

Okay, I uninstalled and installed the versions of the current torch and panda and reinstalled the versions you recommended, and it has worked. The issue is with the panda's update, as when I downgraded with torch to 2.0.1, the TypeError was being thrown back, but after downgrading pandas to 1.5.3 it worked.
Thanks so much for the help!

Thanks for testing, I'll make tmrl compatible with the last version of pandas in the upcoming release

For others who have this bug in the meantime, this should fix it:

pip install pandas==1.5.3

Thanks so much for the help!
Since there is now a solution I'll be closing the issue.
If its any help, I was using torch == 2.1.0 +cu118, and pandas 2.1.1.

After downgrading just pandas to 1.5.3, it seemed to have completely resolved the issue for me as well.

Well, that was a bit early to close, but the issue should now be resolved in version 0.5.3 :)