MeteoSwiss/ldcast

Can't cache sampler_nowcaster_test

caglarkucuk opened this issue · 5 comments

Context

After successfully running
python forecast_demo.py and
python train_autoenc.py --model_dir="../models/autoenc_train",
I couldn't get the
python train_genforecast.py --model_dir="../models/genforecast_train"
command running due to problems in caching the sampler for test and training set.

When running python train_genforecast.py --model_dir="../models/genforecast_train":

Expected behaviour

  1. Creates the sampler files and save to the cache directory for valid, test, and train datasets
  2. Trains the forecaster

Actual behaviour

  1. Creates the file ../cache/sampler_nowcaster_valid.pkl
  2. Throws an error creating the next sampler as (complete error message pasted below):
~/tmp/0606/ldcast/scripts$ python train_genforecast.py --model_dir="../models/genforecast_train"
Loading data...
/home/kucuk/tmp/0606/ldcast/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
  log_scale = np.log10(scale).astype(np.float32)
Loading cached sampler from ../cache/sampler_nowcaster_valid.pkl.
No cached sampler found, creating a new one...
Traceback (most recent call last):
  File "train_genforecast.py", line 129, in <module>
    Fire(main)
  File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "train_genforecast.py", line 125, in main
    train(**config)
  File "train_genforecast.py", line 94, in train
    datamodule = setup_data(
  File "/home/kucuk/tmp/0606/ldcast/scripts/train_nowcaster.py", line 124, in setup_data
    datamodule = split.DataModule(
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 127, in __init__
    self.batch_gen = {
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 128, in <dictcomp>
    split: batch.BatchGenerator(
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/batch.py", line 81, in __init__
    self.sampler = EqualFrequencySampler(
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 30, in __init__
    self.starting_ind = [
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 31, in <listcomp>
    starting_indices_for_centers(
  File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 210, in starting_indices_for_centers
    starting_ind = np.concatenate(
  File "<__array_function__ internals>", line 180, in concatenate
ValueError: need at least one array to concatenate
  1. Seems like an issue with indexing of patches in the sampler, though I'm not sure...

Additional information

  • I removed the sampler_nowcaster* in the cache folder and tried the same command python train_genforecast.py --model_dir="../models/genforecast_train", received the same error
  • In case it helps, below is the complete directory structure (note the size difference among sampler_*_valid.pkl files)
~/tmp/0606/ldcast$ tree -h
.
├── [ 11K]  LICENSE
├── [5.5K]  README.md
├── [4.0K]  cache
│   ├── [501K]  sampler_autoenc_test.pkl
│   ├── [883M]  sampler_autoenc_train.pkl
│   ├── [1.9M]  sampler_autoenc_valid.pkl
│   └── [745K]  sampler_nowcaster_valid.pkl
├── [4.0K]  config
│   ├── [  65]  genforecast-radaronly-128x128-20step.yaml
│   └── [ 158]  genforecast-radaronly-256x256-20step.yaml
├── [4.0K]  data
│   ├── [ 48K]  Border_CH.dbf
│   ├── [130K]  Border_CH.shp
│   ├── [4.0K]  RV
│   │   ├── [197M]  patches_RV_202204.nc
│   │   ├── [156M]  patches_RV_202205.nc
│   │   ├── [154M]  patches_RV_202206.nc
│   │   ├── [151M]  patches_RV_202207.nc
│   │   ├── [ 84M]  patches_RV_202208.nc
│   │   └── [281M]  patches_RV_202209.nc
│   ├── [4.0K]  RZC
│   │   ├── [ 55M]  patches_RZC_201804.nc
│   │   ├── [102M]  patches_RZC_201805.nc
│   │   ├── [ 53M]  patches_RZC_201806.nc
│   │   ├── [ 53M]  patches_RZC_201807.nc
│   │   ├── [ 71M]  patches_RZC_201808.nc
│   │   ├── [ 38M]  patches_RZC_201809.nc
│   │   ├── [ 72M]  patches_RZC_201904.nc
│   │   ├── [117M]  patches_RZC_201905.nc
│   │   ├── [ 69M]  patches_RZC_201906.nc
│   │   ├── [ 57M]  patches_RZC_201907.nc
│   │   ├── [ 77M]  patches_RZC_201908.nc
│   │   ├── [ 55M]  patches_RZC_201909.nc
│   │   ├── [ 42M]  patches_RZC_202004.nc
│   │   ├── [ 67M]  patches_RZC_202005.nc
│   │   ├── [110M]  patches_RZC_202006.nc
│   │   ├── [ 51M]  patches_RZC_202007.nc
│   │   ├── [ 91M]  patches_RZC_202008.nc
│   │   ├── [ 61M]  patches_RZC_202009.nc
│   │   ├── [ 59M]  patches_RZC_202104.nc
│   │   ├── [140M]  patches_RZC_202105.nc
│   │   ├── [ 86M]  patches_RZC_202106.nc
│   │   ├── [120M]  patches_RZC_202107.nc
│   │   ├── [ 62M]  patches_RZC_202108.nc
│   │   └── [ 54M]  patches_RZC_202109.nc
│   ├── [4.0K]  demo
│   │   └── [4.0K]  20210622
│   │       ├── [214K]  RZC211731820VL.801.h5
│   │       ├── [214K]  RZC211731825VL.801.h5
│   │       ├── [216K]  RZC211731830VL.801.h5
│   │       └── [216K]  RZC211731835VL.801.h5
│   └── [4.1K]  split_chunks.pkl.gz
├── [4.0K]  figures
│   └── [4.0K]  demo
│       ├── [207K]  R_past-00.png
│       ├── [208K]  R_past-01.png
│       ├── [208K]  R_past-02.png
│       ├── [208K]  R_past-03.png
│       ├── [192K]  R_pred-00.png
│       ├── [191K]  R_pred-01.png
│       ├── [192K]  R_pred-02.png
│       ├── [194K]  R_pred-03.png
│       ├── [195K]  R_pred-04.png
│       ├── [195K]  R_pred-05.png
│       ├── [196K]  R_pred-06.png
│       ├── [196K]  R_pred-07.png
│       ├── [199K]  R_pred-08.png
│       ├── [202K]  R_pred-09.png
│       ├── [204K]  R_pred-10.png
│       ├── [204K]  R_pred-11.png
│       ├── [208K]  R_pred-12.png
│       ├── [207K]  R_pred-13.png
│       ├── [211K]  R_pred-14.png
│       ├── [211K]  R_pred-15.png
│       ├── [212K]  R_pred-16.png
│       ├── [209K]  R_pred-17.png
│       ├── [209K]  R_pred-18.png
│       └── [205K]  R_pred-19.png
├── [4.0K]  ldcast
│   ├── [4.0K]  analysis
│   │   ├── [4.8K]  crps.py
│   │   ├── [4.4K]  fss.py
│   │   ├── [3.2K]  histogram.py
│   │   └── [5.5K]  rank.py
│   ├── [4.0K]  features
│   │   ├── [4.0K]  __pycache__
│   │   │   ├── [ 11K]  batch.cpython-38.pyc
│   │   │   ├── [ 11K]  patches.cpython-38.pyc
│   │   │   ├── [7.0K]  sampling.cpython-38.pyc
│   │   │   ├── [4.8K]  split.cpython-38.pyc
│   │   │   ├── [8.2K]  transform.cpython-38.pyc
│   │   │   └── [3.1K]  utils.cpython-38.pyc
│   │   ├── [ 13K]  batch.py
│   │   ├── [3.9K]  io.py
│   │   ├── [ 13K]  patches.py
│   │   ├── [7.1K]  sampling.py
│   │   ├── [5.2K]  split.py
│   │   ├── [8.7K]  transform.py
│   │   └── [3.8K]  utils.py
│   ├── [8.5K]  forecast.py
│   ├── [4.0K]  models
│   │   ├── [4.0K]  __pycache__
│   │   │   ├── [1.1K]  distributions.cpython-38.pyc
│   │   │   └── [ 833]  utils.cpython-38.pyc
│   │   ├── [4.0K]  autoenc
│   │   │   ├── [4.0K]  __pycache__
│   │   │   │   ├── [3.4K]  autoenc.cpython-38.pyc
│   │   │   │   ├── [1.9K]  encoder.cpython-38.pyc
│   │   │   │   └── [ 960]  training.cpython-38.pyc
│   │   │   ├── [3.0K]  autoenc.py
│   │   │   ├── [1.9K]  encoder.py
│   │   │   └── [ 952]  training.py
│   │   ├── [4.0K]  benchmarks
│   │   │   ├── [2.4K]  dgmr.py
│   │   │   ├── [3.7K]  pysteps.py
│   │   │   └── [ 350]  transform.py
│   │   ├── [4.0K]  blocks
│   │   │   ├── [4.0K]  __pycache__
│   │   │   │   ├── [9.5K]  afno.cpython-38.pyc
│   │   │   │   ├── [3.2K]  attention.cpython-38.pyc
│   │   │   │   └── [2.2K]  resnet.cpython-38.pyc
│   │   │   ├── [ 13K]  afno.py
│   │   │   ├── [3.0K]  attention.py
│   │   │   └── [2.7K]  resnet.py
│   │   ├── [4.0K]  diffusion
│   │   │   ├── [4.0K]  __pycache__
│   │   │   │   ├── [6.6K]  diffusion.cpython-38.pyc
│   │   │   │   ├── [2.9K]  ema.cpython-38.pyc
│   │   │   │   └── [8.3K]  utils.cpython-38.pyc
│   │   │   ├── [7.4K]  diffusion.py
│   │   │   ├── [2.9K]  ema.py
│   │   │   ├── [ 12K]  plms.py
│   │   │   └── [8.7K]  utils.py
│   │   ├── [ 838]  distributions.py
│   │   ├── [4.0K]  genforecast
│   │   │   ├── [4.0K]  __pycache__
│   │   │   │   ├── [1.4K]  analysis.cpython-38.pyc
│   │   │   │   ├── [1.0K]  training.cpython-38.pyc
│   │   │   │   └── [ 11K]  unet.cpython-38.pyc
│   │   │   ├── [1.0K]  analysis.py
│   │   │   ├── [1.1K]  training.py
│   │   │   └── [ 17K]  unet.py
│   │   ├── [4.0K]  nowcast
│   │   │   ├── [4.0K]  __pycache__
│   │   │   │   └── [8.4K]  nowcast.cpython-38.pyc
│   │   │   └── [8.3K]  nowcast.py
│   │   └── [ 770]  utils.py
│   └── [4.0K]  visualization
│       ├── [1.2K]  cm.py
│       └── [ 11K]  plots.py
├── [4.0K]  ldcast.egg-info
│   ├── [5.9K]  PKG-INFO
│   ├── [ 175]  SOURCES.txt
│   ├── [   1]  dependency_links.txt
│   ├── [ 137]  requires.txt
│   └── [   1]  top_level.txt
├── [4.0K]  models
│   ├── [4.0K]  autoenc
│   │   └── [1.5M]  autoenc-32-0.01.pt
│   ├── [4.0K]  autoenc_train
│   │   ├── [4.6M]  epoch=0-val_rec_loss=0.2204.ckpt
│   │   ├── [4.6M]  epoch=0-val_rec_loss=nan.ckpt
│   │   └── [4.6M]  epoch=1-val_rec_loss=0.1653.ckpt
│   └── [4.0K]  genforecast
│       └── [5.0G]  genforecast-radaronly-256x256-20step.pt
├── [4.0K]  results
├── [4.0K]  scripts
│   ├── [4.0K]  __pycache__
│   │   └── [4.0K]  train_nowcaster.cpython-38.pyc
│   ├── [4.0K]  dwd_dataset.py
│   ├── [1.4K]  eval_data.py
│   ├── [1.5K]  eval_dgmr.py
│   ├── [4.4K]  eval_genforecast.py
│   ├── [1.4K]  eval_pysteps.py
│   ├── [3.8K]  forecast_demo.py
│   ├── [4.0K]  lightning_logs
│   │   └── [4.0K]  version_0
│   │       ├── [4.0K]  events.out.tfevents.1686053670.{VM_NAME}
│   │       └── [   3]  hparams.yaml
│   ├── [3.5K]  metrics.py
│   ├── [ 13K]  plots_genforecast.py
│   ├── [2.8K]  train_autoenc.py
│   ├── [3.4K]  train_genforecast.py
│   └── [4.1K]  train_nowcaster.py
├── [ 951]  setup.py
└── [   0]  tmp4_0607

37 directories, 149 files

Could it be the sth related to version compatibilities of packages, e.g., dask or numba? Perhaps I'm missing something in the data directory.
@jleinonen please let me know how I can provide further information - and thanks in advance!

Hi @caglarkucuk, you say above that you ran

python train_autoenc.py --model_dir="../models/autoenc_train"

successfully and then that the same command failed. Did you mean to have a different command on the second line?

Sorry, I pasted the wrong command while creating the issue.
Edited the original post, apologies for the confusion

This is a very strange bug. The LDM training uses the same sampler code as the autoencoder training. So I don't understand why you would get the latter to work but not the former. Could you paste the layout of your data directory and a longer traceback of the error? Also maybe remove the sampler_nowcaster_*.pkl files from the cache directory and try to run it again to see if it reoccurs?

Hi @jleinonen, I found the same error as @caglarkucuk mentioned when I ran the command /python train_autoenc.py --model_dir="../models/autoenc_train" in the directory of scripts/. I have downloaded the data you provided and put them in the directory of data/. No sampler files were produced in the directory of cache/.
Thanks in advance!

Thanks @jleinonen for the quick response. I updated the original issue to provide further information, based on your suggestions.