Can't cache sampler_nowcaster_test
caglarkucuk opened this issue · 5 comments
Context
After successfully running
python forecast_demo.py
and
python train_autoenc.py --model_dir="../models/autoenc_train"
,
I couldn't get the
python train_genforecast.py --model_dir="../models/genforecast_train"
command running due to problems in caching the sampler for test and training set.
When running python train_genforecast.py --model_dir="../models/genforecast_train"
:
Expected behaviour
- Creates the sampler files and save to the
cache
directory for valid, test, and train datasets - Trains the forecaster
Actual behaviour
- Creates the file
../cache/sampler_nowcaster_valid.pkl
- Throws an error creating the next sampler as (complete error message pasted below):
~/tmp/0606/ldcast/scripts$ python train_genforecast.py --model_dir="../models/genforecast_train"
Loading data...
/home/kucuk/tmp/0606/ldcast/ldcast/features/transform.py:80: RuntimeWarning: divide by zero encountered in log10
log_scale = np.log10(scale).astype(np.float32)
Loading cached sampler from ../cache/sampler_nowcaster_valid.pkl.
No cached sampler found, creating a new one...
Traceback (most recent call last):
File "train_genforecast.py", line 129, in <module>
Fire(main)
File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/kucuk/miniconda3/envs/ldcast_test/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "train_genforecast.py", line 125, in main
train(**config)
File "train_genforecast.py", line 94, in train
datamodule = setup_data(
File "/home/kucuk/tmp/0606/ldcast/scripts/train_nowcaster.py", line 124, in setup_data
datamodule = split.DataModule(
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 127, in __init__
self.batch_gen = {
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/split.py", line 128, in <dictcomp>
split: batch.BatchGenerator(
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/batch.py", line 81, in __init__
self.sampler = EqualFrequencySampler(
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 30, in __init__
self.starting_ind = [
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 31, in <listcomp>
starting_indices_for_centers(
File "/home/kucuk/tmp/0606/ldcast/ldcast/features/sampling.py", line 210, in starting_indices_for_centers
starting_ind = np.concatenate(
File "<__array_function__ internals>", line 180, in concatenate
ValueError: need at least one array to concatenate
- Seems like an issue with indexing of patches in the sampler, though I'm not sure...
Additional information
- I removed the
sampler_nowcaster*
in thecache
folder and tried the same commandpython train_genforecast.py --model_dir="../models/genforecast_train"
, received the same error - In case it helps, below is the complete directory structure (note the size difference among
sampler_*_valid.pkl
files)
~/tmp/0606/ldcast$ tree -h
.
├── [ 11K] LICENSE
├── [5.5K] README.md
├── [4.0K] cache
│ ├── [501K] sampler_autoenc_test.pkl
│ ├── [883M] sampler_autoenc_train.pkl
│ ├── [1.9M] sampler_autoenc_valid.pkl
│ └── [745K] sampler_nowcaster_valid.pkl
├── [4.0K] config
│ ├── [ 65] genforecast-radaronly-128x128-20step.yaml
│ └── [ 158] genforecast-radaronly-256x256-20step.yaml
├── [4.0K] data
│ ├── [ 48K] Border_CH.dbf
│ ├── [130K] Border_CH.shp
│ ├── [4.0K] RV
│ │ ├── [197M] patches_RV_202204.nc
│ │ ├── [156M] patches_RV_202205.nc
│ │ ├── [154M] patches_RV_202206.nc
│ │ ├── [151M] patches_RV_202207.nc
│ │ ├── [ 84M] patches_RV_202208.nc
│ │ └── [281M] patches_RV_202209.nc
│ ├── [4.0K] RZC
│ │ ├── [ 55M] patches_RZC_201804.nc
│ │ ├── [102M] patches_RZC_201805.nc
│ │ ├── [ 53M] patches_RZC_201806.nc
│ │ ├── [ 53M] patches_RZC_201807.nc
│ │ ├── [ 71M] patches_RZC_201808.nc
│ │ ├── [ 38M] patches_RZC_201809.nc
│ │ ├── [ 72M] patches_RZC_201904.nc
│ │ ├── [117M] patches_RZC_201905.nc
│ │ ├── [ 69M] patches_RZC_201906.nc
│ │ ├── [ 57M] patches_RZC_201907.nc
│ │ ├── [ 77M] patches_RZC_201908.nc
│ │ ├── [ 55M] patches_RZC_201909.nc
│ │ ├── [ 42M] patches_RZC_202004.nc
│ │ ├── [ 67M] patches_RZC_202005.nc
│ │ ├── [110M] patches_RZC_202006.nc
│ │ ├── [ 51M] patches_RZC_202007.nc
│ │ ├── [ 91M] patches_RZC_202008.nc
│ │ ├── [ 61M] patches_RZC_202009.nc
│ │ ├── [ 59M] patches_RZC_202104.nc
│ │ ├── [140M] patches_RZC_202105.nc
│ │ ├── [ 86M] patches_RZC_202106.nc
│ │ ├── [120M] patches_RZC_202107.nc
│ │ ├── [ 62M] patches_RZC_202108.nc
│ │ └── [ 54M] patches_RZC_202109.nc
│ ├── [4.0K] demo
│ │ └── [4.0K] 20210622
│ │ ├── [214K] RZC211731820VL.801.h5
│ │ ├── [214K] RZC211731825VL.801.h5
│ │ ├── [216K] RZC211731830VL.801.h5
│ │ └── [216K] RZC211731835VL.801.h5
│ └── [4.1K] split_chunks.pkl.gz
├── [4.0K] figures
│ └── [4.0K] demo
│ ├── [207K] R_past-00.png
│ ├── [208K] R_past-01.png
│ ├── [208K] R_past-02.png
│ ├── [208K] R_past-03.png
│ ├── [192K] R_pred-00.png
│ ├── [191K] R_pred-01.png
│ ├── [192K] R_pred-02.png
│ ├── [194K] R_pred-03.png
│ ├── [195K] R_pred-04.png
│ ├── [195K] R_pred-05.png
│ ├── [196K] R_pred-06.png
│ ├── [196K] R_pred-07.png
│ ├── [199K] R_pred-08.png
│ ├── [202K] R_pred-09.png
│ ├── [204K] R_pred-10.png
│ ├── [204K] R_pred-11.png
│ ├── [208K] R_pred-12.png
│ ├── [207K] R_pred-13.png
│ ├── [211K] R_pred-14.png
│ ├── [211K] R_pred-15.png
│ ├── [212K] R_pred-16.png
│ ├── [209K] R_pred-17.png
│ ├── [209K] R_pred-18.png
│ └── [205K] R_pred-19.png
├── [4.0K] ldcast
│ ├── [4.0K] analysis
│ │ ├── [4.8K] crps.py
│ │ ├── [4.4K] fss.py
│ │ ├── [3.2K] histogram.py
│ │ └── [5.5K] rank.py
│ ├── [4.0K] features
│ │ ├── [4.0K] __pycache__
│ │ │ ├── [ 11K] batch.cpython-38.pyc
│ │ │ ├── [ 11K] patches.cpython-38.pyc
│ │ │ ├── [7.0K] sampling.cpython-38.pyc
│ │ │ ├── [4.8K] split.cpython-38.pyc
│ │ │ ├── [8.2K] transform.cpython-38.pyc
│ │ │ └── [3.1K] utils.cpython-38.pyc
│ │ ├── [ 13K] batch.py
│ │ ├── [3.9K] io.py
│ │ ├── [ 13K] patches.py
│ │ ├── [7.1K] sampling.py
│ │ ├── [5.2K] split.py
│ │ ├── [8.7K] transform.py
│ │ └── [3.8K] utils.py
│ ├── [8.5K] forecast.py
│ ├── [4.0K] models
│ │ ├── [4.0K] __pycache__
│ │ │ ├── [1.1K] distributions.cpython-38.pyc
│ │ │ └── [ 833] utils.cpython-38.pyc
│ │ ├── [4.0K] autoenc
│ │ │ ├── [4.0K] __pycache__
│ │ │ │ ├── [3.4K] autoenc.cpython-38.pyc
│ │ │ │ ├── [1.9K] encoder.cpython-38.pyc
│ │ │ │ └── [ 960] training.cpython-38.pyc
│ │ │ ├── [3.0K] autoenc.py
│ │ │ ├── [1.9K] encoder.py
│ │ │ └── [ 952] training.py
│ │ ├── [4.0K] benchmarks
│ │ │ ├── [2.4K] dgmr.py
│ │ │ ├── [3.7K] pysteps.py
│ │ │ └── [ 350] transform.py
│ │ ├── [4.0K] blocks
│ │ │ ├── [4.0K] __pycache__
│ │ │ │ ├── [9.5K] afno.cpython-38.pyc
│ │ │ │ ├── [3.2K] attention.cpython-38.pyc
│ │ │ │ └── [2.2K] resnet.cpython-38.pyc
│ │ │ ├── [ 13K] afno.py
│ │ │ ├── [3.0K] attention.py
│ │ │ └── [2.7K] resnet.py
│ │ ├── [4.0K] diffusion
│ │ │ ├── [4.0K] __pycache__
│ │ │ │ ├── [6.6K] diffusion.cpython-38.pyc
│ │ │ │ ├── [2.9K] ema.cpython-38.pyc
│ │ │ │ └── [8.3K] utils.cpython-38.pyc
│ │ │ ├── [7.4K] diffusion.py
│ │ │ ├── [2.9K] ema.py
│ │ │ ├── [ 12K] plms.py
│ │ │ └── [8.7K] utils.py
│ │ ├── [ 838] distributions.py
│ │ ├── [4.0K] genforecast
│ │ │ ├── [4.0K] __pycache__
│ │ │ │ ├── [1.4K] analysis.cpython-38.pyc
│ │ │ │ ├── [1.0K] training.cpython-38.pyc
│ │ │ │ └── [ 11K] unet.cpython-38.pyc
│ │ │ ├── [1.0K] analysis.py
│ │ │ ├── [1.1K] training.py
│ │ │ └── [ 17K] unet.py
│ │ ├── [4.0K] nowcast
│ │ │ ├── [4.0K] __pycache__
│ │ │ │ └── [8.4K] nowcast.cpython-38.pyc
│ │ │ └── [8.3K] nowcast.py
│ │ └── [ 770] utils.py
│ └── [4.0K] visualization
│ ├── [1.2K] cm.py
│ └── [ 11K] plots.py
├── [4.0K] ldcast.egg-info
│ ├── [5.9K] PKG-INFO
│ ├── [ 175] SOURCES.txt
│ ├── [ 1] dependency_links.txt
│ ├── [ 137] requires.txt
│ └── [ 1] top_level.txt
├── [4.0K] models
│ ├── [4.0K] autoenc
│ │ └── [1.5M] autoenc-32-0.01.pt
│ ├── [4.0K] autoenc_train
│ │ ├── [4.6M] epoch=0-val_rec_loss=0.2204.ckpt
│ │ ├── [4.6M] epoch=0-val_rec_loss=nan.ckpt
│ │ └── [4.6M] epoch=1-val_rec_loss=0.1653.ckpt
│ └── [4.0K] genforecast
│ └── [5.0G] genforecast-radaronly-256x256-20step.pt
├── [4.0K] results
├── [4.0K] scripts
│ ├── [4.0K] __pycache__
│ │ └── [4.0K] train_nowcaster.cpython-38.pyc
│ ├── [4.0K] dwd_dataset.py
│ ├── [1.4K] eval_data.py
│ ├── [1.5K] eval_dgmr.py
│ ├── [4.4K] eval_genforecast.py
│ ├── [1.4K] eval_pysteps.py
│ ├── [3.8K] forecast_demo.py
│ ├── [4.0K] lightning_logs
│ │ └── [4.0K] version_0
│ │ ├── [4.0K] events.out.tfevents.1686053670.{VM_NAME}
│ │ └── [ 3] hparams.yaml
│ ├── [3.5K] metrics.py
│ ├── [ 13K] plots_genforecast.py
│ ├── [2.8K] train_autoenc.py
│ ├── [3.4K] train_genforecast.py
│ └── [4.1K] train_nowcaster.py
├── [ 951] setup.py
└── [ 0] tmp4_0607
37 directories, 149 files
Could it be the sth related to version compatibilities of packages, e.g., dask or numba? Perhaps I'm missing something in the data
directory.
@jleinonen please let me know how I can provide further information - and thanks in advance!
Hi @caglarkucuk, you say above that you ran
python train_autoenc.py --model_dir="../models/autoenc_train"
successfully and then that the same command failed. Did you mean to have a different command on the second line?
Sorry, I pasted the wrong command while creating the issue.
Edited the original post, apologies for the confusion
This is a very strange bug. The LDM training uses the same sampler code as the autoencoder training. So I don't understand why you would get the latter to work but not the former. Could you paste the layout of your data
directory and a longer traceback of the error? Also maybe remove the sampler_nowcaster_*.pkl
files from the cache
directory and try to run it again to see if it reoccurs?
Hi @jleinonen, I found the same error as @caglarkucuk mentioned when I ran the command /python train_autoenc.py --model_dir="../models/autoenc_train" in the directory of scripts/. I have downloaded the data you provided and put them in the directory of data/. No sampler files were produced in the directory of cache/.
Thanks in advance!
Thanks @jleinonen for the quick response. I updated the original issue to provide further information, based on your suggestions.