RAM is filled in training phase
alicamdal opened this issue · 7 comments
Hi,
I am trying to recreate FLAIR-1 using this repo. I have downloaded dataset and started training. However, after several steps training process starts filling RAM. I have tried to not use metrics etc. but still continues.
Could you provide me some details about it?
Thanks for your time.
Hello @alicamdal, could you provide your config file and the OS you are using ?
`#### DATA PATHS
paths:
out_folder : 'FLAIR-1/test1/'
out_model_name: 'test_1'
######## TRAIN NEEDED
# train_csv : '/csv_toy/flair-1-paths-toy-train_ag.csv'
# val_csv : '/csv_toy/flair-1-paths-toy-val_ag.csv'
train_csv : "FLAIR-1/csv_full/flair-1-paths-train.csv"
val_csv : "FLAIR-1/csv_full/flair-1-paths-val.csv"
######## PREDICT (PATCH) NEEDED
test_csv : "FLAIR-1/csv_full/flair-1-paths-test.csv"
ckpt_model_path: "FLAIR-1/test_1/checkpoints/ckpt-epoch=03-val_loss=0.93_test_1.ckpt"
path_metadata_aerial: '/FLAIR-1/flair_aerial_metadata.json'
USAGE
tasks:
train: True
train_load_ckpt: False
predict: False
metrics: False
delete_preds: False
TRAINING CONF
model_architecture: 'unet'
encoder_name: 'resnet34'
use_augmentation: False
use_metadata: False # Can be True if FLAIR dataset
channels: [1,2,3,4,5] # starts at 1
seed: 2022
HYPERPARAMETERS
batch_size: 16
learning_rate: 0.02
num_epochs: 50
DATA CONF
use_weights: True
classes: # k = value in MSK : v = [weight, name]
1: [1, 'building']
2: [1, 'pervious surface']
3: [1, 'impervious surface']
4: [1, 'bare soil']
5: [1, 'water']
6: [1, 'coniferous']
7: [1, 'deciduous']
8: [1, 'brushwood']
9: [1, 'vineyard']
10: [1, 'herbaceous vegetation']
11: [1, 'agricultural land']
12: [1, 'plowed land']
13: [0, 'other']
# 13: [1, 'swimming_pool']
# 14: [1, 'snow']
# 15: [0, 'clear cut']
# 16: [0, 'mixed']
# 17: [0, 'ligneous']
# 18: [1, 'greenhouse']
# 19: [0, 'other']
NORMALIZATION
norm_type: custom # [scaling, custom, without], default: scaling to range [0,1], see github readme
norm_means: [105.08,110.87,101.82,106.38,53.26] # same length (order) as channels
norm_stds: [52.17,45.38,44,39.69,79.3] # same length (order) as channels
PREDICT CONF
georeferencing_output : False
COMPUTATIONAL RESSOURCES
accelerator: gpu # or cpu
num_nodes: 1
gpus_per_node: 1
strategy: "auto" # null if only one GPU, else 'ddp'
num_workers: 1
PRINT PROGRESS
cp_csv_and_conf_to_output: True
enable_progress_bar: True
progress_rate: 10
`
This is my train config file. I am using Ubuntu 20.04. Thanks for your time.
Thanks you @alicamdal.
Could you try to increase the number of workers (e.g. to 10) and check if the problem persists?
Also you could try 'Null' or None for the strategy instead of 'auto' if you are using only 1 GPU to discard parallel distribution misconfiguration.
Thanks for suggestions. I have tried increasing worker number but it didn't work.I will try to change strategy parameter as you suggested. @agarioud
If you can attach the log file output, it might help @alicamdal
I cannot reproduce the error on a single GPU and Ubuntu 22, having a stable memory load.
@alicamdal
any news on your problem ? Otherwise i'll close the issue as i am unable to reproduce the error.
Feel free to reopen if your error persists