RAM is filled in training phase

Question

RAM is filled in training phase

alicamdal opened this issue 10 months ago · 7 comments

Hi,
I am trying to recreate FLAIR-1 using this repo. I have downloaded dataset and started training. However, after several steps training process starts filling RAM. I have tried to not use metrics etc. but still continues.
Could you provide me some details about it?
Thanks for your time.

Answer 1 · 2024-03-16T20:53:47.000Z

Hello @alicamdal, could you provide your config file and the OS you are using ?

Answer 2 · 2024-03-17T02:04:53.000Z

`#### DATA PATHS
paths:
out_folder : 'FLAIR-1/test1/'
out_model_name: 'test_1'

######## TRAIN NEEDED 
# train_csv : '/csv_toy/flair-1-paths-toy-train_ag.csv'
# val_csv   : '/csv_toy/flair-1-paths-toy-val_ag.csv'
train_csv : "FLAIR-1/csv_full/flair-1-paths-train.csv"
val_csv   : "FLAIR-1/csv_full/flair-1-paths-val.csv"

######## PREDICT (PATCH) NEEDED
test_csv : "FLAIR-1/csv_full/flair-1-paths-test.csv"
ckpt_model_path: "FLAIR-1/test_1/checkpoints/ckpt-epoch=03-val_loss=0.93_test_1.ckpt"

path_metadata_aerial: '/FLAIR-1/flair_aerial_metadata.json'

USAGE

tasks:
train: True
train_load_ckpt: False

predict: False

metrics: False
delete_preds: False

TRAINING CONF

model_architecture: 'unet'
encoder_name: 'resnet34'
use_augmentation: False

use_metadata: False # Can be True if FLAIR dataset

channels: [1,2,3,4,5] # starts at 1
seed: 2022

HYPERPARAMETERS

batch_size: 16
learning_rate: 0.02
num_epochs: 50

DATA CONF

use_weights: True
classes: # k = value in MSK : v = [weight, name]
1: [1, 'building']
2: [1, 'pervious surface']
3: [1, 'impervious surface']
4: [1, 'bare soil']
5: [1, 'water']
6: [1, 'coniferous']
7: [1, 'deciduous']
8: [1, 'brushwood']
9: [1, 'vineyard']
10: [1, 'herbaceous vegetation']
11: [1, 'agricultural land']
12: [1, 'plowed land']
13: [0, 'other']
# 13: [1, 'swimming_pool']
# 14: [1, 'snow']
# 15: [0, 'clear cut']
# 16: [0, 'mixed']
# 17: [0, 'ligneous']
# 18: [1, 'greenhouse']
# 19: [0, 'other']

NORMALIZATION

norm_type: custom # [scaling, custom, without], default: scaling to range [0,1], see github readme
norm_means: [105.08,110.87,101.82,106.38,53.26] # same length (order) as channels
norm_stds: [52.17,45.38,44,39.69,79.3] # same length (order) as channels

PREDICT CONF

georeferencing_output : False

COMPUTATIONAL RESSOURCES

accelerator: gpu # or cpu
num_nodes: 1
gpus_per_node: 1
strategy: "auto" # null if only one GPU, else 'ddp'
num_workers: 1

PRINT PROGRESS

cp_csv_and_conf_to_output: True
enable_progress_bar: True
progress_rate: 10
`
This is my train config file. I am using Ubuntu 20.04. Thanks for your time.

Answer 3 · 2024-03-17T02:38:48.000Z

Thanks you @alicamdal.
Could you try to increase the number of workers (e.g. to 10) and check if the problem persists?
Also you could try 'Null' or None for the strategy instead of 'auto' if you are using only 1 GPU to discard parallel distribution misconfiguration.

Answer 4 · 2024-03-17T02:55:37.000Z

Thanks for suggestions. I have tried increasing worker number but it didn't work.I will try to change strategy parameter as you suggested. @agarioud

Answer 5 · 2024-03-17T03:13:39.000Z

If you can attach the log file output, it might help @alicamdal
I cannot reproduce the error on a single GPU and Ubuntu 22, having a stable memory load.

Answer 6 · 2024-03-30T08:30:47.000Z

@alicamdal
any news on your problem ? Otherwise i'll close the issue as i am unable to reproduce the error.

Answer 7 · 2024-04-10T10:46:09.000Z

Feel free to reopen if your error persists