jolibrain/joliGEN

problème run docker

YoannRandon opened this issue · 17 comments

Hi,
I have some questions about how to build the Dockerfile.
I tried in a first step to build both "Dockerfile.build" and "Dockerfile.server" files, the "build" one build correctly however when i try to run it. It close directly, is it normal ?
Moreover i can't build the Dockerfile.server because of credential. I have the credential but i don't know how to put it in the code and if i try to connect using the url : "https://docker.joligan.com/v2/joligan_build/manifests/latest". I end up with :
"{"errors":[{"code":"MANIFEST_UNKNOWN","message":"manifest unknown","detail":{"Tag":"latest"}}]}"
Can u help me to build and run correctly those Dockerfile?

In fact I'm not that in the server, I would like to know if it's possible to just build the dockerfile and do inference using models downloaded from "https://confiance.joligan.com/#/models" (somehow it's the joligan server so i already can get models from there). I am more interested in the Dockerfile.build and how to run it . Thanks

beniz commented

I would like to know if it's possible to just build the dockerfile and do inference using models

Yes you can do this, though the build docker is not exactly designed for this, as follows:

nvidia-docker run -v /path/to/models/:/models/ -v /path/to/images/:/images/ --rm --gpus all -it --entrypoint bash jolibrain/joligan_build

This gets you a running docker with a root user inside it. The -v mounts the local path to models to /models/ inside the docker, and the path to images to /images/ inside the docker.

From there you can use inference, e.g.

cd scripts
python3 gen_single_image.py --model-in-file /models/xxx/latest_net_G_A.pth --img-in /images/xxx.png --img-out /path/to/out/image.png

I resolve my problem for using the Dockerfile.build by using "tail -f /dev/null" after docker run.
1 problem remain, I tried to launch an inference using a model of the joligan server with the command :
"
python3 gen_single_image.py
--model-in-file /app/pretrained_weights_models/bdd100k_weather_det_clear2snowy_mm1/latest_net_G_A.pth
--img-size 512
--img-in /app/sample_bdd100k_img/8221f03e-7a27e32f.jpg
--img-out 8221f03e-7a27e32f_snowy.jpg
--gpuid 1
"
I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.
So I tried to use the cpu instead (for inference)by replacing "--gpuid" by "--cpu" according to gen_single_image.py" argument description to avoid this error but it returns 'name "device" is not defined'.

beniz commented

but it returns 'name "device" is not defined'.

This is a bug, I just fixed it on master, see bb3c70c

beniz commented

I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.

try nvidia-smi and make sure you have two GPUs available since your are asking GPU 1 (0 should be the first one).

Hi, I still have some problem with the single_gen_image.py script, I did build the dockerfile, and when i tried the command:

" python3 gen_single_image.py --model-in-file /app/pretrained_weights_models/bdd100k_weather_det_clear2snowy_mm1/latest_net_G_A.pth --img-in /app/sample_bdd100k_img/val/8221f03e-7a27e32f.jpg --img-out 8221f03e-7a27e32f_snowy.jpg --gpuid 1"

i got the following error:

"Traceback (most recent call last):
File "gen_single_image.py", line 60, in
model, opt = load_model(modelpath, os.path.basename(args.model_in_file), device)
File "gen_single_image.py", line 28, in load_model
opt = TrainOptions().parse_json(train_json)
File "/app/scripts/../options/base_options.py", line 925, in parse_json
self._json_parse_known_args(parser, opt, flat_json)
File "/app/scripts/../options/base_options.py", line 882, in _json_parse_known_args
raise ValueError(
ValueError: data_online_creation_mask_delta_A: Bad type (<class 'int'>, should be list of <class 'int'>)"

I already replace "cut_semantic_mask" by cut in the train_config.json of the model "bdd100k_weather_det_clear2snowy_mm1" downloaded on the joligan server. It seem the problem comes from base_option.py but i can't find what to change.

I think the problem may comes from the train_config.json, I'll put it bellow.

I received a cuda error : "invalid ordinal device" but this error seems to come form my docker/cuda.

try nvidia-smi and make sure you have two GPUs available since your are asking GPU 1 (0 should be the first one).

I already checked and both GPU 1 and 2 are shown in nvidia-smi, for a unknown reason, it seems the problem resolved by itself but another one occured. The error mention just before "data _online_mask_delta"

train_config.json

{
"D": {
"dropout": false,
"n_layers": 3,
"ndf": 64,
"netDs": [
"projected_d",
"basic",
"vision_aided"
],
"no_antialias": false,
"no_antialias_up": false,
"norm": "instance",
"proj_config_segformer": "models/configs/segformer/segformer_config_b0.py",
"proj_interp": 512,
"proj_network_type": "vitsmall",
"proj_weight_segformer": "models/configs/segformer/pretrain/segformer_mit-b0.pth",
"spectral": false,
"temporal_every": 4,
"temporal_frame_step": 30,
"temporal_num_common_char": -1,
"temporal_number_frames": 5,
"vision_aided_backbones": "clip+dino"
},
"G": {
"attn_nb_mask_attn": 10,
"attn_nb_mask_input": 1,
"backward_compatibility_twice_resnet_blocks": false,
"config_segformer": "models/configs/segformer/segformer_config_b0.py",
"dropout": false,
"netE": "resnet_512",
"netG": "segformer_attn_conv",
"ngf": 64,
"norm": "instance",
"padding_type": "reflect",
"spectral": false,
"stylegan2_num_downsampling": 1
},
"alg": {
"cut": {
"flip_equivariance": false,
"lambda_GAN": 1.0,
"lambda_NCE": 1.0,
"nce_T": 0.07,
"nce_idt": true,
"nce_includes_all_negatives_from_minibatch": false,
"nce_layers": "0,4,8,12,16",
"netF": "mlp_sample",
"netF_dropout": false,
"netF_nc": 256,
"netF_norm": "instance",
"num_patches": 256
},
"cyclegan": {},
"re": {
"P_lr": 0.0002,
"adversarial_loss_p": false,
"netP": "unet_128",
"no_train_P_fake_images": false,
"nuplet_size": 3,
"projection_threshold": 1.0
}
},
"data": {
"online_creation": {
"crop_delta_A": 64,
"crop_delta_B": 64,
"crop_size_A": 512,
"crop_size_B": 512,
"mask_delta_A": 0,
"mask_delta_B": 0,
"mask_square_A": false,
"mask_square_B": false
},
"crop_size": 512,
"dataset_mode": "unaligned_labeled_mask_online",
"direction": "AtoB",
"load_size": 512,
"max_dataset_size": 1000000000,
"num_threads": 4,
"online_context_pixels": 0,
"preprocess": "resize_and_crop",
"relative_paths": false,
"sanitize_paths": false,
"serial_batches": false
},
"f_s": {
"all_classes_as_one": false,
"class_weights": [
1,
10,
10,
1,
5,
5,
10,
10,
30,
50,
50
],
"config_segformer": "models/configs/segformer/segformer_config_b0.py",
"dropout": false,
"net": "segformer",
"nf": 64,
"semantic_nclasses": 11,
"semantic_threshold": 1.0,
"weight_segformer": ""
},
"output": {
"display": {
"G_attention_masks": false,
"diff_fake_real": false,
"env": "bdd100k_weather_det_clear2snowy_mm1",
"freq": 200,
"id": 1,
"ncols": 4,
"networks": false,
"port": 8097,
"server": "http://localhost",
"winsize": 256
},
"no_html": false,
"print_freq": 200,
"update_html_freq": 1000,
"verbose": false
},
"model": {
"init_gain": 0.02,
"init_type": "normal",
"input_nc": 3,
"multimodal": true,
"output_nc": 3
},
"train": {
"sem": {
"cls_B": false,
"cls_pretrained": false,
"cls_template": "basic",
"idt": true,
"l1_regression": false,
"lambda": 1.0,
"lr_f_s": 0.0002,
"net_output": false,
"regression": false,
"use_label_B": true
},
"mask": {
"charbonnier_eps": 1e-06,
"disjoint_f_s": false,
"f_s_B": true,
"for_removal": false,
"lambda_out_mask": 10.0,
"loss_out_mask": "L1",
"no_train_f_s_A": false,
"out_mask": false
},
"D_accuracy_every": 1000,
"D_lr": 0.0001,
"G_ema": true,
"G_ema_beta": 0.999,
"G_lr": 0.0002,
"batch_size": 2,
"beta1": 0.9,
"beta2": 0.999,
"compute_D_accuracy": false,
"compute_fid": false,
"compute_fid_val": false,
"continue": false,
"epoch": "latest",
"epoch_count": 1,
"fid_every": 1000,
"gan_mode": "lsgan",
"iter_size": 4,
"load_iter": 0,
"lr_decay_iters": 50,
"lr_policy": "linear",
"mm_lambda_z": 0.5,
"mm_nz": 16,
"n_epochs": 100,
"n_epochs_decay": 100,
"nb_img_max_fid": 1000000000,
"optim": "adam",
"pool_size": 50,
"save_by_iter": false,
"save_epoch_freq": 1,
"save_latest_freq": 5000,
"use_contrastive_loss_D": false
},
"dataaug": {
"APA": false,
"APA_every": 4,
"APA_nimg": 50,
"APA_p": 0,
"APA_target": 0.6,
"D_label_smooth": false,
"D_noise": 0.01,
"affine": 0.0,
"affine_scale_max": 1.2,
"affine_scale_min": 0.8,
"affine_shear": 45,
"affine_translate": 0.2,
"diff_aug_policy": "",
"diff_aug_proba": 0.5,
"imgaug": false,
"no_flip": false,
"no_rotate": true
},
"checkpoints_dir": "/data1/confiance_platform/checkpoints/",
"dataroot": "/data1/confiance/datasets/bdd100k_weather_clear2snowy/",
"ddp_port": "13456",
"gpu_ids": "2",
"model_type": "cut",
"name": "bdd100k_weather_det_clear2snowy_mm1",
"phase": "train",
"suffix": "",
"warning_mode": false
}

beniz commented

The error mention just before data _online_mask_delta

This is because the option has changed, you can fix it easily by editing the train_config.json file to set:

"mask_delta_A": [0],
"mask_delta_B": [0]

We had to do it ourselves on other models as well.

cuda_error

my problem with cuda is not gone, i think the problem comes from my set up even though I build the dockerfile.
I have tested if my gpus were available by adding the follwing lines in gen_single_image.py script :

"
modelpath = args.model_in_file.replace(os.path.basename(args.model_in_file), "")
print("modelpath=", modelpath)
use_cuda = torch.cuda.is_available()
print("cuda device is availaible :",use_cuda)
GPUtil.getAvailable()
"

and it's seems that it's ok, if u have already see something like this, can u give me a tip to correct it. Thanks

I have the same error when i use "--cpu" argument

beniz commented

If you haven´t done so yet, you shall rebuild the docker image so that it runs the latest code. Or you can patch from within the docker, as you like best.

beniz commented

can u give me a tip to correct it

First, make sure nvidia-smi works correctly from inside the docker, and look at the list of GPUs.

Try export CUDA_VISIBLE_DEVICES=1, and then use --gpuid 0. You may have to set the env variable into the dockerfile as well...

Hi,

I checked several things and i still can't find why I have this cuda error : invalid device ordinal,
nvidia-smi worked well, i can get my gpu names and id with torch.

Using "export CUDA_VISIBLE_DEVICES=1" didn't solve the problem.
I also tried to change versions of modules but i still got the same error.
This error also occur when i use "--cpu" argument from gen_single_image.py
i currently use :

python 3.9.13
torch 1.12.1+cu116
torchvision 0.13.1+cu116
cuda version (nvidia-smi) : 11.8

May I know what is your config when u run gen_single_image.py script?
I'll try to reproduce it. Thanks

tmp3

Le training marche bien, il semblerai que le problème n'arrive que pendant l'inférence.

Hi @YoannRandon ,
#322 should solve your issue, please let us know if you still have any problem.

beniz commented

@YoannRandon you need to rebuild your docker though.