Reproducibility Issue
lizeyan opened this issue · 1 comments
lizeyan commented
Here I list my environment, command and the outputs to run an experiment.
Environment
Machine
硬件概览:
型号名称: MacBook Pro
型号标识符: MacBookPro14,3
处理器名称: 四核Intel Core i7
处理器速度: 2.9 GHz
处理器数目: 1
核总数: 4
L2缓存(每个核): 256 KB
L3缓存: 8 MB
超线程技术: 已启用
内存: 16 GB
系统固件版本: 451.140.1.0.0
操作系统加载程序版本: 540.120.3~19
SMC版本(系统): 2.45f5
Docker Info
$ docker info
Client:
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc., v0.8.2)
compose: Docker Compose (Docker Inc., v2.6.1)
extension: Manages Docker extensions (Docker Inc., v0.2.7)
sbom: View the packaged-based Software Bill Of Materials (SBOM) for an image (Anchore Inc., 0.6.0)
scan: Docker Scan (Docker Inc., v0.17.0)
Server:
Containers: 1
Running: 0
Paused: 0
Stopped: 1
Images: 2
Server Version: 20.10.17
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc version: v1.1.2-0-ga916309
init version: de40ad0
Security Options:
seccomp
Profile: default
cgroupns
Kernel Version: 5.10.104-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.774GiB
Name: docker-desktop
ID: WKZ5:6KJZ:3K6S:I7WY:LTLR:3TNP:D23G:N3C7:6QPG:KG44:WEEK:CVMR
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http.docker.internal:3128
HTTPS Proxy: http.docker.internal:3128
No Proxy: hubproxy.docker.internal
Username: lizytalk
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
hubproxy.docker.internal:5000
127.0.0.0/8
Live Restore Enabled: false
Docker Image Info
REPOSITORY TAG IMAGE ID CREATED SIZE
lizytalk/dejavu latest 32d6db301926 2 months ago 17.3GB
Command
docker run -it --rm -v $(realpath .):/workspace lizytalk/dejavu bash -c 'source .envrc && python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20'
Note that --max_epoch=20
is used to validate the program fastly.
Output
=============
== PyTorch ==
=============
NVIDIA Release 21.11 (build 29224839)
PyTorch Version 1.11.0a0+b6df043
Container image Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2021 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
Using backend: pytorch
2022-09-03 03:02:35.245 | INFO | failure_dependency_graph.FDG_config:process_args:62 - torch.cuda.is_available()=False
2022-09-03 03:02:35.332 | INFO | DejaVu.workflow:_train_exp_CFL:34 -
================================================Config=============================================
{'FI_feature_dim': 3,
'GAT_layers': 8,
'GAT_num_heads': 4,
'GAT_residual': True,
'GAT_shared_feature_mapper': False,
'augmentation': False,
'balance_train_set': True,
'batch_size': 16,
'cache_dir': PosixPath('/tmp/SSF/.cache'),
'checkpoint_metric': 'val_loss',
'cuda': False,
'data_dir': PosixPath('data/A1'),
'dataset_split_ratio': (0.4, 0.2, 0.4),
'display_epoch_freq': 10,
'display_second_freq': 5,
'drop_FDG_edges_fraction': 0.0,
'dropout': False,
'early_stopping_epoch_patience': 500,
'es': True,
'faults_path': None,
'feature_projector_type': 'GRU',
'flush_dataset_cache': True,
'gradient_clip_val': 1.0,
'graph_config_path': None,
'init_lr': 0.01,
'max_epoch': 20,
'metrics_path': None,
'output_base_path': PosixPath('/SSF/output'),
'output_dir': PosixPath('/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139'),
'p': 0.25,
'q': 0.25,
'random_walk_length': 8,
'rec_loss_weight': 1.0,
'test_batch_size': 128,
'test_epoch_freq': 100,
'test_second_freq': 30.0,
'train_set_repeat': 1,
'train_set_sampling': 1.0,
'ts_feature_mode': 'full',
'use_anomaly_direction_constraint': False,
'valid_epoch_freq': 10,
'weight_decay': 0.01,
'window_size': (10, 10)}
===================================================================================================
2022-09-03 03:02:36.842 | INFO | DejaVu.workflow:_train_exp_CFL:39 - reproducibility info: {'command_line': 'python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20', 'time': 'Sat Sep 3 03:02:35 2022', 'git_root': '/workspace', 'git_url': 'https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a', 'git_has_uncommitted_changes': False}
2022-09-03 03:02:37.800 | INFO | failure_dependency_graph.failure_dependency_graph:_load_FDG:206 - Loading FDG from data/A1/FDG.pkl
2022-09-03 03:02:40.971 | INFO | failure_dependency_graph.model_interface:__init__:47 - dataset_cache_dir=/tmp/SSF/.cache/faults=data_A1_faults.csv.graph=data_A1_graph.yml.metrics=data_A1_metrics.norm.pkl.use_anomaly_direction_constraint=False
2022-09-03 03:02:41.019 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:209 - fault ids with multiple root causes: []
2022-09-03 03:02:41.019 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('Docker CPU',)
train_length=7 train_ids=[5, 23, 12, 37, 0, 58, 30]
validation_length=4 validation_ids=[75, 29, 3, 15]
test_length=8 test_ids=[65, 59, 18, 50, 44, 41, 8, 52]
(7 recurring faults)
2022-09-03 03:02:41.020 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('Docker',)
train_length=12 train_ids=[35, 25, 56, 71, 53, 17, 13, 9, 6, 1, 7, 26]
validation_length=6 validation_ids=[2, 21, 63, 68, 70, 54]
test_length=12 test_ids=[24, 31, 28, 77, 48, 49, 16, 67, 76, 62, 60, 69]
(12 recurring faults)
2022-09-03 03:02:41.021 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('DB Session',)
train_length=2 train_ids=[36, 47]
validation_length=2 validation_ids=[74, 45]
test_length=3 test_ids=[19, 4, 57]
(1 recurring faults)
2022-09-03 03:02:41.021 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('DB State',)
train_length=2 train_ids=[46, 11]
validation_length=1 validation_ids=[27]
test_length=2 test_ids=[34, 10]
(2 recurring faults)
2022-09-03 03:02:41.022 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:234 - fault_type=('OS Network',)
train_length=6 train_ids=[22, 39, 64, 61, 20, 38]
validation_length=4 validation_ids=[51, 42, 55, 32]
test_length=7 test_ids=[40, 14, 33, 66, 43, 73, 72]
(4 recurring faults)
2022-09-03 03:02:41.023 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [5, 23, 12, 37, 0, 58, 30] for 1 times
2022-09-03 03:02:41.024 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [35, 25, 56, 71, 53, 17, 13, 9, 6, 1, 7, 26] for 1 times
2022-09-03 03:02:41.024 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [36, 47] for 6 times
2022-09-03 03:02:41.024 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [46, 11] for 6 times
2022-09-03 03:02:41.025 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:247 - repeat [22, 39, 64, 61, 20, 38] for 2 times
2022-09-03 03:02:41.025 | INFO | failure_dependency_graph.model_interface:split_failures_by_type:265 - len(train_list)=55 len(set(train_list))=29 len(validation_list)=17 len(test_list)=32
2022-09-03 03:02:41.081 | INFO | DejaVu.workflow:_train_exp_CFL:51 -
==========================Model Summary================================================================================================================
Layer (type:depth-idx) Param #
=================================================================
GAT --
├─FIFeatureExtractor: 1-1 --
│ └─ModuleList: 2-1 --
│ │ └─GRUFeatureModule: 3-1 --
│ │ │ └─GRU: 4-1 81
│ │ │ └─Sequential: 4-2 --
│ │ │ │ └─Reshape: 5-1 --
│ │ │ │ └─Conv1d: 5-2 100
│ │ │ │ └─GELU: 5-3 --
│ │ │ │ └─Flatten: 5-4 --
│ │ │ │ └─Linear: 5-5 543
│ │ │ │ └─Reshape: 5-6 --
│ │ └─GRUFeatureModule: 3-2 --
│ │ │ └─GRU: 4-3 288
│ │ │ └─Sequential: 4-4 --
│ │ │ │ └─Reshape: 5-7 --
│ │ │ │ └─Conv1d: 5-8 100
│ │ │ │ └─GELU: 5-9 --
│ │ │ │ └─Flatten: 5-10 --
│ │ │ │ └─Linear: 5-11 543
│ │ │ │ └─Reshape: 5-12 --
│ │ └─GRUFeatureModule: 3-3 --
│ │ │ └─GRU: 4-5 81
│ │ │ └─Sequential: 4-6 --
│ │ │ │ └─Reshape: 5-13 --
│ │ │ │ └─Conv1d: 5-14 100
│ │ │ │ └─GELU: 5-15 --
│ │ │ │ └─Flatten: 5-16 --
│ │ │ │ └─Linear: 5-17 543
│ │ │ │ └─Reshape: 5-18 --
│ │ └─GRUFeatureModule: 3-4 --
│ │ │ └─GRU: 4-7 117
│ │ │ └─Sequential: 4-8 --
│ │ │ │ └─Reshape: 5-19 --
│ │ │ │ └─Conv1d: 5-20 100
│ │ │ │ └─GELU: 5-21 --
│ │ │ │ └─Flatten: 5-22 --
│ │ │ │ └─Linear: 5-23 543
│ │ │ │ └─Reshape: 5-24 --
│ │ └─GRUFeatureModule: 3-5 --
│ │ │ └─GRU: 4-9 90
│ │ │ └─Sequential: 4-10 --
│ │ │ │ └─Reshape: 5-25 --
│ │ │ │ └─Conv1d: 5-26 100
│ │ │ │ └─GELU: 5-27 --
│ │ │ │ └─Flatten: 5-28 --
│ │ │ │ └─Linear: 5-29 543
│ │ │ │ └─Reshape: 5-30 --
│ │ └─GRUFeatureModule: 3-6 --
│ │ │ └─GRU: 4-11 153
│ │ │ └─Sequential: 4-12 --
│ │ │ │ └─Reshape: 5-31 --
│ │ │ │ └─Conv1d: 5-32 100
│ │ │ │ └─GELU: 5-33 --
│ │ │ │ └─Flatten: 5-34 --
│ │ │ │ └─Linear: 5-35 543
│ │ │ │ └─Reshape: 5-36 --
│ │ └─GRUFeatureModule: 3-7 --
│ │ │ └─GRU: 4-13 81
│ │ │ └─Sequential: 4-14 --
│ │ │ │ └─Reshape: 5-37 --
│ │ │ │ └─Conv1d: 5-38 100
│ │ │ │ └─GELU: 5-39 --
│ │ │ │ └─Flatten: 5-40 --
│ │ │ │ └─Linear: 5-41 543
│ │ │ │ └─Reshape: 5-42 --
│ │ └─GRUFeatureModule: 3-8 --
│ │ │ └─GRU: 4-15 54
│ │ │ └─Sequential: 4-16 --
│ │ │ │ └─Reshape: 5-43 --
│ │ │ │ └─Conv1d: 5-44 100
│ │ │ │ └─GELU: 5-45 --
│ │ │ │ └─Flatten: 5-46 --
│ │ │ │ └─Linear: 5-47 543
│ │ │ │ └─Reshape: 5-48 --
│ │ └─GRUFeatureModule: 3-9 --
│ │ │ └─GRU: 4-17 63
│ │ │ └─Sequential: 4-18 --
│ │ │ │ └─Reshape: 5-49 --
│ │ │ │ └─Conv1d: 5-50 100
│ │ │ │ └─GELU: 5-51 --
│ │ │ │ └─Flatten: 5-52 --
│ │ │ │ └─Linear: 5-53 543
│ │ │ │ └─Reshape: 5-54 --
│ │ └─GRUFeatureModule: 3-10 --
│ │ │ └─GRU: 4-19 54
│ │ │ └─Sequential: 4-20 --
│ │ │ │ └─Reshape: 5-55 --
│ │ │ │ └─Conv1d: 5-56 100
│ │ │ │ └─GELU: 5-57 --
│ │ │ │ └─Flatten: 5-58 --
│ │ │ │ └─Linear: 5-59 543
│ │ │ │ └─Reshape: 5-60 --
│ │ └─GRUFeatureModule: 3-11 --
│ │ │ └─GRU: 4-21 54
│ │ │ └─Sequential: 4-22 --
│ │ │ │ └─Reshape: 5-61 --
│ │ │ │ └─Conv1d: 5-62 100
│ │ │ │ └─GELU: 5-63 --
│ │ │ │ └─Flatten: 5-64 --
│ │ │ │ └─Linear: 5-65 543
│ │ │ │ └─Reshape: 5-66 --
│ │ └─GRUFeatureModule: 3-12 --
│ │ │ └─GRU: 4-23 81
│ │ │ └─Sequential: 4-24 --
│ │ │ │ └─Reshape: 5-67 --
│ │ │ │ └─Conv1d: 5-68 100
│ │ │ │ └─GELU: 5-69 --
│ │ │ │ └─Flatten: 5-70 --
│ │ │ │ └─Linear: 5-71 543
│ │ │ │ └─Reshape: 5-72 --
│ │ └─GRUFeatureModule: 3-13 --
│ │ │ └─GRU: 4-25 54
│ │ │ └─Sequential: 4-26 --
│ │ │ │ └─Reshape: 5-73 --
│ │ │ │ └─Conv1d: 5-74 100
│ │ │ │ └─GELU: 5-75 --
│ │ │ │ └─Flatten: 5-76 --
│ │ │ │ └─Linear: 5-77 543
│ │ │ │ └─Reshape: 5-78 --
│ │ └─GRUFeatureModule: 3-14 --
│ │ │ └─GRU: 4-27 63
│ │ │ └─Sequential: 4-28 --
│ │ │ │ └─Reshape: 5-79 --
│ │ │ │ └─Conv1d: 5-80 100
│ │ │ │ └─GELU: 5-81 --
│ │ │ │ └─Flatten: 5-82 --
│ │ │ │ └─Linear: 5-83 543
│ │ │ │ └─Reshape: 5-84 --
│ │ └─GRUFeatureModule: 3-15 --
│ │ │ └─GRU: 4-29 243
│ │ │ └─Sequential: 4-30 --
│ │ │ │ └─Reshape: 5-85 --
│ │ │ │ └─Conv1d: 5-86 100
│ │ │ │ └─GELU: 5-87 --
│ │ │ │ └─Flatten: 5-88 --
│ │ │ │ └─Linear: 5-89 543
│ │ │ │ └─Reshape: 5-90 --
│ │ └─GRUFeatureModule: 3-16 --
│ │ │ └─GRU: 4-31 243
│ │ │ └─Sequential: 4-32 --
│ │ │ │ └─Reshape: 5-91 --
│ │ │ │ └─Conv1d: 5-92 100
│ │ │ │ └─GELU: 5-93 --
│ │ │ │ └─Flatten: 5-94 --
│ │ │ │ └─Linear: 5-95 543
│ │ │ │ └─Reshape: 5-96 --
│ │ └─GRUFeatureModule: 3-17 --
│ │ │ └─GRU: 4-33 153
│ │ │ └─Sequential: 4-34 --
│ │ │ │ └─Reshape: 5-97 --
│ │ │ │ └─Conv1d: 5-98 100
│ │ │ │ └─GELU: 5-99 --
│ │ │ │ └─Flatten: 5-100 --
│ │ │ │ └─Linear: 5-101 543
│ │ │ │ └─Reshape: 5-102 --
│ │ └─GRUFeatureModule: 3-18 --
│ │ │ └─GRU: 4-35 144
│ │ │ └─Sequential: 4-36 --
│ │ │ │ └─Reshape: 5-103 --
│ │ │ │ └─Conv1d: 5-104 100
│ │ │ │ └─GELU: 5-105 --
│ │ │ │ └─Flatten: 5-106 --
│ │ │ │ └─Linear: 5-107 543
│ │ │ │ └─Reshape: 5-108 --
│ │ └─GRUFeatureModule: 3-19 --
│ │ │ └─GRU: 4-37 81
│ │ │ └─Sequential: 4-38 --
│ │ │ │ └─Reshape: 5-109 --
│ │ │ │ └─Conv1d: 5-110 100
│ │ │ │ └─GELU: 5-111 --
│ │ │ │ └─Flatten: 5-112 --
│ │ │ │ └─Linear: 5-113 543
│ │ │ │ └─Reshape: 5-114 --
├─Identity: 1-2 --
├─ModuleList: 1-3 --
│ └─GATConv: 2-2 --
│ │ └─Linear: 3-20 36
│ │ └─Dropout: 3-21 --
│ │ └─Dropout: 3-22 --
│ │ └─LeakyReLU: 3-23 --
│ │ └─Linear: 3-24 36
│ └─GATConv: 2-3 --
│ │ └─Linear: 3-25 144
│ │ └─Dropout: 3-26 --
│ │ └─Dropout: 3-27 --
│ │ └─LeakyReLU: 3-28 --
│ │ └─Identity: 3-29 --
│ └─GATConv: 2-4 --
│ │ └─Linear: 3-30 144
│ │ └─Dropout: 3-31 --
│ │ └─Dropout: 3-32 --
│ │ └─LeakyReLU: 3-33 --
│ │ └─Identity: 3-34 --
│ └─GATConv: 2-5 --
│ │ └─Linear: 3-35 144
│ │ └─Dropout: 3-36 --
│ │ └─Dropout: 3-37 --
│ │ └─LeakyReLU: 3-38 --
│ │ └─Identity: 3-39 --
│ └─GATConv: 2-6 --
│ │ └─Linear: 3-40 144
│ │ └─Dropout: 3-41 --
│ │ └─Dropout: 3-42 --
│ │ └─LeakyReLU: 3-43 --
│ │ └─Identity: 3-44 --
│ └─GATConv: 2-7 --
│ │ └─Linear: 3-45 144
│ │ └─Dropout: 3-46 --
│ │ └─Dropout: 3-47 --
│ │ └─LeakyReLU: 3-48 --
│ │ └─Identity: 3-49 --
│ └─GATConv: 2-8 --
│ │ └─Linear: 3-50 144
│ │ └─Dropout: 3-51 --
│ │ └─Dropout: 3-52 --
│ │ └─LeakyReLU: 3-53 --
│ │ └─Identity: 3-54 --
│ └─GATConv: 2-9 --
│ │ └─Linear: 3-55 144
│ │ └─Dropout: 3-56 --
│ │ └─Dropout: 3-57 --
│ │ └─LeakyReLU: 3-58 --
│ │ └─Identity: 3-59 --
├─NodeWeightPredictor: 1-4 --
│ └─Sequential: 2-10 --
│ │ └─Linear: 3-60 1,664
│ │ └─GELU: 3-61 --
│ │ └─Linear: 3-62 128
=================================================================
Total params: 17,267
Trainable params: 17,267
Non-trainable params: 0
=======================================================================================================================================================
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
preprocess metrics for each instance type: 100%|████████████████████████████████████████| 19/19 [00:06<00:00, 2.95it/s]
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs
| Name | Type | Params
---------------------------------
0 | _module | GAT | 17.6 K
---------------------------------
17.6 K Trainable params
0 Non-trainable params
17.6 K Total params
0.070 Total estimated model params size (MB)
2022-09-03 03:02:52.634 | INFO | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=0 val_loss=1.0618 A@1=0.00 % A@2=0.00 % A@3=0.00 % A@5=0.00 % MAR=47.71
2022-09-03 03:02:54.752 | INFO | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=0 loss=1.2861 A@1=7.27 % A@2=14.55% A@3=16.36% A@5=16.36% MAR=36.62
2022-09-03 03:02:59.639 | INFO | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=4 loss=0.5538 A@1=16.36% A@2=21.82% A@3=47.27% A@5=49.09% MAR=8.98
2022-09-03 03:03:04.952 | INFO | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=8 loss=0.1951 A@1=78.18% A@2=80.00% A@3=83.64% A@5=96.36% MAR=2.18
2022-09-03 03:03:06.374 | INFO | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=9 val_loss=0.4763 A@1=29.41% A@2=47.06% A@3=70.59% A@5=82.35% MAR=3.82
Metric val_loss improved. New best score: 0.476
2022-09-03 03:03:07.945 | INFO | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=10 loss=0.1300 A@1=76.36% A@2=83.64% A@3=90.91% A@5=96.36% MAR=1.56
2022-09-03 03:03:14.117 | INFO | DejaVu.models.interface.callbacks:on_train_epoch_end:41 - epoch=15 loss=0.0551 A@1=89.09% A@2=100.00% A@3=100.00% A@5=100.00% MAR=1.11
2022-09-03 03:03:19.016 | INFO | DejaVu.models.interface.callbacks:on_validation_epoch_end:52 - epoch=19 val_loss=0.2789 A@1=64.71% A@2=88.24% A@3=94.12% A@5=94.12% MAR=2.12
Metric val_loss improved by 0.197 >= min_delta = 0.0. New best score: 0.279
2022-09-03 03:03:19.274 | INFO | utils.callbacks:on_fit_end:106 - Average epoch time: 1.33
2022-09-03 03:03:19.275 | INFO | DejaVu.workflow:_train_exp_CFL:99 - trainer.checkpoint_callback.best_model_path='/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt'
2022-09-03 03:03:19.848 | INFO | DejaVu.workflow:_train_exp_CFL:100 - {'command_line': 'python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20', 'time': 'Sat Sep 3 03:03:19 2022', 'git_root': '/workspace', 'git_url': 'https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a', 'git_has_uncommitted_changes': False}
2022-09-03 03:03:19.863 | WARNING | utils.load_model:best_checkpoint:35 - ckpt_path=PosixPath('/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/last.ckpt') not match
Restoring states from the checkpoint path at /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt
Loaded model weights from checkpoint at /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139/lightning_logs/version_0/checkpoints/epoch=19-A@1=0.647059-val_loss=0.278938-MAR=2.117647.ckpt
2022-09-03 03:03:21.327 | INFO | DejaVu.models.interface.callbacks:on_test_epoch_end:107 -
A@1=53.12% A@2=90.62% A@3=100.00% A@5=100.00% MAR=1.56
|id | |FR |AR |recurring|timestamp |root cause |rank-1 |rank-2 |rank-3 |
|65 |✅ | 1| 1|True |2020-05-30T04:13:00+08:00|docker_002 CPU |docker_002 CPU |docker_002 |db_003 Session |
|59 |✅ | 1| 1|True |2020-05-29T03:41:00+08:00|docker_001 CPU |docker_001 CPU |docker_008 |docker_007 |
|18 |✅ | 1| 1|False |2020-05-23T00:05:00+08:00|docker_004 CPU |docker_004 CPU |docker_004 |db_009 |
|50 |✅ | 1| 1|True |2020-05-27T05:09:00+08:00|docker_001 CPU |docker_001 CPU |os_020 Network |docker_001 |
|44 |✅ | 1| 1|True |2020-05-27T01:23:00+08:00|docker_006 CPU |docker_006 CPU |docker_006 |docker_002 CPU |
|41 |✅ | 1| 1|True |2020-05-26T05:15:00+08:00|docker_002 CPU |docker_002 CPU |docker_002 |os_021 |
|8 |✅ | 1| 1|True |2020-04-11T04:40:00+08:00|docker_008 CPU |docker_008 CPU |docker_008 |db_007 Session |
|52 |✅ | 1| 1|True |2020-05-28T00:47:00+08:00|docker_001 CPU |docker_001 CPU |docker_001 |os_021 |
|24 |❌ | 2| 2|True |2020-05-23T05:20:00+08:00|docker_005 |docker_005 CPU |docker_005 |docker_003 |
|31 |✅ | 1| 1|True |2020-05-24T04:47:00+08:00|docker_004 |docker_004 |os_021 Network |docker_004 CPU |
|28 |❌ | 3| 3|True |2020-05-24T02:47:00+08:00|docker_002 |db_007 Session |docker_002 CPU |docker_002 |
|77 |❌ | 2| 2|True |2020-05-31T05:48:00+08:00|docker_003 |docker_003 CPU |docker_003 |db_007 Session |
|48 |✅ | 1| 1|True |2020-05-27T03:23:00+08:00|docker_001 |docker_001 |os_022 Network |os_021 |
|49 |❌ | 2| 2|True |2020-05-27T04:39:00+08:00|docker_007 |docker_007 CPU |docker_007 |db_007 Session |
|16 |❌ | 3| 3|True |2020-05-22T05:18:00+08:00|docker_007 |docker_003 CPU |docker_007 CPU |docker_007 |
|67 |❌ | 2| 2|True |2020-05-30T05:43:00+08:00|docker_002 |os_022 Network |docker_002 |db_007 Session |
|76 |❌ | 2| 2|True |2020-05-31T04:47:00+08:00|docker_006 |docker_006 CPU |docker_006 |docker_004 CPU |
|62 |❌ | 2| 2|True |2020-05-30T00:43:00+08:00|docker_005 |docker_005 CPU |docker_005 |docker_004 CPU |
|60 |❌ | 2| 2|True |2020-05-29T05:11:00+08:00|docker_006 |docker_006 CPU |docker_006 |docker_004 CPU |
|69 |❌ | 2| 2|True |2020-05-31T00:47:00+08:00|docker_001 |os_022 Network |docker_001 |db_007 Session |
|19 |❌ | 3| 3|False |2020-05-23T00:40:00+08:00|db_003 Session |db_003 Load |db_003 |db_003 Session |
|4 |❌ | 2| 2|True |2020-04-11T02:15:00+08:00|db_007 Session |db_007 Load |db_007 Session |db_007 |
|57 |❌ | 2| 2|False |2020-05-29T02:11:00+08:00|db_003 Session |os_021 Network |db_003 Session |db_007 Session |
|34 |✅ | 1| 1|True |2020-05-25T04:47:00+08:00|db_003 State |db_003 State |db_007 Session |db_003 Session |
|10 |✅ | 1| 1|True |2020-04-11T05:45:00+08:00|db_003 State |db_003 State |db_007 Session |os_017 |
|40 |✅ | 1| 1|True |2020-05-26T04:15:00+08:00|os_020 Network |os_020 Network |os_021 Network |docker_004 |
|14 |✅ | 1| 1|True |2020-05-22T01:48:00+08:00|os_018 Network |os_018 Network |docker_002 |os_018 |
|33 |❌ | 2| 2|False |2020-05-25T03:47:00+08:00|os_017 Network |os_019 Network |os_017 Network |docker_008 CPU |
|66 |❌ | 2| 2|True |2020-05-30T05:13:00+08:00|os_018 Network |os_022 Network |os_018 Network |docker_002 |
|43 |✅ | 1| 1|False |2020-05-27T00:53:00+08:00|os_017 Network |os_017 Network |docker_001 |docker_002 |
|73 |✅ | 1| 1|False |2020-05-31T03:17:00+08:00|os_017 Network |os_017 Network |docker_005 |docker_001 |
|72 |✅ | 1| 1|True |2020-05-31T02:47:00+08:00|os_021 Network |os_021 Network |os_021 |os_022 |
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Test metric DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
A@1 0.53125
A@2 0.90625
A@3 1.0
A@5 1.0
MAR 1.5625
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
2022-09-03 03:03:21.358 | INFO | DejaVu.workflow:<lambda>:27 - Time Report:
|path |%total |%parent |count |total |mean(±std) |min-max |
|/train_exp_CFL | 100.00%| 100.00%| 1| 46.105s| 46.105(± 0.000)s| 46.105~ 46.105|
|/train_exp_CFL/DejaVuDataset.__getitem__ | 8.96%| 8.96%| 1183| 4.132s| 0.003(± 0.011)s| 0.000~ 0.134|
|/train_exp_CFL/DejaVuDataset.__getitem__/MetricPreprocessor.__call__ | 0.59%| 6.54%| 78| 0.270s| 0.003(± 0.003)s| 0.002~ 0.021|
|/train_exp_CFL/DejaVuDataset.__getitem__/_get_global_id_getter | 0.00%| 0.00%| 1| 0.000s| 0.000(± 0.000)s| 0.000~ 0.000|
|/train_exp_CFL/DejaVuDataset.__init__ | 0.01%| 0.01%| 6| 0.003s| 0.001(± 0.001)s| 0.000~ 0.003|
|/train_exp_CFL/DejaVuModelInterface.get_collate_fn.<locals>.collate_fn | 0.37%| 0.37%| 84| 0.171s| 0.002(± 0.001)s| 0.001~ 0.009|
|/train_exp_CFL/DejaVuModelInterface.test_step | 0.21%| 0.21%| 1| 0.098s| 0.098(± 0.000)s| 0.098~ 0.098|
|/train_exp_CFL/DejaVuModelInterface.training_step | 18.07%| 18.07%| 80| 8.332s| 0.104(± 0.022)s| 0.061~ 0.165|
|/train_exp_CFL/DejaVuModelInterface.validation_step | 0.63%| 0.63%| 3| 0.290s| 0.097(± 0.012)s| 0.080~ 0.105|
|/train_exp_CFL/Epoch Time | 57.58%| 57.58%| 20| 26.546s| 1.327(± 0.241)s| 0.978~ 2.106|
|/train_exp_CFL/FDG.load | 7.28%| 7.28%| 1| 3.357s| 3.357(± 0.000)s| 3.357~ 3.357|
|/train_exp_CFL/GAT.__init__ | 0.08%| 0.08%| 1| 0.036s| 0.036(± 0.000)s| 0.036~ 0.036|
|/train_exp_CFL/MetricPreprocessor.extract_features | 14.21%| 14.21%| 1| 6.553s| 6.553(± 0.000)s| 6.553~ 6.553|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter | 13.94%| 98.10%| 19| 6.429s| 0.338(± 0.213)s| 0.105~ 0.888|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/fill na | 2.28%| 16.38%| 19| 1.053s| 0.055(± 0.069)s| 0.005~ 0.272|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/fill na/metric iter | 0.63%| 27.67%| 710| 0.291s| 0.000(± 0.000)s| 0.000~ 0.005|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get idx dict | 0.00%| 0.01%| 19| 0.001s| 0.000(± 0.000)s| 0.000~ 0.000|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df | 11.14%| 79.90%| 19| 5.136s| 0.270(± 0.134)s| 0.099~ 0.564|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df/fill into feat| 6.20%| 55.68%| 19| 2.860s| 0.151(± 0.121)s| 0.000~ 0.377|
|/train_exp_CFL/MetricPreprocessor.extract_features/instance type iter/get values from df/index | 4.81%| 43.17%| 19| 2.217s| 0.117(± 0.025)s| 0.090~ 0.181|
|/train_exp_CFL/MetricPreprocessor.extract_features/ts select | 0.15%| 1.04%| 1| 0.068s| 0.068(± 0.000)s| 0.068~ 0.068|
|/train_exp_CFL/_get_global_id_resolver | 0.00%| 0.00%| 1| 0.000s| 0.000(± 0.000)s| 0.000~ 0.000|
2022-09-03 03:03:22.513 | INFO | DejaVu.workflow:<lambda>:124 - command output one-line summary: 53.12,90.62,100.00,100.00,1.56,46.105078504000005,,,/SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139,,,,python exp/run_GAT_node_classification.py -H=4 -L=8 -fe=GRU -bal=True --data_dir=./data/A1 --max_epoch=20,https://github.com/NetManAIOps/DejaVu/tree/00d36dd07eed266840840769ecbc4abf0322319a
train finished. saved to /SSF/output/run_GAT_node_classification.py.2022-09-03T03:02:35.245139
lizeyan commented