naver-airush/NAVER-AI-RUSH

session 속도 저하

Closed this issue · 3 comments

Informations

CLI

  • OS (Mac, Windows, Linux, etc) and version:
Linux 18.04
  • client version(please show nsml --version):
Built: 2020-07-16T14:06:32+09:00
Go: go1.12.17
GOARCH: amd64
GOOS: linux
Revision: abae53dd3589a47d7f89a40128cb86176123f7ec (not-modified)
Preconfigured connection: airush-cli.nsml.navercorp.com:10001

WEB

  • Brower (chrome, firfox, etc):
  • URL:

What is your login ID?

hi-space

Question

nsml run 할 때 session 이 allocating 되고 running 하는 속도가 느립니다.

  1. allocating 로그가 반복적으로 출력됩니다.

  2. 아래와 같이 tensorflow 로딩 후 오랫동안 진행되지 않습니다.

Epoch 1/10

1/2 [==============>...............] - ETA: 6s - loss: 2.3649 - accuracy: 0.3750
2/2 [==============================] - 388s 194s/step - loss: 2.6855 - accuracy: 0.2500 - val_loss: 0.8431 - val_accuracy: 0.4404
Trying to save to /tmp/tmpoqtgdqjm/model
Using TensorFlow backend.
2020-07-21 08:47:19.502230: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2020-07-21 08:47:21.137245: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-07-21 08:47:21.254578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:88:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-07-21 08:47:21.256203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-07-21 08:47:21.256238: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2020-07-21 08:47:21.259268: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-07-21 08:47:21.260514: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-07-21 08:47:21.260812: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-07-21 08:47:21.263423: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-07-21 08:47:21.264079: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2020-07-21 08:47:21.264219: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-07-21 08:47:21.272603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-07-21 08:47:21.280824: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2200000000 Hz
2020-07-21 08:47:21.281167: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x148ed8000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-21 08:47:21.281215: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-21 08:47:21.534657: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56d6d60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-21 08:47:21.534683: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-07-21 08:47:21.534689: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-07-21 08:47:21.537118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:88:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-07-21 08:47:21.538706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties: 
pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s
2020-07-21 08:47:21.538745: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2020-07-21 08:47:21.538791: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-07-21 08:47:21.538807: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-07-21 08:47:21.538831: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-07-21 08:47:21.538845: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-07-21 08:47:21.538857: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.11
2020-07-21 08:47:21.538885: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
2020-07-21 08:47:21.545017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-07-21 08:47:21.545056: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
2020-07-21 08:47:22.290694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-21 08:47:22.290735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1 
2020-07-21 08:47:22.290746: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y 
2020-07-21 08:47:22.290752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N 
2020-07-21 08:47:22.295744: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30154 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0)
2020-07-21 08:47:22.298013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 30154 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0)
2020-07-21 08:48:58.759488: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-07-21 08:48:59.175054: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8
  1. 같은 조건으로 돌렸을 때 특정 시간대에는 굉장히 빠르게 작업이 진행되는데 보통은 꽤 느린 것 같습니다. (baseline 코드 기준)
1/2 [==============>...............] - ETA: 0s - loss: 2.5396 - accuracy: 0.5000
2/2 [==============================] - 385s 193s/step - loss: 2.2393 - accuracy: 0.5000 - val_loss: 1.0433 - val_accuracy: 0.5562
2020-07-21 08:35:41.483612: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.11
2020-07-21 08:35:41.916094: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.8

여러명의 사용자가 동시에 사용하면 session 생성,작업이 오래 걸리나요? 아니면 다른 문제가 있을까요?

@hi-space 님, allocating 로그 반복에 대한 설명을 먼저 드리면,
docker build 과정, package install과정에서 나오는 메시지 입니다. 해당 메시지는 GPU node별로 서로 다른 실행시간을 가질수 있습니다. 기존에 동일한 docker image build한 이력이 있다면, cache 역할을 하게되어 빠르게 실행됩니다. 반면 처음 실행되는 node라면, 상대적으로 긴 시간이 소요됩니다.

@bluebrush 아 gpu 노드별로 다른거였군요. 네 이해했습니다, 감사합니다!

@hi-space 님, 추가로 전체 실행 속도 분석을 위해서 의견을 드립니다
각 세션별로 아래와 같이 실행시간별 리소스 활용그래프를 제공하고 있습니다
예를 들어,
https://airush.nsml.navercorp.com/overview/hi-space/spam-1/106
세션에서는 CPU 할당이 2개(default 값)이며, 일정시간동안 MAX 값(200%)를 치고 있습니다
이 경우, -c 4 또는 -c 6같이 늘려주시면, 해당 구간에는 더 빨리 진행될 것으로 예상됩니다.
전반적으로 GPU utilization을 보면, 낮은 값을 보이고 있으며, 이는 data feeding의 속도가 느려서 발생하는 것으로 보여집니다.