sometimes submit job failed.
zsh4614 opened this issue · 1 comments
Organization Name:HIT
Short summary about the issue/question:
sometimes submit job failed.
Brief what process you are following:
when I submit ad job, it occurs error as follow:
[Exit Trigger Info]
ExitTriggerMessage: FailedTaskCount 1 has reached MinFailedTaskCount 1 in the TaskRole
ExitTriggerTaskRole: taskrole
ExitTriggerTaskIndex: 0
--------------------------------------------------------------------------------
[Exit Spec]
code: 1
phrase: PAIRuntimeExitAbnormally
issuer: PAI_RUNTIME
causer: PAI_RUNTIME
type: PLATFORM_FAILURE
stage: UNKNOWN
behavior: UNKNOWN
reaction: RETRY_TO_MAX
reason: 'PAI Runtime exit abnormally with undefined exitcode, it may have bugs'
repro:
- PAI Runtime exits with exitcode 1
solution:
- Contact PAI Dev to fix PAI Runtime bugs
--------------------------------------------------------------------------------
[Exit Diagnostics]
Pod failed: PodPattern unmatched:
containers:
- name: init
reason: Completed
code: 0
- name: app
reason: Error
message: >
standard_init_linux.go:228: exec user process caused: no such file or
directory
code: 1
what 's the reason about this error? I need help, thanks!
PLATFORM_FAILURE
How to reproduce it:
submit a new job.
OpenPAI Environment:
-
OpenPAI version: v1.8.0
-
Cloud provider or hardware configuration:
-
OS (e.g. from /etc/os-release):
-
Kernel (e.g.
uname -a
): Linux rsgpuserver154 4.15.0-166-generic 174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux -
Hardware (e.g. core number, memory size, storage size, GPU type etc.): A40
-
Others:
Anything else we need to know:
Which docker image do you use. We need include bash
inside docker image