determined-ai/determined
Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
GoApache-2.0
Issues
- 6
π€[question] Can not connect to master node
#9201 opened by monody1 - 6
π[bug]
#7909 opened by humbleearth - 1
no shutdown parameter in cloud formation template yaml with a dry runπ[bug]
#7978 opened by humbleearth - 1
Does determined provide data / model access through gitlfs π€[question]
#8297 opened by humbleearth - 1
π€ model registry - inference with pytorch model
#8806 opened by Fedege98 - 4
Does Determined AI support below features?π€[question]
#8331 opened by humbleearth - 1
- 2
π€[question] dialing to http://172.22.0.1:32862: dial tcp 172.22.0.1:32862: connect: connection refused
#8954 opened by mr-nealon - 14
- 1
- 11
π[bug] Kernel status: pending
#8976 opened by rikirolly - 12
π[bug] Running Mnist Tutorial distributed causes Runtime Errors and Hanging behavior
#8915 opened by samjenks - 1
π€[question] Customize Slack Webhook?
#8581 opened by Wildshire - 1
- 5
π€[question] Changing the default config path for the determined-agent.service
#8891 opened by samjenks - 4
π[bug] Master refuses to accept agents connection
#8856 opened by skynewborn - 4
- 5
π[bug] Resources failed with non-zero exit code: container failed with non-zero exit code: 80
#8844 opened by samjenks - 1
π[bug] Bad ref on requirements.rst in Docs
#8826 opened by sirredbeard - 4
π€[question] LOGs
#8779 opened by fayjie92 - 7
πUpdate readme for @hpe.com/glide-data-grid and consider contributing back
#8617 opened by jassmith - 3
π[bug] Multi-node training hangs in _combine_and_average_training_metrics due to ZeroMQ bug
#8222 opened by igor0 - 5
π[bug] show_ssh_command error on Windows CMD: module 'os' has no attribute 'uname'
#8621 opened by agarcia-ruiz - 3
π[bug] det CLI tool errors on Python 3.12 because it relies on distutils which was deprecated in Python 3.10
#8666 opened by sirredbeard - 1
π€[question] add resource_pools
#8619 opened by TdTianpo - 0
- 3
DDMScheduler parameter bug
#8392 opened by paulaserna16 - 2
- 14
π[bug] Failed to Deploy a Standalone Master
#8305 opened by PurRigiN - 1
cat I change the save path of checkpoint on master node?π€[question]
#8368 opened by ghuorahgaorhga - 2
Anyway to avoid non-Admin User Ability to Delete Others' Task ContainersοΌ
#8379 opened by jeremyjiao - 0
How I get checkpoint save pathπ€[question]
#8366 opened by ghuorahgaorhga - 1
Image support for newer versions of ROCm and PyTorch
#7832 opened by albertogg99 - 10
Does determined provide client server mode of execution something similar to clearML?π€[question]
#8114 opened by humbleearth - 1
- 8
- 2
Are there terraform templates available for determined for aws?π€[question]
#8094 opened by humbleearth - 6
Checkpoint storage validation failed
#7869 opened by r-s-4 - 2
- 7
π€[question] duplicate key value violates unique constraint "steps_trial_id_total_batches_run_id_unique"
#7939 opened by taroTan1997 - 3
Any example of reinforcement learning using determined ai?π€[question]
#7911 opened by humbleearth - 12
Agent instances starting / stoppingπ[bug]
#7977 opened by humbleearth - 4
π€[question] Where is the training stage log?
#7966 opened by caiduoduo12138 - 3
master in aws not able to connect to rdsπ[bug]
#7979 opened by humbleearth - 4
- 0
[question]
#7845 opened by Satej - 2
- 5
π€[question] llama2 test: AttributeError: 'str' object has no attribute 'type'
#7746 opened by caiduoduo12138 - 3
Deploying determined ai on premise kubernetes cluster with custom registry for the postgres db pod results in forbidden error while pulling imageπ[bug]
#7770 opened by humbleearth - 2
π€[question] image(determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.24.0) to submit multinodes task
#7739 opened by caiduoduo12138