NVIDIA/NeMo-Framework-Launcher

Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.

PythonApache-2.0

Issues

slurm Multi-machine and multi-GPU training
#416 opened 4 months ago by yangzhipeng1108
0
Adding support for Ray as a launcher backend
#387 opened 5 months ago by Irvingwangjr
0
’General Configuration‘ Page Not Found
#373 opened 5 months ago by Pty72
0
`demonic child error` occur when execute run_dask_stage.sh
#340 opened 7 months ago by eagle705
1
The hydra-core and omegaconf file encountered type error unexpected keyword vesion base
#338 opened 7 months ago by imn00b
1
[Request][Nemo-Curator] missing add_id and download_and_extract features on Launcher
#339 opened 7 months ago by leejinho610
1
NeMo FW Launcher doesn't have `add_id` and `download_and_extract` logic unlike NeMo Curator repo
#334 opened 7 months ago by eagle705
0
Data Preparation Failed in Kubernetes
#314 opened 7 months ago by Syulin7
0
fsdp not supported for Llama model
#276 opened 9 months ago by jeffnvidia
0
llama2 7b SFT met error
#271 opened 9 months ago by inspurasc
0
nemo container link in the README is not accessible any more
#247 opened 10 months ago by Fizzbb
1
Failed to execute a multirun with different configurations in K8S
#238 opened 10 months ago by Syulin7
2
k8s llama2 Data_preparation error inquiry
#236 opened 10 months ago by donggyulee1
1
k8s llama2 Data preparation error inquiry
#235 opened 10 months ago by donggyulee1
0
Add support for all tasks in latest version of eval harness
#212 opened a year ago by juletx
0
How do we enable samples per second metric in NeMo training?
#55 opened 2 years ago by a-cavalcanti
2
OSError: sbatch: error: Batch job submission failed: Invalid generic resource (gres) specification
#72 opened 2 years ago by starlitsky2010
1
GPT3 126m divergence
#207 opened a year ago by ethanhe42
0
Docker Build Fails
#184 opened a year ago by TaekyungHeo
7
Nvidia nemo blog and Apples to Oranges comparison of H200 vs A100
#180 opened a year ago by Qubitium
1
Documentation Code Block Display Issue
#178 opened a year ago by KTH1234
0
Nemo Ai
#161 opened a year ago by GAN-007
0
Request of NeMo-Megatron-Launcher in other language for local inferencing
#159 opened a year ago by bm777
0
Is nemo-rlhf code available?
#53 opened 2 years ago by maxhgerlach
1
Does NeMo-Megatron-Launcher support training from bare metal environment
#132 opened a year ago by zigzagcai
1
is CSP (Azure,Aws or etc..) must to work on NeMo framework for Generative AI work? Cant we scale/leverage it on -OnPremise database with 8 GPUs
#112 opened a year ago by Shantan243
0
Add Transformer Engine and Apex commit to support matrix
#87 opened 2 years ago by wdykas
0
AssertionError: FP8 amax reduction group is not initialized
#81 opened 2 years ago by starlitsky2010
1
requests.exceptions.SSLError: HTTPSConnectionPool
#75 opened 2 years ago by starlitsky2010
0
OSError: [Errno 30] Read-only file system: 'outputs'
#74 opened 2 years ago by starlitsky2010
0
Which versions of Pyxis, Slurm and enroot for running NeMo-Megatron-Launcher on one Node with 8 * A100?
#73 opened 2 years ago by starlitsky2010
2
Where to execute the python main.py when use slurm only on only one GPU Node ?
#69 opened 2 years ago by starlitsky2010
2
pyxis: failed to import docker image
#28 opened 2 years ago by szhengac
1
Access to the script for evaluating the inference performance (average latency vs. model size)
#22 opened 2 years ago by songkq
3
Convert nemo-megatron-mt5-3B to a binary file of fastertransformer successfully, but tritonserver fails with undesired shape when loading models.
#21 opened 2 years ago by songkq
2
Is there a contribution guide for this repo?
#7 opened 2 years ago by kaiyux
3
Device Support
#6 opened 2 years ago by MikeDean2367
4