EleutherAI/gpt-neox
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
PythonApache-2.0
Issues
- 4
How are multiple datasets loaded?
#1330 opened by fxnie - 1
Insufficient support for tokenizer_type
#1328 opened by fxnie - 1
Error when converting sequential model to HF
#1323 opened by SilverSulfide - 0
[Questions] Context length extension.
#1327 opened by iPRET - 0
AssertionError: Index file doesn't match expected format. Make sure that --dataset-impl is configured properly.
#1329 opened by fxnie - 0
- 2
- 1
Unreachable code and bug in if-else clause
#1324 opened by tomsbergmanis - 1
Can `preprocess_data.py` support Huggingface Dataset?
#1321 opened by cafeii - 1
Error with rotary embeddings and BFloat16
#1305 opened by jahatef - 1
- 2
_forward_step_fn does not always return two values so eval.py breaks if is_pipe_parallel is false
#1320 opened by markNZed - 1
Allow training without knowing num_iters
#1268 opened by StellaAthena - 0
KeyError when converting DPO weights from GPTNeoX format to HuggingFace Llama in post-training documentations
#1317 opened by jacobthebanana - 2
- 0
Training crashes when "(hidden_size * num_kv_heads) / (num_attention_heads * num_attention_heads)" is not an integer.
#1314 opened by tiandeyu-cs - 2
- 1
[Question] Running gpt-neox on AMD-based LUMI HPC centre.
#1310 opened by iPRET - 0
Latest DeepSpeed not supported
#1306 opened by jahatef - 4
batch_input and elapsed time per iteration suddenly slow down during model training
#1248 opened by Yuhanleeee - 2
- 3
For nucleus sampling, top-p sampling appears to happen on the softmax-normalized top-k logits
#1250 opened by j-frei - 2
Cannot convert neox model to HF
#1231 opened by srivassid - 3
Assertion Error when Setting pipe_parallel_size or model_parallel_size in GPT-NeoX
#1251 opened by lieh1203 - 0
How to Load Model from pytorch_model.bin into Trained Model for Text Generation?
#1254 opened by lieh1203 - 0
what's the biggest dataset you've tried?
#1253 opened by exnx - 0
too many .bin files for dataloader, crashed
#1252 opened by exnx - 2
The results of running eval show only 1 digit after decimal point for acc on all tested tasks
#1227 opened by lernerjenny - 2
My servers used for multi-node training do not have ssh. How can I launch multi-node training using the torchrun command?
#1203 opened by dingning97 - 1
Add Basic RWKV Block to GPT-NeoX
#1167 opened by Quentin-Anthony - 3
'intermediate_size' not set in tools/ckpts/convert_neox_to_hf.py for neox model architecture
#1208 opened by jvendrow - 1
LoRA Support
#1204 opened by Quentin-Anthony - 4
NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
#1147 opened by mackmake - 1
How to convert gpt-neox to llama architecture..?
#1151 opened by yuri-son - 2
is there any ignore_index ability in the loss calculation?
#1193 opened by exnx - 1
بهترین تعمیرگاه موبایل در مشهد مقدس
#1173 opened by rezaarefi - 1
- 4
Is there a way to train on the entire dataset for N epochs without specifying train-iters?
#1164 opened by javirandor - 1
- 0
Add basic Mamba block
#1148 opened by Quentin-Anthony - 0
MoE loss variable not defined in gpt j residual code path
#1174 opened by tf-nv - 1
- 3
Converting Pythia checkpoint from HF to NeoX fails
#1161 opened by malteos - 2
Dockerfile installation fails to run pythia 14M
#1165 opened by tf-nv - 0
PyTorch Lightning Fused optimizer step
#1160 opened by jahatef - 2
files in multi-node training
#1146 opened by mackmake - 0
Update to current versions of python and pytorch
#1143 opened by segyges - 0
Add PyTorch Memory Profiler
#1152 opened by Quentin-Anthony - 2
Error on inference of huggingface
#1142 opened by mackmake - 4
calculate epoch
#1140 opened by mackmake