aws-neuron/aws-neuron-parallelcluster-samples

About the running time of building the Megatron helper module

etsurin opened this issue · 0 comments

I am using a single node to lanuch a pretraining job following https://github.com/aws-neuron/aws-neuron-parallelcluster-samples/blob/master/examples/jobs/neuronx-nemo-megatron-llamav2-job.md

For the step of running the command to build the Megatron helper module

cd ~
python3 -c "from nemo.collections.nlp.data.language_modeling.megatron.dataset_utils import compile_helper; \
compile_helper()"

it takes 15 minutes after displaying the following outputs

2023-Oct-03 04:58:41.0157 23005:23005 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
[NeMo W 2023-10-03 04:58:46 optimizers:70] Could not import distributed_fused_adam optimizer from Apex
[NeMo W 2023-10-03 04:58:49 experimental:27] Module <class 'nemo.collections.nlp.data.language_modeling.megatron.megatron_batch_samplers.MegatronPretrainingRandomBatchSampler'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-10-03 04:58:50 experimental:27] Module <class 'nemo.collections.nlp.models.text_normalization_as_tagging.thutmose_tagger.ThutmoseTaggerModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
make: Entering directory '/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.8 -I/opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/pybind11/include helpers.cpp -o helpers.cpython-38-x86_64-linux-gnu.so

I am wondering if there is something wrong.

Updated:
After 40 minutes, there is still not any outputs, so I terminated them.