Bus error (core dumped) when using the pecos model to train xtransformer
runningabcd opened this issue · 3 comments
Description
When I train xtransformer with pecos model, a training error occurs
Constructed training corpus len=679174, training label matrix with shape=(679174, 679174) and nnz=1429299
Constructed training feature matrix with shape=(679174, 1134376) and nnz=1195014
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████| 570/570 [00:00<00:00, 2.08MB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 104kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 635kB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 945kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████| 436M/436M [00:23<00:00, 18.3MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=True
to disable this warning
warnings.warn(
Bus error (core dumped)
How to Reproduce?
train data like this:
4400,1580,5174 教育培训机构.道口财富是一家教育培训机构,由清控控股旗下公司联合上海陆家嘴旗下公司发起设立,为学员提供财富管理课程和创业金融课程。
5156,1188,1459 场景营销平台.北京蜂巢天下信息技术有限公司项目团队组建于2014年,总部位于北京,是基于Beacon网络的场景营销平台。专注于为本地生活服务商户提供基于场景的优惠分发,为用户提供一键接入身边优惠内容。
5156,1459 定制品在线设计及管理平台.时代定制是一个定制品在线设计及业务管理平台,主要服务于印刷和设计类企业、网站、影楼、文印店。
Steps to reproduce
from pecos.utils.featurization.text.preprocess import Preprocessor
from pecos.xmc.xtransformer.model import XTransformer
from pecos.xmc.xtransformer.module import MLProblemWithText
import os
parsed_result = Preprocessor.load_data_from_file(
"./training-data.txt",
"./output-labels.txt",
)
Y = parsed_result["label_matrix"]
X_txt = parsed_result["corpus"]
print(f"Constructed training corpus len={len(X_txt)}, training label matrix with shape={Y.shape} and nnz={Y.nnz}")
vectorizer_config = {
"type": "tfidf",
"kwargs": {
"base_vect_configs": [
{
"ngram_range": [1, 2],
"max_df_ratio": 0.98,
"analyzer": "word",
},
],
},
}
tfidf_model = Preprocessor.train(X_txt, vectorizer_config)
X_feat = tfidf_model.predict(X_txt)
print(f"Constructed training feature matrix with shape={X_feat.shape} and nnz={X_feat.nnz}")
prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)
custom_xtf = XTransformer.train(prob)
custom_model_dir = "multi_labels_model_dir"
os.makedirs(custom_model_dir, exist_ok=True)
tfidf_model.save(f"{custom_model_dir}/tfidf_model")
custom_xtf.save(f"{custom_model_dir}/xrt_model")
# custom_xtf = XTransformer.load(f"{custom_model_dir}/xrt_model")
# tfidf_model = Preprocessor.load(f"{custom_model_dir}/tfidf_model")
Error message or code output
- This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
Bus error (core dumped)
-rw------- 1 root root 35G Jun 20 17:09 core.10580
docker stats screenshot:
Environment
- Operating system: Ubuntu 22.04(docker)
- Python version:3.10
- PECOS version:1.0.0
- GPU version:NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7
help
It is likely due to insufficient shared memory for GPU communication. Could you follow the guidelines by Nvidia to run your docker container? I.e. try adding --shm-size 8G
or --ipc=host
to your docker run
command.
这可能是由于 GPU 通信的共享内存不足。你能按照 Nvidia 的指南来运行你的 docker 容器吗?即尝试添加
--shm-size 8G
或--ipc=host
到您的docker run
命令。
thanks for your reply,the problem has been solved
感谢,感谢