Bus error (core dumped) when using the pecos model to train xtransformer

Question

Bus error (core dumped) when using the pecos model to train xtransformer

runningabcd opened this issue 2 years ago · 3 comments

Description

When I train xtransformer with pecos model, a training error occurs

Constructed training corpus len=679174, training label matrix with shape=(679174, 679174) and nnz=1429299
Constructed training feature matrix with shape=(679174, 1134376) and nnz=1195014
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████| 570/570 [00:00<00:00, 2.08MB/s]
Downloading (…)okenizer_config.json: 100%|█████████████████████████████████████████████████████| 29.0/29.0 [00:00<00:00, 104kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█████████████████████████████████████████████████████| 213k/213k [00:00<00:00, 635kB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████████████| 436k/436k [00:00<00:00, 945kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████| 436M/436M [00:23<00:00, 18.3MB/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForXMC: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
Bus error (core dumped)

How to Reproduce?

train data like this:
4400,1580,5174 教育培训机构.道口财富是一家教育培训机构，由清控控股旗下公司联合上海陆家嘴旗下公司发起设立，为学员提供财富管理课程和创业金融课程。
5156,1188,1459 场景营销平台.北京蜂巢天下信息技术有限公司项目团队组建于2014年,总部位于北京，是基于Beacon网络的场景营销平台。专注于为本地生活服务商户提供基于场景的优惠分发，为用户提供一键接入身边优惠内容。
5156,1459 定制品在线设计及管理平台.时代定制是一个定制品在线设计及业务管理平台，主要服务于印刷和设计类企业、网站、影楼、文印店。

Steps to reproduce

from pecos.utils.featurization.text.preprocess import Preprocessor
from pecos.xmc.xtransformer.model import XTransformer
from pecos.xmc.xtransformer.module import MLProblemWithText

import os

parsed_result = Preprocessor.load_data_from_file(
    "./training-data.txt",
    "./output-labels.txt",
)
Y = parsed_result["label_matrix"]
X_txt = parsed_result["corpus"]

print(f"Constructed training corpus len={len(X_txt)}, training label matrix with shape={Y.shape} and nnz={Y.nnz}")

vectorizer_config = {
    "type": "tfidf",
    "kwargs": {
        "base_vect_configs": [
            {
                "ngram_range": [1, 2],
                "max_df_ratio": 0.98,
                "analyzer": "word",
            },
        ],
    },
}

tfidf_model = Preprocessor.train(X_txt, vectorizer_config)
X_feat = tfidf_model.predict(X_txt)

print(f"Constructed training feature matrix with shape={X_feat.shape} and nnz={X_feat.nnz}")

prob = MLProblemWithText(X_txt, Y, X_feat=X_feat)
custom_xtf = XTransformer.train(prob)

custom_model_dir = "multi_labels_model_dir"
os.makedirs(custom_model_dir, exist_ok=True)

tfidf_model.save(f"{custom_model_dir}/tfidf_model")
custom_xtf.save(f"{custom_model_dir}/xrt_model")

# custom_xtf = XTransformer.load(f"{custom_model_dir}/xrt_model")
# tfidf_model = Preprocessor.load(f"{custom_model_dir}/tfidf_model")

Error message or code output

- This IS expected if you are initializing BertForXMC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForXMC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
Bus error (core dumped)
-rw------- 1 root  root   35G Jun 20 17:09 core.10580

docker stats screenshot:

Environment

Operating system: Ubuntu 22.04(docker)
Python version:3.10
PECOS version:1.0.0
GPU version：NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7

runningabcd commented 2 years ago

help

Answer 1 · 2023-06-20T19:24:46.000Z

It is likely due to insufficient shared memory for GPU communication. Could you follow the guidelines by Nvidia to run your docker container? I.e. try adding --shm-size 8G or --ipc=host to your docker run command.

Nvidia User Guide: Setting The Shared Memory Flag

Answer 2 · 2023-06-21T03:08:23.000Z

这可能是由于 GPU 通信的共享内存不足。你能按照 Nvidia 的指南来运行你的 docker 容器吗？即尝试添加--shm-size 8G或--ipc=host到您的docker run命令。

Nvidia 用户指南：设置共享内存标志

thanks for your reply，the problem has been solved

感谢，感谢