alibaba/EasyRec

dssm negtive sampler not support odps table?

xiahouzuoxin opened this issue · 4 comments

[m�[0;31mE0930 16:31:31.917438 153 edge_loader.cc:98] Try to read next edge file failed, Not found:File system not implemented
�[m�[0;31mE0930 16:31:31.917464 6 graph_store.cc:207] Load graph edges failed, Not found:File system not implemented
�[m[2022-09-30 16:31:31.917490] Load graph edges failed.
[2022-09-30 16:31:31.917492] Not found:File system not implemented
[2022-09-30 16:31:31.917496] Server load data failed and exit now.
[2022-09-30 16:31:31.917499] Not found:File system not implemented
�[0;31mF0930 16:31:31.917500 6 server_impl.cc:163] Server load data failed: Not found:File system not implemented
�[m*** Check failure stack trace: ***
@ 0x7fcf18872250 google::LogMessage::Fail()
@ 0x7fcf18872198 google::LogMessage::SendToLog()
@ 0x7fcf18871abb google::LogMessage::Flush()
@ 0x7fcf18875306 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fcf187f9d6a graphlearn::DefaultServerImpl::Init()
@ 0x7fcf18eeed08 ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEJRKSt6vectorINS3_2io10EdgeSourceESaIS7_EERKS5_INS6_10NodeSourceESaISC_EEEJNS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_SB_SG_E_vJSU_SB_SG_EJSH_SI_SJ_EEEvOSL_PFSK_SN_EST_ENKUlRNS_6detail13function_callEE1_clES11
@ 0x7fcf18ee21b2 pybind11::cpp_function::dispatcher()
@ 0x7fd0767fdeed PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd07677a8e8 function_call
@ 0x7fd07674adc3 PyObject_Call
@ 0x7fd07675d54f instancemethod_call
@ 0x7fd07674adc3 PyObject_Call
@ 0x7fd0767b7910 slot_tp_init
@ 0x7fd0767ae328 type_call
@ 0x7fd07674adc3 PyObject_Call
@ 0x7fd0767fbf07 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
@ 0x7fd0767fe1f6 PyEval_EvalFrameEx
@ 0x7fd0767ff1ce PyEval_EvalCodeEx
bash: line 1: 6 Aborted (core dumped) python run.py --cmd train --config oss://algo-recsys/algo/xiahouzuoxin/recall/dssm_trsample/dssm_trsample.config --train_tables odps://du_algo_1/tables/ec_list_homepage_recall_trainsamples_by_rankscore_v3/ds=20220928 --eval_tables odps://du_algo_1/tables/deal_rec_recall_dssm_feadump_sample_test_1d/ds=20220928 --boundary_table odps://du_algo_1/tables/deal_rec_recall_dssm_featdump_sample_train_sample_pos_train_v1_binning/ds=20220928 2>&1
7 Done | tee /logs/hostname.log

init graph from odps table is naturally supported on max compute platform, so what is your startup command?

init graph from odps table is naturally supported on max compute platform, so what is your startup command?

It occurs when I use EasyRec repo and follow doc.

It seems these lines init graph,

class HardNegativeSampler(BaseSampler):

start command using pai

pai -name easy_rec_ext -project algo_public -Dcmd=train -Dtrain_tables='odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_train_samples/ds=${bizdate}' -Deval_tables='odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_test_samples/ds=${bizdate}' -Dboundary_table='odps://du_algo_1/tables/deal_rec_recall_dssm_featdump_sample_train_sample_pos_train_v1_binning/ds=${bizdate}' -Dcluster='{\"ps\":{\"count\":2,\"cpu\":900,\"memory\":10000},\"worker\":{\"count\":5,\"cpu\":900,\"memory\":10000}}' -Darn='acs:ram::1816563541899700:role/aliyunodpspaidefaultrole' -Dbuckets='oss://algo-recsys.oss-cn-hangzhou-internal.aliyuncs.com/' -Dconfig='oss://algo-recsys/algo/xiahouzuoxin/recall/dssm_trsample/${model_config}.config' -Dmodel_dir="oss://algo-recsys/algo/xiahouzuoxin/recall/dssm_trsample/${model_config}/${bizdate}" -DossHost=oss-cn-hangzhou-internal.aliyuncs.com -- -Dedit_config_json='{\ -- "train_config.num_steps":30000,\ -- "eval_config.num_examples":409600,\ -- "train_config.fine_tune_checkpoint": "oss://algo-recsys/algo/xiahouzuoxin/recall/dssm_trsample/${model_config}/${bizdate_1}",\ -- "data_config.hard_negative_sampler.user_input_path": "odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_users/ds=${bizdate}",\ -- "data_config.hard_negative_sampler.item_input_path": "odps://du_algo_1/tables/ec_list_homepage_recall_hotweighted_items/ds=${bizdate}",\ -- "data_config.hard_negative_sampler.hard_neg_edge_input_path": "odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_hardneg_edge/ds=${bizdate}"\ -- }' -Deval_method='separate' -Dres_project=du_algo_1_dev -Dversion=zuoxin_dev

and pipeline_conf like:

hard_negative_sampler { user_input_path: 'odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_users/ds=20220928' item_input_path: 'odps://du_algo_1/tables/ec_list_homepage_recall_hotweighted_items/ds=20220928' hard_neg_edge_input_path: 'odps://du_algo_1/tables/ec_list_homepage_recall_dssm_trsample_hardneg_edge/ds=20220928' num_sample: 1000 num_hard_sample: 2 num_eval_sample: 1000 attr_fields: 'cspu_id' attr_fields: 'level1_category_id' attr_fields: 'level2_category_id' attr_fields: 'brand_id' attr_fields: 'category_id'

please add the tables specified in hard_negative_sampler to -Dtables so that the platform will authorize tensorflow to read from these tables.