DeepRec-AI/HybridBackend
A high-performance framework for training wide-and-deep recommender systems on heterogeneous cluster
C++Apache-2.0
Issues
- 0
- 0
Op type not registered 'HbGetNcclId' in binary
#159 opened by ZhuYuJin - 3
Train got error died with <Signals.SIGSEGV: 11>
#153 opened by dixingxing0 - 4
Training is very slow
#155 opened by dixingxing0 - 0
Error in multi-card in a single machine mode
#154 opened by dixingxing0 - 1
- 0
No OpKernel was registered to support Op 'HbSparseSegmentMeanGrad1' used by node
#151 opened by karterotte - 9
- 0
- 0
- 0
Throughput is lower than TFRecords when there are many strings in Parquets file
#138 opened by deepllz - 0
hb.data.ParquetDataset will discard some data
#121 opened by silingtong123 - 0
- 0
Deeprec hangs in distributed mode.
#125 opened by silingtong123 - 1
- 0
hb.keras.model evaluate error
#118 opened by karterotte - 0
Failed to train with multiple GPUs in single node
#122 opened by ZhuYuJin - 0
- 0
How to place the embeddings on gpu?
#102 opened by taoyun951753 - 0
merge embedding table
#100 opened by zhaozheng09 - 0
ParquetDataset benchmark add tfrecord data
#97 opened by welsonzhang - 0
- 2
QR code is invalid
#106 opened by co63oc - 1
Error when drop_reminder=True using rebatch API
#56 opened by liurcme - 0
- 0
Row-wise shuffling required
#107 opened by 2sin18 - 0
Sync training with ParquetDataset, Use PS-Worker,The system may block because some worker stop early.
#96 opened by zhbhhb - 0
ParquetDataset should be able to skip corrupted data
#104 opened by 2sin18 - 1
DLRM model on A100 8cards training
#95 opened by zhaozheng09 - 3
Feature Request: Supports prefetching data to GPU
#74 opened by 2sin18 - 0
model.summary didn't show model layers
#87 opened by karterotte - 0
- 1
- 1
tf.keras.layers.DenseFeatures api as the candidate of hb.feature_column.DenseFeatures can not work with tf.feature_column.shared_embedding_columns
#93 opened by taoyun951753 - 0
- 0
- 0
feature_column bucket_size is 6, use 8 gpus, then worker-5 and worker-6 'save/RestoreV2' failed
#89 opened by zhbhhb - 0
the EarlyStopping callback not working well on multi worker distribute training job
#88 opened by taoyun951753 - 0
- 0
- 0
- 1
support ARROW_NUM_THREADS in ParquetDataset
#67 opened by karterotte - 1
- 0
hybridbackend 0.6.0a2 version raise ValueError when ParquetDataset wrapped by parallel_interleave ops
#62 opened by fuhailin - 0
- 0
Feature Request: Support fixed length list
#53 opened by 2sin18 - 5
Using shuffle or rebatch may cause OOM problem
#43 opened by liurcme - 0
Feature Request: Support hybrid parallelism.
#37 opened by 2sin18 - 2
- 0