Hsword/Hetu

Error when running RDMA

xudongliao opened this issue · 4 comments

Hi Hetu Authors,

There are some errors when running Hetu with RDMA. The error logs are:

[libprotobuf ERROR google/protobuf/message_lite.cc:133] Can't parse message of type "ps.PBMeta" because it is missing required fields: (cannot determine missing fields for lite message)
[10-0-10-200:13:06:00] /hetu/ps-lite/include/common/logging.h:317: [13:06:00] /hetu/ps-lite/src/van.cc:544: Check failed: pb.ParseFromArray(meta_buf, buf_size) failed to parse string into protobuf

Stack trace returned 8 entries:
[bt] (0) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(dmlc::StackTrace[abi:cxx11]()+0x17f) [0x7f8fe4c8f3bf]
[bt] (1) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x3b) [0x7f8fe4c90b0b]
[bt] (2) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::Van::UnpackMeta(char const*, int, ps::Meta*)+0x3ae) [0x7f8fe4ccbade]
[bt] (3) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::IBVerbsVan::RecvMsg(ps::Message*)+0x182) [0x7f8fe4cdd012]
[bt] (4) /hetu/python/hetu/gpu_ops/../../../build/lib/libps.so(ps::Van::Receiving()+0x216) [0x7f8fe4ccaa36]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd44c0) [0x7f9007a054c0]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f9011b846db]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f9011ebd61f]

To compile ps-lite with RDMA ibverbs, I added the Macro in ps-lite/CMakeLists.txt

add_compile_definitions(DMLC_USE_IBVERBS)
target_link_libraries(ps PRIVATE -lrdmacm -libverbs)

And there is one bug in the compilation: ibverbs_van.h: 657. There is no data member named msg.meta.datasize

Therefore I implemented one function for the Message.h: struct Message

struct Message {
    /** \brief the meta info of this message */
    Meta meta;
    /** \brief the large chunk of data of this message */
    std::vector<SArray<char>> data;

    std::string DebugString() const {
        std::stringstream ss;
        ss << meta.DebugString();
        if (data.size()) {
            ss << " Body:";
            for (const auto &d : data)
                ss << " data_size=" << d.size();
        }
        return ss.str();
    }

    inline size_t data_size() const {
        size_t data_len = 0;
        for (auto &iter : data)
            data_len += iter.size();
        return data_len;
    }
};

and correct the line with size_t data_len = msg.data_size();

My Hetu is based on commit: (https://github.com/Hsword/Hetu/tree/120b776d653708adfccbadc8e1b35d633eaf1161).

The testing model is wdl_criteo and ps_num = 1, worker_num=8 with the Hybrid communication pattern.

Hope you can help us :)

Hi, Thanks for your interests on Hetu!

As a general-purpose distributed DL system, there are plenty of novel functionalities supported in Hetu and some of them could directly support RDMA (e.g., submodules only relying on collective communication operations).

HET is one of these modules designed for communication-efficient distributed training of large-scale embedding models. It is proposed to solve the communication bottleneck in existing communication architectures (e.g., parameter server). Currently, we only evaluate HET in Ethernet environments. But it would be easy to be extended to RDMA environments by adding some modifications over the PS part.

I really appreciate your efforts on making it RDMA capable. From the error information you provided, I think only implementing the Message structure is not enough. To resolve these errors, I suggest looking at some related repos, e.g., https://github.com/elvinlife/ps-lite-rdma. I believe it would be helpful for you!

If you plan to implement this part for research purposes, please let us know if we can help. And we are grateful to see your future pull requests!

Thanks a lot for your kind response. I really appreciate the provided reference repo. Could you please share the commit ID of ps-lite that Hetu used?

Hi,

Could you please give some instructions on how we can upgrade the ps-lite in Hetu to the latest version in the original ps-lite repo? I would like to see whether the rdma issue would be resolved in a newer version of ps-lite or not. Thank you in advance!

I think ps-lite has not been updated for a long time so there is no need to upgrade.

Besides, please note that our ps implementation is highly optimized. After refactoring the code base and designing new functionalities (e.g., sparse communication, embedding caching), our ps is largely different from ps-lite. So it's hard to expect to directly use ps-lite-rdma to replace Hetu's ps with without any adaptation efforts.

I hope this helps, but please let me know if you have any other questions.