mlc-ai/tokenizers-cpp

add_library INTERFACE library requires no source arguments.

songkq opened this issue · 12 comments

Thanks for sharing the bindings of hugging face tokenizer. When I build it, it failed with the problem. Could you please share a tutorial for usage?

CMake Error at CMakeLists.txt:109 (add_library):
  add_library INTERFACE library requires no source arguments.


CMake Error at CMakeLists.txt:110 (target_link_libraries):
  Cannot specify link libraries for target "tokenizers_c" which is not built
  by this project.

Interesting, this could due to difference of cmake version, can you check your cmake version.

@tqchen Thanks. I have confirmed that Cmake-3.26.4 and -std=c++17 are required.

set(TOKENIZERS_RUST_LIB "${TOKENIZERS_CPP_CARGO_BINARY_DIR}/libtokenizers_c.a")

As shown in the CMakeLists.txt, does it mean that I need to compile the huggingface/tokenizers library as libtokenizers_c.a first?

it should get compiled automatically

Did you confirm that you have rust and cargo installed btw?

@tqchen @junrushao Thanks. After configuring the cargo env, it can be compiled automatically.
After finishing compiling, I get the library libtokenizers_cpp.a and libtokenizers_c.a.
Suppose I just want to use the SentencePieceTokenizer interface, only the library libtokenizers_cpp.a and the tokenizers_cpp.h are required to be added in my program, right?

image

Here is a sentencepiece testcase. However, the returned result is empty. Could you please give some advice?

[debug] input_text = hello world
[debug] token_ids =
[debug] recover_text =
#include <iostream>
#include <vector>
#include <string>
#include "tokenizers_cpp.h"
#include "sentencepiece_tokenizer.cc"

int main() {

    std::unique_ptr<tokenizers::Tokenizer> tokenizer = std::make_unique<tokenizers::SentencePieceTokenizer>("spiece.model");

    std::string text = "hello world";
    printf("[debug] input_text = %s\n", text.c_str());
    auto token_ids = tokenizer->Encode(text);
    printf("[debug] token_ids = ");
    for(const int token_id: token_ids){
        printf("%d, ", token_id);
    }
    printf("\n");
    auto recover_text = tokenizer->Decode(token_ids);
    printf("[debug] recover_text = %s\n", recover_text.c_str());
    return 0;
}

CMakeLists.txt
image

all the interface takes in model binary blob instead of file name

@tqchen Thanks. Everything is OK now.

Added some examples to https://github.com/mlc-ai/tokenizers-cpp, please check it out and send PR to further improve it if you like

@tqchen Thanks for sharing the examples. The target_link_libraries(tokenizers_c INTERFACE ${TOKENIZERS_RUST_LIB} ${CMAKE_DL_LIBS}) is required to be set in the tokenizers-cpp CMakeLists.txt. Or it failed with the following issue.

tokenizers/release/libtokenizers_c.a(std-946b15357ac77df4.std.1ade4ed0-cgu.0.rcgu.o): In function `std::sys::unix::weak::fetch':
/rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys/unix/weak.rs:138: undefined reference to `dlsym'
collect2: error: ld returned 1 exit status
CMakeFiles/example.dir/build.make:99: recipe for target 'example' failed
make[2]: *** [example] Error 1
CMakeFiles/Makefile2:165: recipe for target 'CMakeFiles/example.dir/all' failed
make[1]: *** [CMakeFiles/example.dir/all] Error 2
Makefile:155: recipe for target 'all' failed
make: *** [all] Error 2

thanks @songkq , do you mind send a PR? I think we can detect linux system name and set it here https://github.com/mlc-ai/tokenizers-cpp/blob/main/CMakeLists.txt#L20 (just like foundation for iOS)

@tqchen I have committed a PR here(#2).