A C library that can determine the language of text
This library was my solution to a data clustering contest Telegram Data Clustering Contest 2021 - check it out.
The library was tested on servers running Debian GNU/Linux 10 (buster), x86-64 with 8 cores and 16 GB RAM and will work correctly on any clean system. Use the following commands to build the library:
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ cmake --build .
You can test the resulting library file libtgcat.so on the test data using the test script libtgcat-tester.tar.gz. To do this, copy libtgcat.so into the directory containing the test script, then build with cmake in the standard way:
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ cmake --build .
To test the library output, launch the resulting binary file tgcat-tester with the following parameters:
$ tgcat-tester language <input_file> <output_file>
where: <input_file> – path to file containing input data, <output_file> – path to file containing output data.
Output data is presented as a text file where each line represents processed channel data in JSON format:
{
"lang_code": "en"
}