/pycld3

Python3 bindings for the Compact Language Detector v3 (CLD3)

Primary LanguageC++Apache License 2.0Apache-2.0

pycld3

Python bindings to the Compact Language Detector v3 (CLD3).

CircleCI License PyPI Wheel Status Python Implementation

Newer Alternative: gcld3

Note: Since the original publication of this pycld3, Google's cld3 authors have published the Python package gcld3, which are official Python bindings built with pybind. Please check that project out as it is part of the canonical cld3 repository and will likely stay in better lock step with any cld3 changes over time.

Overview

This package contains Python bindings (via Cython) to Google's CLD3 library.

>>> import cld3
>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

The library outputs BCP-47-style language codes. For some languages, output is differentiated by script. Language and script names from Unicode CLDR. It supports over 100 languages/scripts. See full list of supported languages/scripts in Google's CLD3 documentation.

Installing with Wheels: Supported Versions and Platforms

This project supports CPython versions 3.6 through 3.9.

We publish wheels for the following matrix:

  • MacOS: CPython 3.6 thru 3.9
  • Linux: CPython 3.6 thru 3.9; (manylinux1)

The wheels for both MacOS and manylinux1 include the external protobuf library copied into the wheel itself via auditwheel or delocate so that you won't need to install any extra non-PyPI dependencies.

If you are installing on one of the variants listed above, you should not need to have protoc or libprotobuf installed:

python -m pip install -U pycld3

Installing from Source: Prerequisites

If you are not on a platform variant that is eligible to use a wheel, you may still be able to use pycld3 via its source distribution (tar.gz), but a bit more work is required to install. Namely, you'll also need:

  • the Protobuf compiler (the protoc executable)
  • the Protobuf development headers and libprotoc library
  • a compiler, preferably g++

Please consult the official protobuf repository for information on installing Protobuf. The project contains an Installation README that covers installation on Windows and Unix.

If for whatever reason you are on a Unix host but unable to use the wheels (for instance, if you have an i686 architecture), here is a quick-and-dirty guide to installing.

Debian/Ubuntu

sudo apt-get update -y
sudo apt-get install -y --no-install-recommends \
    g++ \
    protobuf-compiler \
    libprotobuf-dev
python -m pip install -U pycld3

Alpine Linux

Note: Alpine Linux does not support PyPI wheels as of April 2020. The steps below are mandatory on Alpine Linux because you will need to install from the source distribution. If the situation permits, using a Debian distro should be much easier (and faster).

apk --update add g++ protobuf protobuf-dev
python -m pip install -U pycld3

CentOS/RHEL

Install from source, as root/UID 0:

sudo su -
set -ex
pushd /opt
PROTOBUF_VERSION='3.11.4'
yum update -y
yum install -y autoconf automake gcc-c++ glibc-headers gzip libtool make python3-devel zlib-devel
curl -Lo /opt/protobuf.tar.gz \
    "https://github.com/protocolbuffers/protobuf/releases/download/v${PROTOBUF_VERSION}/protobuf-cpp-${PROTOBUF_VERSION}.tar.gz"
tar -xzvf protobuf.tar.gz
rm -f protobuf.tar.gz
pushd "protobuf-${PROTOBUF_VERSION}"
./configure --with-zlib --disable-debug && make && make install && ldconfig --verbose
popd && rm -rf "protobuf-${PROTOBUF_VERSION}" && popd && set +ex

python -m pip install -U pycld3

Note: the steps above are for CentOS 8. For earlier versions, you may need to replace:

  • gcc-c++ with g++
  • python3-devel with python-devel

MacOS/Homebrew

brew update
brew upgrade protobuf || brew install -v protobuf
python -m pip install -U pycld3

Windows

Please consult Protobuf's C++ Installation - Windows section for help with installing Protobuf on Windows.

If you would like to help contribute Windows wheels (preferably as a job within the project's CI/CD pipelines), please file an issue.

Usage

cld3 exports two module-level functions, get_language() and get_frequent_languages():

>>> import cld3

>>> cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

>>> cld3.get_language("This is a test")
LanguagePrediction(language='en', probability=0.9999980926513672, is_reliable=True, proportion=1.0)

>>> for lang in cld3.get_frequent_languages(
...     "This piece of text is in English. Този текст е на Български.",
...     num_langs=3
... ):
...     print(lang)
...
LanguagePrediction(language='bg', probability=0.9173890948295593, is_reliable=True, proportion=0.5853658318519592)
LanguagePrediction(language='en', probability=0.9999790191650391, is_reliable=True, proportion=0.4146341383457184)

FAQ

cld3 incorrectly detects my input. How can I fix this?

A first resort is to preprocess (clean) your input text based on conditions specific to your program.

A salient example is to remove URLs and email addresses from the input. CLD3 (unlike CLD2) does almost none of this cleaning for you, in the spirit of not penalizing other users with overhead that they may not need.

Here's such an example using a simplified URL regex from Regular Expressions Cookbook, 2nd ed.:

>>> import re
>>> import cld3

# cld3 does not ignore the URL components by default
>>> s = "Je veux que: https://site.english.com/this/is/a/url/path/component#fragment"
>>> cld3.get_language(s)
LanguagePrediction(language='en', probability=0.5319557189941406, is_reliable=False, proportion=1.0)

>>> url_re = r"\b(?:https?://|www\.)[a-z0-9-]+(\.[a-z0-9-]+)+(?:[/?].*)?"
>>> new_s = re.sub(url_re, "", s)
>>> new_s
'Je veux que: '
>>> cld3.get_language(new_s)
LanguagePrediction(language='fr', probability=0.9799421429634094, is_reliable=True, proportion=1.0)

Note: This URL regex aims for simplicity. It requires a domain name, and doesn't allow a username or password; it allows the scheme (http or https) to be omitted if it can be inferred from the subdomain (www). Source: Regular Expressions Cookbook, 2nd ed. - Goyvaerts & Levithan.

In some other cases, you cannot fix the incorrect detection. Language detection algorithms in general may perform poorly with very short inputs. Rarely should you trust the output of something like detect("hi"). Keep this limitation in mind regardless of what library you are using.

Please remember that, at the end of the day, this project is just a Python wrapper to the CLD3 C++ library that does the actual heavy-lifting.

I'm seeing an error during pip installation. How can I fix this?

First, please make sure you have read the installation section that that you have installed Protobuf if necessary.

If that doesn't help, please file an issue in this repository. The build process for this project is somewhat complex because it involves both Cython and Protobuf, but I do my best to make it work everywhere possible.

Protobuf is installed, but I'm still seeing "cannot open shared object file"

If you've installed Protobuf, but are seeing an error such as:

ImportError: libprotobuf.so.22: cannot open shared object file: No such file or directory

This likely means that Python is not finding the libprotobuf shared object, possibly because ldconfig didn't do what it was supposed to. You may need to tell it where to look.

You can find where the library sits via:

$ find /usr -name 'libprotoc.so' \( -type l -o -type f \)
/usr/local/lib/libprotoc.so

Then, you can add the directory containing this file to LD_LIBRARY_PATH:

export LD_LIBRARY_PATH="$(dirname $(find /usr -name 'libprotoc.so' \( -type l -o -type f \))):$LD_LIBRARY_PATH"

You can quickly test that this worked:

$ python -c 'import cld3; print(cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度"))'
LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

Authors

This repository contains a fork of google/cld3 at commit 06f695f. The license for google/cld3 can be found at LICENSES/CLD3_LICENSE.

This repository is a combination of changes introduced by various forks of google/cld3 by the following people: