Detect the language of text.
What’s so cool about franc?
- franc supports more languages(†) than any other library, or Google;
- franc is easily forked to support 335 languages;
- franc is just as fast as the competition.
† - If humans write in the language, on the web, and the language has more than one million speakers, franc detects it.
Installation
npm:
$ npm install franc
$ component install wooorm/franc
$ bower install franc
Duo:
var franc = require('wooorm/franc');
require(['path/to/dist/franc.js'], function (franc) {
franc('Alle menslike wesens word vry'); // "afr"
});
Browser globals (info):
<script src="path/to/dist/franc.js" charset="utf-8"></script>
<script>
franc('Alle menslike wesens word vry'); // "afr"
</script>
Usage
var franc = require('franc');
franc('Alle menslike wesens word vry'); // "afr"
franc('এটি একটি ভাষা একক IBM স্ক্রিপ্ট'); // "ben"
franc('Alle mennesker er født frie og'); // "nno"
franc(''); // "und"
franc.all('O Brasil caiu 26 posições em');
/*
* [
* [ 'por', 1 ],
* [ 'glg', 0.7362599377808503 ],
* [ 'src', 0.7286553750432078 ],
* [ 'lav', 0.6944348427238161 ],
* [ 'cat', 0.6802627030763913 ],
* [ 'spa', 0.6633252678880055 ],
* [ 'bos', 0.6536467334946423 ],
* [ 'tpi', 0.6477704804701002 ],
* [ 'hrv', 0.6456965088143796 ],
* [ 'snn', 0.6374006221914967 ],
* [ 'bam', 0.5900449360525406 ],
* [ 'sco', 0.5893536121673004 ],
* ...
* ]
*/
/* "und" is returned for too-short input: */
franc('the'); // 'und'
/* You can change what’s too short (default: 10): */
franc('the', {'minLength': 3}); // 'sco'
/* Provide a whitelist: */
franc.all('O Brasil caiu 26 posições em', {
'whitelist' : ['por', 'src', 'glg', 'spa']
});
/*
* [
* [ 'por', 1 ],
* [ 'glg', 0.7362599377808503 ],
* [ 'src', 0.7286553750432078 ],
* [ 'spa', 0.6633252678880055 ]
* ]
*/
/* Provide a blacklist: */
franc.all('O Brasil caiu 26 posições em', {
'blacklist' : ['src', 'glg', 'lav']
});
/*
* [
* [ 'por', 1 ],
* [ 'cat', 0.6802627030763913 ],
* [ 'spa', 0.6633252678880055 ],
* [ 'bos', 0.6536467334946423 ],
* [ 'tpi', 0.6477704804701002 ],
* [ 'hrv', 0.6456965088143796 ],
* [ 'snn', 0.6374006221914967 ],
* [ 'bam', 0.5900449360525406 ],
* [ 'sco', 0.5893536121673004 ],
* ...
* ]
*/
CLI
Install:
$ npm install --global franc
Use:
Usage: franc [options] <string>
Detect the language of text
Options:
-h, --help output usage information
-v, --version output version number
-m, --min-length <number> minimum length to accept
-w, --whitelist <string> allow languages
-b, --blacklist <string> disallow languages
Usage:
# output language
$ franc "Alle menslike wesens word vry"
# afr
# output language from stdin (expects utf8)
$ echo "এটি একটি ভাষা একক IBM স্ক্রিপ্ট" | franc
# ben
# blacklist certain languages
$ franc --blacklist por,glg "O Brasil caiu 26 posições em"
# src
# output language from stdin with whitelist
$ echo "Alle mennesker er født frie og" | franc --whitelist nob,dan
# nob
Supported languages
franc supports 175 “languages”, by default. For a complete list, check out Supported-Languages.md.
Supporting more or less languages
Supporting more or less languages is easy: fork the project and run the following:
$ npm install # Install development dependencies.
$ export THRESHOLD=100000 # Set minimum speakers to a 100,000.
$ npm run build # Run the `build` script.
The above would create a version of franc with support for any language with 100,000 or more speakers. To support all languages, even dead ones like Latin, specify -1
.
Browser
I’ve compiled three versions of franc for use in the browser. They’re UMD compliant: they work with AMD, CommonJS, and <script>
s.
- dist/franc.js — franc with support for languages with 8 million or more speakers (75 languages);
- dist/franc-most.js — franc with support for languages with 1 million or more speakers (175 languages, the same as the Node or Component version);
- dist/franc-all.js — franc with support for all languages (335 languages, carful, huge!).
Benchmark
On a MacBook Air, it runs 175 paragraphs 2 times per second (total: 350 op/s).
benchmarks * 175 paragraphs in different languages
2 op/s » franc -- this module
2 op/s » guesslanguage
2 op/s » languagedetect
2 op/s » vac
(I’ll work on a better benchmark soon)
Derivation
Franc is a derivative work from guess-language (Python, LGPL), guesslanguage (C++, LGPL), and Language::Guess (Perl, GPL). Their creators granted me the rights to distribute franc under the MIT license: respectively, Maciej Ceglowski, Jacob R. Rideout, and Kent S. Johnson.