Text encoding definition class based on a range of code page character numbers.
So far, in PHP v7.* the mb_detect_encoding
function does not work well.
Therefore, you have to somehow solve this problem.
This class is one solution.
Built-in encodings and accuracy:
letters -> | 5 | 15 | 30 | 60 | 120 | 180 | 270 |
---|---|---|---|---|---|---|---|
windows-1251 | 99.13 | 98.83 | 98.54 | 99.04 | 99.73 | 99.93 | 100.0 |
koi8-r | 99.89 | 99.98 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
iso-8859-5 | 81.79 | 99.27 | 99.98 | 100.0 | 100.0 | 100.0 | 100.0 |
ibm866 | 99.81 | 99.99 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
mac-cyrillic | 12.79 | 47.49 | 73.48 | 92.15 | 99.30 | 99.94 | 100.0 |
Worst accuracy with mac-cyrillic, you need at least 60 characters to determine this encoding with an accuracy of 92.15%. Windows-1251 encoding also has very poor accuracy. This is because the numbers of their characters in the tables overlap very much.
Fortunately, mac-cyrillic and ibm866 encodings are not used to encode web pages. By default, they are disabled in the script, but you can enable them if necessary.
letters -> | 5 | 10 | 15 | 30 | 60 |
---|---|---|---|---|---|
windows-1251 | 99.40 | 99.69 | 99.86 | 99.97 | 100.0 |
koi8-r | 99.89 | 99.98 | 99.98 | 100.0 | 100.0 |
iso-8859-5 | 81.79 | 96.41 | 99.27 | 99.98 | 100.0 |
The accuracy of the determination is high even in short sentences from 5 to 10 letters. And for phrases from 60 letters, the accuracy of determination reaches 100%.
Determining the encoding is very fast, for example, text longer than 1,300,000 Cyrillic characters is checked in 0.00096 sec. (on my computer)
Link to the idea: http://patttern.blogspot.com/2012/07/php-python.html
Composer (recommended) Use Composer to install this library from Packagist: onnov/captcha
Run the following command from your project directory to add the dependency:
composer require onnov/detect-encoding
Alternatively, add the dependency directly to your composer.json file:
"require": {
"onnov/detect-encoding": "^1.0"
}
The classes in the project are structured according to the PSR-4 standard, so you can also use your own autoloader or require the needed files directly in your code.
use Onnov\DetectEncoding\EncodingDetector;
$detector = new EncodingDetector();
- Definition of text encoding:
$text = 'Проверяемый текст';
$detector->getEncoding($text)
- Method for converting text of an unknown encoding into a given encoding, by default in utf-8 optional parameters:
$extra = '//TRANSLIT' (default setting) , other options: '' or '//IGNORE'
$encoding = 'utf-8' (default setting) , other options: any encoding that is available iconv
$detector->iconvXtoEncoding($text)
- Method to enable encoding definition:
$detector->enableEncoding([
$detector::IBM866,
$detector::MAC_CYRILLIC,
]);
- Method to disable encoding definition:
$detector->disableEncoding([
$detector::ISO_8859_5,
]);
- Method for adding custom encoding:
$detector->addEncoding([
'encodingName' => [
'upper' => '1-50,200-250,253', // uppercase character number range
'lower' => '55-100,120-180,199', // lowercase character number range
],
]);
- Method to get a custom encoding range:
use Onnov\DetectEncoding\CodePage;
// utf-8 encoded alphabet
$cyrillicUppercase = 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФЧЦЧШЩЪЫЬЭЮЯ';
$cyrillicLowercase = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя';
$codePage = new CodePage();
$encodingRange = $codePage->getRange($cyrillicUppercase, $cyrillicLowercase, 'koi8-u'));
Add in services.yaml file:
services:
Onnov\DetectEncoding\EncodingDetector:
autowire: true