atom/encoding-selector

fallback encoding

risperdal opened this issue · 7 comments

I have a legacy project with mixed encodings. Encoding selector could not able to detect the encodings properly because of limitations.

In sublime text there is a config named fallback_encoding
which is described as following.

fallback_encoding

The encoding to use when the encoding can’t be determined automatically. ASCII, UTF-8 and UTF-16 encodings will be detected automatically .

I am setting sublime default_encoding to utf-8 and fallback_encoding to ISO-8859-9

With this setting any encoding that do not detected as utf-8 is treated as ISO-8859-9

It would be good to see a smiliar option in encoding-selector.

There is a setting for this in the main Settings View page:

screen shot 2015-12-28 at 10 05 22 am

Unless I'm misunderstanding what it is you're asking for?

What I'm talking about is if file encoding does not detected as default file encoding setting which is utf8 as you posted in your image the encoding should fallback to fallback_encoding setting.

If fallback_encoding option is set, any encoding that do not detected as default_encoding (in my case UTF-8) should be treated as fallback_encoding (in my case ISO-8859-9) instead of guessing encoding based on file content. Because in some situations the guessing could be wrong.

What if the file is UTF-16 and your settings are UTF-8 and ISO-8859-9? It should treat it as ISO-8859-9 instead of UTF-16 as detected?

If the file is UTF-16 and encoding setting is auto-detect it should go with auto detected encoding no matter what.

But if the file is ISO-8859-9 and my default encoding setting is UTF-8 instead of auto-detect, it should be treated as UTF-8. But if fallback_encoding option is set to something there is a catch in this case. If the file content does not match with UTF-8(default encoding) it should fall back to fallback encoding setting.

I am asking this because sometimes detected encoding could be wrong.

Some of my flies are windows-1254 but they are detected as windows-1252 . Because of that it would be good to have an option like fallback encoding.

I'm confused, in the case that I specified with the file being UTF-16, your proposed system would select an incorrect encoding for the file even though it was detected as UTF-16 correctly. I feel like this would be really confusing for users.

I understand that the proposed system would be useful in your specific case. I'm just not convinced how useful it would be in the common case.

Okey I'm starting over.

Think about two file in a project with different encodings.

Keep in mind this whole idea is to overcome wrong charset detection. There should be fallback mechanism.

1 - a.php which is UTF-8
2 - b.php which is windows-1254

If encoding selector is configured to auto-detect;
It detects a.php as UTF-8. But it detects b.php as windows-1252. (wrong detection)

I'm sure auto encoding detection is not %100 correct sometimes. This problem may apply to another encodings.

To overcome this issue, user should not configure encoding-selector to auto-detect.
User should configure encoding to UTF-8 (I assume most of the files in project is UTF-8)
And user should configure a fallback encoding if needed

If fallback encoding is configured
if the file (a.php) meet the requirements of default encoding (UTF-8) it should be treated as default encoding (UTF-8)

But if the file (b.php) does not meet the requirements of default encoding it should be fallback to fallback encoding setting (windows-1254)

If fallback encoding is not configured everything should work as right now.

A pseudo code of what I'm asking

var defaultEncoding = config.get('default_encoding');
var encoding = null;

if (!config.get('auto_detect_encoding')) {
    //no auto detection
    if (config.get('fallback_encoding') !== null) {
        //check if meets the default encoding requirements and treat as default encoding as usual
        //in my case check if file is UTF-8, if so treat as UTF-8 else fall back to fallback encoding
        if (checkEncoding(buffer, defaultEncoding)) {
            encoding = defaultEncoding;
        } else {
            //fall back to fallback encoding
            encoding = config.get('fallback_encoding');
        }
    } else {
        //treat as default encoding
        encoding = defaultEncoding;
    }
} else {
    encoding = autoDetectEncoding(buffer);
}

Closing this since you reopened it in atom/atom.