Not working with windows-1251 charset
huzaifa-99 opened this issue · 2 comments
Describe the bug
For html docs with windows-1251 charset, the response has broken encoding.
To Reproduce
- Find a url with html doc that has a windows-1251 charset encoding (Example: https://www.pravda.com.ua/)
- With a basic setup of
link-preview-js
, query the url - Notice the
title
,siteName
anddescription
fields have broken encoding
Expected behavior
The encodings should be correct
Screenshots
N/A
Desktop (please complete the following information):
- OS: MacOS Monterey 12.5.1
- Browser: Chrome
- Version: 111.0.5563.64
Additional context
This is what I tried
const { getLinkPreview, getPreviewFromContent } = require("link-preview-js");
const axios = require('axios')
const url = "https://www.pravda.com.ua/" // this url gives html document with windows-1251 charset
// Try 1
getLinkPreview(url).then((data) =>
console.log(data)
);
// response
/*
{
url: 'https://www.pravda.com.ua/',
title: '��������� ������',
siteName: '��������� ������',
description: '��������� ������ - ������ ��� ������',
mediaType: 'article',
contentType: 'text/html',
images: [ 'https://img.pravda.com/images/up_for_fb.png' ],
videos: [],
favicons: [
'https://www.pravda.com.ua/favicon-32x32.png',
'https://www.pravda.com.ua/favicon-96x96.png',
'https://www.pravda.com.ua/android-chrome-192x192.png',
'https://www.pravda.com.ua/favicon.ico',
'https://www.pravda.com.ua/apple-touch-icon-57x57.png',
'https://www.pravda.com.ua/apple-touch-icon-72x72.png',
'https://www.pravda.com.ua/apple-touch-icon-76x76.png',
'https://www.pravda.com.ua/apple-touch-icon-114x114.png',
'https://www.pravda.com.ua/apple-touch-icon-120x120.png',
'https://www.pravda.com.ua/apple-touch-icon-144x144.png',
'https://www.pravda.com.ua/apple-touch-icon-152x152.png',
'https://www.pravda.com.ua/apple-touch-icon-180x180.png'
]
}
*/
I thought maybe it was not picking charset correctly, so tried to give headers via response (that includes charset), but still no luck
// Try 2
axios.get(url).then(data => {
const content = {
url: url,
headers: data.headers, // headers include 'content-type': "text/html; charset=windows-1251"
data: data.data
}
getPreviewFromContent(content).then((res) => console.log("res => ", res));
})
// response
/*
{
url: 'https://www.pravda.com.ua/',
title: '��������� ������',
siteName: '��������� ������',
description: '��������� ������ - ������ ��� ������',
mediaType: 'article',
contentType: 'text/html',
images: [ 'https://img.pravda.com/images/up_for_fb.png' ],
videos: [],
favicons: [
'https://www.pravda.com.ua/favicon-32x32.png',
'https://www.pravda.com.ua/favicon-96x96.png',
'https://www.pravda.com.ua/android-chrome-192x192.png',
'https://www.pravda.com.ua/favicon.ico',
'https://www.pravda.com.ua/apple-touch-icon-57x57.png',
'https://www.pravda.com.ua/apple-touch-icon-72x72.png',
'https://www.pravda.com.ua/apple-touch-icon-76x76.png',
'https://www.pravda.com.ua/apple-touch-icon-114x114.png',
'https://www.pravda.com.ua/apple-touch-icon-120x120.png',
'https://www.pravda.com.ua/apple-touch-icon-144x144.png',
'https://www.pravda.com.ua/apple-touch-icon-152x152.png',
'https://www.pravda.com.ua/apple-touch-icon-180x180.png'
]
}
*/
JavaScript handles everything in UTF-8, adding support for windows-1251 charset will probably require increasing the package size to add some dependency that transcodes from windows-1251 to UTF-8 (for example this).
I don't think this is worth it, even the wikipedia article mentions that most russian websites use UTF-8.
I can try to extract the enconding and return it in the response so you can later implement the transcoding once you receive your response?
I published a new version 3.0.5 which includes a new field charset
, you should be able to use that to transcode the text fields. Let me know if it works for you.