OP-Engineering/link-preview-js

Not working with windows-1251 charset

huzaifa-99 opened this issue · 2 comments

Describe the bug
For html docs with windows-1251 charset, the response has broken encoding.

To Reproduce

  1. Find a url with html doc that has a windows-1251 charset encoding (Example: https://www.pravda.com.ua/)
  2. With a basic setup of link-preview-js, query the url
  3. Notice the title, siteName and description fields have broken encoding

Expected behavior
The encodings should be correct

Screenshots
N/A

Desktop (please complete the following information):

  • OS: MacOS Monterey 12.5.1
  • Browser: Chrome
  • Version: 111.0.5563.64

Additional context
This is what I tried

const { getLinkPreview, getPreviewFromContent } = require("link-preview-js");
const axios = require('axios')

const url = "https://www.pravda.com.ua/" // this url gives html document with windows-1251 charset

// Try 1
getLinkPreview(url).then((data) =>
    console.log(data)
);
// response
/*
{
  url: 'https://www.pravda.com.ua/',
  title: '��������� ������',
  siteName: '��������� ������',
  description: '��������� ������ - ������ ��� ������',
  mediaType: 'article',
  contentType: 'text/html',
  images: [ 'https://img.pravda.com/images/up_for_fb.png' ],
  videos: [],
  favicons: [
    'https://www.pravda.com.ua/favicon-32x32.png',
    'https://www.pravda.com.ua/favicon-96x96.png',
    'https://www.pravda.com.ua/android-chrome-192x192.png',
    'https://www.pravda.com.ua/favicon.ico',
    'https://www.pravda.com.ua/apple-touch-icon-57x57.png',
    'https://www.pravda.com.ua/apple-touch-icon-72x72.png',
    'https://www.pravda.com.ua/apple-touch-icon-76x76.png',
    'https://www.pravda.com.ua/apple-touch-icon-114x114.png',
    'https://www.pravda.com.ua/apple-touch-icon-120x120.png',
    'https://www.pravda.com.ua/apple-touch-icon-144x144.png',
    'https://www.pravda.com.ua/apple-touch-icon-152x152.png',
    'https://www.pravda.com.ua/apple-touch-icon-180x180.png'
  ]
}
*/

I thought maybe it was not picking charset correctly, so tried to give headers via response (that includes charset), but still no luck

// Try 2
axios.get(url).then(data => {
    const content = {
        url: url,
        headers: data.headers, // headers include 'content-type': "text/html; charset=windows-1251"
        data: data.data
    }
    getPreviewFromContent(content).then((res) => console.log("res => ", res));
})
// response
/*
{
  url: 'https://www.pravda.com.ua/',
  title: '��������� ������',
  siteName: '��������� ������',
  description: '��������� ������ - ������ ��� ������',
  mediaType: 'article',
  contentType: 'text/html',
  images: [ 'https://img.pravda.com/images/up_for_fb.png' ],
  videos: [],
  favicons: [
    'https://www.pravda.com.ua/favicon-32x32.png',
    'https://www.pravda.com.ua/favicon-96x96.png',
    'https://www.pravda.com.ua/android-chrome-192x192.png',
    'https://www.pravda.com.ua/favicon.ico',
    'https://www.pravda.com.ua/apple-touch-icon-57x57.png',
    'https://www.pravda.com.ua/apple-touch-icon-72x72.png',
    'https://www.pravda.com.ua/apple-touch-icon-76x76.png',
    'https://www.pravda.com.ua/apple-touch-icon-114x114.png',
    'https://www.pravda.com.ua/apple-touch-icon-120x120.png',
    'https://www.pravda.com.ua/apple-touch-icon-144x144.png',
    'https://www.pravda.com.ua/apple-touch-icon-152x152.png',
    'https://www.pravda.com.ua/apple-touch-icon-180x180.png'
  ]
}
*/

JavaScript handles everything in UTF-8, adding support for windows-1251 charset will probably require increasing the package size to add some dependency that transcodes from windows-1251 to UTF-8 (for example this).

I don't think this is worth it, even the wikipedia article mentions that most russian websites use UTF-8.

I can try to extract the enconding and return it in the response so you can later implement the transcoding once you receive your response?

I published a new version 3.0.5 which includes a new field charset, you should be able to use that to transcode the text fields. Let me know if it works for you.