wikimedia/html-metadata

Scraping foreign language site returns gibberish in metadata

abelabbesnabi opened this issue · 2 comments

While scrapping a url that is in a language different than English, such as Russian, the title and description in the metadata are returned as gibberish

Here is a link example: https://pikabu.ru/story/privet_fsbshniki_mne_drug_byivshiy_sotrudnik_fsb_rasskazal_chto_vyi_tut_sidite__2821880

mvolz commented

This is a character encoding issue. By default cheerio loads the html as 'utf-8' charset. We handle this outside of the library.

It looks like the charset is as a tag in the html here: <meta charset="windows-1251">

We basically load as utf-8 and check the charset, and then reload it again after decoding using the iconv-lite library.

Something like

var str = iconv.decode(response.body, 'windows-1251');
var $ = cheerio.load(str);
return parseAll($).then(function(metadata){
   console.log(metadata);
});

And then scrape the page, should hopefully work. Unfortunately this means you need to load the page into cheerio twice, once to find the charset, and then again to get the metadata.

Thank you Marielle. I'll give it a try.