Encoding issue with Spanish accents
Opened this issue · 3 comments
Deleted user commented
Hi, I'm a fairly new programmer and I don't know exactly if this issue is related with this library but I'm trying to scrape a website where the content have accents. The output in my console seems like this:
{ place: 'C�rtama',
title: 'IV Torneo de F�tbol 7 Miguel Gonz�lez Santos \'Milli\'',
unqueriableDate: 'Fecha: Todo el a�o',
event_img: 'img_contenido/agenda/2019/08/365993/234100240__130x130.jpg',
location: 'Lugar: Campo Municipal Joaqu�n Mart�n D�az - C�rtama' } ] }
There's no way to solve it? Seems like an encoding issue.
Thanks in advance
marcellkiss commented
I have the same problem
marcellkiss commented
I found a solution for my case.
I realized, that the page I wanted to use didn't use a standard utf8 encoding, but an ISO encoding, like this:
Content-Type: text/html; charset="iso-8859-15"
I decided to encode the html myself, and use the scrapeIt.scrapeHTML
function instead of the original scrapeIt
.
Here's my code:
const axios = require('axios');
const iso88592 = require('iso-8859-2');
const scrapeIt = require('scrape-it');
run();
async function run() {
// Send the request and get the binary response
const axiosResponse = await axios.request({
method: 'GET',
url: `YOUR_UR`,
responseType: 'arraybuffer',
responseEncoding: 'binary'
});
// iso88592 encode the binary string with a specific library
const htmlString = iso88592.decode(axiosResponse.data.toString('binary'));
const scrapedJson = await scrapeIt.scrapeHTML(htmlString, mappingConfig);
// And here we are:
console.log(scrapedJson);
}
Deleted user commented
Thanks a lot for sharing, @marcellkiss