Corrupted files with special characters
diegomachado1 opened this issue · 6 comments
Hello,
I'm trying to save Portuguese language pages, and the plugin tries to keep the original resource name even if it has special characters, and I'm getting images name like this:
você_Easy-Resize.com_-1024x1024.jpg
Is it possible to filter these characters when saving?
Thank you!
Edit:
I tried to use registerAction('saveResource') and it worked and changed the file name, but it doesn't solved because the plugin is saving a corrupted file. It seems that it can't download files with filenames like that.
Hi @diegomachado1 👋
You can try to create different filename for such files with generatefilename action.
If you suspect there was issue with saving - you can enable logs and see what happens with resource. Or if you can provide reproducible example - I'll be able to take close look on it.
Hello @s0ph1e,
Now I know what's happening.
When the plugin requests the page, it generates a binary, right?!
website-scraper-puppeteer/lib/index.js:
// convert utf-8 -> binary string because website-scraper needs binary
return Buffer.from(content).toString('binary');
If there are links with latin characters like á é í ó ô, they can't be accessed by request.js because the binary modify the link structure for something like this: você_Easy-Resize.com_-1024x1024.jpg
So, the request returns an error message because it doesn't found the page and the plugin save it as image body. This is the reason images looks corrupted.
encodeURI() looks promising, but it need to be used before binary conversion.
I removed binary conversion and used encodeURI() on request for testing, and it worked for images, but it broken others things. So, there's a way...
Yes, it generates binary. I can also suggest to try without puppeteer plugin and check if it works correctly without any plugins.
If you feel you found a bug - please share steps to reproduce or at least example of url which causes problems.
The problem is that requests are always binary, I had to request and save files with text content in utf8. Therefore I wrote the following plugin:
class utf8Fix {
apply(registerAction) {
let absoluteDirectoryPath, loadedResources = [];
registerAction('beforeRequest', async ({resource, requestOptions}) => {
const urlLower = resource.getUrl().toLowerCase();
if (urlLower.endsWith('.html') || urlLower.endsWith('.js') || urlLower.endsWith('.css') || urlLower.endsWith('/')) {
requestOptions.encoding = 'utf-8'
} else {
requestOptions.encoding = 'binary'
}
return {requestOptions};
});
registerAction('beforeStart', ({options}) => {
if (!options.directory || typeof options.directory !== 'string') {
throw new Error(`Incorrect directory ${options.directory}`);
}
absoluteDirectoryPath = path.resolve(process.cwd(), options.directory);
if (fs.existsSync(absoluteDirectoryPath)) {
throw new Error(`Directory ${absoluteDirectoryPath} exists`);
}
});
registerAction('saveResource', async ({resource}) => {
const filename = path.join(absoluteDirectoryPath, resource.getFilename());
const text = resource.getText();
const filenameLower = filename.toLowerCase();
if (filenameLower.endsWith('.html') || filenameLower.endsWith('.css') || filenameLower.endsWith('.js')) {
await fs.outputFile(filename, text, {encoding: 'utf-8'});
} else {
await fs.outputFile(filename, text, {encoding: 'binary'});
}
loadedResources.push(resource);
});
registerAction('error', async () => {
if (loadedResources.length > 0) {
await fs.remove(absoluteDirectoryPath);
}
});
}
}
Maybe this will help someone else
This issue has been automatically closed because there has been no response from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
The problem is that requests are always binary, I had to request and save files with text content in utf8. Therefore I wrote the following plugin:
class utf8Fix { apply(registerAction) { let absoluteDirectoryPath, loadedResources = []; registerAction('beforeRequest', async ({resource, requestOptions}) => { const urlLower = resource.getUrl().toLowerCase(); if (urlLower.endsWith('.html') || urlLower.endsWith('.js') || urlLower.endsWith('.css') || urlLower.endsWith('/')) { requestOptions.encoding = 'utf-8' } else { requestOptions.encoding = 'binary' } return {requestOptions}; }); registerAction('beforeStart', ({options}) => { if (!options.directory || typeof options.directory !== 'string') { throw new Error(`Incorrect directory ${options.directory}`); } absoluteDirectoryPath = path.resolve(process.cwd(), options.directory); if (fs.existsSync(absoluteDirectoryPath)) { throw new Error(`Directory ${absoluteDirectoryPath} exists`); } }); registerAction('saveResource', async ({resource}) => { const filename = path.join(absoluteDirectoryPath, resource.getFilename()); const text = resource.getText(); const filenameLower = filename.toLowerCase(); if (filenameLower.endsWith('.html') || filenameLower.endsWith('.css') || filenameLower.endsWith('.js')) { await fs.outputFile(filename, text, {encoding: 'utf-8'}); } else { await fs.outputFile(filename, text, {encoding: 'binary'}); } loadedResources.push(resource); }); registerAction('error', async () => { if (loadedResources.length > 0) { await fs.remove(absoluteDirectoryPath); } }); } }
Maybe this will help someone else
This only solves the saved file issues right? Im trying to use utf8 for the pagesource. Do you know how I would use this?