Corrupted files with special characters

Question

Corrupted files with special characters

diegomachado1 opened this issue 5 years ago · 6 comments

Hello,

I'm trying to save Portuguese language pages, and the plugin tries to keep the original resource name even if it has special characters, and I'm getting images name like this:

vocÃª_Easy-Resize.com_-1024x1024.jpg

Is it possible to filter these characters when saving?

Thank you!

Edit:
I tried to use registerAction('saveResource') and it worked and changed the file name, but it doesn't solved because the plugin is saving a corrupted file. It seems that it can't download files with filenames like that.

Answer 1 · 2020-06-11T16:57:04.000Z

Hi @diegomachado1 👋

You can try to create different filename for such files with generatefilename action.

If you suspect there was issue with saving - you can enable logs and see what happens with resource. Or if you can provide reproducible example - I'll be able to take close look on it.

Answer 2 · 2020-06-11T17:56:44.000Z

Hello @s0ph1e,

Now I know what's happening.

When the plugin requests the page, it generates a binary, right?!

website-scraper-puppeteer/lib/index.js:

// convert utf-8 -> binary string because website-scraper needs binary
return Buffer.from(content).toString('binary');

If there are links with latin characters like á é í ó ô, they can't be accessed by request.js because the binary modify the link structure for something like this: vocÃª_Easy-Resize.com_-1024x1024.jpg

So, the request returns an error message because it doesn't found the page and the plugin save it as image body. This is the reason images looks corrupted.

encodeURI() looks promising, but it need to be used before binary conversion.

I removed binary conversion and used encodeURI() on request for testing, and it worked for images, but it broken others things. So, there's a way...

Answer 3 · 2020-06-15T18:45:05.000Z

Hi @diegomachado1

Yes, it generates binary. I can also suggest to try without puppeteer plugin and check if it works correctly without any plugins.

If you feel you found a bug - please share steps to reproduce or at least example of url which causes problems.

Answer 4 · 2020-06-25T06:58:31.000Z

The problem is that requests are always binary, I had to request and save files with text content in utf8. Therefore I wrote the following plugin:

class utf8Fix {
    apply(registerAction) {
        let absoluteDirectoryPath, loadedResources = [];
        registerAction('beforeRequest', async ({resource, requestOptions}) => {
            const urlLower = resource.getUrl().toLowerCase();
            if (urlLower.endsWith('.html') || urlLower.endsWith('.js') || urlLower.endsWith('.css') || urlLower.endsWith('/')) {
                requestOptions.encoding = 'utf-8'
            } else {
                requestOptions.encoding = 'binary'
            }
            return {requestOptions};
        });
        registerAction('beforeStart', ({options}) => {
            if (!options.directory || typeof options.directory !== 'string') {
                throw new Error(`Incorrect directory ${options.directory}`);
            }

            absoluteDirectoryPath = path.resolve(process.cwd(), options.directory);

            if (fs.existsSync(absoluteDirectoryPath)) {
                throw new Error(`Directory ${absoluteDirectoryPath} exists`);
            }
        });

        registerAction('saveResource', async ({resource}) => {
            const filename = path.join(absoluteDirectoryPath, resource.getFilename());
            const text = resource.getText();
            const filenameLower = filename.toLowerCase();
            if (filenameLower.endsWith('.html') || filenameLower.endsWith('.css') || filenameLower.endsWith('.js')) {
                await fs.outputFile(filename, text, {encoding: 'utf-8'});
            } else {
                await fs.outputFile(filename, text, {encoding: 'binary'});
            }
            loadedResources.push(resource);
        });
        registerAction('error', async () => {
            if (loadedResources.length > 0) {
                await fs.remove(absoluteDirectoryPath);
            }
        });
    }
}

Maybe this will help someone else

Answer 5 · 2020-06-30T10:43:53.000Z

This issue has been automatically closed because there has been no response from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

Answer 6 · 2020-08-12T21:17:26.000Z

The problem is that requests are always binary, I had to request and save files with text content in utf8. Therefore I wrote the following plugin:

class utf8Fix {
    apply(registerAction) {
        let absoluteDirectoryPath, loadedResources = [];
        registerAction('beforeRequest', async ({resource, requestOptions}) => {
            const urlLower = resource.getUrl().toLowerCase();
            if (urlLower.endsWith('.html') || urlLower.endsWith('.js') || urlLower.endsWith('.css') || urlLower.endsWith('/')) {
                requestOptions.encoding = 'utf-8'
            } else {
                requestOptions.encoding = 'binary'
            }
            return {requestOptions};
        });
        registerAction('beforeStart', ({options}) => {
            if (!options.directory || typeof options.directory !== 'string') {
                throw new Error(`Incorrect directory ${options.directory}`);
            }

            absoluteDirectoryPath = path.resolve(process.cwd(), options.directory);

            if (fs.existsSync(absoluteDirectoryPath)) {
                throw new Error(`Directory ${absoluteDirectoryPath} exists`);
            }
        });

        registerAction('saveResource', async ({resource}) => {
            const filename = path.join(absoluteDirectoryPath, resource.getFilename());
            const text = resource.getText();
            const filenameLower = filename.toLowerCase();
            if (filenameLower.endsWith('.html') || filenameLower.endsWith('.css') || filenameLower.endsWith('.js')) {
                await fs.outputFile(filename, text, {encoding: 'utf-8'});
            } else {
                await fs.outputFile(filename, text, {encoding: 'binary'});
            }
            loadedResources.push(resource);
        });
        registerAction('error', async () => {
            if (loadedResources.length > 0) {
                await fs.remove(absoluteDirectoryPath);
            }
        });
    }
}

Maybe this will help someone else

This only solves the saved file issues right? Im trying to use utf8 for the pagesource. Do you know how I would use this?