rchipka/node-osmosis

Getting "script" content seems to truncate characters after a limit

oliv23 opened this issue · 2 comments

Hi,

First of all, thank you for Osmosis! I've been using it for a few years now (yikes) and have only run into a handful of edge-cases issues since.

I'm trying to scrape the content of a <script type="text/json"> tag, and it works to some degree but when I save the output to a file, it's clear that the contents are not full and were truncated at a certain point.

I've logged the length of the content in the console and apparently the portion of the script I'm able to retrieve is 45825 characters. Now the strange thing is, I just tried this again, and now the length is 45873.

I suspect there is something wrong with the parsing of big strings or a hard-coded limit somewhere? Do you have any idea what this could be?

Best,
Olivier

@oliv23 glad to hear it's been working well for so long.

Could this be an issue with console.log() truncating formatted output or are you outputting the raw string into a file?

I think we already set XML_PARSE_HUGE, but we might need to also set XML_PARSE_BIG_LINES

@rchipka Thanks for the quick reply! I thought that might be the case so I wrote the result to a file instead. Same result unfortunately, forgot to mention that.

Ok, there might be something there? For reference, the page I'm scraping is this:
https://www.modaoperandi.com/acler-fw19/dalisay-draped-midi-dress?color=white

If you inspect the script's content in the console, you will find it is complete, however when scraping it with osmosis, it comes back chopped. Here is the code I'm using for reproducibility:

osmosis
.get('https://www.modaoperandi.com/acler-fw19/dalisay-draped-midi-dress?color=white')
.then(function(context, data, next) {
	next(context, data);
})
.find('#wraps-body-content')
.set({
	'category_json': '[data-react-class=SiteComponent] + script[type=text/json]'
})
.data(function(product) {
	var path = 'category_json.json';
	fs.writeFileSync(path, JSON.stringify(product.category_json, null, '\t'), 'utf8');

	console.log(product.category_json.length);
})

I looked into XML_PARSE_HUGE, the number associated with it is 524288. Not sure if that's the character count of object size, but if it is the former, it seems to me like it should work: I manually copied the full content from the script and it amounts to 147617 characters in total.

Olivier