whitespace issues

Question

whitespace issues

codinguncut opened this issue 7 years ago · 4 comments

it appears that .xpath('normalize-space()') does not deal with whitespace in an ideal way in all cases.

Examples:

<span class="dropcap">A</span>Telephone => ATelephone
<span>Phone</span>1-855-445-9710 => Phone1-855-445-9710
<option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option> => VifonVinamilkVinaphone

I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.

I had hoped there would be a way to add ' '.join(fragments), but it doesn't look quite so easy...

I believe Rolando already adressed this previously. Maybe it was done using along the lines of ' '.join(x.strip() for x in cleaned.xpath('//text()').extract())...

Answer 1 · 2017-05-26T13:48:20.000Z

yeah, people use inline tags in weird ways... thanks for spotting this!
Adding a whitespace in this cases looks like a better default, even if some words are split it's easier to capture them with ngrams. And I don't think that pages that wrap every letter in a span are that common.

Answer 2 · 2017-05-26T14:02:39.000Z

' '.join(x.strip() for x in cleaned.xpath('//text()').extract())

yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts

Answer 3 · 2017-05-26T14:46:41.000Z

awesome, thanks for looking into this!
just implemented something similar ;)
it may make sense to precompile the rex for speed: REX = re.compile(r'\s+'), later REX.sub('', string)

Answer 4 · 2017-05-26T15:22:56.000Z

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.