whitespace issues
codinguncut opened this issue · 4 comments
it appears that .xpath('normalize-space()')
does not deal with whitespace in an ideal way in all cases.
Examples:
<span class="dropcap">A</span>Telephone
=>ATelephone
<span>Phone</span>1-855-445-9710
=>Phone1-855-445-9710
<option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option>
=>VifonVinamilkVinaphone
I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.
I had hoped there would be a way to add ' '.join(fragments)
, but it doesn't look quite so easy...
I believe Rolando already adressed this previously. Maybe it was done using along the lines of ' '.join(x.strip() for x in cleaned.xpath('//text()').extract())
...
yeah, people use inline tags in weird ways... thanks for spotting this!
Adding a whitespace in this cases looks like a better default, even if some words are split it's easier to capture them with ngrams. And I don't think that pages that wrap every letter in a span are that common.
' '.join(x.strip() for x in cleaned.xpath('//text()').extract())
yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts
awesome, thanks for looking into this!
just implemented something similar ;)
it may make sense to precompile the rex for speed: REX = re.compile(r'\s+')
, later REX.sub('', string)
Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.
The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.