TeamHG-Memex/html-text

whitespace issues

codinguncut opened this issue · 4 comments

it appears that .xpath('normalize-space()') does not deal with whitespace in an ideal way in all cases.

Examples:

  • <span class="dropcap">A</span>Telephone => ATelephone
  • <span>Phone</span>1-855-445-9710 => Phone1-855-445-9710
  • <option value="156">Vifon</option><option value="157">Vinamilk</option><option value="158">Vinaphone</option> => VifonVinamilkVinaphone

I understand that the behavior may be in line with html inline tags and whitespace, but it does not work IMHO for real-world html documents.

I had hoped there would be a way to add ' '.join(fragments), but it doesn't look quite so easy...

I believe Rolando already adressed this previously. Maybe it was done using along the lines of ' '.join(x.strip() for x in cleaned.xpath('//text()').extract())...

yeah, people use inline tags in weird ways... thanks for spotting this!
Adding a whitespace in this cases looks like a better default, even if some words are split it's easier to capture them with ngrams. And I don't think that pages that wrap every letter in a span are that common.

' '.join(x.strip() for x in cleaned.xpath('//text()').extract())

yes, this seems to work, thanks for the pointer! I'll need to check the difference on some real texts

awesome, thanks for looking into this!
just implemented something similar ;)
it may make sense to precompile the rex for speed: REX = re.compile(r'\s+'), later REX.sub('', string)

Checked old vs. new way on about 1000 html pages, on average the text is longer by 0.2% characters, with most pages having some difference. In all cases I checked (about 10 pages) the new way is better, unsplitting words that were joined without spaces, and I didn't find any unwanted splits.

The speed is almost 2x slower though: 7 s for 1000 html pages before, 11.5 s without regexp, 12.5 s with regexp (and caching). But I guess it's not that bad.