ruippeixotog/scala-scraper

Duplicate text from sibling <td></td> elements is lost

drstevens opened this issue · 4 comments

Unless I'm misunderstanding something, there appears to be a bug when selecting text from columns in a table. When two sibling columns have duplicate text, one is not returned.

I have added a test which demonstrates the issue I have encountered via d98f01a

sigh... I screwed up my original commit and squashed it. The old commits still show up here because of the commit msgs. The SHA in the OP is correct.

Hi @drstevens! Thanks for using the library and for submitting the issue.

After some investigation, it seems that the bug comes from jsoup itself, and not from scala-scraper (here is some Java code demonstrating that).

The good news is it seems that this issue has already been submitted and fixed there (jhy/jsoup#614), so all we have to do is wait until the next jsoup version comes out. The bad news is that I don't really have an alternative for doing what you want other than using directly jsoup. You can use the select method and ensure that it is always called on an instance of Element and not on Elements:

(doc >> element("#mytable") select "td").toList.map(_.text.toInt) mustEqual Seq(3, 15, 15, 1)

I'll keep an eye on new dependency versions and I'll inform you here once this is fixed in scala-scraper.

Great! Thanks for looking into it. I spent some time late last night digging around and came to the conclusion it was likely in jsoup. I just found your project yesterday though and was unsure.

For now, I've got a python script keeping me unblocked.

Due to some refactoring on the part of scala-scraper, jsoup is now being used in a way that does not trigger the jsoup bug causing this issue. I'll close this issue now. Thank you once again, it was really helpful to provide a failing test for the issue.