Table data extractor

Question

Table data extractor

samikrc opened this issue 8 years ago · 12 comments

Hi,
Do you have code for extracting data from tables like you have for extracting data from forms? Because of the use of rowspan and colspan attributes, it gets difficult to parse a table from the raw html. Is there an easy way to do this from the in-memory rendering of the browsers?
Regards.

Answer 1 · 2016-09-17T23:56:24.000Z

Hi! Currently there is no specialized extractor for HTML tables. It would be a nice addition to have one, but does come with its challenges. For example, what data structure did you have in mind? As you mentioned in your example, the ability of cells to cover an arbitrary number of rows and columns can make the organization rather messy...

Answer 2 · 2016-09-18T06:32:28.000Z

I have something in mind - let me see if I can open a pull request in a few weeks. If I do end up coding something, where do you think it would show up in the codebase? In scalascraper/scraper/HtmlExtractor.scala?

Answer 3 · 2016-09-18T21:21:56.000Z

Yeah, I would expect it to be a new extractor in ContentExtractors. Looking forward to your pull request then!

Answer 4 · 2017-05-15T13:01:41.000Z

hi, any update with this ?

Answer 5 · 2017-05-15T13:16:19.000Z

I worked on this a bit, but was unable to directly extend the library to include this feature (mostly because of my level of Scala skills). What I have is a standalone piece of code which does this. I can publish that code, and you can either use it as is, or try integrating this code in the library. Let me know if there is any interest. Thanks.

…

On 15-May-17 6:31 PM, Trinadh Gupta wrote: hi, any update with this ? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#30 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACDd2We2gAN7iK57U1N41iFYPHERxfy0ks5r6Ey2gaJpZM4J9hAR>.

Answer 6 · 2017-05-15T13:35:32.000Z

Hello samikrc ,
Thanks a lot for quick response and yes, I would be happy to check it, please do publish.

Answer 7 · 2017-05-15T22:28:54.000Z

It would be awesome if you shared your approach here, even if there was no interest now (which it seems there is), it would be a good resource for anyone dealing with this issue :) If the algorithm and data structure you chose to parse the table is general enough, I can surely help you implementing it in scala-scraper.

Answer 8 · 2017-05-19T05:52:18.000Z

Guys,

Sorry for the delay. Attaching two files, one containing the source code and the other containing some test code. Note that the test code is not automated - just some prints for manually checking if things look OK.

@ruippeixotog Saw your other email about the exciting features in the next version, including the "Content Extractors". This is probably too late to get included in that, but that is probably where this stuff can be integrated.

Ready to answer questions :-)

Thanks.
-Samik

TableExtractor.scala.txt
TableExtractorTester.scala.txt

Answer 9 · 2017-05-19T05:56:46.000Z

Also note that some of the methods are just stubs, but are easy to implement. Important methods are already implemented.

Answer 10 · 2017-08-26T06:21:35.000Z

Hi, any update on this? Did the code get used somewhere?

Answer 11 · 2017-08-26T18:24:11.000Z

Hi @samikrc, I ended up not using it anywhere for now - mostly due to my lack of time lately. I took a look at your code before and it seemed like a good implementation, it just needs to be converted to a more idiomatic extractor, like the regex extractors. I'll try to work on it in the next two weeks :)

Answer 12 · 2017-09-17T23:43:53.000Z

I have just added a new table content extractor to scala-scraper (e7d3fe6). I ended up writing the extractor from scratch, as it seemed easier for me to integrate it with the style of the other extractors this way.

Closing this now. If you find any bug with the implementation feel free to open another issue!