Sotera/DatawakeDepot

HTML/Text Extractor

bwhiteman opened this issue · 5 comments

We should have a simple extractor that pulls the HTML and extracted body text of a document.

@michaelsframe don't we already have this via the StanNER extractor? I'm also getting body content persisted and viewable in the trail url section

I'm not sure what this is asking. The current trailing process pulls and sends the element of the page to the extractor.

What do you mean by document?

Does it persist the HTML and the plain text "body"?

For future analytics, it would be good to have an extractor that pulls the main body text with something like unfluff and returns it. I had the stanbol extractor doing it but it shouldn't be there.

Once this is done, we can run text analytics on the corpus of pages that has been scraped.

If we have the HTML it seems redundant to store the plain text body, doubles storage requirements and will slow down the URL insertion. Any analytic can retrieve and unfluff as necessary right?

Yes, this is a question of which is more costly, to write text once read it may times or to extract the text every time.