Sotera/DatawakeDepot

Panel: Extractions may occur from other page elements not visible

bmcdougald opened this issue · 4 comments

Looks like we are extracting everything on a given page if it's in HTML. I came across pulldown menus on a site that were just lists of anchor hrefs. The text for these pulldown menus got extracted too. When I added them to the domain, I didn't see them on the page text body itself, but when you would extend the pulldown menu on the page the entry you added would be highlighted as a domain item like any other.
screen shot 2016-02-01 at 2 12 54 pm

Is this considered a bad thing?

On Feb 1, 2016, at 2:13 PM, Brodie McDougald notifications@github.com wrote:

Looks like we are extracting everything on a given page if it's in HTML. I came across pulldown menus on a site that were just lists of anchor hrefs. The text for these pulldown menus got extracted too. When I added them to the domain, I didn't see them on the page text body itself, but when you would extend the pulldown menu on the page the entry you added would be highlighted as a domain item like any other.


Reply to this email directly or view it on GitHub.

My personal opinion is that it is undesirable. Pulling terms from page elements like menus and jamming them into the trails just adds a bunch of noise. In most cases the content contained in stuff like menus is not directly related to page being researched/viewed rather the site as a whole. Add to this that menu items don't really have context and it's just more irrelevant garbage. Imagine looking at a website like a forum or something...you're interested in this guy Charlie who's been selling some illegal guns on this forum listing site (i.e. gunbroker). The site has pulldown menus that have "To contact our New York office", and maybe some other links that break down guns by category or something. None of that is relevant to the Charlie search really.

I think there's this general idea that "let's put everything under the moon in our trails" as being a good thing. Personally I disagree. When you start adding tons of irrelevant junk to the data set it becomes muddied and unusable, not to mention difficult for someone to follow up and manually prune the trailed info to get rid of extracted content they don't want that comes in from the viewed page. If we focus on gathering "all the stuff" without some gates for relevancy or context our tools become less focused and less useful in helping people find patterns or information in the data. If it becomes more work for them to sort out garbage so they are left with something useful, they will just stop using our tool altogether.

I don't think we are qualified to make that determination. If those items
int the menu were types of guns or names of people you could hire you might
want that information. I think the extractor should be trained to weed out
things like that instead of our tool being specific to one domain or
another. "To contact or New York Office" could just as easily be in the
body of the page.

I think this is more a bug against the extractor no?

On Mon, Feb 1, 2016 at 2:58 PM, Brodie McDougald notifications@github.com
wrote:

My personal opinion is that it is undesirable. Pulling terms from page
elements like menus and jamming them into the trails just adds a bunch of
noise. In most cases the content contained in stuff like menus is not
directly related to page being researched/viewed rather the site as a
whole. Add to this that menu items don't really have context and it's just
more irrelevant garbage. Imagine looking at a website like a forum or
something...you're interested in this guy Charlie who's been selling some
illegal guns on this forum listing site (i.e. gunbroker). The site has
pulldown menus that have "To contact our New York office", and maybe some
other links that break down guns by category or something. None of that is
relevant to the Charlie search really.

I think there's this general idea that "let's put everything under the
moon in our trails" as being a good thing. Personally I disagree. When you
start adding tons of irrelevant junk to the data set it becomes muddied and
unusable, not to mention difficult for someone to follow up and manually
prune the trailed info to get rid of extracted content they don't want that
comes in from the viewed page. If we focus on gathering "all the stuff"
without some gates for relevancy or context our tools become less focused
and less useful in helping people find patterns or information in the data.
If it becomes more work for them to sort out garbage so they are left with
something useful, they will just stop using our tool altogether.


Reply to this email directly or view it on GitHub
#163 (comment)
.

We are best qualified to decide how a tool we are designing "should" work I'd think. Especially since we don't get concrete direction or requirements from the customer. I wasn't saying the tool was being confined by domain or training...more that it should focus on the content viewed in the page body...not menu structure or other extraneous unrelated garbage. To use your example, if the blurb of the New York office was in the page...great we're extracting that. Omitting menu garbage is totally unrelated to what we extract from the body.