VIDA-NYU/domain_discovery_tool

Start crawl sends wrong seed to the crawler

aecio opened this issue · 2 comments

aecio commented

When DDT sends the URL to DDT it is appending a string ,1 to the end of the seed URL. Maybe that string is the count of URLs shown in the recommendations box.

This does not seem to be the case. The following ACHE crawler message when urls are added reiterates this:

[2017-08-03 15:50:34,238] INFO [qtp597874846-15] (FrontierManager.java:236) - Adding 3 seed URL(s)...
[2017-08-03 15:50:34,320] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/dir/index/discover?sid=396545327
[2017-08-03 15:50:34,320] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/dir/index/discover?sid=396545433
[2017-08-03 15:50:34,321] INFO [qtp597874846-15] (FrontierManager.java:248) - Added seed URL: http://answers.yahoo.com/

aecio commented

This issue is still happening, tough it not always appending ,1. Right now I'm seeing that it appended 1 in the URLs shown in "Crawling View" -> "Deep Crawling" -> "Domains for crawling".