USCDataScience/sparkler

Argument '-i -1' does not work.

MobinRanjbar opened this issue · 4 comments

Hi there,

I wanted to crawl whole content of a website. When I run the command below, crawling process does not start. What is wrong?

bin/sparkler.sh crawl -id 1 -i -1

Output:
2020-06-19 12:38:06 INFO Crawler$:153 - Committing crawldb..
2020-06-19 12:38:06 INFO Crawler$:221 - Shutting down Spark CTX..

Sparkler does nothing when no URLs are there to crawl. And your output looks like there are no new URLs to be crawled.
try injecting some new URLs and try again.

Hi,

I have injected a new URL before that like below. The same thing happens.

bin/sparkler.sh inject -id 1 -su 'https://www.nasa.gov/'

I am guessing there is an error in your setup.
Did you try it from docker image https://hub.docker.com/r/uscdatascience/sparkler/tags ; could you please try?

CC @buggtb do you have any guesses on why/when/how this case might happen?

Hi,

The same thing happened in docker!! :

sparkler@292e25536b51:/data/sparkler$ bin/sparkler.sh inject -id 1 -su 'https://www.nasa.gov/'
2020-06-23 07:46:16 INFO Injector$:97 - Injecting 1 seeds
jobId = 1
sparkler@292e25536b51:/data/sparkler$ bin/sparkler.sh crawl -id 1 -tn 100 -i -1
2020-06-23 07:46:35 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-06-23 07:46:40 INFO Crawler$:153 - Committing crawldb..
2020-06-23 07:46:40 INFO Crawler$:221 - Shutting down Spark CTX..
sparkler@292e25536b51:/data/sparkler$

Have you ever tried that argument?