USCDataScience/sparkler

Caught Server refused connection at: http://localhost:8983/solr/crawldb

francesco1119 opened this issue · 19 comments

Issue Description

Please describe our issue, along with:
Is very easy, I the second command I run on your guide didn't worked

How to reproduce it

I run bash dockler.sh and the result I had was:

root@DS1515:/volume3/Docker_Volume/Sparkler# bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

As second step I run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and as result I have:

bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-11-27 23:18:42 INFO  PluginService$:53 - Loading plugins...
2021-11-27 23:18:42 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-11-27 23:18:42 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-11-27 23:18:42 INFO  Injector$:108 - Injecting 1 seeds
2021-11-27 23:18:43 WARN  SolrProxy:93 - Caught Server refused connection at: http://localhost:8983/solr/crawldb while adding beans, trying to add one by one
2021-11-27 23:18:43 WARN  SolrProxy:100 - (SKIPPED) Server refused connection at: http://localhost:8983/solr/crawldb while adding [!!!edu.usc.irds.sparkler.model.Resource@26a529dc=>java.util.IllegalFormatConversionException:f != java.util.HashMap!!!]
2021-11-27 23:18:43 DEBUG SolrProxy:101 - Server refused connection at: http://localhost:8983/solr/crawldb
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:285) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:267) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResources(SolrProxy.scala:97) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:111) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.Main.main(Main.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
        at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) ~[?:?]
        at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339) ~[?:?]
        at java.net.Socket.connect(Socket.java:603) ~[?:?]
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.commitCrawlDb(SolrProxy.scala:112)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:112)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43)
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162)
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
        ... 6 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
        ... 18 more
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.Net.pollConnect(Native Method)
        at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
        at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
        at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339)
        at java.base/java.net.Socket.connect(Socket.java:603)
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        ... 28 more
2021-11-27 23:18:43 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

  • Java Version: 1.8.0_275 but I taught it was provided inside your Docker
  • Spark Version: I taught it was already instaled inside your Docker. If not I haven't installed it
  • Operating System name and version: is Docker installed on a Synology DS1515+ . If I run docker version I receive
    Client:
    Version: 20.10.3
    API version: 1.41
    Go version: go1.15.6
    Git commit: b35e731
    Built: Fri Jun 18 08:25:45 2021
    OS/Arch: linux/amd64
    Context: default
    Experimental: true

Server:
Engine:
Version: 20.10.3
API version: 1.41 (minimum version 1.12)
Go version: go1.15.6
Git commit: e7f7c95
Built: Fri Jun 18 08:26:10 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.3
GitCommit: b1dc45ec561bd867c4805eee786caab7cc83acae
runc:
Version: v1.0.0-rc93
GitCommit: 89783e1862a2cc04647ab15b6e88a0af3d66fac3
docker-init:
Version: 0.19.0
GitCommit: 12b6a20

An external links for reference

Nah, just tell me if Java and Spark are inside your Docker image or not. If they are not and I have to install them you can close this ticket

Contributing

I'm willing to contribute

@francesco1119 Thanks for reaching out.
Server refused connection at: http://localhost:8983/solr/crawldb says solr service is not running.
Please check/debug why solr is not starting up. If it is running why you are getting this exception

Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)

Hi @thammegowda and thank you for your help.
In the documentation is not written I have to install solr.
In fact I taught it came within the docker image....

I followed your documentation and it says that installing solr is an option.

As I'm a new user I can help you out to rewrite your documentation but sincerely I have no idea why solr is not starting

@francesco1119
the dockler.sh is supposed to start solr service.
I just ran it now and I got

bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
[...truncated]
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=61). Happy searching!

In the last part of the output, it starts solr, waits until solr is up before going to the next step.
I don't see these messages in your output.

@thammegowda , I tried again.

I run bash dockler.sh and I receive:

Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [-]
Started Solr server on port 8983 (pid=62). Happy searching!

Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

And yes, everything is fine, solr is up and running:

image

I then run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and the command executes correctly. Or at least this is what I believe:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-12-01 19:57:12 INFO  PluginService$:53 - Loading plugins...
2021-12-01 19:57:12 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-12-01 19:57:13 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-12-01 19:57:13 INFO  PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-12-01 19:57:13 INFO  PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-12-01 19:57:13 INFO  Injector$:108 - Injecting 1 seeds
>>jobId = 1
2021-12-01 19:57:13 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

And when I pass to the very last step with /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations :

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.spark.spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2021-12-01 19:58:24 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-01 19:58:26 INFO  Crawler$:160 - Setting local job: {User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Sparkler/${project.version}, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Language=en-US,en}
2021-12-01 19:58:26 INFO  Crawler$:174 - Committing crawldb..
2021-12-01 19:58:26 INFO  Crawler$:219 - Starting the job:1, task:906e00e1-7369-4a64-9593-17fe85d0566a
2021-12-01 19:58:26 INFO  MemexCrawlDbRDD$:54 - selecting 1 out of 1
2021-12-01 19:58:27 DEBUG SolrResultIterator$:63 - Query status:UNFETCHED, Start = 0
2021-12-01 19:58:27 DEBUG SolrResultIterator$:77 - Reached the end of result set
2021-12-01 19:58:27 DEBUG SolrResultIterator$:79 - closing solr client.
2021-12-01 19:58:27 WARN  BlockManager:69 - Block rdd_3_0 could not be removed as it was not found on disk or in memory
2021-12-01 19:58:27 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
2021-12-01 19:58:27 WARN  TaskSetManager:69 - Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

2021-12-01 19:58:27 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152)
        at edu.usc.irds.sparkler.pipeline.Crawler.score(Crawler.scala:254)
        at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$1(Crawler.scala:231)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:179)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:50)
        at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:338)
        at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
        ... 6 more
Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

I'm litterary following your documentation

Thank you @lewismc ,

I have installed the latest version of Docker today, this is the only thing that has changed since yesterday.
So maybe this change at environment level has trigged something that allowed me to go to the next step.

...we will never know what that was. Sorry I haven't noted on what Docker version I have tested yesterday, it might have been something 6 month old but not more.

I'm watching your repository and I will definitely try your next release as soon as it's out.
Can you please conform that you have tried on your end and I experience the same with a fresh new installation.

Otherwise if you can't reproduce I keep investigating.

@lewismc , I see where the problem is:

Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'

Is mentioned on the very first page of your GitHub project:

image

56ad891

            <exclusions>
                <exclusion>
                    <groupId>net.jpountz.lz4</groupId>
                    <artifactId>lz4</artifactId>
                </exclusion>
            </exclusions>

The exclusion of that class was hardcoded

I believe this issue is due to Spark and Kafka being incompatible on lz4 dependency; https://stackoverflow.com/a/51052507/1506477
And excluding lz4 from Kafka is the right thing to do (hence exclusion is good!)

However, in docker hub, https://hub.docker.com/repository/docker/uscdatascience/sparkler
I see the docker image was last updated 6 months ago, but this exclusion commit is newer.
I think rebuilding Docker image and releasing it should fix this
https://github.com/USCDataScience/sparkler/wiki/Build-and-Deploy#docker-build

Yes @thammegowda , the link you provided has an update dating back to September that says:

Update: This appears to be an issue with Kafka 0.11.x.x and earlier version. As of 1.x.x Kafka seems to have moved away from using the problematic net.jpountz.lz4 library. Therefore, using latest Kafka (1.x) with latest Spark (2.3.x) should not have this issue.

Hence latest Spark with Latest Kafka will probably give no problem.

I look forward to test your new image.

@lewismc Docs are here https://github.com/USCDataScience/sparkler/blob/master/Release-Checklist.md
I believe @buggtb has been releasing docker images since I left IRDS/JPL.

@buggtb any chance of you performing a release of the new convenience binaries? Thanks

I am also facing this issue. Since the code is already merged with the fix, I tried to build docker image from it and it fails here -
Step 8/13 : COPY ./sparkler-ui/sparkler-dashboard/sparkler-ui-*.war /data/solr/server/solr-webapp/sparkler
COPY failed: no source files were specified

Can you please share steps to build sparkler-ui. I don't see sparkler-dashboard in the sparkler-ui.

Hi @thammegowda & @lewismc , let me know when you have a stable Docker release and I will test it on my end.

Thank you

Hi all

I am having my dissertation defense this month, so totally focused on that. I will have more availability for this project in April (after my dissertation)

@buggtb @karanjeets @chrismattmann any help or suggestions here, sir/bro?

Focus on your dissertation, I'm busy too.
Let's keep in touch.
Thank you

Hi @thammegowda , how is going?
have you had the time to have a look at Sparkler?
I haven't experienced the same issue since but I have new problems

Hello @lewismc , the error seems to have changed since last year.

If I execute:

sudo docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'

the error now is:

15:52:08.623 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config'
15:52:08.624 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'null'
15:52:08.625 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-chrome'
15:52:08.626 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-regex'
15:52:08.627 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.RegexURLFilter' using class loader 'org.pf4j.PluginClassLoader@158a8276'
15:52:08.637 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.RegexURLFilter'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'databricks-api'
15:52:08.640 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-htmlunit'
15:52:08.641 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'url-injector'
15:52:08.642 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-samehost'
15:52:08.643 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost' using class loader 'org.pf4j.PluginClassLoader@5fbe4146'
15:52:08.644 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost'
15:52:08.645 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.646 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'scorer-dd-svn'
15:52:08.647 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.822 [main] INFO edu.usc.irds.sparkler.service.Injector$ - Injecting 1 seeds
15:52:12.990 [main] DEBUG org.apache.http.impl.nio.client.MainClientExec - [exchange: 1] start execution
15:52:13.007 [main] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: default
15:52:13.038 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - Re-using cached 'basic' auth scheme for http://localhost:9200
15:52:13.040 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - No credentials for preemptive authentication
15:52:13.041 [main] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] Request connection for {}->http://localhost:9200
15:52:13.045 [main] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request: [route: {}->http://localhost:9200][total kept alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
15:52:13.088 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request failed
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.089 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] connection request failed
15:52:13.092 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - request [GET http://localhost:9200/] failed
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.095 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - added [[host=http://localhost:9200]] to blacklist
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:71)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: ElasticsearchException[java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused]; nested: ExecutionException[java.net.ConnectException: Connection refused]; nested: ConnectException[Connection refused];
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2695)
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:2171)
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:2137)
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:2105)
        at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:1241)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1(ElasticsearchProxy.scala:175)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1$adapted(ElasticsearchProxy.scala:172)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.commitCrawlDb(ElasticsearchProxy.scala:172)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:137)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:39)
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:188)
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
        ... 6 more
Caused by: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75)
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2692)
        ... 22 more
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)

I also have a doubt, on Docker I find 2 different repositories:

  • docker pull uscdatascience/sparkler:latest
  • docker pull ghcr.io/uscdatascience/sparkler/sparkler:main (I'm using this one currently)

which is which?

It's fixed now