Caught Server refused connection at: http://localhost:8983/solr/crawldb
francesco1119 opened this issue · 19 comments
Issue Description
Please describe our issue, along with:
Is very easy, I the second command I run on your guide didn't worked
How to reproduce it
I run bash dockler.sh
and the result I had was:
root@DS1515:/volume3/Docker_Volume/Sparkler# bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running
Some useful queries:
- Get stats on groups, status, depth:
http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth
Inside docker, you can do the following:
/data/solr/bin/solr - command line tool for administering solr
start -force -> start solr
stop -force -> stop solr
status -force -> get status of solr
restart -force -> restart solr
/data/sparkler/bin/sparkler.sh - command line interface to sparkler
inject - inject seed urls
crawl - launch a crawl job
As second step I run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
and as result I have:
bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-11-27 23:18:42 INFO PluginService$:53 - Loading plugins...
2021-11-27 23:18:42 INFO PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-11-27 23:18:42 WARN PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-11-27 23:18:42 INFO PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-11-27 23:18:42 INFO PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 INFO PluginService$:82 - Recognised Plugins: Map()
2021-11-27 23:18:42 INFO Injector$:108 - Injecting 1 seeds
2021-11-27 23:18:43 WARN SolrProxy:93 - Caught Server refused connection at: http://localhost:8983/solr/crawldb while adding beans, trying to add one by one
2021-11-27 23:18:43 WARN SolrProxy:100 - (SKIPPED) Server refused connection at: http://localhost:8983/solr/crawldb while adding [!!!edu.usc.irds.sparkler.model.Resource@26a529dc=>java.util.IllegalFormatConversionException:f != java.util.HashMap!!!]
2021-11-27 23:18:43 DEBUG SolrProxy:101 - Server refused connection at: http://localhost:8983/solr/crawldb
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:285) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:267) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResources(SolrProxy.scala:97) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:111) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.service.Injector.main(Injector.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
at edu.usc.irds.sparkler.Main$.main(Main.scala:50) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
at edu.usc.irds.sparkler.Main.main(Main.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
... 19 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) ~[?:?]
at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339) ~[?:?]
at java.net.Socket.connect(Socket.java:603) ~[?:?]
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
... 19 more
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
at edu.usc.irds.sparkler.storage.solr.SolrProxy.commitCrawlDb(SolrProxy.scala:112)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:112)
at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43)
at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162)
at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
... 6 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
... 18 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.Net.pollConnect(Native Method)
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339)
at java.base/java.net.Socket.connect(Socket.java:603)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
... 28 more
2021-11-27 23:18:43 WARN PluginService$:49 - Stopping all plugins... Runtime is about to exit.
Environment and Version Information
Please indicate relevant versions, including, if relevant:
- Java Version: 1.8.0_275 but I taught it was provided inside your Docker
- Spark Version: I taught it was already instaled inside your Docker. If not I haven't installed it
- Operating System name and version: is Docker installed on a Synology DS1515+ . If I run
docker version
I receive
Client:
Version: 20.10.3
API version: 1.41
Go version: go1.15.6
Git commit: b35e731
Built: Fri Jun 18 08:25:45 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.3
API version: 1.41 (minimum version 1.12)
Go version: go1.15.6
Git commit: e7f7c95
Built: Fri Jun 18 08:26:10 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.4.3
GitCommit: b1dc45ec561bd867c4805eee786caab7cc83acae
runc:
Version: v1.0.0-rc93
GitCommit: 89783e1862a2cc04647ab15b6e88a0af3d66fac3
docker-init:
Version: 0.19.0
GitCommit: 12b6a20
An external links for reference
Nah, just tell me if Java and Spark are inside your Docker image or not. If they are not and I have to install them you can close this ticket
Contributing
I'm willing to contribute
@francesco1119 Thanks for reaching out.
Server refused connection at: http://localhost:8983/solr/crawldb
says solr service is not running.
Please check/debug why solr is not starting up. If it is running why you are getting this exception
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
Hi @thammegowda and thank you for your help.
In the documentation is not written I have to install solr
.
In fact I taught it came within the docker image....
I followed your documentation and it says that installing solr
is an option.
As I'm a new user I can help you out to rewrite your documentation but sincerely I have no idea why solr
is not starting
@francesco1119
the dockler.sh is supposed to start solr service.
I just ran it now and I got
bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
[...truncated]
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=61). Happy searching!
In the last part of the output, it starts solr, waits until solr is up before going to the next step.
I don't see these messages in your output.
@thammegowda , I tried again.
I run bash dockler.sh
and I receive:
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [-]
Started Solr server on port 8983 (pid=62). Happy searching!
Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running
Some useful queries:
- Get stats on groups, status, depth:
http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth
Inside docker, you can do the following:
/data/solr/bin/solr - command line tool for administering solr
start -force -> start solr
stop -force -> stop solr
status -force -> get status of solr
restart -force -> restart solr
/data/sparkler/bin/sparkler.sh - command line interface to sparkler
inject - inject seed urls
crawl - launch a crawl job
And yes, everything is fine, solr
is up and running:
I then run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
and the command executes correctly. Or at least this is what I believe:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-12-01 19:57:12 INFO PluginService$:53 - Loading plugins...
2021-12-01 19:57:12 INFO PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-12-01 19:57:13 WARN PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-12-01 19:57:13 INFO PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-12-01 19:57:13 INFO PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 INFO PluginService$:82 - Recognised Plugins: Map()
2021-12-01 19:57:13 INFO Injector$:108 - Injecting 1 seeds
>>jobId = 1
2021-12-01 19:57:13 WARN PluginService$:49 - Stopping all plugins... Runtime is about to exit.
And when I pass to the very last step with /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations
:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.spark.spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2021-12-01 19:58:24 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-01 19:58:26 INFO Crawler$:160 - Setting local job: {User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Sparkler/${project.version}, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Language=en-US,en}
2021-12-01 19:58:26 INFO Crawler$:174 - Committing crawldb..
2021-12-01 19:58:26 INFO Crawler$:219 - Starting the job:1, task:906e00e1-7369-4a64-9593-17fe85d0566a
2021-12-01 19:58:26 INFO MemexCrawlDbRDD$:54 - selecting 1 out of 1
2021-12-01 19:58:27 DEBUG SolrResultIterator$:63 - Query status:UNFETCHED, Start = 0
2021-12-01 19:58:27 DEBUG SolrResultIterator$:77 - Reached the end of result set
2021-12-01 19:58:27 DEBUG SolrResultIterator$:79 - closing solr client.
2021-12-01 19:58:27 WARN BlockManager:69 - Block rdd_3_0 could not be removed as it was not found on disk or in memory
2021-12-01 19:58:27 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[org.scala-lang.scala-library-2.12.12.jar:?]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[org.scala-lang.scala-library-2.12.12.jar:?]
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[org.scala-lang.scala-library-2.12.12.jar:?]
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
2021-12-01 19:58:27 WARN TaskSetManager:69 - Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:830)
2021-12-01 19:58:27 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:830)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152)
at edu.usc.irds.sparkler.pipeline.Crawler.score(Crawler.scala:254)
at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$1(Crawler.scala:231)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:179)
at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:50)
at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:338)
at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
... 6 more
Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:830)
I'm litterary following your documentation
Thank you @lewismc ,
I have installed the latest version of Docker today, this is the only thing that has changed since yesterday.
So maybe this change at environment level has trigged something that allowed me to go to the next step.
...we will never know what that was. Sorry I haven't noted on what Docker version I have tested yesterday, it might have been something 6 month old but not more.
I'm watching your repository and I will definitely try your next release as soon as it's out.
Can you please conform that you have tried on your end and I experience the same with a fresh new installation.
Otherwise if you can't reproduce I keep investigating.
@lewismc , I see where the problem is:
Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
Is mentioned on the very first page of your GitHub project:
<exclusions>
<exclusion>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
</exclusion>
</exclusions>
The exclusion of that class was hardcoded
I believe this issue is due to Spark and Kafka being incompatible on lz4 dependency; https://stackoverflow.com/a/51052507/1506477
And excluding lz4 from Kafka is the right thing to do (hence exclusion is good!)
However, in docker hub, https://hub.docker.com/repository/docker/uscdatascience/sparkler
I see the docker image was last updated 6 months ago, but this exclusion commit is newer.
I think rebuilding Docker image and releasing it should fix this
https://github.com/USCDataScience/sparkler/wiki/Build-and-Deploy#docker-build
Yes @thammegowda , the link you provided has an update dating back to September that says:
Update: This appears to be an issue with Kafka 0.11.x.x and earlier version. As of 1.x.x Kafka seems to have moved away from using the problematic net.jpountz.lz4 library. Therefore, using latest Kafka (1.x) with latest Spark (2.3.x) should not have this issue.
Hence latest Spark with Latest Kafka will probably give no problem.
I look forward to test your new image.
@lewismc Docs are here https://github.com/USCDataScience/sparkler/blob/master/Release-Checklist.md
I believe @buggtb has been releasing docker images since I left IRDS/JPL.
@buggtb any chance of you performing a release of the new convenience binaries? Thanks
I am also facing this issue. Since the code is already merged with the fix, I tried to build docker image from it and it fails here -
Step 8/13 : COPY ./sparkler-ui/sparkler-dashboard/sparkler-ui-*.war /data/solr/server/solr-webapp/sparkler
COPY failed: no source files were specified
Can you please share steps to build sparkler-ui. I don't see sparkler-dashboard in the sparkler-ui.
Hi @thammegowda & @lewismc , let me know when you have a stable Docker release and I will test it on my end.
Thank you
Hi all
I am having my dissertation defense this month, so totally focused on that. I will have more availability for this project in April (after my dissertation)
@buggtb @karanjeets @chrismattmann any help or suggestions here, sir/bro?
Focus on your dissertation, I'm busy too.
Let's keep in touch.
Thank you
Hi @thammegowda , how is going?
have you had the time to have a look at Sparkler?
I haven't experienced the same issue since but I have new problems
Hello @lewismc , the error seems to have changed since last year.
If I execute:
sudo docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'
the error now is:
15:52:08.623 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config'
15:52:08.624 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'null'
15:52:08.625 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-chrome'
15:52:08.626 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-regex'
15:52:08.627 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.RegexURLFilter' using class loader 'org.pf4j.PluginClassLoader@158a8276'
15:52:08.637 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.RegexURLFilter'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'databricks-api'
15:52:08.640 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-htmlunit'
15:52:08.641 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'url-injector'
15:52:08.642 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-samehost'
15:52:08.643 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost' using class loader 'org.pf4j.PluginClassLoader@5fbe4146'
15:52:08.644 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost'
15:52:08.645 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.646 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'scorer-dd-svn'
15:52:08.647 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.822 [main] INFO edu.usc.irds.sparkler.service.Injector$ - Injecting 1 seeds
15:52:12.990 [main] DEBUG org.apache.http.impl.nio.client.MainClientExec - [exchange: 1] start execution
15:52:13.007 [main] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: default
15:52:13.038 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - Re-using cached 'basic' auth scheme for http://localhost:9200
15:52:13.040 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - No credentials for preemptive authentication
15:52:13.041 [main] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] Request connection for {}->http://localhost:9200
15:52:13.045 [main] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request: [route: {}->http://localhost:9200][total kept alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
15:52:13.088 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request failed
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.089 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] connection request failed
15:52:13.092 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - request [GET http://localhost:9200/] failed
java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.095 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - added [[host=http://localhost:9200]] to blacklist
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at edu.usc.irds.sparkler.Main$.main(Main.scala:71)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: ElasticsearchException[java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused]; nested: ExecutionException[java.net.ConnectException: Connection refused]; nested: ConnectException[Connection refused];
at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2695)
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:2171)
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:2137)
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:2105)
at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:1241)
at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1(ElasticsearchProxy.scala:175)
at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1$adapted(ElasticsearchProxy.scala:172)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.commitCrawlDb(ElasticsearchProxy.scala:172)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:137)
at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:39)
at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:188)
at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244)
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75)
at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2692)
... 22 more
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
at java.base/java.lang.Thread.run(Thread.java:829)
I also have a doubt, on Docker I find 2 different repositories:
docker pull uscdatascience/sparkler:latest
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
(I'm using this one currently)
which is which?
It's fixed now