USCDataScience/sparkler

Failed to construct kafka producer

misterpilou opened this issue · 5 comments

Kafka feature, with all default args doesn't work, i have kafka 1.1.0, running on localhost:9092.
I give you the error trace:

2018-04-27 01:35:40 ERROR Executor:95 [Executor task launch worker-0] - Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.kafka.common.KafkaException: Failed to construct kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:335)
at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:188)
at edu.usc.irds.sparkler.pipeline.SparklerProducer.(SparklerProducer.scala:45)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$storeContentKafka$1.apply(Crawler.scala:235)
at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$storeContentKafka$1.apply(Crawler.scala:234)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: -ktp
at org.apache.kafka.clients.ClientUtils.parseAndValidateAddresses(ClientUtils.java:45)
at org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:275)
... 14 more

Caused by: org.apache.kafka.common.config.ConfigException: Invalid url in bootstrap.servers: -ktp

This indicates an error in CLI args. Please paste the full command line that you used to invoke sparkler.
Also, if you edited sparkler config( the .yaml file) , please let me know

build/bin/sparkler.sh crawl -id 26-04-2018-00-16 -ke -kls -ktp, doing a little workaround, it happens when -ktp is after -kls

I did edited the sparkler-default.yaml but turned back to master after stash

Hmm,

here is how things work:

  • -ke is a boolean flag. when it is missing, kafka is disabled. when it is added to CLI, it turns the flag that enables kafka. You need to include -ke as you have done to enable it
  • -kls is short for kafka listeners. It accepts host:port. Default is localhost:9020. If you want to keep it the same, then you dont need to do anything. If you want to change it to host1:6060, then -kls host1:6060 should be added to command
  • -ktp is short for kafka topic. It is the topic name. Default is sparkler_<jobid>, since 26-04-2018-00-16 is your job id, your default topic will be sparkler_26-04-2018-00-16. If you want to change it, then add -ktp topicname . If you want your job id to be included, then add %s in string to get it replaced.

Here is what went wrong in your command line args:
since you tried to customize kafka address with -kls -ktp , as you see -ktp is read as value to -kls, and it doesnt match host:port pattern (example for correct value localhost:9020)

So,
Simple things first.

  1. Just add -ke and leave the rest to defaults which are defined in config file.
  2. Listen to sparkler_26-04-2018-00-16 to recieve the data.

Ah Okay, thanks for the information, so not an issue at all