pinterest/DoctorK

Timeout expired while fetching topic metadata

BrianGallew opened this issue · 10 comments

I'm clearly mis-understanding something, but I'm unable to get kafkastats to actually collect and report any stats.

My cluster is running kafka_2.12-0.11.0.2, with Java 1.8.0_131-b11.
I've built kafkastats at both 0.2.1 and 0.2.2 with identical results.

I launch kafkastats with this:

java -server -Xmx800M -Xms800M -verbosegc -Xloggc:./gc.log \
           -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=20M \
    -XX:+UseG1GC -XX:MaxGCPauseMillis=250 -XX:G1ReservePercent=10 -XX:ConcGCThreads=4 \
    -XX:ParallelGCThreads=4 -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=70 \
    -XX:ErrorFile=./jvm_error.log \
       -Dlog4j.configurationFile=file:./log4j2.xml  \
       -cp /mnt/kafkastats/kafkastats-0.2.2-jar-with-dependencies.jar \
       -Dbootstrap.servers=10.10.1.139:9092,10.10.1.195:9092,10.10.2.233:9092,10.10.2.148:9092 \
       com.pinterest.doctorkafka.stats.KafkaStatsMain \
         -broker 10.10.1.139 \
         -jmxport 9090  \
         -kafka_config /mnt/kafkastats/producer.properties \
         -ostrichport 2051 \
         -pollingintervalinseconds 5 \
         -topic doctorkafka.brokerstats \
         -tsdhostport localhost:18126 \
         -uptimeinseconds 99000 \
         -zookeeper 10.10.2.152:2181,10.10.2.237:2181,10.10.1.49:2181,10.10.2.104:2181,10.10.1.83:2181

My producer.properties file looks like this:

bootstrap.servers=10.10.1.139:9092,10.10.1.195:9092,10.10.2.233:9092,10.10.2.148:9092
client.id=kafkastats
compression.type=snappy
group.id=kafkastats-kafka-dev
linger.ms=5
log.dirs=/mnt/kafka/data

I've verified that all 4 Kafka brokers and all 5 Zookeeper IP addresses are correct.

I'll be attaching the log to this ticket in just a minute.

the argument for -kafka_config needs to be the kafka server.properties file, not a producer properties file.

kafkastats needs to access zookeeper of the managed cluster to see if there is any on-going partition reassignment ( kafkastats/src/main/java/com/pinterest/doctorkafka/stats/BrokerStatsRetriever.java line 281-283). Because of this, it needs to know the zookeeper connection string of the managed cluster.
Based on kafkastats.log, kafkastats failed to get the zk connection string, and throw a null pointer exception.

java.lang.NullPointerException: null
        at org.apache.zookeeper.client.ConnectStringParser.<init>(ConnectStringParser.java:50) ~ . [kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:443) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:70) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:1228) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:157) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:131) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at kafka.utils.ZkUtils$.createZkClientAndConnection(ZkUtils.scala:115) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at kafka.utils.ZkUtils.createZkClientAndConnection(ZkUtils.scala) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at com.pinterest.doctorkafka.util.OperatorUtil.getZkUtils(OperatorUtil.java:204) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]
        at com.pinterest.doctorkafka.stats.BrokerStatsRetriever.retrieveStatsThroughKafkaApi(BrokerStatsRetriever.java:281) ~[kafkastats-0.2.2-jar-with-dependencies.jar:?]

I had tried that before, with similar problems. I've now reverted to using the Kafka server.properties file and see the same issue:

broker.id=10101139
reserved.broker.max.id=1099255255
host.name=10.10.1.139
auto.create.topics.enable=false
inter.broker.protocol.version=0.10.1.0
port=9092
num.io.threads=5
num.network.threads=16
num.replica.fetchers=64
num.recovery.threads.per.data.dir=4
max.socket.request.bytes=104857600
fetch.purgatory.purge.interval.requests=100
producer.purgatory.purge.interval.requests=100
controlled.shutdown.enable=true
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
zookeeper.session.timeout.ms=20000
replica.lag.time.max.ms=1000
log.dirs=/mnt/kafka/data
num.partitions=2
default.replication.factor=2
log.message.format.version=0.8.1.1
log.flush.interval=10000
log.default.flush.interval.ms=1000
log.default.flush.scheduler.interval.ms=1000
log.retention.hours=24
log.file.size=536870912
log.cleanup.interval.mins=1
enable.zookeeper=true
zookeeper.connect=10.10.2.152:2181,10.10.2.237:2181,10.10.1.49:2181,10.10.2.104:2181,10.10.1.83:2181
zookeeper.connection.timeout.ms=10000
delete.topic.enable=false
broker.rack=us-east-1a
compression.type=snappy
log.message.timestamp.type=LogAppendTime

collector.log

The exception at the beginning of the collector log is due to some log4j configuration issue. That can be ignored.

Based on that it seems that kafkastats process got up and running, but it failed to get topicmetadata from the local kafka process. The "TimeoutException" is thrown at https://github.com/pinterest/doctorkafka/blob/master/kafkastats/src/main/java/com/pinterest/doctorkafka/stats/BrokerStatsRetriever.java#L275

org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
16:10:32.288 [StatsReporter] INFO com.pinterest.doctorkafka.stats.BrokerStatsReporter - publised to kafka : {"timestamp": 1524758726336, "id": 10101139, "name": null, "zkUrl": "10.10.2.152:2181,10.10.2.237:2181,10.10.1.49:2181,10.10.2.104:2181,10.10.1.83:2181", "kafkaVersion": "0.11.0.2", "statsVersion": "0.1.15", "hasFailure": false, "failureReason": null, "availabilityZone": "us-east-1a", "instanceType": "r4.xlarge", "amiId": "ami-938a4de9", "rackId": null, "logFilesPath": "/mnt/kafka/data", "cpuUsage": 4.2, "freeDiskSpaceInBytes": 2145109954560, "totalDiskSpaceInBytes": 2145898463232, "leadersBytesIn1MinRate": 0, "leadersBytesOut1MinRate": 0, "leadersBytesIn5MinRate": 0, "leadersBytesOut5MinRate": 0, "leadersBytesIn15MinRate": 0, "leadersBytesOut15MinRate": 0, "sysBytesIn1MinRate": 0, "sysBytesOut1MinRate": 0, "numReplicas": 0, "numLeaders": 0, "topicsBytesIn1MinRate": {}, "topicsBytesOut1MinRate": {}, "topicsBytesIn5MinRate": {}, "topicsBytesOut5MinRate": {}, "topicsBytesIn15MinRate": {}, "topicsBytesOut15MinRate": {}, "leaderReplicaStats": [], "leaderReplicas": null, "followerReplicas": null, "inReassignmentReplicas": []}

Right, which is what's interesting. One of the things that confuses me is in the response where it says '"name": null', because I would be shocked to discover a topic with no name. Meanwhile, this Kafka cluster is in daily use, is monitored by Burrow, and otherwise displays no issues. So I'm confused as to why I'm seeing the timeout (which brings us back to the topic of this issue).

Can you tell the command line for running your kafka service?

Sorry for the slow response, new work priorities.

It's pretty much the same as above.

java -server -Xmx800M -Xms800M -verbosegc -Xloggc:./gc.log \
           -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=20M \
    -XX:+UseG1GC -XX:MaxGCPauseMillis=250 -XX:G1ReservePercent=10 -XX:ConcGCThreads=4 \
    -XX:ParallelGCThreads=4 -XX:G1HeapRegionSize=8m -XX:InitiatingHeapOccupancyPercent=70 \
    -XX:ErrorFile=./jvm_error.log \
       -Dlog4j.configurationFile=file:./log4j2.xml  \
       -cp /mnt/kafkastats/kafkastats-0.2.2-jar-with-dependencies.jar \
       -Dbootstrap.servers=10.10.1.139:9092,10.10.1.195:9092,10.10.2.233:9092,10.10.2.148:9092 \
       com.pinterest.doctorkafka.stats.KafkaStatsMain \
         -broker 10.10.1.139 \
         -jmxport 9090  \
         -kafka_config /mnt/kafka/config/server.properties \
         -ostrichport 2051 \
         -pollingintervalinseconds 5 \
         -topic doctorkafka.brokerstats \
         -tsdhostport localhost:18126 \
         -uptimeinseconds 99000 \
         -zookeeper 10.10.2.152:2181,10.10.2.237:2181,10.10.1.49:2181,10.10.2.104:2181,10.10.1.83:2181

Oh, duh, that was for the kafkastats. This is the kafka server command line:

/usr/lib/jvm/java-8-oracle/bin/java -Xmx8G -Xms8G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -verbose:gc -Xloggc:/mnt/kafka/data/../logs/kafka-broker-gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Djava.rmi.server.hostname=10.10.1.139 -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9090 -Dkafka.logs.dir=/mnt/kafka/bin/../logs -Dlog4j.configuration=file:/mnt/kafka/bin/../config/log4j.properties -cp :/mnt/kafka/bin/../libs/aopalliance-repackaged-2.5.0-b05.jar:/mnt/kafka/bin/../libs/argparse4j-0.7.0.jar:/mnt/kafka/bin/../libs/commons-lang3-3.5.jar:/mnt/kafka/bin/../libs/connect-api-0.11.0.2.jar:/mnt/kafka/bin/../libs/connect-file-0.11.0.2.jar:/mnt/kafka/bin/../libs/connect-json-0.11.0.2.jar:/mnt/kafka/bin/../libs/connect-runtime-0.11.0.2.jar:/mnt/kafka/bin/../libs/connect-transforms-0.11.0.2.jar:/mnt/kafka/bin/../libs/guava-20.0.jar:/mnt/kafka/bin/../libs/hk2-api-2.5.0-b05.jar:/mnt/kafka/bin/../libs/hk2-locator-2.5.0-b05.jar:/mnt/kafka/bin/../libs/hk2-utils-2.5.0-b05.jar:/mnt/kafka/bin/../libs/jackson-annotations-2.8.5.jar:/mnt/kafka/bin/../libs/jackson-core-2.8.5.jar:/mnt/kafka/bin/../libs/jackson-databind-2.8.5.jar:/mnt/kafka/bin/../libs/jackson-jaxrs-base-2.8.5.jar:/mnt/kafka/bin/../libs/jackson-jaxrs-json-provider-2.8.5.jar:/mnt/kafka/bin/../libs/jackson-module-jaxb-annotations-2.8.5.jar:/mnt/kafka/bin/../libs/javassist-3.21.0-GA.jar:/mnt/kafka/bin/../libs/javax.annotation-api-1.2.jar:/mnt/kafka/bin/../libs/javax.inject-1.jar:/mnt/kafka/bin/../libs/javax.inject-2.5.0-b05.jar:/mnt/kafka/bin/../libs/javax.servlet-api-3.1.0.jar:/mnt/kafka/bin/../libs/javax.ws.rs-api-2.0.1.jar:/mnt/kafka/bin/../libs/jersey-client-2.24.jar:/mnt/kafka/bin/../libs/jersey-common-2.24.jar:/mnt/kafka/bin/../libs/jersey-container-servlet-2.24.jar:/mnt/kafka/bin/../libs/jersey-container-servlet-core-2.24.jar:/mnt/kafka/bin/../libs/jersey-guava-2.24.jar:/mnt/kafka/bin/../libs/jersey-media-jaxb-2.24.jar:/mnt/kafka/bin/../libs/jersey-server-2.24.jar:/mnt/kafka/bin/../libs/jetty-continuation-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-http-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-io-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-security-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-server-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-servlet-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-servlets-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jetty-util-9.2.22.v20170606.jar:/mnt/kafka/bin/../libs/jopt-simple-5.0.3.jar:/mnt/kafka/bin/../libs/kafka_2.12-0.11.0.2.jar:/mnt/kafka/bin/../libs/kafka_2.12-0.11.0.2-sources.jar:/mnt/kafka/bin/../libs/kafka_2.12-0.11.0.2-test-sources.jar:/mnt/kafka/bin/../libs/kafka-clients-0.11.0.2.jar:/mnt/kafka/bin/../libs/kafka-log4j-appender-0.11.0.2.jar:/mnt/kafka/bin/../libs/kafka-streams-0.11.0.2.jar:/mnt/kafka/bin/../libs/kafka-streams-examples-0.11.0.2.jar:/mnt/kafka/bin/../libs/kafka-tools-0.11.0.2.jar:/mnt/kafka/bin/../libs/log4j-1.2.17.jar:/mnt/kafka/bin/../libs/lz4-1.3.0.jar:/mnt/kafka/bin/../libs/maven-artifact-3.5.0.jar:/mnt/kafka/bin/../libs/metrics-core-2.2.0.jar:/mnt/kafka/bin/../libs/osgi-resource-locator-1.0.1.jar:/mnt/kafka/bin/../libs/plexus-utils-3.0.24.jar:/mnt/kafka/bin/../libs/reflections-0.9.11.jar:/mnt/kafka/bin/../libs/rocksdbjni-5.0.1.jar:/mnt/kafka/bin/../libs/scala-library-2.12.2.jar:/mnt/kafka/bin/../libs/scala-parser-combinators_2.12-1.0.4.jar:/mnt/kafka/bin/../libs/slf4j-api-1.7.25.jar:/mnt/kafka/bin/../libs/slf4j-log4j12-1.7.25.jar:/mnt/kafka/bin/../libs/snappy-java-1.1.2.6.jar:/mnt/kafka/bin/../libs/validation-api-1.1.0.Final.jar:/mnt/kafka/bin/../libs/zkclient-0.10.jar:/mnt/kafka/bin/../libs/zookeeper-3.4.10.jar kafka.Kafka /mnt/kafka/config/server.properties

Merged PR fixes this issue.