killme2008/Metamorphosis

Meta 客户端关闭连接时,进入死循环。

Closed this issue · 6 comments

"SinkRunner-PollingRunner-DefaultSinkProcessor" prio=10 tid=0x00002aaab0653000 nid=0x4141 runnable [0x0000000044169000]
java.lang.Thread.State: RUNNABLE
at com.taobao.gecko.core.util.LinkedTransferQueue$Itr.advance(LinkedTransferQueue.java:714)
at com.taobao.gecko.core.util.LinkedTransferQueue$Itr.(LinkedTransferQueue.java:691)
at com.taobao.gecko.core.util.LinkedTransferQueue.iterator(LinkedTransferQueue.java:673)
at com.taobao.gecko.core.core.impl.AbstractSession.close0(AbstractSession.java:350)
at com.taobao.gecko.core.nio.impl.AbstractNioSession.close0(AbstractNioSession.java:92)
at com.taobao.gecko.core.core.impl.AbstractController.stop(AbstractController.java:506)
at com.taobao.gecko.service.impl.BaseRemotingController.stop(BaseRemotingController.java:196)
- locked <0x00000007b153ac18> (a com.taobao.gecko.service.impl.DefaultRemotingClient)
at com.taobao.metamorphosis.client.RemotingClientWrapper.stop(RemotingClientWrapper.java:339)
at com.taobao.metamorphosis.client.MetaMessageSessionFactory.shutdown(MetaMessageSessionFactory.java:304)
at org.apache.flume.sink.MetaSink.destroyConnection(MetaSink.java:295)
at org.apache.flume.sink.MetaSink.process(MetaSink.java:496)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:662)

Locked ownable synchronizers:
- None
这个函数运行几个星期了。
class Itr implements Iterator {
QNode nextNode; // Next node to return next
QNode currentNode; // last returned node, for remove()
QNode prevNode; // predecessor of last returned node
E nextItem; // Cache of next item, once commited to in next

    Itr() {
        this.nextNode = traversalHead();
        advance();
    }

    E advance() {
        this.prevNode = this.currentNode;
        this.currentNode = this.nextNode;
        E x = this.nextItem;

        QNode p = this.nextNode.next;
        for (;;) {
            if (p == null || !p.isData) {
                this.nextNode = null;
                this.nextItem = null;
                return x;
            }
            Object item = p.get();
            if (item != p && item != null) {
                this.nextNode = p;
                this.nextItem = cast(item);
                return x;
            }
            this.prevNode = p;
            p = p.next;
        }
    }

    public boolean hasNext() {
        return this.nextNode != null;
    }

    public E next() {
        if (this.nextNode == null) {
            throw new NoSuchElementException();
        }
        return advance();
    }

    public void remove() {
        QNode p = this.currentNode;
        QNode prev = this.prevNode;
        if (prev == null || p == null) {
            throw new IllegalStateException();
        }
        Object x = p.get();
        if (x != null && x != p && p.compareAndSet(x, p)) {
            clean(prev, p);
        }
    }
}

这个for函数应该死循环了
哪位大侠帮忙查一下

public class LinkedTransferQueue extends AbstractQueue implements BlockingQueue {
package com.taobao.gecko.core.util;
LinkedTransferQueue 这个类在 com.taobao.gecko.core.util;包里面.

不容易也容易(183167601) 14:57:17
714行 是循环那里?
小规模(245885697) 14:58:15
对应for语句的 }
觉得一直在那个地方for

现在程序不处理消息,但是CPU仍然刚好100%, 以及根据上面的堆栈,代码。 推断应该是死循环。

不容易也容易(183167601) 15:01:14
item == null的时候 可能会死循环
这个 会是null么
小规模(245885697) 15:02:49
那个队列比较抽象,不知道对应的业务消息是什么
所以不好整
不容易也容易(183167601) 15:03:31
那就只能问作者了 这个类起的名字和jdk里的一样
小规模(245885697) 15:03:36
如果是一般的逻辑就好看了
小规模(245885697) 15:08:04
到社区上提个单跟踪一下

注意堆栈信息,并不是卡在队列的next方法,而是队列迭代器的构造函数中。
at com.taobao.gecko.core.util.LinkedTransferQueue$Itr.advance(LinkedTransferQueue.java:714)
at com.taobao.gecko.core.util.LinkedTransferQueue$Itr.(LinkedTransferQueue.java:691)
at com.taobao.gecko.core.util.LinkedTransferQueue.iterator(LinkedTransferQueue.java:673)
at com.taobao.gecko.core.core.impl.AbstractSession.close0(AbstractSession.java:350)

你能确认在这种情况下,进程的GC情况和内存使用情况,我怀疑是OOM了。

RES:1.6G
-Xmx=2048m

jmap 信息:

num #instances #bytes class name

1: 2952836 415622496 [Ljava.util.concurrent.ConcurrentHashMap$HashEntry;
2: 239339 175449112 [B
3: 2952820 118112800 java.util.concurrent.ConcurrentHashMap$Segment
4: 3127193 100070176 java.util.concurrent.locks.ReentrantLock$NonfairSync
5: 38515 77359288 [I
6: 1174647 46985880 java.util.HashMap$KeyIterator
7: 283385 38623656 [Ljava.lang.Object;
8: 207652 16604960 [Ljava.util.HashMap$Entry;
9: 184553 14764128 [Ljava.util.concurrent.ConcurrentHashMap$Segment;
10: 142390 11316576 [S
11: 172107 11014848 com.taobao.gecko.service.callback.SingleRequestCallBack
12: 335853 10747296 java.util.concurrent.ConcurrentHashMap$HashEntry
13: 81049 10012480 [C
14: 207465 9958320 java.util.HashMap
15: 172315 9649640 com.taobao.gecko.core.nio.impl.TimerRef
16: 184553 8858544 java.util.concurrent.ConcurrentHashMap
17: 179494 7179760 com.taobao.metamorphosis.network.PutCommand
18: 215756 6904192 java.util.HashMap$Entry
19: 170667 6826680 com.taobao.metamorphosis.Message
20: 180951 5790432 com.taobao.gecko.core.core.impl.FutureImpl
21: 172107 5507424 java.util.concurrent.CountDownLatch$Sync
22: 227950 5470800 java.util.Collections$UnmodifiableCollection$1
23: 170667 5461344 com.taobao.metamorphosis.client.producer.SimpleMessageProducer$1
24: 201764 4842336 org.apache.flume.event.SimpleEvent
25: 200738 4817712 org.apache.flume.sink.MetaSink$EventStat
26: 137186 4389952 com.taobao.gecko.service.exception.NotifyRemotingException
27: 172107 4130568 com.taobao.gecko.service.impl.DefaultConnection$SingleRequestCallBackRunner
28: 170667 4096008 org.apache.flume.sink.MetaSink$MyCallback
29: 24747 3374008
30: 24747 3365048
31: 198551 3176816 java.lang.Integer
32: 192142 3074272 java.lang.Object
33: 174221 2787536 java.util.concurrent.locks.ReentrantLock
34: 172107 2753712 java.util.concurrent.CountDownLatch
35: 170667 2730672 com.taobao.metamorphosis.network.PutCommand$1
36: 2173 2292160
37: 36402 1971032
38: 58488 1871616 java.lang.String
39: 2173 1621144
40: 1858 1385168
41: 39113 1251616 java.util.concurrent.locks.AbstractQueuedSynchronizer$Node
42: 15496 743808 java.nio.HeapByteBuffer
43: 30166 723984 java.util.concurrent.LinkedBlockingQueue$Node
44: 18870 603840 java.lang.StackTraceElement

jstat -gcutil 看看,明显是OOM了,不work了。