riemann/riemann

Meet java.util.concurrent.RejectedExecutionException after running some days

Closed this issue · 11 comments

When I run the riemann after some days, it will start to throw the error: java.util.concurrent.RejectedExecutionException , I see another resolved issue #655 change the queue-size to 10000, I have to change the queue size to 10000 and max-pool-size to 1024 but after some days the alert will show again and the alert will run out of our disk space.

riemann version: 0.2.14

riemann.conf

; -*- mode: clojure; -*-
; vim: filetype=clojure


; non-streams related configuration is located here
(include "/etc/riemann/include")

(instrumentation {:interval 60 :enabled? true})

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams
    (default :ttl 60
      ; Index all events immediately.
      index

      ; Log expired events.
      (expired
        (fn [event] (info "expired" event)))))

  (streams
        ;Output riemann internal metrics to log
	(where  (service #"^riemann ")
	  #(info %)
	)

    (by [:host :service]
      (where (not (expired? event))
        (smap (fn [events]
          (dissoc
             (assoc events :site (get (re-find (re-pattern "([A-Za-z0-9]*)[.](ops)[.]([a-z]*)") hostname) 1) :ttl 360 )
           :okthreshold :criticalthreshold :departmentkey :alertformat :alerter :alertstate
           :alertdescription :failure_message :step_message :sqlinstance_owner :is_job_running :original_host :original_service
               :description :job_category :job_id :job_owner :failure_time :next_runtime :frequency :step_id :sqlinstanceowner
            ))
     send_to_central_riemann))))
)

include/centralriemann.clj

; -*- mode: clojure; -*-
; vim: filetype=clojure

; Load the Riemann TCP client. Were going to use this client to send events downstream.
(require 'riemann.client)

; Forward events to the central riemann server
; 1. We define the downstream symbol ('send_to_central_riemann'). When that symbol is referenced events are passed into it.
; 2. Events first go into the batch stream. Every 100 events or 1/10th of a second events are batched and sent on.
; 3. The batched events are passed to the async-queue! stream.
; 4. The async-queue! stream passes the events to the forward stream which sends them to the central riemann server over 5555
(def send_to_central_riemann (batch 1000 1/10 
				(async-queue! :agg { :queue-size 10000
					:core-pool-size 32 
					:max-pool-size 1024}
						(forward
							(riemann.client/tcp-client :host "10.x.x.x")))) ; Need to set it to loadbalancer IP

riemann.log error log

102497 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102498 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102499 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 379907009233/250, :ttl 120}
 102500 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 conns, :state ok, :description nil, :metric 0, :tags [riemann], :time 379907009        233/250, :ttl 120}
 102501 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 3799        07009233/250, :ttl 120}
 102502 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 37990        7009233/250, :ttl 120}
 102503 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102504 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102505 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102506 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102507 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 379907009233/250, :ttl 120}
 102508 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 379907009233/250, :ttl 120}
 102509 INFO [2018-02-26 14:53:56,933] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 379907009233/250, :ttl 120}
 102510 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102511 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 379907009233/250, :ttl 120}
 102512 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 379907009233/250, :ttl 120}
 102513 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg accepted rate, :state ok, :description nil, :metric 223500/30001, :tags [riemann], :time         379907009233/250, :ttl 120}
 102514 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg completed rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 379907009        233/250, :ttl 120}
 102515 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg rejected rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 3799070092        33/250, :ttl 120}
 102516 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue capacity, :state ok, :description nil, :metric 10000, :tags [riemann], :time 379907        009233/250, :ttl 120}
 102517 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue size, :state critical, :description nil, :metric 9991, :tags [riemann], :time 37990        7009233/250, :ttl 120}
 102518 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue used, :state critical, :description nil, :metric 9991/10000, :tags [riemann], :time         379907009233/250, :ttl 120}
 102519 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads active, :state ok, :description nil, :metric 246, :tags [riemann], :time 37990700        9233/250, :ttl 120}
 102520 INFO [2018-02-26 14:53:56,934] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads alive, :state ok, :description nil, :metric 246, :tags [riemann], :time 379907009        233/250, :ttl 120}
...
...
102557 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102558 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server udp 0.0.0.0:5555 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 759814048467/500, :ttl 120}
 102559 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 conns, :state ok, :description nil, :metric 0, :tags [riemann], :time 759814048        467/500, :ttl 120}
 102560 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 7598        14048467/500, :ttl 120}
 102561 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in rate, :state ok, :description nil, :metric 0.0, :tags [riemann], :time 75981        4048467/500, :ttl 120}
 102562 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102563 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102564 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102565 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102566 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 out latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :        time 759814048467/500, :ttl 120}
 102567 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.0, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 759814048467/500, :ttl 120}
 102568 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.5, :state ok, :description nil, :metric nil, :tags [riemann], :tim        e 759814048467/500, :ttl 120}
 102569 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.95, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102570 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.99, :state ok, :description nil, :metric nil, :tags [riemann], :ti        me 759814048467/500, :ttl 120}
 102571 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann server ws 0.0.0.0:5556 in latency 0.999, :state ok, :description nil, :metric nil, :tags [riemann], :t        ime 759814048467/500, :ttl 120}
 102572 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg accepted rate, :state ok, :description nil, :metric 152000/20001, :tags [riemann], :time         303925619387/200, :ttl 120}
 102573 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg completed rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 303925619        387/200, :ttl 120}
 102574 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg rejected rate, :state ok, :description nil, :metric 0N, :tags [riemann], :time 3039256193        87/200, :ttl 120}
 102575 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue capacity, :state ok, :description nil, :metric 10000, :tags [riemann], :time 303925        619387/200, :ttl 120}
 102576 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue size, :state critical, :description nil, :metric 10000, :tags [riemann], :time 3039        25619387/200, :ttl 120}
 102577 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg queue used, :state critical, :description nil, :metric 1, :tags [riemann], :time 30392561        9387/200, :ttl 120}
 102578 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads active, :state ok, :description nil, :metric 693, :tags [riemann], :time 30392561        9387/200, :ttl 120}
 102579 INFO [2018-02-26 14:54:56,936] Thread-9 - riemann.config - #riemann.codec.Event{:host host1.test, :service riemann executor agg threads alive, :state ok, :description nil, :metric 693, :tags [riemann], :time 303925619        387/200, :ttl 120}
...
...
102599 java.util.concurrent.RejectedExecutionException: Task clojure.core$bound_fn_STAR_$fn__4671@c1fdfcf rejected from java.util.concurrent.ThreadPoolExecutor@4b1f67c8[Running, pool size = 1024, active threads = 1024, queued tasks = 10        000, completed tasks = 8356055]
 102600         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
 102601         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
 102602         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
 102603         at riemann.service.ExecutorServiceService.execute(service.clj:173)
 102604         at riemann.streams$execute_on$stream__7093.invoke(streams.clj:266)
 102605         at riemann.streams$batch$flush__7957$fn__7963.invoke(streams.clj:1154)
 102606         at riemann.streams$batch$flush__7957.invoke(streams.clj:1154)
 102607         at riemann.streams$part_time_simple$tick__7301.invoke(streams.clj:610)
 102608         at riemann.time.Once.run(time.clj:42)
 102609         at riemann.time$run_tasks_BANG_$fn__5370$fn__5371.invoke(time.clj:154)
 102610         at riemann.time$run_tasks_BANG_$fn__5370.invoke(time.clj:153)
 102611         at riemann.time$run_tasks_BANG_.invokeStatic(time.clj:147)
 102612         at riemann.time$run_tasks_BANG_.invoke(time.clj:142)
 102613         at riemann.time$start_BANG_$fn__5386$fn__5387.invoke(time.clj:189)
 102614         at clojure.lang.AFn.applyToHelper(AFn.java:152)
 102615         at clojure.lang.AFn.applyTo(AFn.java:144)
 102616         at clojure.core$apply.invokeStatic(core.clj:646)
 102617         at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881)
 102618         at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881)
 102619         at clojure.lang.RestFn.invoke(RestFn.java:425)
 102620         at clojure.lang.AFn.applyToHelper(AFn.java:156)
 102621         at clojure.lang.RestFn.applyTo(RestFn.java:132)
 102622         at clojure.core$apply.invokeStatic(core.clj:650)
 102623         at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911)
 102624         at clojure.lang.RestFn.invoke(RestFn.java:397)
 102625         at clojure.lang.AFn.run(AFn.java:22)
 102626         at java.lang.Thread.run(Thread.java:748)
 102627 WARN [2018-02-26 14:55:39,096] riemann task 0 - riemann.streams - riemann.streams$execute_on$stream__7093@194d221c threw
 102628 java.util.concurrent.RejectedExecutionException: Task clojure.core$bound_fn_STAR_$fn__4671@69dab334 rejected from java.util.concurrent.ThreadPoolExecutor@4b1f67c8[Running, pool size = 1024, active threads = 1024, queued tasks = 1        0000, completed tasks = 8356055]
 102629         at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
 102630         at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
 102631         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
 102632         at riemann.service.ExecutorServiceService.execute(service.clj:173)
 102633         at riemann.streams$execute_on$stream__7093.invoke(streams.clj:266)
 102634         at riemann.streams$batch$flush__7957$fn__7963.invoke(streams.clj:1154)
 102635         at riemann.streams$batch$flush__7957.invoke(streams.clj:1154)
 102636         at riemann.streams$part_time_simple$tick__7301.invoke(streams.clj:610)
 102637         at riemann.time.Once.run(time.clj:42)
 102638         at riemann.time$run_tasks_BANG_$fn__5370$fn__5371.invoke(time.clj:154)
 102639         at riemann.time$run_tasks_BANG_$fn__5370.invoke(time.clj:153)
 102640         at riemann.time$run_tasks_BANG_.invokeStatic(time.clj:147)
 102641         at riemann.time$run_tasks_BANG_.invoke(time.clj:142)

Hi @mcorbin,
This is the status at the moment. The complele rate from 7 drop to 0. But I don't know it's network issue or what happened.
https://imgur.com/a/el3Nh

I dont know what happened. You had any other exceptions in the logs ?

Btw, in your example code:

(def send_to_central_riemann (batch 1000 1/10 
				(async-queue! :agg { :queue-size 10000
					:core-pool-size 32 
					:max-pool-size 1024}
						(forward
							(riemann.client/tcp-client :host "10.x.x.x")))) ; Need to set it to loadbalancer IP

a parenthesis is missing at the end.

HI @mcorbin
No error log in the log file, the riemann I miss copy a parenthesis, it's ok on our system. Thank you for notification. It happens sometimes, is there any way to check when it happens next time?

So when it happens, the Riemann on 10.x.x.x stop receiving events completely ?
What happen to the Riemann forwarding the events ? does it stop emitting statistics ? Your previous graph was the stats of the Riemann forwarding the events or the Riemann receiving the events on "10.x.x.x " ?

Hi @mcorbin,
We have improved our central riemann code to reduce CPU usage, and monitor if it happens again. If nothing happens again this month, I will close the ticket. Thank you for the help.

Excellent. Please re-open if your issue reappears. Thanks.

Hi @jamtur01, it happen again, could you help to reopen it?

Hi @mcorbin ,
Here is the workflow tcb0124 --> haproxy --> central riemann1,2,3 --> influxdb or logstash.
When the riemann forwarder shows the error, the central riemanns seems works fine.
image

And the riemann forwarder stats and other data seems can't send to center riemann and influxdb, so there is no metrics on the graph after the error happen. I think the forwarder stats on the graph is the Riemann forwarding the events, not receiver stats.

image
image

I think you need to use a tool like visualvm and perform thread and heap dump and analyze the threads states on the frozen Riemann process.

You can do it by adding on /etc/default/riemann (or /etc/sysconfig on RHEL) these options:

EXTRA_JAVA_OPTS="-Dcom.sun.management.jmxremote.port=4444 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false"

Be careful, this example configure jmx without authentication/security (but you can also configure jmx with password auth/tls/...).

You can then analyze your Riemann process using for example https://visualvm.github.io/download.html, to see the thread states and perform thread dumps.

You can also do heap dump using jmap, with this command for example:

sudo -u riemann bash -c 'jmap -dump:format=b,file=/tmp/heapdump.bin <RIEMANN_PID>'

and then open it in visualvm.

Let me try and see if I can find any issue.

seems don't happen again, I close it.