funnel error stops normal running process
oliveagle opened this issue · 8 comments
tick := time.Tick(time.Second)
for {
select {
case <-tick:
log.Info().Msg("tick")
os.Stderr.WriteString(fmt.Sprintf("stderr - now :%v\n", time.Now()))
}
}
funnel kafka brokers set to invalid ports to mock kafka brokers failure.
we expect demo
process running 7x24. but funnel stops it when kafka brokers down.
Hi, thanks for trying funnel.
This is not technically funnel
's fault though. Your demo program is writing to a broken pipe when funnel has exited. Therefore, it will exit with a SIGPIPE signal. Please see this -https://golang.org/pkg/os/signal/#hdr-SIGPIPE.
There are couple of ways to handle this -
- Handle the
SIGPIPE
signal within your demo code - Just setup a signal handler and if the signal is SIGPIPE, you can ignore it. - Use
nohup
to execute your demo program -nohup ./demo & tailf nohup.out | funnel
- Use a systemd service, to restart the entire thing in case the program exits. Probably not something you want to do, but still a solution.
Couple of closing notes -
- The error you are seeing is during the initialization of funnel, not during execution. It was unable to get a kafka broker, hence it stopped.
- You are printing to
stderr
, so you have to redirect that to funnel too. Otherwise funnel cannot get the data. You can use|&
.
I am going to close this issue due to inactivity. If you have some other suggestions, please feel free to comment.
Yes. we can fix SIGPIPE
problem hardcoded. Does it mean all programs using funnel need to add this fix to their source code ?
If we already have many apps running(1000+), how to adopt funnel
as a middleware to centralize logs without modifications to those apps ??
Any logging, monitoring, tracing infrastructure failure should not break business logical. This is top priority consideration for OPS. that's why many system using UDP instead of TCP for monitoring.
we are now in cloud era. docker is everywhere. so systemd
is not an option in this case. if a app in k8s/mesos stops working, it will be spawn up at another host. pulling images, start up container, wait it to exits with error, spawns up at another host.. repeated until some infrastructure failure fixed. or close the shop.
It's very common that log files filled up disk. Maybe forget to put config into logrotate.d
. Maybe there are no logrorate service running at all (inside docker container for instance). Maybe logs flooding of some unexpected error.... funnel
seems to provide a solution. this is also what makes funnel
shining. so nohup is not a good option too. who is responsible for rotating this nohup.out
? by introducing another dependencies?
BTW:
- os.Stderr is in purpose.
Does it mean all programs using funnel need to add this fix to their source code ?
Only if you expect the target to not be there during initialization. Note that if funnel cannot initialize its config, it has to stop. It cannot begin execution without a proper config.
Any logging, monitoring, tracing infrastructure failure should not break business logical.
Sure. But you do realize it is due to the pipe connecting them. All other logging, monitoring infrastructure like Logplex or FluentD read from files. If there is a pipe connecting two processes, the target process can fail any time and will crash the source process unless the signal is handled.
If we already have many apps running(1000+), how to adopt funnel as a middleware to centralize logs without modifications to those apps ??
Another idea that I can think of is have another binary as a passthrough which ignores the SIGPIPE signal. And run your app like ./demo | ignore_sigpipe | funnel
. Of course, ignore_sigpipe
may itself crash (although very unlikely). And also, you have to think about cases when funnel stops working, but your app is still logging. How are you going to handle the logs then ?
At the end of the day, funnel is a minimalistic tool. It does a very simple thing of taking logs from stdin and sending to various targets. If you have a very strict requirement that you cannot tolerate loss of a single line of log, maybe you can look at Fluent or Logstash. Otherwise I would try to put more monitoring on the kafka cluster to ensure that it does not go down.
I am open to any other solution that you have that can fix this from funnel's side.
If you have a very strict requirement that you cannot tolerate loss of a single line of log.
No. on the opposite. we prefer losing logs instead of crash.
It cannot begin execution without a proper config.
Config is correct at one time. But environment changes to make it unreachable shortly.
I think funnel
works in fan-out manner, only one output breaks should not stop funnel process.
funnel
works only behind a pipe. so why bother exit ? many Linux commands consume pipe but don't exit themselves . wc
, uniq
, tail
, etc.
funnel
already put error messages to syslog. (or print to stderr at the same time? can syslog be optional instead of hard requirement ? ) user already aware of that.
./demo | ignore_sigpipe | funnel
only prevent demo
process from been killed by sigpipe
. but streaming logs are totally gone afterwards, even kafka back online again.
In sarama client initial process, it will retry many times to reach brokers. once max retry limit reached, it will return an error(... out of available brokers to talk to ...
) and stops trying. But sarama is open to set Metadata.Retry.Max
(default is 3) and Net.DialTimeout
(default 30s) to control the establishment behavior. funnel
just ignore them.
Is it an option to add those two configuration? At least we can tuning these two numbers to prevent funnel stop functioning, and we can have streaming logs back online ASAP automatically.
Is it an option to add those two configuration? At least we can tuning these two numbers to prevent funnel stop functioning,
I like that. I can add those knobs.
I have added the knobs. Please set the values in funnel.toml.
great. thx