benoitc/gunicorn

The gunicorn got timeout but not exit

Closed this issue · 13 comments

I'm use the gunicorn_thrift a extention of gunicorn ,gunicorn_thrift,for normal time the application is run well ,but when it run for a time ,I got a error

1741648 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7438)
1741649 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7445)
1741650 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7447)
1741651 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7448)
1741652 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7449)
1741653 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7451)
1741654 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7453)
1741655 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7456)
1741656 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7457)
1741657 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7458)
1741658 [2017-12-06 12:15:39 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7459)
1741659 [2017-12-06 12:15:40 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7434)
1741660 [2017-12-06 12:15:40 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7437)
1741661 [2017-12-06 12:15:40 +0000] [5727] [CRITICAL] WORKER TIMEOUT (pid:7443)
1741662 [2017-12-06 12:15:40 +0000] [7506] [INFO] Booting worker with pid: 7506
1741663 [2017-12-06 12:15:40 +0000] [7507] [INFO] Booting worker with pid: 7507
1741664 [2017-12-06 12:15:40 +0000] [7508] [INFO] Booting worker with pid: 7508
1741665 [2017-12-06 12:15:40 +0000] [7509] [INFO] Booting worker with pid: 7509
1741666 [2017-12-06 12:15:40 +0000] [7510] [INFO] Booting worker with pid: 7510
1741667 [2017-12-06 12:15:40 +0000] [7511] [INFO] Booting worker with pid: 7511
1741668 [2017-12-06 12:15:40 +0000] [7512] [INFO] Booting worker with pid: 7512
1741669 [2017-12-06 12:15:40 +0000] [7513] [INFO] Booting worker with pid: 7513
1741670 [2017-12-06 12:15:40 +0000] [7514] [INFO] Booting worker with pid: 7514
1741671 [2017-12-06 12:15:40 +0000] [7515] [INFO] Booting worker with pid: 7515
1741672 [2017-12-06 12:15:40 +0000] [7514] [ERROR] Exception in worker process:
1741673 Traceback (most recent call last):
1741674   File "/usr/local/lib/python2.7/site-packages/gunicorn/arbiter.py", line 507, in spawn_worker
1741675     worker.init_process()
1741676   File "/usr/local/lib/python2.7/site-packages/gunicorn_thrift/thriftpy_gevent_worker.py", line 98, in init_process
1741677     return super(GeventThriftPyWorker, self).init_process()
1741678   File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/ggevent.py", line 192, in init_process
1741679     super(GeventWorker, self).init_process()
1741680   File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/base.py", line 118, in init_process
1741681     self.wsgi = self.app.wsgi()
1741682   File "/usr/local/lib/python2.7/site-packages/gunicorn/app/base.py", line 67, in wsgi
1741683     self.callable = self.load()
1741684   File "/usr/local/lib/python2.7/site-packages/gunicorn_thrift/thriftapp.py", line 39, in load
1741685     self.thrift_app = self.load_thrift_app()
1741686   File "/usr/local/lib/python2.7/site-packages/gunicorn_thrift/thriftapp.py", line 33, in load_thrift_app
1741687     return utils.load_obj(self.app_uri)
1741688   File "/usr/local/lib/python2.7/site-packages/gunicorn_thrift/utils.py", line 16, in load_obj
1741689     mod = importlib.import_module(module)
1741690   File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
1741691     __import__(name)
1741692   File "/usr/local/lib/python2.7/site-packages/gevent-1.1rc2-py2.7-linux-x86_64.egg/gevent/builtins.py", line 58, in __import__
1741693     result = _import(*args, **kwargs)
1741694   File "/home/zhouyi/heybox_script/release/pubg_thriftpy/pubg_account_server.py", line 13, in <module>
1741695     pg_account = psycopg2.connect(database="account", user='postgres', password="", host="10.10.10.54", port='19934')
1741696   File "/usr/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
1741697     conn = _connect(dsn, connection_factory=connection_factory, async=async)
1741698 OperationalError: FATAL:  sorry, too many clients already

it seems like the worker got timeout but not exit, so the master booting another worker , finally the worker connect to postgresql got the ' too many clients' error, Is there anyone know what's the problem ?

Gunicorn will send an abort signal to the workers that time out, start new workers, and then send a kill signal to any old workers that still have not shut down. It is possible your new workers start up quickly enough that they connect before the old ones shut down. Or it's possible that Postgres has not noticed the socket shut down because the workers are killed and do not shut down cleanly.

I would increase the client limit on Postgres and monitor the number of connections to make sure it doesn't continually increase. You can also connect through PgBouncer or use some other solution for closing inactive connections. See: https://stackoverflow.com/questions/12391174/how-to-close-idle-connections-in-postgresql-automatically

Also, investigate your application to figure out why your workers time out, or switch to a threaded or async worker model to avoid it.

In my case, I see the Gunicorn error log before, the' [CRITICAL] WORKER TIMEOUT' and '[INFO] Booting worker with pid ' was continue about 1 min but not show any exit log, That should be the problem , so It can't be the postgresql connection problem , the task in my application is a very simple select query with little time cost ,so it also can't be the timeout origin, The command i used to start gunicorn was

gunicorn_thrift pubg_account_server:app -w 60 --log-level=debug  \
  --worker-connections 2000 -k thriftpy_gevent --thrift-protocol-factory thriftpy.protocol:TCyBinaryProtocolFactory \
    --thrift-transport-factory thriftpy.transport:TCyBufferedTransportFactory  -D \
    --error-logfile /homex2/zhouyi/log/pubg_account_thrift_error.log --access-logfile /homex2/zhouyi/log/pubg_account_thrift_access.log\
      -b  0.0.0.0:16796 --reload  --forwarded-allow-ips 10.10.10.55

the thriftpy_gevent was a async worker for gunicorn_thrift , And the gunicorn_thrift version is 19.3.0

For more detail the server code

pg_account = psycopg2.connect(database="account", user='postgres', password="", host="10.10.10.8", port='19932')
pg_account_cur = pg_account.cursor()

class PubgAccountRpc(object):
    def rpc_pubg_account_sid(self, pubg_id):
        sql = "SELECT sid from pubg_simple_ids where pubg_id=%s;"
        pg_account_cur.execute(sql, [pubg_id, ])
        r = pg_account_cur.fetchall()
        if len(r) > 0:
            return str(r[0][0])
        try:
            pg_account_cur.execute('BEGIN ')
            pg_account_cur.execute('LOCK TABLE pubg_simple_ids IN SHARE ROW EXCLUSIVE MODE;')
            sql = "select nextval('seq_pubg_sid');"
            pg_account_cur.execute(sql, [])
            r = pg_account_cur.fetchall()
            sid = r[0][0]
            sql = "INSERT into pubg_simple_ids (pubg_id,sid,create_time) values (%s,%s,%s);"
            pg_account_cur.execute(sql, [pubg_id, sid, int(time.time())])
            pg_account_cur.execute('COMMIT;')
            return str(sid)
        except Exception, e:
            pg_account_cur.execute('ROLLBACK;')
        return str(0)

    def rpc_pubg_update_bind_info(self, heybox_id, account_id):
        try:
            pg_account_cur.execute('BEGIN;')
            sql = 'insert into accounts_bind_game_id (heybox_id,game_id,game_type,update_time) values (%s,%s,%s,%s);'
            pg_account_cur.execute(sql, [heybox_id, account_id, 'pubg', int(time.time())])
            pg_account_cur.execute('commit;')
            return '1'
        except Exception, e:
            pg_account_cur.execute('rollback;')
            return '0'

pubg_account = thriftpy.load("pubg_account.thrift")
app = TProcessor(pubg_account.PubgAccountRpc, PubgAccountRpc())

The server only have two very simple function which can't cause timeout , and for most time it run very well and only got the problem for a while with variable interval.

I have known the reason why the worker is timeout, the function in my code request a 'SHARE ROW EXCLUSIVE' lock for postgresql this level of table lock is conflict with postgresql's autovacuum, when the autovacuum happen, the worker was block and wait the lock release ,Then the timeout happen, but i still have no ideal why the worker not exit.

Are you saying that the worker never exits when this happens, even after the WORKER TIMEOUT message?

Yes, The WORKER TIMEOUT message persistent happened about 1-2 mins and booting new worker without anyone worker exit, That's what cause pg FATAL: sorry, too many clients already error,Because there exists too many worker.

Do the workers eventually exit, or do they stay forever stuck waiting for Postgres? If the worker process does not exit 1-2s after the WORKER EXIT message then there may be a bug. If the worker does exit, you may just need to ensure that Postgres has enough connections to handle 2x the number of workers since Gunicorn will start new workers before killing old ones.

log.txt The log.txt was the error log for the problem from the beginning to the exception end.

Hmm. But at the end are there many, many worker processes that are zombie or stuck, or just somewhere between n and n * 2 workers where n is the configured number of workers?

It still sounds like maybe the issue is just Postgres vaccuum, as you identified, and then connection exhaustion because the workers are not shutting down cleanly and you are not using PgBouncer.

The only other consideration is whether the workers could be made to exit cleanly and shut down their connections by handling SIGABRT better, but if the postgres client has C extension components that could be difficult to guarantee. Therefore, you will get unclean shutdowns, hanging connections, and reach connection limit.

So far, from this log, I still cannot say that there is any bug.

I got time to see the source code of gunicorn and know the WORKER TIMEOUT message should send SIGABRT to the worker and then got SystemExit exception with Worker exiting message. It seems nothing was wrong. But as real, The log file didn't show any Worker exiting message, I can't got the reason, but now my application work well, Hope anyone others with this problem can get any idea to solve this.

@zyi1992 did you observe more processes or the same numbers once the workers restarted?

I think we could close this and #1663 together by adding logging the arbiter around reaping workers.

@benoitc I do a test with my old code, The log is same as before when the timeout happened, But the worker numbers seems not increase , It should be only the logger problem like #1663 talking about, thanks for all your time and closed!