ETS Too many db tables error

Question

ETS Too many db tables error

long-tran opened this issue 8 years ago · 9 comments

Hi man, I've recently run into this problem on my production environment:

CRASH REPORT==== 7-Nov-2016::09:00:05 ===
  crasher:
    initial call: ranch_conns_sup:init/7
    pid: <0.8347.1>
    registered_name: []
    exception exit: {system_limit,
                        [{ets,new,[pdu_storage_by_sequence_number,[set]],[]},
                         {'Elixir.SMPPEX.PduStorage',init,1,
                             [{file,"lib/smppex/pdu_storage.ex"},{line,43}]},
                         {gen_server,init_it,6,
                             [{file,"gen_server.erl"},{line,328}]},
                         {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,247}]}]}
      in function  ranch_conns_sup:terminate/3 (src/ranch_conns_sup.erl, line 224)
    ancestors: [<0.8346.1>,<0.8345.1>]
    messages: []
    links: []
    dictionary: [{<0.8348.1>,true}]
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 261
  neighbours: 
.....
[error] * Too many db tables

It seems like something to do with the pdu_storage, is there any potential misconfiguration in the the SMPPEX code?

Thanks,
Long

Answer 1 · 2016-11-07T19:09:04.000Z

Hello!

Thanks for the feedback.

There are two main reasons that may cause the problem:

there is something that creates many ets'es in your code, so that the creation of the next mc session fails when system limits are exhausted;
all of the ets'es are consumed by SMPPEX itself, in this case there should be many (unstopped for some reason) mc sessions.

So there are several questions I would like to ask to get the situation more clear:

What is the number of simultaneous client connections that your server has when the crash occurs? Have you specified custom max_connections transport option when starting MC?
What are the names of ets'es that pollute the ets space when the crash occurs? (This info can be obtained by running :ets.i()).

Answer 2 · 2016-12-15T10:14:48.000Z

Closing due to no reply.

Answer 3 · 2017-04-05T16:57:28.000Z

@savonarola Hi, we just ran into the same issue. My max_connections is at 600 (while the ETS table limit should be around 1400 by default) and I had a health checker try and open (and close) a socket every 10 seconds.

12:47:58.117 [info]  mc_conn #PID<0.1832.0>, socket closed, stopping

12:48:08.117 [info]  mc_conn #PID<0.1838.0>, socket closed, stopping

12:48:08.117 [info]  mc_conn #PID<0.1841.0>, socket closed, stopping

12:48:18.117 [info]  mc_conn #PID<0.1844.0>, socket closed, stopping

12:48:18.117 [info]  mc_conn #PID<0.1847.0>, socket closed, stopping

12:48:28.117 [info]  mc_conn #PID<0.1850.0>, socket closed, stopping

12:48:28.117 [info]  mc_conn #PID<0.1853.0>, socket closed, stopping

12:48:38.117 [info]  mc_conn #PID<0.1856.0>, socket closed, stopping

...

After a few hours, of this though, any time the health checker opens a socket, we encounter this issue:

16:54:48.121 [error] Ranch listener #Reference<0.0.2.571> connection process start failure; SMPPEX.Session:start_link/4 returned: {:error, {{:badmatch, {:error, {:system_limit, [{:ets, :new, [:pdu_storage_by_sequence_number, [:set]], []}, {SMPPEX.PduStorage, :init, 1, [file: 'lib/smppex/pdu_storage.ex', line: 43]}, {:gen_server, :init_it, 6, [file: 'gen_server.erl', line: 328]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}}}, [{SMPPEX.MC, :init, 1, [file: 'lib/smppex/mc.ex', line: 386]}, {:gen_server, :init_it, 6, [file: 'gen_server.erl', line: 328]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 247]}]}}

So it seems that the ETS table is not getting cleaned up properly when a Ranch socket is closed.

Answer 4 · 2017-04-05T16:58:39.000Z

Do note that we have no active connections to the instance, except for the health-check opening and closing the socket (so this is not a case of it being over-saturated with traffic).

Answer 5 · 2017-04-06T08:03:46.000Z

Hello!

Trying to reproduce the issue.

Answer 6 · 2017-04-06T08:06:03.000Z

Reproduction tips: Repeatedly open and close a raw socket to the server, (maybe leave a bit of delay for the garbage collector to kick in) Decrease the limit of ETS tables too maybe, the default is 1400 so it will take a while.. Set it to 100 or so

…

On Thu, 6 Apr 2017 at 17:03, Аверьянов Илья ***@***.***> wrote: Hello! Trying to reproduce the issue. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/savonarola/smppex/issues/5#issuecomment-292099477>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTy9o2o3vyWvRyCSh1rVhwwrczvbAOvks5rtJxigaJpZM4Kq2Hn> .

Answer 7 · 2017-04-06T08:07:58.000Z

With the probe we had, it took a few hours to reach the failure point, (1 socket every 10s, ~4h)

…

On Thu, 6 Apr 2017 at 17:05, Blaž Hrastnik ***@***.***> wrote: Reproduction tips: Repeatedly open and close a raw socket to the server, (maybe leave a bit of delay for the garbage collector to kick in) Decrease the limit of ETS tables too maybe, the default is 1400 so it will take a while.. Set it to 100 or so On Thu, 6 Apr 2017 at 17:03, Аверьянов Илья ***@***.***> wrote: Hello! Trying to reproduce the issue. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/savonarola/smppex/issues/5#issuecomment-292099477>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABTy9o2o3vyWvRyCSh1rVhwwrczvbAOvks5rtJxigaJpZM4Kq2Hn> .

Answer 8 · 2017-04-06T11:57:12.000Z

Hello!

I have reproduced the issue; the reason was that peer closing socket is not considered to be an abnormal case, so MC session stopped with :normal leaving child PduStorage alive and keeping its ets.

I have added the necessary cleanup.

Answer 9 · 2017-04-06T13:09:22.000Z

@savonarola as always, thank you for the swift fix! 🍻