RTM connection fails in large org but works in small org

Question

RTM connection fails in large org but works in small org

geekdave opened this issue 8 years ago · 36 comments

Hello,

I'm attempting to use a bot to connect Slack with Pagerduty: https://github.com/stripe-contrib/pagerbot

I set up the bot with a Slack bot access token.
I verified that the token works by using "curl" from the command line and was able to see the message getting posted in Slack.
I tested the bot on a large org (5000+ people) and it consistently fails to connect.
I tested the bot on a small org (15 people) (using a different token of course), and it worked fine.

I've worked with the developer of the bot @macobo, and he suggested I file an issue here.

I've created a gist with the error logs from heroku. The first log shows the errors when running this in the large org. The second log shows the successful connection with a smaller org:

I added some comments at the bottom noting some differences, notably:

On large org I see this in the response, but not in the small org:

transfer-encoding: "chunked"

Could there be an issue with the large number of members in the org or the number of channels causing something to stall? Is there a huge payload coming back from the RTM API due to my large org?

There is a delay of 100+ seconds in the large org log, and the error:

Socket URL has expired

Thanks for looking!

asheynkman commented 7 years ago

lol

Answer 1 · 2017-02-10T16:35:11.000Z

UPDATE: I pinged Slack's API help team about this and got the following response:

My guess is the bot is timing out on the initial call to rtm.start, which is the method that's called to initialize an RTM connection — for large teams this can take a long time. Thinking aloud, it could also be a memory allocation issue (depending on your server config).

Any chance you can increase the timeout for the initial API call ?

We're working very (very) hard on improvements in this area, as you can imagine this is a performance problem across-the-board for very large teams.

/cc @macobo

Answer 2 · 2017-02-10T16:37:35.000Z

Indeed, there's a very large payload coming back from the initial rtm.start with all the org information. Tweak the call to rtm.start with simple_latest: true and no_unreads, that reduces the size by a bit.

Then disable any client-side parsing of that data with Slack::RealTime::Client.new(store_class: nil), see if that helps at all.

Answer 3 · 2017-02-10T16:40:45.000Z

There's a feature request on supplying a custom Faraday connection in #72. There's an unfinished pull request for adding a timeout option in #74. In the meantime you can probably test other connection options with some editing of https://github.com/slack-ruby/slack-ruby-client/blob/master/lib/slack/web/faraday/connection.rb or monkey-patching.

Answer 4 · 2017-02-11T00:11:15.000Z

Thanks @dblock for the suggestions - we tested with these settings, but ran into a similar issue (didn't try store_class though). Will try events API instead.

Related to this issue: I think this should make client.start! throw in a way that the program exits or attempt to reconnect. Currently the program will just hang indefinitely and the root cause is not really visible if a logger is not specified.

Answer 5 · 2017-02-11T16:57:37.000Z

Agreed @macobo, please contribute.

Answer 6 · 2017-02-17T21:39:00.000Z

Did anyone develop a short term solution to this? Seeing the same issue and the D, [2017-02-17T13:38:12.283081 #5079] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#parse: �B{"type":"error","error":{"msg":"Socket URL has expired","code":1}} debug error.

It did randomly work for me one time with the suggestions from @dblock with regard to reducing the size of the data coming back, about the same as the commit that @macobo referenced above. But will not work again. As far as I can tell, nothing changed between runs.

UPDATE: May have been a corp network issue. Works when on regular network.

Answer 7 · 2017-02-17T21:45:09.000Z

@tmsteen @macobo has either of you attempted to increase the timeout? maybe finish #74?

Answer 8 · 2017-02-17T22:29:15.000Z

Not familiar with how to do that, sorry. Certainly willing to assist if you can start me in the right direction. But I am starting to think that after setting simple_latest, no_unreads, and store_class that the issue may only be partially related to those.

Right now, my connection works on a seemingly random interval. From the debug log I get this in all cases D, [2017-02-17T16:11:10.540318 #6791] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#write: GET /websocket/ZKsL1B2ulhjnvmg2pJDOFZLMmEKTxNnzcXGLJJHGXZpNbWjd0NQI9IUwb4Z4s-QynUGzlDvIVuogXhsoAcls4rEa4LI0ny9qvtn2CnwlukVyZdC-RPYEFvSAfi-dSywgMmQ9mCQDxzypcQd9o7Iztg3yxVdz2ciRk= HTTP/1.1

You probably understand the inner workings better than I do, but it seems that after the initial call to the Slack API, it returns a unique web socket to use for the rest of the communication. I am curious if the "randomness" of when my connection works and when it doesn't is caused by that web socket timing out or only being valid within a certain window?

More testing needed, but it seems that it will work once, then fail for a while, and then work again a few minutes later. When it doesn't work, the error is still the same as noted above.

Answer 9 · 2017-02-17T22:38:12.000Z

Sadly I wasn't able to work on this as I don't have access to a large slack organization.

@tmsteen How big is the organization you're using this gem for?

Answer 10 · 2017-02-18T02:10:05.000Z

5000+ but my intended audience within the org is much smaller.

Answer 11 · 2017-02-22T20:10:36.000Z

We also well facing the same issue setting {simple_latest: true, no_unreads:true} or store_class = nil makes no difference, is there any work around on this issue ?

Answer 12 · 2017-02-22T20:32:34.000Z

I think there's multiple factors that make debugging this hard:

The developers/others interested don't have access to an account large enough to replicate the problem.
Part of this is on the side of slack - that is, the initial payload will be parsed from a string to something else so a huge payload might be causing issues.

One way to solve these issues is for someone with an admin in a larger account to either:

Debug this themselves (i.e. profile, dump the payload, etc)
Give temporary access to someone who is interested in debugging.

Number two is rather undesirable though due to the security implications for the account given access to.

Answer 13 · 2017-02-22T20:36:13.000Z

One data point I have notice , if using just simple client from the sample with large organization connects and established connection. The issue happens while using Slack::RealTime::Client

require 'http'
require 'json'
require 'faye/websocket'
require 'eventmachine'

rc = JSON.parse HTTP.post("https://slack.com/api/rtm.start", params: {
token: ENV['SLACK_API_TOKEN']
})

url=rc['url']

EM.run {
ws = Faye::WebSocket::Client.new(url)

ws.on :open do |event|
p [:open]
end

ws.on :message do |event|
p [:message, JSON.parse(event.data)]
end

ws.on :close do |event|
p [:close, event.code, event.reason]
ws = nil
end
}

Answer 14 · 2017-02-22T20:57:28.000Z

Anyone who has this issue in their team, please try and hard-code a bigger timeout just like in #74. Does it solve this? If it does, we need to finish #74 with something cleaner and allow anyone to set it, and possibly increase the default.

Answer 15 · 2017-02-22T21:47:14.000Z

is timeout expose anywhere in version 0.10.1?

Answer 16 · 2017-02-22T22:01:23.000Z

No @asheynkman.

Answer 17 · 2017-02-23T00:31:23.000Z

@dblock we have increased time out and its all work great. Let us know when you planing to push into release.

Answer 18 · 2017-02-23T02:19:25.000Z

Can you help finish #74 @asheynkman? Can you copy-paste some code increasing the timeout at least?

Answer 19 · 2017-02-23T16:48:52.000Z

@dblock need to add spec / README and CHANGELOG entries and will push it.

Answer 20 · 2017-03-08T17:12:08.000Z

Seems #74 was closed. @asheynkman did you just implement those same changes or did you increase timeout another way?

My ability to connect to the RTM is still hit or miss.

Answer 21 · 2017-03-08T17:56:44.000Z

#74 was closed by its author, it needs to be "finished", please

Answer 22 · 2017-03-08T18:11:28.000Z

I've PRed #135 for timeout options. Can you please try @tmsteen?

Answer 23 · 2017-03-08T18:18:13.000Z

@asheynkman what timeout and open_timeout values did you use? I want to recommend something in the README.

Answer 24 · 2017-03-08T19:17:35.000Z

@dblock we used 500000 for both

Answer 25 · 2017-03-08T19:19:23.000Z

I just tested #135 with a timeout and opentimeout set at 1000 just to be sure. seems to be working consistently now. Tested with both the Web and RTM clients.

Edit: just saw the comment above.

Answer 26 · 2017-03-08T19:20:02.000Z

1000 should be more then enough , the units in seconds , so should be recommended value

Answer 27 · 2017-03-08T19:27:57.000Z

A 1000 seconds timeout seems like a generally bad idea, other than an "arbitrarily high number" :)

What does everyone think here about changing the default. What would you do?

Answer 28 · 2017-03-08T19:28:39.000Z

cc: @technicalpickles

Answer 29 · 2017-03-08T19:29:19.000Z

i just tested with 180 sec , with account mix of channels and users around 12K , took like 2 min to download

Answer 30 · 2017-03-08T19:30:01.000Z

It's crazy that rtm.start takes 2 minutes, fundamentally bad design for this api :(

Answer 31 · 2017-03-08T19:32:32.000Z

i don't think there are any other options unless slack redesigns their api's

Answer 32 · 2017-03-08T19:32:55.000Z

I'm just venting @asheynkman :)

Answer 33 · 2017-03-08T19:36:31.000Z

I've merged #135. To close this issue we need to increase the default timeout value for rtm_start only (not for all web API calls).

Answer 34 · 2017-03-08T20:33:31.000Z

Alright, check out #136, comment on!

Answer 35 · 2017-04-29T20:54:29.000Z

Slack added rtm.connect, which no longer sends any more information than needed to the client on start unlike rtm.start. I've implemented support for it in slack-ruby-client in #145, you should try slack-ruby-client#HEAD, it has dramatically improved the memory footprint and connection speed of my game bot that currently has 385 active teams and should remove the need for a long timeout.