slack-ruby/slack-ruby-client

RTM connection fails in large org but works in small org

geekdave opened this issue · 36 comments

Hello,

I'm attempting to use a bot to connect Slack with Pagerduty: https://github.com/stripe-contrib/pagerbot

  • I set up the bot with a Slack bot access token.
  • I verified that the token works by using "curl" from the command line and was able to see the message getting posted in Slack.
  • I tested the bot on a large org (5000+ people) and it consistently fails to connect.
  • I tested the bot on a small org (15 people) (using a different token of course), and it worked fine.

I've worked with the developer of the bot @macobo, and he suggested I file an issue here.

I've created a gist with the error logs from heroku. The first log shows the errors when running this in the large org. The second log shows the successful connection with a smaller org:

I added some comments at the bottom noting some differences, notably:

  1. On large org I see this in the response, but not in the small org:
transfer-encoding: "chunked" 

Could there be an issue with the large number of members in the org or the number of channels causing something to stall? Is there a huge payload coming back from the RTM API due to my large org?

  1. There is a delay of 100+ seconds in the large org log, and the error:
Socket URL has expired

Thanks for looking!

UPDATE: I pinged Slack's API help team about this and got the following response:

My guess is the bot is timing out on the initial call to rtm.start, which is the method that's called to initialize an RTM connection — for large teams this can take a long time. Thinking aloud, it could also be a memory allocation issue (depending on your server config).

Any chance you can increase the timeout for the initial API call ?

We're working very (very) hard on improvements in this area, as you can imagine this is a performance problem across-the-board for very large teams.

/cc @macobo

Indeed, there's a very large payload coming back from the initial rtm.start with all the org information. Tweak the call to rtm.start with simple_latest: true and no_unreads, that reduces the size by a bit.

Then disable any client-side parsing of that data with Slack::RealTime::Client.new(store_class: nil), see if that helps at all.

There's a feature request on supplying a custom Faraday connection in #72. There's an unfinished pull request for adding a timeout option in #74. In the meantime you can probably test other connection options with some editing of https://github.com/slack-ruby/slack-ruby-client/blob/master/lib/slack/web/faraday/connection.rb or monkey-patching.

Thanks @dblock for the suggestions - we tested with these settings, but ran into a similar issue (didn't try store_class though). Will try events API instead.

Related to this issue: I think this should make client.start! throw in a way that the program exits or attempt to reconnect. Currently the program will just hang indefinitely and the root cause is not really visible if a logger is not specified.

Agreed @macobo, please contribute.

Did anyone develop a short term solution to this? Seeing the same issue and the D, [2017-02-17T13:38:12.283081 #5079] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#parse: �B{"type":"error","error":{"msg":"Socket URL has expired","code":1}} debug error.

It did randomly work for me one time with the suggestions from @dblock with regard to reducing the size of the data coming back, about the same as the commit that @macobo referenced above. But will not work again. As far as I can tell, nothing changed between runs.

UPDATE: May have been a corp network issue. Works when on regular network.

@tmsteen @macobo has either of you attempted to increase the timeout? maybe finish #74?

Not familiar with how to do that, sorry. Certainly willing to assist if you can start me in the right direction. But I am starting to think that after setting simple_latest, no_unreads, and store_class that the issue may only be partially related to those.

Right now, my connection works on a seemingly random interval. From the debug log I get this in all cases D, [2017-02-17T16:11:10.540318 #6791] DEBUG -- Slack::RealTime::Concurrency::Eventmachine::Client#write: GET /websocket/ZKsL1B2ulhjnvmg2pJDOFZLMmEKTxNnzcXGLJJHGXZpNbWjd0NQI9IUwb4Z4s-QynUGzlDvIVuogXhsoAcls4rEa4LI0ny9qvtn2CnwlukVyZdC-RPYEFvSAfi-dSywgMmQ9mCQDxzypcQd9o7Iztg3yxVdz2ciRk= HTTP/1.1

You probably understand the inner workings better than I do, but it seems that after the initial call to the Slack API, it returns a unique web socket to use for the rest of the communication. I am curious if the "randomness" of when my connection works and when it doesn't is caused by that web socket timing out or only being valid within a certain window?

More testing needed, but it seems that it will work once, then fail for a while, and then work again a few minutes later. When it doesn't work, the error is still the same as noted above.

Sadly I wasn't able to work on this as I don't have access to a large slack organization.

@tmsteen How big is the organization you're using this gem for?

5000+ but my intended audience within the org is much smaller.

We also well facing the same issue setting {simple_latest: true, no_unreads:true} or store_class = nil makes no difference, is there any work around on this issue ?

I think there's multiple factors that make debugging this hard:

  • The developers/others interested don't have access to an account large enough to replicate the problem.
  • Part of this is on the side of slack - that is, the initial payload will be parsed from a string to something else so a huge payload might be causing issues.

One way to solve these issues is for someone with an admin in a larger account to either:

  • Debug this themselves (i.e. profile, dump the payload, etc)
  • Give temporary access to someone who is interested in debugging.

Number two is rather undesirable though due to the security implications for the account given access to.

One data point I have notice , if using just simple client from the sample with large organization connects and established connection. The issue happens while using Slack::RealTime::Client

require 'http'
require 'json'
require 'faye/websocket'
require 'eventmachine'

rc = JSON.parse HTTP.post("https://slack.com/api/rtm.start", params: {
token: ENV['SLACK_API_TOKEN']
})

url=rc['url']

EM.run {
ws = Faye::WebSocket::Client.new(url)

ws.on :open do |event|
p [:open]
end

ws.on :message do |event|
p [:message, JSON.parse(event.data)]
end

ws.on :close do |event|
p [:close, event.code, event.reason]
ws = nil
end
}

Anyone who has this issue in their team, please try and hard-code a bigger timeout just like in #74. Does it solve this? If it does, we need to finish #74 with something cleaner and allow anyone to set it, and possibly increase the default.

is timeout expose anywhere in version 0.10.1?

@dblock we have increased time out and its all work great. Let us know when you planing to push into release.

Can you help finish #74 @asheynkman? Can you copy-paste some code increasing the timeout at least?

@dblock need to add spec / README and CHANGELOG entries and will push it.

Seems #74 was closed. @asheynkman did you just implement those same changes or did you increase timeout another way?

My ability to connect to the RTM is still hit or miss.

#74 was closed by its author, it needs to be "finished", please

I've PRed #135 for timeout options. Can you please try @tmsteen?

@asheynkman what timeout and open_timeout values did you use? I want to recommend something in the README.

@dblock we used 500000 for both

I just tested #135 with a timeout and opentimeout set at 1000 just to be sure. seems to be working consistently now. Tested with both the Web and RTM clients.

Edit: just saw the comment above.

1000 should be more then enough , the units in seconds , so should be recommended value

A 1000 seconds timeout seems like a generally bad idea, other than an "arbitrarily high number" :)

What does everyone think here about changing the default. What would you do?

i just tested with 180 sec , with account mix of channels and users around 12K , took like 2 min to download

It's crazy that rtm.start takes 2 minutes, fundamentally bad design for this api :(

i don't think there are any other options unless slack redesigns their api's

I'm just venting @asheynkman :)

lol

I've merged #135. To close this issue we need to increase the default timeout value for rtm_start only (not for all web API calls).

Alright, check out #136, comment on!

Slack added rtm.connect, which no longer sends any more information than needed to the client on start unlike rtm.start. I've implemented support for it in slack-ruby-client in #145, you should try slack-ruby-client#HEAD, it has dramatically improved the memory footprint and connection speed of my game bot that currently has 385 active teams and should remove the need for a long timeout.