DoSomething/legacy-website

Signups endpoint is timing out.

Opened this issue · 12 comments

BUG

Current Behavior

The /v1/api/signups endpoint in Phoenix seems to either timing out or returning 500s. It seems like the timeouts started to occur roughly around the same time that we deployed the fix for #7329 (the diff for that deploy is in links below).

Desired Behavior

It should return a successful response in a reasonable amount of time.

Steps to Replicate

curl "https://www.dosomething.org/api/v1/signups?count=10"
HTTP/1.1 500 Internal Server Error
Content-Type: application/json
X-Drupal-Cache: MISS
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Cache-Control: no-cache, must-revalidate
X-Content-Type-Options: nosniff
X-Frame-Options: Allow-From: http://optimizely.com
Via: 1.1 varnish
Set-Cookie: USDRUP2=us-varnish-1; path=/
Access-Control-Allow-Origin: *
Fastly-Restarts: 1
Content-Length: 2
Accept-Ranges: bytes
Date: Mon, 27 Mar 2017 19:46:57 GMT
Via: 1.1 varnish
Connection: close
X-Fastly-Original: /api/v1/signups?count=10
X-Fastly-Lookup: None
X-Served-By: cache-iad2647-IAD
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1490643973.316112,VS0,VE44432
Vary: Accept
X-Fastly-Country-Code-Client: US

Why This Matters

This is causing errors on SMS messaging in Gambit and we want to send a broadcast this afternoon!

💣

Relevant Screenshots + Links

@blisteringherb found the error for the example 500 request that I included above:

PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate
16384 bytes) in Unknown on line 0" while reading response header from upstream, client:
104.156.95.47, server: localhost, request: "GET /api/v1/signups?count=10 HTTP/1.1",
upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "www.dosomething.org"

Possibly something minor, but is there a chance the wrong data/structure is being sent in here?

Mar 27 15:24:24 ip-10-100-35-11 drupal: https://www.dosomething.org|1490642664|php|104.156.95.47|https://www.dosomething.org/api/v1/signups?count=10||2800057||Warning: Invalid argument supplied for foreach() in dosomething_helpers_extract_field_data() (line 179 of /var/www/www.dosomething.org/releases/20170327182454/lib/modules/dosomething/dosomething_helpers/dosomething_helpers.module).
23.235.46.30 - - [27/Mar/2017:14:32:41 -0400] "GET /api/v1/signups?count=100&user=8771655 HTTP/1.1" 200 2137 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
104.156.95.47 - - [27/Mar/2017:15:24:49 -0400] "GET /api/v1/signups?count=10 HTTP/1.1" 500 12 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
23.235.46.33 - - [27/Mar/2017:15:25:20 -0400] "GET /api/v1/signups?count=10 HTTP/1.1" 500 12 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
23.235.46.38 - - [27/Mar/2017:15:46:35 -0400] "GET /api/v1/signups?count=10 HTTP/1.1" 500 12 "-" "Paw/3.0.16 (Macintosh; OS X/10.12.3) GCDHTTPRequest"
104.156.95.47 - - [27/Mar/2017:15:46:57 -0400] "GET /api/v1/signups?count=10 HTTP/1.1" 500 12 "-" "Paw/3.0.16 (Macintosh; OS X/10.12.3) GCDHTTPRequest"
23.235.46.25 - - [27/Mar/2017:15:51:26 -0400] "GET /api/v1/signups?count=100&user=1940652 HTTP/1.1" 200 10691 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
23.235.46.25 - - [27/Mar/2017:15:53:14 -0400] "GET /api/v1/signups?count=100&user=1940652 HTTP/1.1" 200 10691 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
23.235.46.25 - - [27/Mar/2017:15:53:44 -0400] "GET /api/v1/signups?count=100&user=1940652 HTTP/1.1" 200 10691 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"
104.156.95.35 - - [27/Mar/2017:15:54:23 -0400] "GET /api/v1/signups?count=100&user=1940652 HTTP/1.1" 200 10691 "-" "GuzzleHttp/6.2.1 curl/7.35.0 PHP/7.0.17-2+deb.sury.org~trusty+1"

Yeah, it seems like it works when you whittle down the query by specifying a user to filter by.

@aaronschachter What's an example request URL from Gambit?

nginx/dosomething-error.log:2017/03/27 15:24:49 [error] 1336#0: *50685 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in /var/www/www.dosomething.org/releases/20170327182454/html/includes/database/database.inc on line 2227
nginx/dosomething-error.log:PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 16384 bytes) in Unknown on line 0" while reading response header from upstream, client: 104.156.95.47, server: localhost, request: "GET /api/v1/signups?count=10 HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "www.dosomething.org"
nginx/dosomething-error.log:2017/03/27 15:25:20 [error] 1336#0: *50608 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in /var/www/www.dosomething.org/releases/20170327182454/html/includes/database/database.inc on line 2227
nginx/dosomething-error.log:PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 16384 bytes) in Unknown on line 0" while reading response header from upstream, client: 23.235.46.33, server: localhost, request: "GET /api/v1/signups?count=10 HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "www.dosomething.org"
nginx/dosomething-error.log:2017/03/27 15:46:35 [error] 1337#0: *52692 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in /var/www/www.dosomething.org/releases/20170327182454/html/includes/database/database.inc on line 2227" while reading response header from upstream, client: 23.235.46.38, server: localhost, request: "GET /api/v1/signups?count=10 HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "www.dosomething.org"
nginx/dosomething-error.log:2017/03/27 15:46:57 [error] 1337#0: *52692 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 4096 bytes) in /var/www/www.dosomething.org/releases/20170327182454/html/includes/database/database.inc on line 2227
nginx/dosomething-error.log:PHP message: PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 16384 bytes) in Unknown on line 0" while reading response header from upstream, client: 104.156.95.47, server: localhost, request: "GET /api/v1/signups?count=10 HTTP/1.1", upstream: "fastcgi://unix:/run/php/php7.0-fpm.sock:", host: "www.dosomething.org"

Interesting – I'm getting a successful response when querying Phoenix's /api/v1/signups endpoint with my Northstar or Drupal ID, but an empty response when I use Northstar proxy...

# good, returns full response
curl "https://www.dosomething.org/api/v1/signups?count=10&user=5543dfd6469c64ec7d8b46b3" \

# good, returns full response
curl "https://www.dosomething.org/api/v1/signups?count=10&user=1258186" \

# bad, returns { "data": [] }
curl "https://northstar.dosomething.org/v1/signups?user=5543dfd6469c64ec7d8b46b3&count=10" \
     -H "Accept: application/json" \
     -H "Authorization: Bearer ***** Hidden credentials *****" \
     -H "Content-Type: application/json; charset=utf-8"

For the Northstar empty-response, my guess is that something weird is happening with how the drupalIdForNorthstarId method in Northstar interacts with the new ID logic in Phoenix. But not really sure if that's the issue since we're seeing timeouts.

I made and deployed the change in Gambit today to query Phoenix instead of Northstar (https://github.com/DoSomething/gambit/pull/830) per https://dosomething.slack.com/archives/C02BBRN5A/p1490634446294290

If the Northstar proxy stays broken, the mobile apps stay broken.

If we drop support for the mobile apps completely (which.. we kind of have if we don't fix the proxy or update the apps), we may not have a need to join on things like Reportback Items or Campaign information anymore on this Signups index -- which could be helpful for performance...

While load testing Gambit thor, it became apparent to me that using an aggressive concurrency setting ended in a high error rate of requests made to Northstar thor. Requests would start to take longer to respond very rapidly. Once it passed the 30 second mark, Heroku router interrupted the request and logged an H12 Timeout error.

@DFurnes to answer #7333 (comment) Gambit makes 2 requests to the Signups endpoint:

  • GET signups?user=:id:&campaigns=:id: for one Campaign ID, used by the chatbot route
  • GET signups/:id for a Signup ID, used by the signups route

Investigating in New Relic, and one idea that comes to mind is introducing a query parameter to exclude Reportback Files. The mobile apps (and possibly other DS apps?) need Reportback File image URL's to render User profiles, but Gambit doesn't ever use them -- it just cares about the Signup's reportback.quantity to determine whether the current User has completed the Campaign (side note: Gambit also currently requires the Signup user.id to exist)

Adding the parameter and conditionally joining on RB files could be a way to improve Gambit performance without breaking the mobile app User profiles.

Here's some examples of 5 minute GET signups requests I've found from poking around New Relic for Gambit external services

gambit-response-times

From Phoenix New Relic:
phoenix-services

Note: User profiles in mobile app are currently broken while the Northstar Signups proxy is broken.