badges/shields

Badge Images Often Fail To Load In Github README

Closed this issue Β· 33 comments

I've noticed that at least 50% of the time one or more badges on the README's from my various github project fail to display the image. I'm on a very fast connection (~ 100MBS)

In the error console:

Failed to load resource: the server responded with a status of 504 (Gateway Timeout)

The URLS are not the URLs added to the badges in the README, but point to some kind of Github cache:

URL from badge: https://img.shields.io/npm/v/blain.svg
URL of error: https://camo.githubusercontent.com/fa71495d8e006d53927660ed22594c3e7097c5a6/68747470733a2f2f696d672e736869656c64732e696f2f6e706d2f762f626c61696e2e737667

  • Multiple refreshes usually get one or all of the badges to load correctly.
  • Tested in Chrome, Safari and Firefox
  • I have seen this a lot recently on other project's READMEs

Example Repos

Hi, thanks for raising this issue. I've observed this behavior too; I'm sure many people can corroborate.

If you look at https://status.shields-server.com/ and click on one server at a time, you'll see that response times sometimes spike. It's not about the speed of your connection; rather some combination of our server's capacity, and the upstream services being slow or rate limiting us. Github images are served through a proxy, and the meaning of the 504 Gateway Timeout is that the shields server has taken too long to respond to the proxy, and the proxy has given up.

I would love to put work into making Shields more reliable. I think the fix is to add server capacity, and given that we're not going to make upstream rate limiting go away, be much more aggressive with caching through several means:

  • Excess server capacity and or elastic scaling so we can handle traffic spikes (currently there are three virtual machines, period)
  • Caching pieces of data from API responses (not just computed badge text, as we do now)
  • Basing cache priority on frequency (not just recency)
  • Sharing cache data between servers (not cache per server as now)
  • Bigger caches (requires more memory than our virtual machines have)

Our server budget is extremely limited, and frankly we need a significantly larger budget to consider any of these these options.

We ask developers who know and love shields to please make a one-time $10 donation. If you've already given, please ask your developer friends to do the same, or solicit big donations from big projects / companies who use Shields.

https://opencollective.com/shields

Also open to promotion ideas, ideas that don't take money, and in general discussing further!

In my case it's more like

  • 90% of all cases at least one badge is not loaded.
  • 50% of all cases at least 2 bdges are not loded

screen shot 2018-03-22 at 15 46 17

Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
Failed to load resource: the server responded with a status of 504 (Gateway Timeout)
g105b commented

Is this simply a case of not having enough server capacity? If so, would you mind letting us know the specifics of what server is being used, where it is located, and any details regarding bandwidth?

Just a quick test of the github timeouts:

1 second delay:
1 second
2 second delay:
2 seconds
3 second delay:
3 seconds
3.9 second delay:
3 seconds
3.95 second delay:
3 seconds
4 second delay:
4 seconds

Edit:
Removed 5, 6 second delays as not needed,
4 seconds seems to always timeout,
3.95 seconds looks to be okay.

Is this simply a case of not having enough server capacity? If so, would you mind letting us know the specifics of what server is being used, where it is located, and any details regarding bandwidth?

@g105b Server capacity, yes, combined with more aggressive caching. See my comment above: #1568 (comment)

There are three servers, single-core VPS's with 2 GB RAM: VPS SSD 1 from OVH. One is in Gravelines, France, and I believe the other two are in Quebec, Canada.

@RedSparr0w Thanks for those tests!

To everyone following this issue, if you know and love Shields, please make a one-time $10 donation if you haven't already, and ask your friends to do the same! https://opencollective.com/shields

I've noticed a trend over the past few days that server response times around 7am-10am & 1pm-3pm (UTC) are a lot higher than usual,
I suspect this is the time where most of the badges are failing (due to GitHub timing out after 4 seconds).
image
@espadrine Is there anything in the logs that would suggest a much higher amount of traffic from any particular sources during those times?

Been tracking how often the badges have a response time over 4 seconds here, and still seems to be consistent with the above.

Between 7am-10am & 1pm-3pm response times are a lot higher than normal causing the images to timeout when loading on GitHub:
chart
During the weekend response times were pretty good:
image
On Monday and Tuesday response times were above 4 seconds almost the entire peak hours:
image
Note: times are UTC

Anything new regarding this issue? This is now the case 99% of the time - I just don't see any pypi badges working for my repositories. This is fixed temporarily if I go and look at the badge directly (for example, at https://img.shields.io/pypi/v/pdpipe.svg ).

Seems like moving away from badges is a good idea (or at least reduce them to a minimum e.g travis-build)

I'm hoping to get better results for the "important" badges this way.

I have set maxAge=3600 for all my badges and added Shields as a GitHub application but the problem still happen.

@paulmelnikow - that is a super write-up of the problem. Thanks for doing that.

Also open to promotion ideas, ideas that don't take money, and in general discussing further!

I have a suggestion that doesn't take money, and might help the load by enabling downstream proxies (such as github's) to cache the responses.

I freely accept that I'm completely out of my comfort zone with this, but rather than adding the caching on the server side I think it would be worth considering changing the response headers to include caching:

pelson@~> curl https://img.shields.io/conda/dn/conda-forge/iris.svg
> GET /conda/dn/conda-forge/iris.svg HTTP/1.1
> Host: img.shields.io
> User-Agent: curl/7.55.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Wed, 06 Jun 2018 09:41:09 GMT
< Content-Type: image/svg+xml;charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=dcaabaa<snip>8053; expires=Thu, 06-Jun-19 09:40:53 GMT; path=/; domain=.shields.io; HttpOnly
< Cache-Control: no-cache, no-store, must-revalidate
< Expires: Wed, 06 Jun 2018 09:40:59 GMT
< Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< Server: cloudflare
< CF-RAY: 4269eb8a9f8334a6-LHR
< 
<svg xmlns="http://www.w3.org/2000/svg" ...

Specifically, the response < Cache-Control: no-cache, no-store, must-revalidate suggests to me that github's proxy doesn't even have the option of caching existing responses.

In addition, there is a stale-while-revalidate response header that appears to allow stale caches to be returned while the server is working out the new content.

I for one would be completely comfortable with a sensible cache period (and hour, 6 hours, etc.) along with a stale-while-revalidate so that users always get a response quickly, even if the response they are getting isn't the absolute latest information. I've no idea if github's proxy supports this particular header, but I can't see it being harmful.

Apologies if I've missed a conversation about caching headers - I can completely understand if there is a good reason that responses shouldn't be cached other than on the shields.io servers.

@paulmelnikow have you considered using Zeit Now instead?

I guess the relevant history on cache-control is #221, and the key line:

ask.res.setHeader('Cache-Control', 'no-cache, no-store, must-revalidate');

FWIW, this is quite problematic. Even the shields repo is having problems:

bad badges

I donated $10 in the hope that this will get fixed soon! 🀞

kopax commented

I have the same error :

https://github.com/yeutech-lab/accept-dot-path/blob/master/README.md

image

Work fine on npm:

image

It seems that GitHub is not rendering those images correctly :

<img src="https://camo.githubusercontent.com/45aad6d50cc48a0e4ac9a1da135afdffa7795359/68747470733a2f2f696d672e736869656c64732e696f2f6e6f64652f762f40796575746563682d6c61622f6163636570742d646f742d706174682e7376673f7374796c653d666c6174" alt="npm Version" data-canonical-src="https://img.shields.io/node/v/@yeutech-lab/accept-dot-path.svg?style=flat" style="max-width:100%;">

This would be the expected value:

<img src="https://img.shields.io/node/v/@yeutech-lab/accept-dot-path.svg?style=flat" alt="npm Version" data-canonical-src="https://img.shields.io/node/v/@yeutech-lab/accept-dot-path.svg?style=flat" style="max-width:100%;">
wei commented

@kopax this is the intended behavior on Github. Checkout https://help.github.com/articles/about-anonymized-image-urls/

Right; and the reason you’re not seeing the badges is because github camo requests time out after ~3 seconds.

@paulmelnikow I don't think it's camo's fault though. Requests to e.g. https://img.shields.io/hexpm/v/meck.svg?style=flat-square takes 5+ seconds (camo seems to time out after 3 seconds, thus failing to fetch the original image resulting in the missing images in READMEs)

I'm frustrated by our server capacity and that I can't act on this myself without essentially forking.

However it's not Shields or the browser that's timing out, it's camo.

No proxy = Slow badges: https://www.npmjs.com/package/react-boxplot
Proxy = Flaky badges: https://github.com/paulmelnikow/react-boxplot

@paulmelnikow I frequently see https://img.shields.io/hexpm/v/meck.svg?style=flat-square taking over 10 seconds to complete, which would point to Shields being the issue?

Shields is definitely the reason they are slow! πŸ˜›

I think shield.io needs to set more aggressive cloudflare caching options.

Hackage
Hackage-Deps

I found out that you can use a "trick" to reduce the traffic directed to img.shield.io and have better caching (avoid broken images) by simply using the Google Cache. Add https://images1-focus-opensocial.googleusercontent.com/gadgets/proxy?container=focus&url= in front of your shield url, see example above.

@paulmelnikow have you considered using Zeit Now instead?

I'm a fan of that idea. I just proposed it here a couple days ago: #1742 (comment)

I work in the CDN/proxy space and can validate that @pelson's response is the correct approach. Adding server capacity for what is essentially a misconfigured HTTP response is not an efficient use of donation money.

@joshenders There is work going on with headers in #1725 which has recently been merged, with #1806 being the next step to enabling it, and hopefully getting this issue fixed 🀞

@joshenders If you have a chance to read the discussion in #1725, please do!

The recent work to set longer cache headers has just gone live. I will be curious to see how much that helps.

It is very likely we also have a capacity issue, owing to ~10% growth over the last several months. I have proposed moving to Zeit Now to fix the capacity issue and solve our sysadmin bottleneck at the same time. This proposal is blocked awaiting response from @espadrine who owns the servers and load balancer.

I’m glad to say addressing the cache headers (#1723) has had a huge effect. Today’s peak traffic is being handled like weekend traffic, with 99% of requests coming in underneath the 4 second camo timeout. The only broken badges I’m seeing today are not ours. 😁

That gives us a little time to sort out our hosting. We’re still relatively slow on a number of badges, particularly the static badges which should be instant.

Uptimes are definitely getting better:
snapshot of the last 24 hours
average response time (24 hours)

Another weekday over 99%. πŸ‘πŸ˜Œ

If this problem recurs, or there are any other follow-on proposals, let’s open a new issue.

Still having issues with this on several readme’s

Going to close and lock this issue as it's long been resolved but has a reasonably high potential to elicit follow-on comments.

For anyone else that stumbles upon this one...

This 3+ year old issue (as of the time of this post) was originally reflective of the fact that the Shields project experienced a lot of growth that was overwhelming the minimal runtime environment back then, and the overloaded Shields servers were often unable to serve the requested badges within the window enforced by GitHub/Camo. That in turn would result in timeouts/badges not being rendered on GitHub readme pages.

This has long since been resolved with various runtime improvements and caching mechanisms, and today Shields is serving up more than 750 million badges per month without issue. It is of course still possible that one may see a badge that failed to render in GitHub from time to time, but this isn't related to the widespread and persistent issues that originated this issue.

If anyone has questions/reports/etc. about badges not rendering, please open a new issue and/or ping us on Discord with all the relevant details, including screenshots and the badges/badge types.

Please also note that the GitHub/Camo imposed time limits for rendering images are still in place, so it's not entirely uncommon to see rendering challenges with certain badges like the Dynamic and/or Endpoint badges, particularly if those endpoints are running on a platform that periodically shuts them down (like the Heroku free tier). This can happen because there is a rather tight time window for the entire badge request/response flow to complete, and after receiving a badge request the Shields servers almost always have to first fetch data from some upstream endpoint which does not always provide the needed data quickly enough.