Glench/ExtPay

Website down

endeffects opened this issue ยท 8 comments

@Glench Your Website is down with a nginx error

The site is back up now. I'm actively looking into causes of the issue and how to mitigate it. I'll update here with more information.

ExtensionPay has been experiencing stability issues every night at 1am. Sometimes these were for a few minutes but over the past few days it's just been constant. The nominal cause of the issue is the server running out of memory. I've increased the server memory which seems to have made the site stable. I'll continue to investigate to mitigate any further issues.

Turns out this has helped but not fixed the issue. We are continuing to investigate.

Just went down a few minutes ago.

We believe we have found the underlying issue. We have deployed a temporary fix that should help. More details to come.

The site has been stable for many hours now. During the instability users would sometimes receive slow responses or HTTP 499 errors. We'll have a breakdown of the issue, causes, and fixes soon.

The site has continued to be stable with no appreciable downtime for a couple days now.

Details
Starting May 16, some short instabilities were detected in ExtensionPay's web services, mostly around 12:58am US/Eastern time. We upgraded our database backup software that seemed to be contributing to the instability and causing spikes in CPU and memory usage. We also optimized some database parameters to increase performance.

Even so, each night the length of these instabilities seemed to increase slightly in the range of 1am-1:30am until May 22 when they impacted service with extended instability through 9:30am. During this period, many clients received HTTP 499 or 500 errors or slow response times due to high CPU/memory usage of the server. Our response began around 6am. In order to mitigate instability and buy us time to investigate we upgraded the server which took only a few minutes of downtime and helped to stabilize the site.

We discovered a bug in our caching code that caused a memory leak that slowly maxed out server resources, which seemed to be the main cause of the instability. A short-term fix was deployed at 9:30am on May 22 and a permanent fix was deployed yesterday around 10am. Additionally, we deployed more database performance optimizations which have significantly reduced server resource usage.

There is a still a short period of instability (<1 minute) at 12:58am every night that we'll continue to investigate. Going forward, we now have more robust monitoring for instability issues as well as more automated testing for our caching layer.

Thanks for Great work and the quick investigation.