UKHSA-Internal/coronavirus-dashboard-api-python-sdk

Internal server error

bhavesh0009 opened this issue · 13 comments

I am running the following code and it results in 500 error. It may be because high volume but at least the document specifies what should be maximum threshold? Due to this, my app stopped running, which gives insight about the local spread and may other statistics.

    ltla_filter = ['areaType=ltla']
    cases_and_deaths = {
                        "areaType":"areaType"
                        ,"areaName":"areaName"
                        ,"areaCode":"areaCode"
                        ,"specimenDate":"date"
                        ,"dailyLabConfirmedCases":"newCasesBySpecimenDate"
                        ,"totalLabConfirmedCases":"cumCasesBySpecimenDate"
                        }
    api = Cov19API(filters=ltla_filter, structure=cases_and_deaths)
    data = api.get_json()  # Returns a dictionary   

Error : "statusCode": 500, "message": "Internal server error", "activityId": "a7f790fe-e834-46d0-b8ed-0632c4b08274"

Hi, thanks for getting in touch and apologies for the delay.

When did this problem start?

We have been experiencing some infrastructure related issues in last 10 days. It's entirely out of our control and is under active investigation by people all over the planet. In the meantime, we have diverted part of the traffic onto the staging server, which may cause occasional timeouts, specially when the traffic is high. For that reason, I have had to adjust the quota and rate for the API.

I will set a retry clause and increase the forwarding duration slightly - let's see if that helps.

Right... the changes that I made seem to have worked. The queries seem to be running well. I will increase the default timeout value in the SDKs and push a new version tomorrow.

@bhavesh0009 Are you still experiencing difficulty using the SDK to download the data or is the issue resolved?

@bhavesh0009 Are you still experiencing difficulty using the SDK to download the data or is the issue resolved?

I am still getting issue. Same error.

Hi, thanks for getting in touch and apologies for the delay.

When did this problem start?

We have been experiencing some infrastructure related issues in last 10 days. It's entirely out of our control and is under active investigation by people all over the planet. In the meantime, we have diverted part of the traffic onto the staging server, which may cause occasional timeouts, specially when the traffic is high. For that reason, I have had to adjust the quota and rate for the API.

I will set a retry clause and increase the forwarding duration slightly - let's see if that helps.

I have observed response is slow since long. But since yesterday it is returning 500 error code only.

Hi, thanks for getting in touch and apologies for the delay.
When did this problem start?
We have been experiencing some infrastructure related issues in last 10 days. It's entirely out of our control and is under active investigation by people all over the planet. In the meantime, we have diverted part of the traffic onto the staging server, which may cause occasional timeouts, specially when the traffic is high. For that reason, I have had to adjust the quota and rate for the API.
I will set a retry clause and increase the forwarding duration slightly - let's see if that helps.

I have observed response is slow since long. But since yesterday it is returning 500 error code only.

And it started working again!!. thanks so much. Hope it remains running.

Hi @xenatisch, we are currently experiencing timeouts and the occasional 500 error. We have not been able to retrieve data since yesterday. The problem is reproduced when using the API directly e.g. visiting https://api.coronavirus.data.gov.uk/v1/data?filters=areaType=nation;areaName=england&structure={"name":"areaName"} in a browser.

Update: the API is working for us again and has been working for the past 12 hours.

@xenatisch again getting 500 Internal error for same code.

Apologies @geeogi and @bhavesh0009, I've been working 18-hour shifts to get this thing fixed. I really didn't have the opportunity to check GitHub until now.

Following numerous discussions with service devs in the States, we have now identified a bug in the Cloud service that we use to manage the API. They are working to create a fix. In the meantime, I have implemented a workaround which should improve the service substantially.

It may also be worth a mention that:

  • We have sustained ~200% increase in traffic over the last week alone. During peak hours, the API is hit by up to 185,000 requests per minute. Now take into account the fact that we have been running on my DIY fixes - which included a makeshift load balancer - since the initial outage on the 25th of August all the way until today.

  • The latency in the UK South data centre, which started on Monday and increased our response time on average by 1600% and up to 26,500% has now been resolved. Though there is now a mild increase of latency in the UK West data centre, but let's hope that it gets rectified soon.

  • We also had a substantial increase in Client Connection Failures between our API manager and the backend cache server (Redis), which was causing a lot of the request to be directed to the servers. This filled up the capacity of the API manager pretty quickly, bringing the service to a halt. It's also not possible to restart it because it's fully managed. Under normal circumstance, each instance of this service should be able to sustain up to 6,000 requests per second - and we run 3 instances in 2 regions - but not if the requests keep getting held up in a queue due to increased latency. The number of such failures has now subsided, though I'm still keeping an eye on it.

  • There are now 60 dedicated servers - of which 22 are always warm - across 2 different regions and multiple availability zones. The requests are now routed to the best performing endpoint at DNS level.

  • To mitigate some of these issues, I had to impose a more aggressive throttling policy on API requests that come from sources other than the dashboard. It's still a fairly generous number and would allow people to download almost everything they that need, however, if someone starts hitting the server 3 times a second looking for new data - which believe you me, hundreds of people do - then they're going to reach the quota and would have to wait for up to 300 seconds.

All in all, I think the performance should now be closer to what it used be.

Let me know if you're still experiencing any issues or are continually getting a >=500 response code.

All the best.

@xenatisch thanks for the detailed response. We’ve been able to use the API successfully over the past few days. We do occasionally get a 500, in which case we try again later and usually get through.

Thanks so much for helping us access this data. We’re using the data to power our site (covidlive.co.uk) and there must be 1000s of others.

@xenatisch Thanks for the prompt and detailed response. Yes, I have noticed API call is super fast now. Thanks for all your hard work.

Thanks for your kind word @bhavesh0009 and @geeogi. I'm just doing my job.

We are always happy to help. Let me know if you experience any difficulty again.