UK National Charge Point Registry import is timing out
Closed this issue · 3 comments
It seems that the import for the UK National Chargepoint Registry is timing out.
What needs doing to make imports of large datasets less intensive?
or is this just a cloudflare cacheing issue, with the cloudflare request timing out?
I'm guessing you have a script to invoke that because I took the button away ages ago :) all large imports are subject to timeouts using the admin web page method and yes you will hit the cloudflare timeout first but the import will keep going. There is a plan B.
When an import runs, we get all the latest data from the import, transform it into our own POI object model, then compare it against our current data set (either whole world or country specific, usually country specific) to see if we already have items we previously imported that we could now update or if any of the new items are approximate duplicates (close distance, network etc) of something we already have (we discard these). If item have no changes then we discard those as well. The fetch of all data to compare against and the comparison/deduplication stage is extremely brute-force and expensive.
Once we have our list of new/updated POIs we write them to the SQL database as added/updated items and refresh the POI cache (mongodb).
A while ago I was working on an offline method to perform the import/deduplication on a different machine, then post the changes to the API as a batch. This works OK but it's not fully automated. The plan was to implement a .net worker service to run on Linux (because its cheaper to run linux VMs). I've run out of time/energy for that currently, which is a shame because it was nearly there, it's just a matter of pulling it all together. We currently also use a .net worker to wrap our API as a linux systemd service, this hosts our 2 API mirrors which constantly sync with the master api to local MongoDB instances and are load balanced through a cloudflare worker. Our master API/website still runs on windows/mongodb/sql server.
So to make large imports less intensive we need to optimise comparision and de-duplication and we need to fully offload the pre-import process so it can run on a different server. When we started in 2011 it ran as a little GUI/console app on the server itself but as it got more complex it was useful to see numbers of duplicates etc and run at times of low load, so it moved to being part of the admin website.
Currently I've been running the imports manually by using the little GUI to prepare the batch JSON file to upload, then uploading it so the objective of the final process if to completely automate that. There needs to be an api call to inform the database when the Date Last Imported was for provider as well as currently the manual process doesn't update that.
Well any help is much appreciated! As background, I work full time developing https://certifytheweb.com which is an app for Let's Encrypt/ACME certificates on Windows and it's taking up all my time currently. I'm also not a regular explorer of new charging stations (I use 2 regularly) which make my level of interest on public charging rather low at times. Meanwhile, the API is churning out 3 million queries a month/1TB of data, so while we don't have very many active contributors we do (appear to have) have plenty of consumers.
The advantage of the OCM stuff having moved to .net core (which will soon be called .net 5) is that it's now running on very current technology rather than something that was heading towards legacy. Plus, now most of it can run on docker/Linux (especially the API side) which makes it cheaper to scale and again it deals with very current technology from a sharpening your skillset point of view. The overall OCM software is a bit complex at the high level but if you break it down into small chunks it's OK, and there are parts of it that are mostly unused or rarely updated.