cedadev/cf-checker

CF Checker execution speed

martinjuckes opened this issue · 6 comments

Discussions on the WIP about use of the CF Checker to test CMIP6 files before publication indicate that it is currently too slow to be used in the CMIP6 publication work-flow.
I don't know what the requirements are here, but I've looked into it and found that we can speed it up by around a factor 10 by caching the CF standard name table and the area type table in python shelve files. For the test_files/cell_methods.nc file, this change reduces the execution time on my Dell laptop from 2-8 seconds (I have no idea why there is such a great range ... clearly it would be good to have more extensive tests in a more realistic environment) to around 0.2 seconds.
I'm going to create a branch to share my proof-of-concept code, and then discuss with the WIP whether such a change would make it possible to test all CMIP6 data files.

Hi Martin,
Thanks for raising the speed issue. Just to echo Bryan that we are planning a significant rewrite of the CF Checker and will definitely look at making it more efficient as part of this.

In the meantime please do share your proof-of-concept and if the WIP are happy with the improvement and we're happy with the changes, we could hopefully incorporate into the next release which I'm working on as we speak. This I'm hoping will be the last major release before we rework the CF Checker.

Regards,
Ros

Hi Ros,
thanks. I don't want to mess up your development plans, but it looks as though the current execution speed would prevent the checker being used for CMIP6, which would be a shame. I've had positive feedback from Karl, but don't yet have confirmation that the gain in speed for these fairly limited changes is enough.
They are fairly simple and localised changes and do offer a substantial time saving .. so it would be good if they could be incorporated.

Regards,
Martin

Hello Ros, when would you need to have the branch finalised to get it into the next release? The WIP likes it, but may need some refinement of the cache duration settings (currently fixed at 600 seconds).

Hi Martin. My aim is to make a new release in a couple of weeks' time. I could delay a bit if required, but don't worry if it won't be ready by then as I am happy to make another release afterwards especially as this is for CMIP6.

Hi Martin, I've taken your proposed changes and put them against the latest version. All runs through fine and my group of testfiles now run in a total of 27s as opposed to 1m 47s! I see the cache files are hard-wired to be under /tmp I'll probably change this so that /tmp is the default but it can be overridden with an environment variable and/or option.

Added the option --cache_dir to specify where the cached tables are stored. If this option is not specified /tmp directory will be used.

Default cache time is 1 day and can be overridden as required.