Update caching strategy for ungridded data
heikoklein opened this issue · 3 comments
Is your feature request related to a problem? Please describe.
The cached ungridded data objects are very often (more or less daily) re-evaluated, though creation takes several hours. The rules for cache-rejections are too strict:
{
'pyaerocom_version': '0.20.0',
'newest_file_in_read_dir': 'data',
'newest_file_date_in_read_dir': 1719824185.0,
'data_revision': '20240627',
'reader_version': '0.52_0.09',
'ungridded_data_version': '0.22',
'cacher_version': '1.12'}
The fields causing problems are:
pyaerocom_version
pyaerocom is nowadays rapidly updated, i.e. 4 new releases and the same number of dev-release, in total 8 new versions within 4 weeks, without any change to the obs-datanewest_file_in_read_dir
/newest_file_date_in_read_dir
:- This checks besides the files also the directories timestamps,
data
is not a file but a directory - This is based on
ctime
, which (on unix/linux) is the time of the last metadata change, but we would need the last modification timemtime
. - The linux timestamps give ~ second-resolution. In EEA 2022 we have 50+ files with the same ctime: 1718640297.0 making the newest_file_in_read_dir not well-defined.
- In addition, checking all files for timestamp can take very long time.
- This checks besides the files also the directories timestamps,
Describe the solution you would like to see
- pyaerocom_version should not be used for cache invalidation.
- newest_file_in_read_dir/newst_file_date_in_read_dir should be based on mtime rather ctime
- newest_file_in_read_dir/newst_file_date_in_read_dir should only be written, but not tested for. It is the responsibility of the database-maintainer to update data_revision if the data is to be updated.
I think you nailed it. This seems like a better approach to the cache invalidation.
Although at least at the file system level some file systems have higher mtime resolution than just one second, it seems Python translates that to one sec resolution only. But I agree that searching through thousands of files is not a good idea for cache invalidation.
I also agree that we should not use pyaerocom_version for cache validation.
We need to make sure that all obs networks really provide a revision string. Not all might do that correctly as it is not mandatory.
If all do, basing the cache invalidation on data revision number and the ungridded revision number should do the job.
Thanks for the feedback. I will remove the dependency on the pyaerocom_version
and switch from ctime to mtime.
I will open a separate ticket to make a revision string from the file-readers or observation networks mandatory. This will take some more time, because we have to check all obs-readers.