ESGF/esgf-download

error on import-synda for existing db

Opened this issue · 2 comments

`(esgpull) -bash-4.2$ esgpull self import-synda /gpfscmip/gpfsdata/esgf/synda-cmn/db/CMIP5/sdt.db
Found 810229 files to import, proceed? [y/n]: y
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--AttributeError: 'NoneType' object has no attribute 'upper'
See /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log for error log.
Aborted!
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--
(esgpull) -bash-4.2$ cat /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log
[2023-04-11 10:46:46] DEBUG root
Locals:
{
'self': SyndaFile(
file_id=297,
url='http://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
file_functional_id='cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
filename='areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
local_path='CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
data_node='aims3.llnl.gov',
checksum=None,
checksum_type=None,
duration=None,
size=42760,
rate=None,
start_date=None,
end_date=None,
crea_date='2020-11-03 14:44:18.596992',
status='done',
error_msg=None,
sdget_status=None,
sdget_error_msg=None,
priority=1000,
tracking_id='7181939e-4b39-4eaf-a4be-85eae5b5a9e9',
model='FGOALS-g2',
project='CMIP5',
variable='areacella',
last_access_date=None,
dataset_id=97,
insertion_group_id=1,
timestamp='2013-03-12T17:25:11Z'
),
'file_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'dataset_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314',
'dataset_master': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0',
'version': 'v20130314',
'master_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'url': 'https://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc',
'local_path': 'CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella'
}
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--

[2023-04-11 10:46:46] ERROR root

Traceback (most recent call last):
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/tui.py", line 154, in logging
yield
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/cli/self.py", line 235, in import_synda
nb_imported = esg.import_synda(url=path, track=True, ask=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/esgpull.py", line 227, in import_synda
file = synda_file.to_file()
^^^^^^^^^^^^^^^^^^^^
File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/models/synda_file.py", line 64, in to_file
checksum_type=self.checksum_type.upper(),
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'upper'
`

It seems that this file (filename = areacella_fx_FGOALS-g2_lgm_r0i0p0.nc) has no checksum nor checksum_type attributes in the synda CMIP5 database you are importing, and those are currently required by esgpull.
The query ran by esgpull on the ESGF search API confirms that the 2 attributes are also missing from the file's metadata: https://esgf-node.ipsl.upmc.fr/esg-search/search?type=File&offset=0&limit=1&format=application%2Fsolr%2Bjson&fields=%2A&query=title%3Aareacella_fx_FGOALS-g2_lgm_r0i0p0.nc&distrib=true&latest=true&retracted=false
I am guessing synda used the same query to fill its database at the time this file was added.
Now if I increase the limit parameter for this query (numFound tells us 4 replicas exist in this case), checksum and checksum_type do exist in the next 3 replicas' metadata.

Knowing this, 2 things could be done during import to handle missing information:

  • look up the metadata from all replicas for each incomplete file and use the most complete one, but I don't know if we can guarantee there will always be at least one index node with the full metadata,
  • skip files with incomplete metadata from the imported database

The first solution might look more complete but it could seriously slow down the import procedure, and does not guarantee missing info will be filled, while the 2nd solution is easy to set up but will definitely introduce divergence between the filesystem and database.

I also encountered this error. My solution was simply to skip the files that were missing metadata bits, and for me, since there weren't a lot of files missing metadata, this was an acceptable loss. I simply added a try/except block with a little extra information to bypass the error halting the program and add the information to the log. I may submit a pull request soon with my proposed code changes:

In esgpull.py

        nb_imported = 0
        for start in iter_idx_range:
            stop = min(len(synda_ids), start + size)
            ids = synda_ids[start:stop]
            synda_files = synda.scalars(sql.synda_file.with_ids(*ids))
            files: list[File] = []
            for synda_file in synda_files:
                try:
                    file = synda_file.to_file()
                except AttributeError as e:
                    logger.warning(e)
                    warn_msg = f"Skipping {synda_file.filename} due to missing database metadata. Continuing to the next file"
                    print(warn_msg)
                    logger.warning(warn_msg)
                    continue

                if file.sha not in shas:
                    file.queries.append(self.legacy_query)
                    files.append(file)
                    synda_shas.add(file.sha)
            if files:
                nb_imported += len(files)
                self.db.add(*files)