geomagpy/magpy

extractDateFromString() will take the last date in a file name with multiple dates

bairaelyn opened this issue · 0 comments

Only a problem when reading files from large archives without specified filename.

Example: DSCOVR data archive files have the format _DATASTARTTIME_DATAENDTIME_DATACOMPILATIONTIME, and look like this:

oe_m1m_dscovr_s20170911000000_e20170911235959_p20170912023324_pub.nc

When extractDateFromString() is searching the correct file for a date, it cycles through all available number strings in the filename but only takes the last one (here, the compilation date, which is not relevant for the data in the file):

        for i in range(len(testunder)):
            try:
                numberstr = re.findall(r'\d+',testunder[i])[0]
            except:
                numberstr = '0'
            if len(numberstr) > 4:
                tmpdaystring = numberstr
            elif len(numberstr) == 4 and int(numberstr) > 1900: # use year at the end of string
                tmpdaystring = numberstr

        if len(tmpdaystring) > 8:
            try: # first try whether an easy pattern can be found e.g. test12014-11-22
                match = re.search(r'\d{4}-\d{2}-\d{2}', daystring)
                date = datetime.strptime(match.group(), '%Y-%m-%d').date()

This could be remedied by testing all available len(numberstr) > 4 strings and returning the first deciphered, but that could break the automatic reading of other formats. Will need extensive testing. Workaround for now means reading data with endtime + 1 day, so that the correct files are read, then trimming down later.