mroberge/hydrofunctions

hf.peaks ParserError month

danhamill opened this issue · 6 comments

  • HydroFunctions version:
hf.__version__
Out[46]: '0.2.0'
  • Python version:
Python 3.7.10
  • Operating System:
    Windows

Description

Date parsing Error for Downloads peak data for usgs gage no. 06813500

What I Did

hf.peaks('06813500')
Retrieving annual peak discharges for site # 06813500  from  https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb
Traceback (most recent call last):

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 655, in parse
    ret = self._build_naive(res, default)

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1241, in _build_naive
    naive = default.replace(**repl)

ValueError: month must be in 1..12


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "pandas\_libs\tslib.pyx", line 514, in pandas._libs.tslib.array_to_datetime

  File "pandas\_libs\tslibs\parsing.pyx", line 243, in pandas._libs.tslibs.parsing.parse_datetime_string

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 657, in parse
    six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)

  File "<string>", line 3, in raise_from

ParserError: month must be in 1..12: 1881-00-00


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "pandas\_libs\tslib.pyx", line 525, in pandas._libs.tslib.array_to_datetime

TypeError: invalid string coercion to datetime


During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 655, in parse
    ret = self._build_naive(res, default)

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1241, in _build_naive
    naive = default.replace(**repl)

ValueError: month must be in 1..12


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "<ipython-input-45-74616d452e0e>", line 1, in <module>
    hf.peaks('06813500')

  File "C:\Anaconda3\envs\py37\lib\site-packages\hydrofunctions\usgs_rdb.py", line 384, in peaks
    outputDF.peak_dt = pd.to_datetime(outputDF.peak_dt)

  File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\tools\datetimes.py", line 805, in to_datetime
    values = convert_listlike(arg._values, format)

  File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\tools\datetimes.py", line 472, in _convert_listlike_datetimes
    allow_object=True,

  File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\arrays\datetimes.py", line 2090, in objects_to_datetime64ns
    raise e

  File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\arrays\datetimes.py", line 2081, in objects_to_datetime64ns
    require_iso8601=require_iso8601,

  File "pandas\_libs\tslib.pyx", line 364, in pandas._libs.tslib.array_to_datetime

  File "pandas\_libs\tslib.pyx", line 591, in pandas._libs.tslib.array_to_datetime

  File "pandas\_libs\tslib.pyx", line 726, in pandas._libs.tslib.array_to_datetime_object

  File "pandas\_libs\tslib.pyx", line 717, in pandas._libs.tslib.array_to_datetime_object

  File "pandas\_libs\tslibs\parsing.pyx", line 243, in pandas._libs.tslibs.parsing.parse_datetime_string

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1374, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)

  File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 657, in parse
    six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)

  File "<string>", line 3, in raise_from

ParserError: month must be in 1..12: 1881-00-00

Thank you @danhamill ! This looks interesting... It looks like first line of the RDB data has a poorly-formed date. You can see it in the RDB file, here: https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb

The peak apparently occurred in "1881-00-00"! There is an explanatory code that goes along with the date that says the 'Month of occurrence is unknown or not exact'! I guess it was a busy year, what with Garfield getting shot and all. I suspect that really old data in these RDB files are likely to have problems like this.

As a first attempt to fix this, I may catch parsing errors and then output the weird lines for people to recover on their own, manually. So, for example:

>>> rulo = hf.peaks('06813500')
Retrieving annual peak discharges for site # 06813500  from  https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb
Found some malformed lines of data. Use *.errors to review these lines, or *.rdb to view the entire RDB file.

>>> rulo.errors
Found 1 line with parsing errors:
USGS	06813500	1881-00-00				22.90	Bm

I created a partial solution in the bugfix-rdb-parsing branch. Now, instead of raising an error, hf.peaks() will stop trying to turn the date column into a datetime type, and it won't make this column the index. It issues a message noting the problem, but it keep going.

Ideally, I would like to isolate the line that causes the problem so it can be fixed and added in later.

You can install this 'fixed' version of the software directly from the branch by typing:

pip install git+https://github.com/mroberge/hydrofunctions.git@bugfix-rdb-parsing

Hopefully I will have a better version soon. I'll close this issue when I've merged the improved solution with the Develop branch.

The latest version 9f5866f tries to convert all date columns into datetimes. If it can't, it leaves the column as a string and prints the reason. Perhaps it should raise a warning?

Another idea would be to assign the peak with to the beginning of the water year.

i.e. 1881-10-01

I like the idea of still having a column of datetime with a warning indicating there was an issue with that date.

Closed with commit 374177b
hf.peaks() and hf.field_meas() will now print a statement saying that they were unable to convert a date into the dateTime format and will leave the column as a string. Both will preserve the original rdb file so the user can re-parse the data, or they can alter the dataframe to make the fix.

Excellent. Thank you!