hf.peaks ParserError month
danhamill opened this issue · 6 comments
- HydroFunctions version:
hf.__version__
Out[46]: '0.2.0'
- Python version:
Python 3.7.10
- Operating System:
Windows
Description
Date parsing Error for Downloads peak data for usgs gage no. 06813500
What I Did
hf.peaks('06813500')
Retrieving annual peak discharges for site # 06813500 from https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb
Traceback (most recent call last):
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1241, in _build_naive
naive = default.replace(**repl)
ValueError: month must be in 1..12
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pandas\_libs\tslib.pyx", line 514, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\parsing.pyx", line 243, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 657, in parse
six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
File "<string>", line 3, in raise_from
ParserError: month must be in 1..12: 1881-00-00
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pandas\_libs\tslib.pyx", line 525, in pandas._libs.tslib.array_to_datetime
TypeError: invalid string coercion to datetime
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1241, in _build_naive
naive = default.replace(**repl)
ValueError: month must be in 1..12
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<ipython-input-45-74616d452e0e>", line 1, in <module>
hf.peaks('06813500')
File "C:\Anaconda3\envs\py37\lib\site-packages\hydrofunctions\usgs_rdb.py", line 384, in peaks
outputDF.peak_dt = pd.to_datetime(outputDF.peak_dt)
File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\tools\datetimes.py", line 805, in to_datetime
values = convert_listlike(arg._values, format)
File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\tools\datetimes.py", line 472, in _convert_listlike_datetimes
allow_object=True,
File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\arrays\datetimes.py", line 2090, in objects_to_datetime64ns
raise e
File "C:\Anaconda3\envs\py37\lib\site-packages\pandas\core\arrays\datetimes.py", line 2081, in objects_to_datetime64ns
require_iso8601=require_iso8601,
File "pandas\_libs\tslib.pyx", line 364, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 591, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 726, in pandas._libs.tslib.array_to_datetime_object
File "pandas\_libs\tslib.pyx", line 717, in pandas._libs.tslib.array_to_datetime_object
File "pandas\_libs\tslibs\parsing.pyx", line 243, in pandas._libs.tslibs.parsing.parse_datetime_string
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "C:\Anaconda3\envs\py37\lib\site-packages\dateutil\parser\_parser.py", line 657, in parse
six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
File "<string>", line 3, in raise_from
ParserError: month must be in 1..12: 1881-00-00
Thank you @danhamill ! This looks interesting... It looks like first line of the RDB data has a poorly-formed date. You can see it in the RDB file, here: https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb
The peak apparently occurred in "1881-00-00"! There is an explanatory code that goes along with the date that says the 'Month of occurrence is unknown or not exact'! I guess it was a busy year, what with Garfield getting shot and all. I suspect that really old data in these RDB files are likely to have problems like this.
As a first attempt to fix this, I may catch parsing errors and then output the weird lines for people to recover on their own, manually. So, for example:
>>> rulo = hf.peaks('06813500')
Retrieving annual peak discharges for site # 06813500 from https://nwis.waterdata.usgs.gov/nwis/peak?site_no=06813500&agency_cd=USGS&format=rdb
Found some malformed lines of data. Use *.errors to review these lines, or *.rdb to view the entire RDB file.
>>> rulo.errors
Found 1 line with parsing errors:
USGS 06813500 1881-00-00 22.90 Bm
I created a partial solution in the bugfix-rdb-parsing branch. Now, instead of raising an error, hf.peaks() will stop trying to turn the date column into a datetime type, and it won't make this column the index. It issues a message noting the problem, but it keep going.
Ideally, I would like to isolate the line that causes the problem so it can be fixed and added in later.
You can install this 'fixed' version of the software directly from the branch by typing:
pip install git+https://github.com/mroberge/hydrofunctions.git@bugfix-rdb-parsing
Hopefully I will have a better version soon. I'll close this issue when I've merged the improved solution with the Develop branch.
The latest version 9f5866f tries to convert all date columns into datetimes. If it can't, it leaves the column as a string and prints the reason. Perhaps it should raise a warning?
Another idea would be to assign the peak with to the beginning of the water year.
i.e. 1881-10-01
I like the idea of still having a column of datetime with a warning indicating there was an issue with that date.
Closed with commit 374177b
hf.peaks() and hf.field_meas() will now print a statement saying that they were unable to convert a date into the dateTime format and will leave the column as a string. Both will preserve the original rdb file so the user can re-parse the data, or they can alter the dataframe to make the fix.
Excellent. Thank you!