prevent removing too much duplicates
Closed this issue · 0 comments
veenstrajelmer commented
- Data Distributie Laag. Service from Rijkswaterstaat for distributing water quantity data. version:
- Python version:
- Operating System:
Description
Apparently, .drop_duplicates()
does not consider the index in deciding whether a row is unique. We removed the Tijdstap column from the dataframe in #51. We now have to do this after dropping the duplicates, instead of before.
TODO:
- remove Tijdstap column as part of cleanup, after
.drop_duplicates()
- add testcase that catches this
What I Did
import ddlpy
import datetime as dt
locations = ddlpy.locations()
location = locations[(locations['Grootheid.Code'] == 'WATHTE') &
(locations['Groepering.Code'] == 'NVT')].loc['DENHDR']
start_date = dt.datetime(2014, 1, 1)
end_date = dt.datetime(2014, 1, 7)
measurements_clean = ddlpy.measurements(location, start_date=start_date, end_date=end_date, clean_df=True)
measurements_raw = ddlpy.measurements(location, start_date=start_date, end_date=end_date, clean_df=False)
print()
print(len(measurements_clean))
print(len(measurements_raw))
measurements_clean.plot(y="Meetwaarde.Waarde_Numeriek")
measurements_raw.plot(y="Meetwaarde.Waarde_Numeriek")