Updates to 00_intro_to_plant_data notebook

Question

Updates to 00_intro_to_plant_data notebook

jordanperr opened this issue 2 years ago · 0 comments

Hi @RHammond2, I'm just going to leave my comments on 00_intro_to_plant_data.ipynb and project_ENGIE.py here too. Thanks for putting in the effort to translate all these preprocessing steps over to the v3 workflow!

00_intro_to_plant_data.ipynb

Since this might be someone's first introduction to OpenOA, near the top of the file I think it would be helpful to explain what a PlantData object is and how it is used in OpenOA (e.g., organizing data, the main analysis methods in OpenOA operate on a PlantData object, etc.). To clarify the workflow I think it would also be helpful to first show how a PlantData object can be instantiated using data frames or csv files (maybe just as a comment in the notebook), then explain that in practice there are typically steps that need to be taken to clean the data and convert the data into the format required by PlantData . We describe the steps required for the LHB data, but not all of these steps may be required for other plants and other wind plants may require additional steps, depending on the data quality. We could also mention that we implemented these preprocessing steps for the example data in a project_ENGIE module, but the user can decide how they want to implement the preprocessing.
Step 1: Load the SCADA data: would be good to comment on what the extract_data function is doing, and that this is specific to the example data
Step 2: Convert the timestamps to proper timestamp data objects: Might be helpful to explain that WindToolKitQualityControlDiagnosticSuite can only be used for US-based wind farms because the WIND Toolkit wind data it uses are only available for the US.
Step 3: Dive into the data: Probably don't need to show qa.describe for both cases, to save space
Step 3 "Inspecting the distributions of each column of numerical data" subsection: ix_consecutive = (scada_df_tz.loc[single_turbine_ix].Va_avg.diff(1) != 0): won't this actually be returning non consecutive data? Shouldn't the != be ==? Same comment in the next cell
Step 4: Inspecting the timestamps for DST gaps and duplications: In general this section is pretty long (the nature of timezones and DST!) and readers might get bogged down here. I wonder if there's a way to condense this? And perhaps it could be reorganized to walk through tz-unaware data first, then tz-aware to avoid confusion.
Step 4 "In the below, timezone unaware data, we can see that there is a significant deviation between the local timestamps and the UTC timestamps, especialy around the end of March in both 2018 and 2019, suggesting that there is something missing with the DST data." It seems like it’s the beginning of March (8th and 9th) where there are lots of duplicated timestamps for utc_no_tz. But why are there duplicates this early in the month? This wasn't the case in the v2 01a notebook. Could be a bug somewhere.
Step 4: "Now, in the timezone aware data, it is clear that the timezone data is able to fully account for DST, so there aren't any erroneous duplications in the standardized data." This is little confusing because there are a lot of duplicates still.
Step 4: "Based on the duplicated timestamps, it does seem like there is a DST correction in spring and a time gap in the fall. This is in contrast with the timezone unaware counterpart of this example where we found gaps in the spring and none in the fall for the original data, and vice versa for the UTC data." This is confusing because there are no gap results shown for the tz-aware data yet.
Step 4: "Below, we can observe the effects of having timezones not encoded (first set of plots) versus encoded (second set of plots), and what that might mean for potential analyses...." The description in this paragraph doesn’t seem to match the plots. Further, there seems to be a bug in the plots. The gap between local and UTC is 6 hours in the first plots, but for France, the difference should be 1 or 2 hours. I'm wondering if somehow local got encoded as a US time zone. This would also explain why we're seeing duplicates in early March in the tz-unaware utc data, since DST occurs earlier in the US than in Europe.
qa.dalyight_savings_plot: typo in "daylight"
Step 5: It would be helpful to explain that project_ENGIE is an example of how the preprocessing steps could be implemented. We used a functions to implement the steps but this could also be done using a standalone script or jupyter notebooks, for example.
Step 6: print(PlantMetaData.__doc__): There's some other metadata we probably want to include in PlantMetaData, like plant capacity and number of turbines.

project_ENGIE.py

This is my personal preference, but I think it could be useful to rename this module to something else, like "project_ENGIE_preprocessing" to avoid confusion with the "project_ENGIE" project class in v2.
The intro documentation might need to be updated a little to reflect how the module is used in v3

Originally posted by @ejsimley in #194 (comment)