Running analysis on data subsets
Opened this issue · 2 comments
Noticed some unexpected behaviour in the code after trying to run the analysis on a very small subset of the data (~25 models).
In the modelling notation part, this line was causing an issue:
df_meta_selected = df_meta_selected.groupby('namespace').resample('Y').sum(numeric_only=True).reset_index()
As we can see in the above picture, I found out the particular case of the count being 0 for a specific date, a duplicate namespace column filled with NaN values would be added to the dataframe.
To palliate this issue, I added the min_count option and subsequently filled the NaNs with 0.
In the element types section, the small quantity of models used for the analysis raised another issue. After the data crunching, Seaborn interprets the dataframe with the original number of rows instead of the actual one. This leads to an error throw in the form of:
AttributeError: 'NoneType' object has no attribute 'get_bbox'
This part of the error message gives us a hint:
The palette list has fewer values (18) than needed (27) and will cycle, which may produce an uninterpretable plot.
18 containers are expected but the list has 27 containers, hence the error.
The current solution is to remove hue="category"
under an arbitrary threshold of models. Seems like this particular issue only applies to small data subsets. Potential fix could be updating pandas to v2.