[CORE] Getting 'Categorical categories must be unique' error during the aggregate step
Closed this issue · 10 comments
What happened + What you expected to happen
Hello,
During the aggregation step, I am getting an error that says 'categorical categories must be unique'.
Y_df, S_df, tags = aggregate(df, spec)
![image](https://private-user-images.githubusercontent.com/12494156/247470541-9e43b9a9-3351-4f1c-862c-b26585544023.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxODA2NDIsIm5iZiI6MTcyMjE4MDM0MiwicGF0aCI6Ii8xMjQ5NDE1Ni8yNDc0NzA1NDEtOWU0M2I5YTktMzM1MS00ZjFjLTg2MmMtYjI2NTg1NTQ0MDIzLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI4VDE1MjU0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRkMzZkN2ExMWI5ZDdmMjUyMTJhZjA3NDQyY2NiMDhlMWJmOTU2MDM0ODIwMDQxOGY2YjJhZTZjYjVmODgyMWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.MEBogZgmW1kJvrHajxKKDbOgqMJbEbed0iL0GZ2K85A)
Updated also to the main branch of the repo (pip install --upgrade git+https://github.com/Nixtla/hierarchicalforecast.git), but not sure if it has been properly updated .. because the where error occurred doesn't correspond to the current values ..
Line 207 from utils.py file (doesn't contain Y_bottom_df, S_df, tags = _to_summing_dataframe(df=df, spec=spec) )
Screenshot from the utils.py file
Error
---> 22 Y_df, S_df, tags = aggregate(df, spec)
23 print(f"Number of unique ids of aggregated dataframe: {S_df.shape[0]}")
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/hierarchicalforecast/utils.py:207, in aggregate(df, spec, is_balanced)
194 """ Utils Aggregation Function.
195 Aggregates bottom level series contained in the pd.DataFrame `df` according
196 to levels defined in the `spec` list applying the `agg_fn` (sum, mean).
(...)
203 summing dataframe `S_df`, and hierarchical aggregation indexes `tags`.
204 """
205 #-------------------------------- Wrangling --------------------------------#
206 # constraints S_df and collapsed Y_bottom_df with 'unique_id'
--> 207 Y_bottom_df, S_df, tags = _to_summing_dataframe(df=df, spec=spec)
209 # Create balanced/sorted dataset for numpy aggregation (nan=0)
210 # TODO: investigate potential memory speed tradeoff
211 if not is_balanced:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/hierarchicalforecast/utils.py:187, in _to_summing_dataframe(df, spec)
185 Y_bottom_df = df.copy()
186 Y_bottom_df.unique_id = Y_bottom_df.unique_id.astype('category')
--> 187 Y_bottom_df.unique_id = Y_bottom_df.unique_id.cat.set_categories(S_df.columns)
188 Y_bottom_df = Y_bottom_df.groupby(['unique_id', 'ds'])['y'].sum().reset_index()
189 return Y_bottom_df, S_df, tags
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/accessor.py:94, in PandasDelegate._add_delegate_accessors.<locals>._create_delegator_method.<locals>.f(self, *args, **kwargs)
93 def f(self, *args, **kwargs):
---> 94 return self._delegate_method(name, *args, **kwargs)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2879, in CategoricalAccessor._delegate_method(self, name, *args, **kwargs)
2876 from pandas import Series
2878 method = getattr(self._parent, name)
-> 2879 res = method(*args, **kwargs)
2880 if res is not None:
2881 return Series(res, index=self._index, name=self._name)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:1021, in Categorical.set_categories(self, new_categories, ordered, rename, inplace)
1019 if ordered is None:
1020 ordered = self.dtype.ordered
-> 1021 new_dtype = CategoricalDtype(new_categories, ordered=ordered)
1023 cat = self if inplace else self.copy()
1024 if rename:
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:186, in CategoricalDtype.__init__(self, categories, ordered)
185 def __init__(self, categories=None, ordered: Ordered = False) -> None:
--> 186 self._finalize(categories, ordered, fastpath=False)
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:340, in CategoricalDtype._finalize(self, categories, ordered, fastpath)
337 self.validate_ordered(ordered)
339 if categories is not None:
--> 340 categories = self.validate_categories(categories, fastpath=fastpath)
342 self._categories = categories
343 self._ordered = ordered
File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:537, in CategoricalDtype.validate_categories(categories, fastpath)
534 raise ValueError("Categorical categories cannot be null")
536 if not categories.is_unique:
--> 537 raise ValueError("Categorical categories must be unique")
539 if isinstance(categories, ABCCategoricalIndex):
540 categories = categories.categories
ValueError: Categorical categories must be unique
Versions / Dependencies
0.3.0
but installed additionally
pip install --upgrade git+https://github.com/Nixtla/hierarchicalforecast.git
Reproduction script
Below shows two examples, where by adding additional hierarchy the error disappears.
Gives error:
df = pd.DataFrame(data = {'cat1': ['a', 'a', 'a'], 'cat2': ['1', '2', '3'], 'y': [10, 20, 30], 'ds': ['2020-01-01', '2020-01-01', '2020-01-01']})
df.insert(0, 'country', 'COUNTRY')
spec = [['country'], ['country', 'cat1'], ['country', 'cat2']]
Y_df, S_df, tags = aggregate(df, spec)
![image](https://private-user-images.githubusercontent.com/12494156/247485604-e3bfa8a4-66dc-4ecc-a77f-2a3d5a9932a5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxODA2NDIsIm5iZiI6MTcyMjE4MDM0MiwicGF0aCI6Ii8xMjQ5NDE1Ni8yNDc0ODU2MDQtZTNiZmE4YTQtNjZkYy00ZWNjLWE3N2YtMmEzZDVhOTkzMmE1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI4VDE1MjU0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU2ZTJlZTNkZTJhM2QzNTJkZjVlMzVmNmYyMTE3NTE5NzlhMjJmZTI4ZTM4MTAxMzQ3OGYzY2RhMjEzYjBlMjgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.zg777RhgwoU2xa-S2SxMuRFr8q5eq0RTeVzWt6Yrmsg)
Does not give error:
df = pd.DataFrame(data = {'cat1': ['a', 'a', 'a'], 'cat2': ['1', '2', '3'], 'y': [10, 20, 30], 'ds': ['2020-01-01', '2020-01-01', '2020-01-01']})
df.insert(0, 'country', 'COUNTRY')
spec = [['country'], ['country', 'cat1'], ['country', 'cat2'], ['country', 'cat1', 'cat2'],]
Y_df, S_df, tags = aggregate(df, spec)
![image](https://private-user-images.githubusercontent.com/12494156/247485787-e1f3a396-46a2-45f7-8cfc-26c812b5411d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjIxODA2NDIsIm5iZiI6MTcyMjE4MDM0MiwicGF0aCI6Ii8xMjQ5NDE1Ni8yNDc0ODU3ODctZTFmM2EzOTYtNDZhMi00NWY3LThjZmMtMjZjODEyYjU0MTFkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzI4VDE1MjU0MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk3M2YwNWMyOTUwZmU1NjkwYzA5NWU1NmU4NWI3NzNjNzNiNmE5ZDkwMWRkY2MwYzk4NjQ3ZDJiNjJlN2Y5ZDMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Sf5Q32IMUogFPTl-XMpcbxtcy7VgDNIbx_P6k7CskIA)
Issue Severity
None
Hey @iamyihwa,
I found a similar error last week, and its cause was that some labels were duplicated; in my case, I had a geographic hierarchy where a state and a city had the same name.
My solution was to add a suffix to the string to identify states and cities uniquely.
'ABC', 'ABC' vs 'state_[ABC]', 'city_[ABC]'
Would you be able to confirm if that solves it?
Thanks @kdgutier for your quick response!
I am not sure if in this case it is the same.
For example, in this very simple example, there is no overlap at all between values in different columns (cat1, cat2).
In this case, it seems to do also with what is selected as the bottom level.
Same set of hierarchies as above example with error, but by putting cat2 (distinct values in the category) before cat1 (all the values in the column are same), error disappears.
Error: [['country'], ['country', 'cat1'], ['country', 'cat2']]
No Error: [['country'], ['country', 'cat2'], ['country', 'cat1']]
it works only for string / objects
@PetricaRadan In above all categories are strings or objects. (see '1', '2', it isn't 1, 2 )
@iamyihwa can you show me the df.info, please? i am referring to data type.
Sure @PetricaRadan
can you convert df using convert_dtypes?
df = df.convert_dtypes()
I don't understand reasoning behind it ...
TLDR; Always have in the hierarchy (spec) a level which is longest, from which other levels can be reconstructed.
I guess the problem behind it is that one cannot reconstruct the upper level forecast with the bottom level forecast. Currently in the code, what is being selected as bottom level is the level which is longest.
When there is a tie, then the first one gets selected, and in the scenario above, since 'cat1' contains only a single value, and one can think it doesn't contain any information. So if that level comes first before 'cat2', and gets selected as bottom level, there is no way to get a reconstruction that include 'cat2'.
Whereas in the other case where 'cat2' comes before then since 'cat1' is simply sum of all 'cat2' , there is no issue.
Since found the solution to the issue, will close the topic.