[CORE] Getting 'Categorical categories must be unique' error during the aggregate step

Question

[CORE] Getting 'Categorical categories must be unique' error during the aggregate step

Closed this issue a year ago · 10 comments

What happened + What you expected to happen

Hello,
During the aggregation step, I am getting an error that says 'categorical categories must be unique'.

Y_df, S_df, tags = aggregate(df, spec)

Updated also to the main branch of the repo (pip install --upgrade git+https://github.com/Nixtla/hierarchicalforecast.git), but not sure if it has been properly updated .. because the where error occurred doesn't correspond to the current values ..

Line 207 from utils.py file (doesn't contain Y_bottom_df, S_df, tags = _to_summing_dataframe(df=df, spec=spec) )
Screenshot from the utils.py file

Error

---> 22 Y_df, S_df, tags = aggregate(df, spec)
     23 print(f"Number of unique ids of aggregated dataframe: {S_df.shape[0]}")

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/hierarchicalforecast/utils.py:207, in aggregate(df, spec, is_balanced)
    194 """ Utils Aggregation Function.
    195 Aggregates bottom level series contained in the pd.DataFrame `df` according 
    196 to levels defined in the `spec` list applying the `agg_fn` (sum, mean).
   (...)
    203 summing dataframe `S_df`, and hierarchical aggregation indexes `tags`.
    204 """
    205 #-------------------------------- Wrangling --------------------------------#
    206 # constraints S_df and collapsed Y_bottom_df with 'unique_id'
--> 207 Y_bottom_df, S_df, tags = _to_summing_dataframe(df=df, spec=spec)
    209 # Create balanced/sorted dataset for numpy aggregation (nan=0)
    210 # TODO: investigate potential memory speed tradeoff
    211 if not is_balanced:

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/hierarchicalforecast/utils.py:187, in _to_summing_dataframe(df, spec)
    185 Y_bottom_df = df.copy()
    186 Y_bottom_df.unique_id = Y_bottom_df.unique_id.astype('category')
--> 187 Y_bottom_df.unique_id = Y_bottom_df.unique_id.cat.set_categories(S_df.columns)
    188 Y_bottom_df = Y_bottom_df.groupby(['unique_id', 'ds'])['y'].sum().reset_index()
    189 return Y_bottom_df, S_df, tags

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/accessor.py:94, in PandasDelegate._add_delegate_accessors.<locals>._create_delegator_method.<locals>.f(self, *args, **kwargs)
     93 def f(self, *args, **kwargs):
---> 94     return self._delegate_method(name, *args, **kwargs)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2879, in CategoricalAccessor._delegate_method(self, name, *args, **kwargs)
   2876 from pandas import Series
   2878 method = getattr(self._parent, name)
-> 2879 res = method(*args, **kwargs)
   2880 if res is not None:
   2881     return Series(res, index=self._index, name=self._name)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:1021, in Categorical.set_categories(self, new_categories, ordered, rename, inplace)
   1019 if ordered is None:
   1020     ordered = self.dtype.ordered
-> 1021 new_dtype = CategoricalDtype(new_categories, ordered=ordered)
   1023 cat = self if inplace else self.copy()
   1024 if rename:

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:186, in CategoricalDtype.__init__(self, categories, ordered)
    185 def __init__(self, categories=None, ordered: Ordered = False) -> None:
--> 186     self._finalize(categories, ordered, fastpath=False)

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:340, in CategoricalDtype._finalize(self, categories, ordered, fastpath)
    337     self.validate_ordered(ordered)
    339 if categories is not None:
--> 340     categories = self.validate_categories(categories, fastpath=fastpath)
    342 self._categories = categories
    343 self._ordered = ordered

File /anaconda/envs/azureml_py38/lib/python3.8/site-packages/pandas/core/dtypes/dtypes.py:537, in CategoricalDtype.validate_categories(categories, fastpath)
    534         raise ValueError("Categorical categories cannot be null")
    536     if not categories.is_unique:
--> 537         raise ValueError("Categorical categories must be unique")
    539 if isinstance(categories, ABCCategoricalIndex):
    540     categories = categories.categories

ValueError: Categorical categories must be unique

Versions / Dependencies

0.3.0
but installed additionally
pip install --upgrade git+https://github.com/Nixtla/hierarchicalforecast.git

Reproduction script

Below shows two examples, where by adding additional hierarchy the error disappears.
Gives error:

df = pd.DataFrame(data = {'cat1': ['a', 'a', 'a'], 'cat2': ['1', '2', '3'], 'y': [10, 20, 30], 'ds': ['2020-01-01', '2020-01-01', '2020-01-01']})
df.insert(0, 'country', 'COUNTRY')

spec = [['country'], ['country', 'cat1'], ['country', 'cat2']]

Y_df, S_df, tags = aggregate(df, spec)

Does not give error:

df = pd.DataFrame(data = {'cat1': ['a', 'a', 'a'], 'cat2': ['1', '2', '3'], 'y': [10, 20, 30], 'ds': ['2020-01-01', '2020-01-01', '2020-01-01']})
df.insert(0, 'country', 'COUNTRY')

spec = [['country'], ['country', 'cat1'], ['country', 'cat2'], ['country', 'cat1', 'cat2'],]

Y_df, S_df, tags = aggregate(df, spec)

Issue Severity

None

Answer 1 · 2023-06-21T10:09:42.000Z

Hey @iamyihwa,

I found a similar error last week, and its cause was that some labels were duplicated; in my case, I had a geographic hierarchy where a state and a city had the same name.

My solution was to add a suffix to the string to identify states and cities uniquely.
'ABC', 'ABC' vs 'state_[ABC]', 'city_[ABC]'

Would you be able to confirm if that solves it?

Answer 2 · 2023-06-21T10:44:00.000Z

Thanks @kdgutier for your quick response!

I am not sure if in this case it is the same.
For example, in this very simple example, there is no overlap at all between values in different columns (cat1, cat2).

Note, that by adding additional level, it solves the issue.

Answer 3 · 2023-06-21T11:12:53.000Z

In this case, it seems to do also with what is selected as the bottom level.

Same set of hierarchies as above example with error, but by putting cat2 (distinct values in the category) before cat1 (all the values in the column are same), error disappears.

Error: [['country'], ['country', 'cat1'], ['country', 'cat2']]
No Error: [['country'], ['country', 'cat2'], ['country', 'cat1']]

Answer 4 · 2023-06-22T09:21:21.000Z

it works only for string / objects

Answer 5 · 2023-06-22T10:52:18.000Z

@PetricaRadan In above all categories are strings or objects. (see '1', '2', it isn't 1, 2 )

Answer 6 · 2023-06-22T10:54:17.000Z

@iamyihwa can you show me the df.info, please? i am referring to data type.

Answer 7 · 2023-06-22T12:48:41.000Z

Sure @PetricaRadan

Answer 8 · 2023-06-23T06:22:46.000Z

can you convert df using convert_dtypes?

df = df.convert_dtypes()

Answer 9 · 2023-06-23T09:27:20.000Z

I don't understand reasoning behind it ...

Answer 10 · 2023-06-23T12:20:12.000Z

TLDR; Always have in the hierarchy (spec) a level which is longest, from which other levels can be reconstructed.

I guess the problem behind it is that one cannot reconstruct the upper level forecast with the bottom level forecast. Currently in the code, what is being selected as bottom level is the level which is longest.
When there is a tie, then the first one gets selected, and in the scenario above, since 'cat1' contains only a single value, and one can think it doesn't contain any information. So if that level comes first before 'cat2', and gets selected as bottom level, there is no way to get a reconstruction that include 'cat2'.
Whereas in the other case where 'cat2' comes before then since 'cat1' is simply sum of all 'cat2' , there is no issue.
Since found the solution to the issue, will close the topic.