worldbank/REaLTabFormer

Could order of columns affect performance of synthetic data quality?

efstathios-chatzikyriakidis opened this issue · 2 comments

Hi @avsolatorio!

Could order of columns (first categorical, then numerical/datetime) or the opposite (first numerical/datetime, then categorical) could affect quality of synthetic data? Furthermore in categorical could be ordered more by cardinality. Correlations exist on all columns and I am thinking if putting first the categoricals or not, or sorting categoricals by ascending or descending will allow better learning or not.

Thanks!

I have done some tests and it seems that it doesn't matter. Similar results observed for each possible case of first or last categorical columns and with increasing and decreasing cardinality as well.

Can be closed