KylinOLAP/Kylin

Too high cardinality is not suitable for dictionary!

Yancey1989 opened this issue · 3 comments

Hi !
With building a cube faild, it throws some error.

[QuartzScheduler_Worker-22]:[2015-01-08 00:21:38,468][INFO][com.kylinolap.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:72)] - Dictionary cardinality 9999956
[QuartzScheduler_Worker-22]:[2015-01-08 00:21:38,468][ERROR][com.kylinolap.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:55)] - Too high cardinality is not suitable for dictionary! Are the values stable enough for incremental load??
java.lang.IllegalArgumentException: Too high cardinality is not suitable for dictionary! Are the values stable enough for incremental load??
        at com.kylinolap.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:75)
        at com.kylinolap.dict.DictionaryGenerator.buildDictionary(DictionaryGenerator.java:110)
        at com.kylinolap.dict.DictionaryManager.buildDictionary(DictionaryManager.java:166)
        at com.kylinolap.cube.CubeManager.buildDictionary(CubeManager.java:171)

in source code

/**
 * @author yangli9
 */
@SuppressWarnings({ "rawtypes", "unchecked" })
public class DictionaryGenerator {

    private static final Logger logger = LoggerFactory.getLogger(DictionaryGenerator.class);

    private static final String[] DATE_PATTERNS = new String[] { "yyyy-MM-dd" };

    public static Dictionary<?> buildDictionaryFromValueList(DictionaryInfo info, List<byte[]> values) {
        info.setCardinality(values.size());
...
        // log a few samples
        StringBuilder buf = new StringBuilder();
        for (Object s : samples) {
            if (buf.length() > 0)
                buf.append(", ");
            buf.append(s.toString()).append("=>").append(dict.getIdFromValue(s));
        }
        logger.info("Dictionary value samples: " + buf.toString());
        logger.info("Dictionary cardinality " + info.getCardinality());

        if (values.size() > 1000000)
            throw new IllegalArgumentException("Too high cardinality is not suitable for dictionary! Are the values stable enough for incremental load??");

        return dict;
...

Here is limit to 1000000, what is it means?

dictionary resides in memory. if a column has a quite large cardinality, it means the generated dictionary will occupy a lot memory, which does not make a lot sense. For such columns, you might consider avoid using dictionary encoding

@binmahone

  1. Does "a quite large cardinality" means a column has a large distinct count number?
  2. How can i "avoid using dictionary encoding" ? Can i do it with create cube?

1.yes
2. when you created the cube, check "advanced setttings" tab, set "use dictionary" false for the dimension