anhnongdan/Spark1.6_Problems

Problem writing parquet file: ava.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary

Closed this issue · 3 comments

Read 4 files in a chunk and join:

chunk = [u'2017-05-21', u'2017-05-22', u'2017-05-23', u'2017-05-24', u'2017-05-25']
chunk_files = [fconf['v3_combined_features'].format(cd) for cd in chunk]
fts = sqlContext.read.options(basePath=os.path.dirname(fconf['v3_combined_features']))\
            .parquet(*chunk_files)
Job aborted due to stage failure: Task 24 in stage 110.0 failed 4 times, most recent failure: Lost task 24.3 in stage 110.0 (TID 2029, datanode02.trustiq.local, executor 2): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainFloatDictionary

Changed to a chunk of 3 input parquet files and the script is running OK.

get combined_fts for: [u'2017-05-01', u'2017-05-02', u'2017-05-03']

=> Still happen after all. Even when reduce to 2 files!!

Checked cluster resources: RAM is nearly full

=> Waited for RAM usage cooled down, restart the kernel but problem persist

Usually, this problem is stated: PlainFloatDictionary is not implemented.

I remembered: this caused by the mismatched data type of the 'ac_real_age' across the dates in our database.

=> Needed to casting the wrong dates into the right format.