databricks/koalas

why kdf.head() is much lower than sdf.show()?

Closed this issue · 3 comments

sdf = read_csv('backflow.csv')
kdf = sdf.to_koalas()

# run time 75ms
sdf.show(5)

# run time 53s
kdf.head()

image

image

kdf.head() is much lower than sdf.show().Is there any way to speed it up in koalas?

Very likely because of the default index: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type . Can you try with ks.set_option('compute.default_index_type', 'distributed')?

It's much faster now. Can we set it to distributed by default. The speed gap is too big.

distributed disables the operations between other DataFrames. It's something we should discuss. Let me close this ticket for now though.