Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives
nikeshv opened this issue · 1 comments
Hi,
I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value.
I did the same operations on a Pandas dataframe, but it gives the minimum value .
Please find attached the notebook for step by step operations I performed.
cmd1
import databricks.koalas as ks
import pandas as pd
import random
cmd2
#Create Sample dataframe in Koalas
df = ks.DataFrame.from_dict({
'Age': [random.randint(0, 100000) for i in range(100000)],
'Income': [random.randint(0, 100000) for i in range(100000)]
})
print(df.head(5))
cmd3
import scipy.stats as stats
import numpy as np
ks.set_option('compute.ops_on_diff_frames', True)
df['Income_zscore'] = ks.Series(stats.zscore(df['Income'].to_numpy()))
df['Age_zscore'] = ks.Series(stats.zscore(df['Age'].to_numpy()))
df['sq_dist'] = [np.linalg.norm(i) for i in df[['Income_zscore','Age_zscore']].to_numpy()]
ks.set_option('compute.ops_on_diff_frames', False)
cmd4
#display(df)
cmd5
#calculate min of sq_dist
minindex=df['sq_dist'].idxmin()
minindex
cmd6
#display min value of sq_dist
df['sq_dist'].iloc[minindex]
cmd7
df.to_spark().createOrReplaceTempView("koalastable")
cmd8
%sql
select min(sq_dist) from koalastable -- THis doesnt match with the value we got in cmd6
cmd9
#do same operations with Pandas
df_spark = df.to_spark()
stats_array = np.array(df_spark.select('Age', 'Income').collect())
normalized_data = stats.zscore(stats_array, axis=0)
df_pd = pd.DataFrame(data=normalized_data, columns=['Age', 'Income'])
df_pd['sq_dist'] = [np.linalg.norm(i) for i in normalized_data]
df_pd.head(5)
cmd10
minindex_pd=df_pd['sq_dist'].idxmin()
minindex_pd
cmd11
#minimum of sq_dist using Koalas
df_pd['sq_dist'].iloc[minindex_pd]
cmd12
spark.createDataFrame(df_pd).createOrReplaceTempView("pandastable")
cmd13
%sql
select min(sq_dist) from pandastable -- This match with the value we got in cmd11