Question) Getting Imputation Weight
hyojin0912 opened this issue · 5 comments
Thanks for your nice package.
I have one question.
I am imputing large matrix (90,000 by 7,000).
And this matrix contain lots of NA (Over 80%).
Also include numerical value and zero or one categorical value.
Below is my code (After loading whole dataframe to impute)
` import datawig
with tf.device(d):
df = datawig.SimpleImputer.complete(df, inplace=True, num_epochs=max_epoch, verbose=1, output_path=result_dir+ str(num_seed)+'seed_imputer_model')
with open(result_dir+str(num_seed)+"seed_Imputed_merged_cid.pickle", 'wb') as handle:
pickle.dump(merged_cid, handle, protocol=pickle.HIGHEST_PROTOCOL)
pd.DataFrame(df).to_csv(result_dir+ str(num_seed) + 'seed_Imputed_merged_cid.csv', index=None)`
I use "datawig.SimpleImputer.complete" for simplicity,
but is there any method to get neural network weight which used for imputation.
And "datawig.SimpleImputer.complete" function how works for train and validation
I asking because there is no decrease of accuracy
2020-10-27 11:14:22,355 [INFO] Epoch[49] Batch [0-34] Speed: 1651.71 samples/sec cross-entropy=0.515578 C0040436-accuracy=0.000000 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-cross-entropy=0.667427 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-C0040436-accuracy=0.000000 2020-10-27 11:14:22,676 [INFO] Epoch[49] Time cost=0.657 2020-10-27 11:14:22,688 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0040436/model-0049.params" 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-cross-entropy=0.492388 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-C0040436-accuracy=0.000000
Thanks
Hyojin
Hi
I'm not sure I fully understand how to solve this, but here are a couple remarks:
- datawig is built on mxnet, you're using a tensorflow context
pd.DataFrame(df)
is not needed, df is already a dataframe- you can get the model weights from the mxnet model following the mxnet API tutorials, an example is here
- Most importantly: there is no decrease in accuracy because you're already at 0 accuracy
- if you want to control the train / test split, I'd recommend to use SimpleImputer.fit and SimpleImputer.predict
I would strongly recommend to check the metrics of your model, imputation results should be treated with care if the metrics indicate such a low accuracy as in your case. The predict function of SimpleImputer has a precision_threshold for categorical values, that ensures that you'll only get high precision imputations.
Hope this helps - feel free to reopen otherwise
Thanks for your kind reply.
But still things to ask.
- Typo. I can't understand why accuracy doesn't increase evenif loss decreases.
- How should I set up "precision_threshold" in my case (Related to 1.)
- And is there any recommendation for paralyzing mxnet when using tensorflow as backend
I used below code.
for d in ['/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']: with tf.device(d):
I ask because I spend more than days in below state. There must be error..
2020-10-27 20:38:31,079 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0344329/model-0036.params" 2020-10-27 20:38:31,136 [INFO] No improvement detected for 20 epochs compared to 1.0773332220560405 last error obtained: 5.240848921006545, stopping her 2020-10-27 20:38:31,136 [INFO] ========== done (33.13334774971008 s) fit model
I uploaded merged_cid.csv that I used in upper code as merged_cid (=df for "SimpleImputer.complete")
Thanks
The cross-entropy can still change when the accuracy doesn't, the cross-entropy is just a finer grained loss
the precision threshold is a standard parameter of SimpleImputer.predict
mxnet and tensorflow are usually not combined, you pick one or the other.
Thank you for fast reply.
I understand everythings.
Then, isn't there your guess about my zero accuracy? When seeing my metrics which contains lots of NA
hm, i'd probably use the SimpleImputer.fit/predict approach for single columns (like complete does, but writing the for loop through the columns yourself, because in complete, the metrics/log dir is deleted immediately) and then check the metrics files to see which columns can actually be predicted well enough.