feature-engine/feature_engine

SmartCorrelatedSelection may not replicate the same result

WhyYouNeedMyUsername opened this issue · 1 comments

Hi,

I've found this issue when I'm doing experiment on some datasets.

Describe the bug
SmartCorrelatedSelection may not replicate the same result for some datasets.
This is because set is unordered data structure, and thus .add() would not preserve the order.

When some features match the same score by selection method, the result would be different.

It happened only when I restart my development environment. (otherswise, the result might be the same.)

I have solved this by transforming set to list: _temp_set = list(set([feature])) and replace .add(f2) to _temp_set.append(f2) in SmartCorrelatedSelection.py

To Reproduce
Steps to reproduce the behavior:

  1. Run SmartCorrelatedSelection on some dataset which has plenty correlated features.
    (In my case, I assign selection_method="variance")
  2. record features_to_keep_
  3. restart your environment. (including PyThon)
  4. Run again and you will see different features_to_keep_ result

Expected behavior
Drop the same features when the given parameters is the same.

Desktop (please complete the following information):
NAME="CentOS Linux"
VERSION="8"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"

Thank you for developing this wonderful tool! 🌟

@WhyYouNeedMyUsername thanks for raising this up!

I will look into it over Christmas :)

Feel free to make a PR otherwise with the suggested changes! That would be most welcome.