Datasets used for producing benchmarks in scikit-learn intelex

Question

Datasets used for producing benchmarks in scikit-learn intelex

vineel96 opened this issue 3 years ago · 7 comments

Hello,
Can I get the information of datasets used for producing benchmark results(speedup values) for different scikit-learn algorithms as shown in figure under Acceleration sub section at https://github.com/intel/scikit-learn-intelex . Image is also attached here:

Answer 1 · 2023-05-05T07:32:28.000Z

Datasets are specified in this config: https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_config.json
Data generation/loading functions are defined here: https://github.com/IntelPython/scikit-learn_bench/tree/master/datasets

Answer 2 · 2023-05-05T12:41:01.000Z

Hi,
Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Answer 3 · 2023-05-05T13:11:51.000Z

Hi, Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Yes, that's right.

Answer 4 · 2023-05-08T09:52:52.000Z

Thanks for the information

Answer 5 · 2023-05-22T08:42:43.000Z

Hi @Alexsandruss,
For the inference,

which data is used for kmeans (there is no "testing" attribute for kmeans in skl_config.json)
For knn, training and testing samples are generated seperately or training samples are itself used for testing?
for knn-kdt, linear regression, ridge regression, there is no testing data info is provided, so which data is used for inference?
for random forest and svc there is no info provided for train and test split. Which data is used for inference?
In inference speedup graph, dbscan algorithm is not shown, why?

Answer 6 · 2023-05-22T09:19:31.000Z

1-4. If 'testing' field is not provided, than data is same for training and inference. Train and test split is defined in data loaders for named datasets.
5. sklearn's DBSCAN doesn't have separate function for inference

Answer 7 · 2023-05-22T14:46:00.000Z

Hi @Alexsandruss ,
1-4. Generally we use different data for inference and training right? Is it ok to use same training data for inference also?
For named datasets, example higgs_one_m for random forest, in the above speedup graph it shows size of data
as 1M for both inference graph and training graph. But in loader_classification.py(in datasets folder), it shows different
split for train as (1000000, 28) and inference as (500000, 28). So which split is actually used in inference speedup
graph? (this is same for all named dataset)
5. So which function is used for dbscan in training speedup graph, fit() or fit_predict()?
6. For knn kdtree, there is no fit() function. So in training speedup graph, only object creation KDTree() is considered for timing or any other is used? Also for inference which function is used? is tree.query() is used in inference?
7. Also can you provide parameter information that was used for each algorithm while generating above speedup graph? Like for SVC and RF? I see for other algorithms parameters info is given in skl_config.json.
8. Also what's "time_method", "time_limit" for kmeans in skl_config.json file? Also n_clusters in it refers to initial no of clusters?